Pandora.Apache.Avro.IDL.To.Apache.Parquet 0.11.32

.NET 6.0

dotnet add package Pandora.Apache.Avro.IDL.To.Apache.Parquet --version 0.11.32

NuGet\Install-Package Pandora.Apache.Avro.IDL.To.Apache.Parquet -Version 0.11.32

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Pandora.Apache.Avro.IDL.To.Apache.Parquet" Version="0.11.32" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add Pandora.Apache.Avro.IDL.To.Apache.Parquet --version 0.11.32

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Pandora.Apache.Avro.IDL.To.Apache.Parquet, 0.11.32"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install Pandora.Apache.Avro.IDL.To.Apache.Parquet as a Cake Addin
#addin nuget:?package=Pandora.Apache.Avro.IDL.To.Apache.Parquet&version=0.11.32

// Install Pandora.Apache.Avro.IDL.To.Apache.Parquet as a Cake Tool
#tool nuget:?package=Pandora.Apache.Avro.IDL.To.Apache.Parquet&version=0.11.32

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Pandora.Apache.Avro.IDL.To.Apache.Parquet

Background
How to use the library
How to contribute
Project dependencies

Background

Currently, when working with Apache Kafka® and Azure Databricks® (Apache Spark®), there is a built-in mechanism to transform Apache Avro® data to Apache Parquet® files. The issue with this approach, if we think in medallion lakehouse architecture, is that AVRO with nested data, will be persisted in a single PARQUET file in the bronze layer (full, raw and unprocessed history of each dataset) relying on ArrayType, MapType and StructType to represent the nested data. This will make it a bit more tedious to post-process data respectively in the following layers: silver (validated and deduplicated data) and gold (data as knowledge).


Figure 1: Delta lake medallion architecture and data mesh

To avoid this issue, we present an open-source library, that will help transform AVRO, with nested data, to multiple PARQUET files where each of the nested data elements will be represented as an extension table (separate file). This will allow to merge both the bronze and silver layers (full, raw and history of each dataset combined with defined structure, enforced schemas as well validated and deduplicated data), to make it easier for data engineers/scientists and business analysts to combine data with already known logic (SQL joins) and tools.


Figure 2: Azure Databricks `python` notebook and `SQL` cell

As two of the medallion layers are being combined to a single, it might lead to the possible saving of a ⅓ in disk usage and hereby using fewer servers and less computing power. Furthermore, since we aren't relying on a naive approach, when flattening and storing data, it could further lead to greater savings and a more sustainable and environmentally friendly approach.


Figure 3: Green Software Foundation with the Linux Foundation to put sustainability at the core of software engineering

Dependency	Author	License
FSharp.Core	Microsoft	MIT License
Apache.Avro	The Apache Software Foundation	Apache License 2.0
Newtonsoft.Json	James Newton-King	MIT License
Parquet.Net	Ivan G	MIT License

Dependency	Author	License
Microsoft.NET.Test.Sdk	Microsoft	MIT License
coverlet.collector	.NET foundation	MIT License
xunit	.NET foundation	Apache License 2.0
xunit.runner.visualstudio	.NET foundation	Apache License 2.0

Version	Downloads	Last updated
0.11.32	225	5/10/2023
0.11.31	199	4/17/2023
0.11.30	246	3/21/2023
0.11.29	244	3/14/2023
0.11.28	251	3/6/2023
0.11.27	250	3/6/2023
0.11.26	268	3/6/2023
0.11.25	250	3/4/2023
0.11.24	264	3/4/2023
0.11.23	242	3/4/2023
0.11.22	241	2/24/2023
0.11.21	267	2/16/2023
0.11.20	259	2/16/2023
0.11.19	258	2/15/2023
0.11.18	267	2/15/2023
0.11.17	258	2/15/2023
0.11.16	250	2/15/2023
0.11.15	266	2/14/2023
0.11.14	261	2/14/2023
0.11.13	269	2/14/2023
0.11.12	268	2/14/2023
0.11.11	257	2/14/2023
0.11.10	254	2/14/2023
0.11.9	251	2/14/2023
0.11.8	255	2/14/2023
0.11.7	272	2/14/2023
0.11.6	281	2/13/2023
0.11.5	282	2/13/2023
0.11.4	293	2/8/2023
0.11.3	282	2/8/2023
0.11.2	289	2/6/2023
0.11.1	287	2/6/2023
0.11.0	305	2/3/2023

Pandora.Apache.Avro.IDL.To.Apache.Parquet 0.11.32

Pandora.Apache.Avro.IDL.To.Apache.Parquet

Table of Contents

Background

How to use the library

Package dependencies (A2P)

Package imports (A2P)

Generating random AVRO data

Logger and DataLakeServiceClient

Loop-logic

Delta-control files (optional)

Main method

How to contribute

Package dependencies (A2D)

Package imports (A2D)

Logger

isNullable and fieldToType

Iterating over local AVSC files

Generate Directed Graphs

Generate SVG and PNG files

Project dependencies

Library

Samples

Unit Tests

net6.0

NuGet packages

GitHub repositories