Dbarone.Net.Document 2.0.2

dotnet add package Dbarone.Net.Document --version 2.0.2
                    
NuGet\Install-Package Dbarone.Net.Document -Version 2.0.2
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Dbarone.Net.Document" Version="2.0.2" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Dbarone.Net.Document" Version="2.0.2" />
                    
Directory.Packages.props
<PackageReference Include="Dbarone.Net.Document" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Dbarone.Net.Document --version 2.0.2
                    
#r "nuget: Dbarone.Net.Document, 2.0.2"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Dbarone.Net.Document@2.0.2
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Dbarone.Net.Document&version=2.0.2
                    
Install as a Cake Addin
#tool nuget:?package=Dbarone.Net.Document&version=2.0.2
                    
Install as a Cake Tool

Dbarone.Net.Document

A .NET document library offering the following services:

  1. A document model based on the document model in LiteDB
  2. A serialisation / data interchange format, including automatic compression of the serialised data
  3. A query language to manipulate documents

This library is being used for another of my projects, Dbarone.Net.Database, and has been heavily influenced by the LiteDB project.

Library Reference

For a full API reference of this library, please refer to the documentation.

Document Model

At the core of this library is the DocumentValue type. This represents a value. Values can be simple / native types, or complex types:

Simple / Native Types

The following native types are supported:

DocType Description
Null Special type representing null values
Boolean Boolean type
Byte Single byte
SByte Signed byte
Char Single unicode UTF-16 code point
Decimal A 16-byte floating point numeric type
Double An 8-byte floating point numeric type
Single A 4-byte floating point numeric type
VarInt A variable-length integer
Int16 A signed 16-bit integer
UInt16 An unsigned 16-bit integer
Int32 A signed 32-bit integer
UInt32 An unsigned 32-bit integer
Int64 A signed 64-bit integer
UInt64 An unsigned 64-bit integer
DateTime A date/time structure
Guid Represents a globally unique identifier (GUID)
Blob A variable-length byte-array
String A variable-length string

Complex Types

The following complex types are supported:

DocType Description
Array An array or collection of values. Supports indexing of elements
Document An associative array of key / value elements

Creating Documents

Simple documents can be created by simply assigning the appropriate value to a new DocumentValue variable, for example:

    DocumentValue doc = "foobar"; // doc.Type = DocumentType.String
    DocumentValue doc = (Int32)123; // doc.Type = DocumentType.Int32
    DocumentValue doc = DateTime.Now;   // doc.Type = DocumentType.DateTime 

Arrays can be created using the DocumentArray class:

    int[] arr = new int[] { 1, 2, 3, 4, 5 };
    DocumentArray docArr = new DocumentArray(arr.Select(a=>(DocumentValue)a));  // doc.Type = DocumentType.Array 

Objects can be modelled using the DictionaryDocument class:

    DictionaryDocument dictDoc = new DictionaryDocument();  // doc.Type = DocumentType.Document
    dictDoc["foo"] = 123;
    dictDoc["bar"] = DateTime.Now;

Document Schema

Documents can be schema-less, meaning that they can take on any arbitrary structure. However, you can impose rules on how a document structure should be. This is done by creating a schema.

A schema is a set of rules that define the structure of a document. Schema rules include:

  • The permitted data type of a value (one of the above native or complex types)
  • Whether a null value is permitted
  • The permitted document key values with their associated data types
  • Whether an array of elements is permitted, with the optional data type of each element

There are 2 classes used to define schemas: SchemaElement and SchemaAttribute.

The example below creates a schema, then validates the schema against 2 documents:

    // Document passing schema validation

    SchemaElement schema = new SchemaElement(DocumentType.Document, false, null, new List<SchemaAttribute>{
        new SchemaAttribute(1, "a", new SchemaElement(DocumentType.String, false)),
        new SchemaAttribute(2, "b", new SchemaElement(DocumentType.DateTime, false)),
        new SchemaAttribute(3, "c", new SchemaElement(DocumentType.Int32, false))
    });

    DictionaryDocument dd = new DictionaryDocument();
    dd["a"] = new DocumentValue("foobar");
    dd["b"] = new DocumentValue(DateTime.Now);
    dd["c"] = new DocumentValue((int)123);

    Assert.True(schema.Validate(dd));   // returns 'true'. Document successfully validated.
    // Document failing schema validation
    
    SchemaElement schema = new SchemaElement(DocumentType.Document, false, null, new List<SchemaAttribute>{
        new SchemaAttribute(1, "a", new SchemaElement(DocumentType.String, false)),
        new SchemaAttribute(2, "b", new SchemaElement(DocumentType.DateTime, false)),
        new SchemaAttribute(3, "c", new SchemaElement(DocumentType.Int32, false))
    });

    DictionaryDocument dd = new DictionaryDocument();
    dd["a"] = new DocumentValue("foobar");
    dd["b"] = new DocumentValue(DateTime.Now);
    dd["c"] = new DocumentValue("baz"); // this should be an Int32!

    Assert.True(schema.Validate(dd));   // throws an exception. Document not validated.

Serialisation

Documents can be serialised to / deserialised from byte arrays. The IDocumentSerializer interface defines serialisation operations.

Multiple serialisation methods are supported based on whether the document has a predefined schema, or is a no-schema document.

Variable Length Integers (VarInt)

Before discussing serialisation in more depth, we need to cover a couple of topics. Firstly is Variable length integers (VarInts). These are a mechanism of storing integers in the least amount of bytes possible. VarInts are used extensively in this project to store things such as data types and data sizes. The use of VarInts allows the serialised data to be compressed - for example, using Int32 values to store sizes of data would require 4 bytes of storage, even for small values. However, using a VarInt value, small values can be cleverly encoded to require 1 byte of storage only. VarInts are also used in systems like SQLite, and you can read more about them here.

Serial Types

When decoding documents, metadata like the data types and data sizes must be encoded with the data to enable readers to deserialise the data afterwards. The 'Serial Type' encodes the data type and data size of the subsequent serialised data. Serial types are encoded as VarInts to ensure efficient compression when stored in files. The following table describes how the serial type values are calculated:

Serial Type Meaning
0 Value is NULL
1 Value is Boolean
2 Value is Byte
3 Value is SByte
4 Value is Char
5 Value is Decimal
6 Value is Double
7 Value is Single
8 Value is Int16
9 Value is UInt16
10 Value is Int32
11 Value is UInt32
12 Value is Int64
13 Value is UInt64
14 Value is DateTime
15 Value is Guid
N>=20 and N%5==0 Array. Value is a byte array that is (N-20)/5 bytes long
N>=21 and N%5==1 Blob. Value is a byte array that is (N-21)/5 bytes long.
N>=22 and N%5==2 String. Value is a string that is (N-22)/5 bytes long, stored in the text encoding of the database.
N>=23 and N%5==3 Document. value is a document that is (N-23)/5 bytes long
N>=24 and N%5==4 VarInt. value is (N-24)/5 bytes long

Schema-Defined Document

Documents can be schema-less or schema-bound. This affects how the document is serialized.

Schema-less documents are those without any fixed schema. Schema-less DictionaryDocument objects can contain any arbitrary keys and values. This allows for flexible / unstructed data to be stored. When these documents are serialised, the key associated with each value is serialised with the value in much the same fashion as text serialisation protocols like Json or XML. This serialisaton technique, allows for unstructured data to be fully self-describing. However, it is not efficient with regards to data storage.

Alternatively, documents can be serialised with a predefined document schema. If a schema is defined the document is also validated against the schema before being serialised. When schema-bound documents are serialised, the schema is encoded at the start of the serialised output. The schema includes all dictionary key names and types (attributes). Each attribute requires a unique AttributeId to be assigned. The data is serialised after the schema. When serialising the data, the attribute / key names are replaced with the AttributeId which is stored as a VarInt. This results in a much compressed serialised output.

ColumnStoreDocumentArray

A common document structure is a tabluar model comprised of a 2 dimensional array of rows and columns. This can be modelled using a DocumentArray containing zero or more DictionaryDocument objects. This can be thought of as a 'row based' document. Column storage can be thought of as having all the rows for a particular column adjacent. This is like having a DictionaryDocument, where each element is array with the same number of elements. When tables are stored in columnar format, some additional storage optimisations can be done:

TO DO

CSV library byte compression - Huffman ColumnStore class - dictionary encoding / RLE / Huffman encoding / VarInt encoding

ColumnStore

Run Length Encoding Dictionary Encoding Huffman (string / byte) encoding Delta Encoding???

Only write page once - when get to 1M 2^20 (1,048,576) No updates - but rows can be deleted? (logical delete)

N Columns, split into M row groups

each M row group = 2"^20 rows

MEtadata

  • Location of all column metadata start locations

Column Chunk = 1 column 2^20 rows = multiple pages

https://parquet.apache.org/docs/file-format/ Model: https://parquet.apache.org/docs/file-format/metadata/

https://cloudsqale.com/2020/05/29/how-parquet-files-are-written-row-groups-pages-required-memory-and-flush-operations/

Footer:

  • SChema (columns + types
  • all row group info (size, rows, min/max/null for each column)

SQL Server

https://sqlespresso.com/2019/06/26/understanding-columnstore-indexes-in-sql-server-part-1/#:~:text=Columnstore%20is%20simply%20the%20way%20the%20data%20is,columns%20and%20logically%20organized%20in%20rows%20and%20columns.

Min row group 102400, max 1M COlumn segments

Delta group = remainder of rows on b-tree index Delta store = multiple row groups

Tuple-Mover Process = moves rows from delta store to columnstore index looks for delta groups > 1M rows (closed group)

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
2.0.2 118 11/15/2024
2.0.1 153 7/6/2024
2.0.0 130 6/16/2024
1.0.2 185 1/21/2024
1.0.1 121 1/20/2024
1.0.0 149 1/14/2024