DataLinq.Spark
1.1.0
dotnet add package DataLinq.Spark --version 1.1.0
NuGet\Install-Package DataLinq.Spark -Version 1.1.0
<PackageReference Include="DataLinq.Spark" Version="1.1.0" />
<PackageVersion Include="DataLinq.Spark" Version="1.1.0" />
<PackageReference Include="DataLinq.Spark" />
paket add DataLinq.Spark --version 1.1.0
#r "nuget: DataLinq.Spark, 1.1.0"
#:package DataLinq.Spark@1.1.0
#addin nuget:?package=DataLinq.Spark&version=1.1.0
#tool nuget:?package=DataLinq.Spark&version=1.1.0
DataLinq.Spark
LINQ-native Apache Spark integration for DataLinq.NET.
Migrating from
dotnet/spark? Microsoft deprecated it March 2025. DataLinq.Spark is the maintained successor — same cluster, cleaner API, nospark-submitrequired. Migration guide →
dotnet add package DataLinq.Spark --version 1.1.0
Free dev tier included — 1,000 rows, no license key, no credit card. The core DataLinq.NET package (streaming, SUPRA pattern, Cases, EF Core) is Apache 2.0 free and a dependency.
📖 LINQ-to-Spark Guide | DataLinq.NET on GitHub | 🌐 Product Website
Features
- Native LINQ Translation - Write C# LINQ, execute distributed Spark
- Streaming Results - Efficient processing with DataFrames
- Type Safety - Strong typing with automatic column mapping
- Distributed Processing - Scale to petabytes with Apache Spark
- O(1) Memory Writes - Batched streaming for table writes
- Window Functions - Rank, Lead, Lag, running aggregates with expression syntax
- Cases Pattern - Multi-output conditional routing
- Auto-UDF — Custom methods in Where/Select auto-translate to Spark UDFs (static, instance, lambda)
- ForEach — Distributed side effects with automatic field sync-back to the driver
- In-Memory Push -
context.Push(data)for test data injection
Quick Start
using DataLinq.Spark;
// Connect to Spark (local mode)
using var context = Spark.Connect("local[*]", "MyApp");
// Production cluster examples:
// using var context = Spark.Connect("spark://spark-master:7077", "MyApp");
// using var context = Spark.Connect("yarn", "MyApp");
// Query with LINQ (cluster-side execution)
var stats = context.Read.Table<Order>("sales.orders")
.Where(o => o.Amount > 1000)
.GroupBy(o => o.Region)
.Select(g => new { Region = g.Key, Total = g.Sum(o => o.Amount) })
.ToList();
// Side effects with ForEach (executes on Spark executors - NOT locally!)
int processed = 0;
context.Read.Table<Order>("sales.orders")
.ForEach(o => processed++)
.Do(); // ← Triggers distributed execution; field sync-back happens here
Console.WriteLine($"Processed {processed} orders");
Write Operations
using DataLinq.Spark;
// From SparkQuery (server-side)
await context.Read.Table<Order>("orders")
.Where(o => o.Amount > 1000)
.WriteParquet("/output/high_value");
await context.Read.Table<Order>("orders")
.WriteTable("analytics.summary", overwrite: true);
// From local IEnumerable (client → server, context required)
await data.WriteTable(context, "orders", overwrite: true, bufferSize: 10_000);
await data.WriteParquet(context, "hdfs://data/orders.parquet", bufferSize: 50_000);
Test Coverage
| Tier | Tests | Pass | Fail | Coverage |
|---|---|---|---|---|
| Unit Tests | 122 | 122 | 0 | 100% |
| Integration Tests | 250 | 250 | 0 | 100% |
| Adversarial Audit | 306 | 306 | 0 | 100% |
| TOTAL | 678 | 678 | 0 | 100% |
Requirements
- .NET 8.0+
- DataLinq.NET 1.0.0+
- Apache Spark 3.5.0+
- DataLinq.Spark license for production
Before You Run
DataLinq.Spark is the developer layer — your DevOps/infra team owns the Spark cluster setup.
# Verify Spark is available:
spark-submit --version
# Or verify your cluster master is reachable:
curl http://spark-master:8080
If Spark.Connect(...) fails immediately, the issue is most likely your Spark environment. See the Apache Spark installation guide.
ForEach — Distributed Side Effects with Sync-Back
ForEach runs your code on Spark executors, then automatically syncs field mutations back to the driver:
// Static fields sync back after Do():
query.ForEach(OrderStats.ProcessOrder).Do();
Console.WriteLine(OrderStats.TotalAmount); // ← Updated correctly
// Lambda closures sync back:
int count = 0;
query.ForEach(o => count++).Do();
Console.WriteLine(count); // ← Updated correctly
// Instance fields sync back:
var processor = new OrderProcessor();
query.ForEach(processor.Process).Do();
Console.WriteLine(processor.Processed); // ← Updated correctly
Limitations: Collections (
List<T>, arrays) are not synchronized — use scalar accumulators. The Roslyn analyzer warns at compile time (DFSP001,DFSP002).
Support & Issues
📧 Contact: support@get-datalinq.net
🐛 Report Issues: github.com/improveTheWorld/DataLinq.NET/issues
License
Free Tier (No Setup Required)
DataLinq.Spark works out of the box with no license and no configuration. The free tier allows up to 1,000 rows per query — exceeding this throws a LicenseException. No environment variables, no opt-in needed. Just install and run.
Production License
For production workloads (unlimited rows), obtain a license at:
- 🌐 Pricing: https://get-datalinq.net/pricing
- 📧 Contact: support@get-datalinq.net
Set your license key as an environment variable (auto-detected at runtime):
# PowerShell
$env:DATALINQ_LICENSE_KEY="your-license-key"
# Bash/Linux/macOS
export DATALINQ_LICENSE_KEY="your-license-key"
# Docker / Kubernetes
ENV DATALINQ_LICENSE_KEY=your-license-key
Security: The license key is never in source code. Set it in your deployment environment (CI/CD secrets, Azure Key Vault, AWS Secrets Manager, etc.)
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- DataLinq.Net (>= 1.0.0)
- Microsoft.Spark (>= 2.3.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
v1.1.0: Task-returning Write API (CS4014 safety), expression-based MergeTable updateOnly, BREAKING: SaveMode enum replaced by bool overwrite/createIfMissing params. Full notes: https://github.com/improveTheWorld/DataLinq.NET/blob/main/releasenotes/DataLinq.Spark_1.1.0.md