DataFlow.Spark
1.2.0
This package has been renamed to DataLinq.Spark. All development, bug fixes, and new features continue under the new name. The rename aligns with the parent framework rebrand from DataFlow.NET to DataLinq.NET, resolving naming conflicts with Google Cloud Dataflow and System.Threading.Tasks.Dataflow (TPL).
To migrate: Install DataLinq.Spark, then update your using statements from using DataFlow; to using DataLinq; and using DataFlow.Spark; to using DataLinq.Spark;
dotnet add package DataFlow.Spark --version 1.2.0
NuGet\Install-Package DataFlow.Spark -Version 1.2.0
<PackageReference Include="DataFlow.Spark" Version="1.2.0" />
<PackageVersion Include="DataFlow.Spark" Version="1.2.0" />
<PackageReference Include="DataFlow.Spark" />
paket add DataFlow.Spark --version 1.2.0
#r "nuget: DataFlow.Spark, 1.2.0"
#:package DataFlow.Spark@1.2.0
#addin nuget:?package=DataFlow.Spark&version=1.2.0
#tool nuget:?package=DataFlow.Spark&version=1.2.0
DataFlow.Spark
LINQ-native Apache Spark integration for DataFlow.NET.
Features
- Native LINQ Translation - Write C# LINQ, execute distributed Spark
- Streaming Results - Efficient processing with DataFrames
- Type Safety - Strong typing with automatic column mapping
- Distributed Processing - Scale to petabytes with Apache Spark
- O(1) Memory Writes - Batched streaming for table writes
- Window Functions - Rank, Lead, Lag, running aggregates with expression syntax
- Cases Pattern - Multi-output conditional routing
- In-Memory Push -
context.Push(data)for test data injection
Quick Start
using DataFlow.Spark;
// Connect to Spark (local mode)
using var context = Spark.Connect("local[*]", "MyApp");
// Production cluster examples:
// using var context = Spark.Connect("spark://spark-master:7077", "MyApp");
// using var context = Spark.Connect("yarn", "MyApp");
// Query with LINQ (cluster-side execution)
var stats = context.Read.Table<Order>("sales.orders")
.Where(o => o.Amount > 1000)
.GroupBy(o => o.Region)
.Select(g => new { Region = g.Key, Total = g.Sum(o => o.Amount) })
.ToList();
// Side effects with ForEach (executes on Spark executors - NOT locally!)
context.Read.Table<Order>("sales.orders")
.ForEach(o => Metrics.Increment("orders_processed")) // Runs on cluster
.Show();
Write Operations
// From SparkQuery (server-side, no context needed)
await context.Read.Table<Order>("orders")
.Where(o => o.Amount > 1000)
.WriteParquet("/output/high_value");
await context.Read.Table<Order>("orders")
.WriteTable("analytics.summary").Overwrite();
// From local IEnumerable (client → server, context required)
await data.WriteTable(context, "orders", bufferSize: 10_000).Overwrite();
await data.WriteParquet(context, "hdfs://data/orders.parquet", bufferSize: 50_000);
// From IAsyncEnumerable (streaming client → server)
await asyncStream.WriteParquet(context, "path.parquet",
bufferSize: 5_000,
flushInterval: TimeSpan.FromSeconds(30));
Test Coverage
| Tier | Tests | Pass | Fail | Skip | Coverage |
|---|---|---|---|---|---|
| Unit Tests | 76 | 76 | 0 | 0 | 100% |
| Integration Tests | 157 | 153 | 4 | 0 | 97.5% |
| Package Audit | 118 | 112 | 0 | 6 | 94.9% |
| TOTAL | 351 | 341 | 4 | 6 | 97.2% |
Known Issues: 4 expected failures track open bugs:
- BUG-001 (2 tests): Anonymous type materialization - use DTO classes with
{ get; set; }- BUG-004 (2 tests): Ternary expressions - use
Casespattern instead
Requirements
- .NET 8.0+
- DataFlow.Net 1.1.0+
- Apache Spark 3.5.0+
- DataFlow.Spark license for production
Support & Issues
📧 Contact: tecnet.paris@gmail.com
🐛 Report Issues: github.com/improveTheWorld/DataFlow.NET/issues
License
Development Tier (Free)
Use DataFlow.Spark free for development and testing up to 1,000 rows per query:
| Environment | How It's Detected | Limit |
|---|---|---|
| Debugger Attached | Visual Studio, Rider, VS Code | 1,000 rows |
| ASPNETCORE_ENVIRONMENT=Development | ASP.NET apps | 1,000 rows |
| DOTNET_ENVIRONMENT=Development | Console apps | 1,000 rows |
| DATAFLOW_ENVIRONMENT=Development | Explicit opt-in | 1,000 rows |
Examples:
# Option 1: Set environment variable
$env:DATAFLOW_ENVIRONMENT="Development" # PowerShell
export DATAFLOW_ENVIRONMENT=Development # Bash
# Option 2: Launch with debugger attached (auto-detects)
dotnet run --launch-profile "Development" # Uses launchSettings.json
# Or simply press F5 in Visual Studio/Rider
Production License
For production workloads (unlimited rows), obtain a license at:
- 🌐 Pricing: https://get-dataflow.net/pricing
- 📧 Contact: tecnet.paris@gmail.com
Set your license key as an environment variable (auto-detected at runtime):
# PowerShell
$env:DATAFLOW_LICENSE_KEY="your-license-key"
# Bash/Linux/macOS
export DATAFLOW_LICENSE_KEY="your-license-key"
# Docker / Kubernetes
ENV DATAFLOW_LICENSE_KEY=your-license-key
Security: The license key is never in source code. Set it in your deployment environment (CI/CD secrets, Azure Key Vault, AWS Secrets Manager, etc.)
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- DataFlow.Net (>= 1.1.0)
- Microsoft.Spark (>= 2.3.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|
v1.2.0: RSA licensing, dev tier auto-detection, Cases pattern. Full changelog: https://github.com/improveTheWorld/DataFlow.NET/tree/main/docs/changelog