FilePrepper 0.6.0
dotnet add package FilePrepper --version 0.6.0
NuGet\Install-Package FilePrepper -Version 0.6.0
<PackageReference Include="FilePrepper" Version="0.6.0" />
<PackageVersion Include="FilePrepper" Version="0.6.0" />
<PackageReference Include="FilePrepper" />
paket add FilePrepper --version 0.6.0
#r "nuget: FilePrepper, 0.6.0"
#:package FilePrepper@0.6.0
#addin nuget:?package=FilePrepper&version=0.6.0
#tool nuget:?package=FilePrepper&version=0.6.0
FilePrepper
A powerful .NET library and CLI tool for data preprocessing. Features a Pipeline API for efficient in-memory data transformations with 67-90% reduction in file I/O. Perfect for ML data preparation, ETL pipelines, and data analysis workflows.
๐ Quick Start
SDK Installation
# Install FilePrepper SDK for programmatic use
dotnet add package FilePrepper
# Or install CLI tool globally
dotnet tool install -g fileprepper-cli
SDK Usage (Recommended)
using FilePrepper.Pipeline;
// CSV Processing: Only 2 file I/O operations (read + write)
await DataPipeline
.FromCsvAsync("data.csv")
.Normalize(columns: new[] { "Age", "Salary", "Score" },
method: NormalizationMethod.MinMax)
.FillMissing(columns: new[] { "Score" }, method: FillMethod.Mean)
.FilterRows(row => int.Parse(row["Age"]) >= 30)
.ToCsvAsync("output.csv");
// Multi-Format Support: Excel โ Transform โ JSON
await DataPipeline
.FromExcelAsync("sales.xlsx")
.AddColumn("Total", row =>
(double.Parse(row["Price"]) * double.Parse(row["Quantity"])).ToString())
.FilterRows(row => double.Parse(row["Total"]) >= 1000)
.ToJsonAsync("high_value_sales.json");
// Multi-File CSV Concatenation: Merge 33 files โญ NEW
await DataPipeline
.ConcatCsvAsync("kemp-*.csv", "dataset/")
.ParseKoreanTime("Time", "ParsedTime") // Korean time format โญ NEW
.ExtractDateFeatures("ParsedTime", DateFeatures.Hour | DateFeatures.Minute)
.ToCsvAsync("processed.csv");
CLI Usage
# Normalize numeric columns
fileprepper normalize-data --input data.csv --output normalized.csv \
--columns "Age,Salary,Score" --method MinMax
# Fill missing values
fileprepper fill-missing-values --input data.csv --output filled.csv \
--columns "Age,Salary" --method Mean
# Get help
fileprepper --help
fileprepper <command> --help
๐ฆ Supported Formats
Process data in multiple formats:
- CSV (Comma-Separated Values)
- TSV (Tab-Separated Values)
- JSON (JavaScript Object Notation)
- XML (Extensible Markup Language)
- Excel (XLSX/XLS files)
๐ ๏ธ Feature Matrix (30 Tasks)
| Category | CLI Command | Task | Description |
|---|---|---|---|
| Data Transformation | normalize |
NormalizeData | MinMax, ZScore normalization |
scale |
ScaleData | StandardScaler, MinMaxScaler, RobustScaler | |
one-hot-encoding |
OneHotEncoding | Categorical โ binary columns | |
convert-type |
DataTypeConvert | Column data type conversion | |
extract-date |
DateExtraction | Extract Year, Month, Day, DayOfWeek | |
datetime |
DateTimeOps | Parse datetime and extract features | |
string |
StringOps | upper, lower, trim, substring, concat, replace | |
conditional |
Conditional | If-then-else column creation | |
expression |
Expression | Arithmetic expression-based columns | |
| Data Cleaning | fill-missing |
FillMissingValues | Mean, Median, Mode, Forward, Backward, Constant |
drop-duplicates |
DropDuplicates | Remove duplicate rows by key columns | |
replace |
ValueReplace | Replace values in columns | |
remove-constants |
RemoveConstants | Remove constant/near-constant columns | |
clean |
CSVCleaner | Thousand separators, whitespace, \r strip |
|
| Column Operations | add-columns |
AddColumns | Add computed columns |
remove-columns |
RemoveColumns | Delete columns | |
rename-columns |
RenameColumns | Rename column headers | |
reorder-columns |
ReorderColumns | Change column order | |
column-interaction |
ColumnInteraction | Create interaction features between columns | |
| Data Organization | merge |
Merge | Vertical (concat) / Horizontal (join), glob support |
merge-asof |
MergeAsOf | Time-series merge with tolerance | |
data-sampling |
DataSampling | Random, Stratified, Systematic sampling | |
convert-format |
FileFormatConvert | CSV โ TSV โ JSON โ XML โ Excel | |
unpivot |
Unpivot | Wide โ Long format reshape | |
filter-rows |
FilterRows | Row filtering by conditions | |
| Data Analysis | stats |
BasicStatistics | Mean, Median, StdDev, ZScore |
aggregate |
Aggregate | Group-by aggregations | |
| Feature Engineering | create-lag-features |
CreateLagFeatures | Time-series lag features |
window |
WindowOps | Resample, rolling aggregations | |
| Common Options | โ | โ | --skip-rows, --has-header, --encoding, --ignore-errors |
๐งช ML Data Preparation Cookbook
Common scenarios for machine learning data preparation:
Large Dataset Sampling (100K+ rows โ 10K sample)
# Random sampling with fixed seed for reproducibility
fileprepper data-sampling -i large_dataset.csv -o sampled.csv \
--method Random --sample-size 10000 --seed 42
# Stratified sampling (preserve label distribution)
fileprepper data-sampling -i large_dataset.csv -o sampled.csv \
--method Stratified --sample-size 10000 --stratify-column "label"
// Pipeline API
await DataPipeline
.FromCsvAsync("large_dataset.csv")
.Sample(10000, SamplingMethod.Random, seed: 42)
.ToCsvAsync("sampled.csv");
Merging X/Y Split Files (Features + Labels)
# Horizontal merge: combine X_train.csv (features) + Y_train.csv (labels) by row index
fileprepper merge -i X_train.csv Y_train.csv -o merged_train.csv --direction Horizontal
// Pipeline API
var features = await DataPipeline.FromCsvAsync("X_train.csv");
var labels = await DataPipeline.FromCsvAsync("Y_train.csv");
await features
.Join(labels, JoinType.Full, leftKey: null, rightKey: null) // row-by-row join
.ToCsvAsync("merged_train.csv");
Multi-Row Header Files (Skip metadata rows)
# Skip first row (category header), use second row as actual column names
fileprepper filter-rows -i messy_data.csv -o clean_data.csv --skip-rows 1
# No header in file โ use numeric column indices
fileprepper normalize -i raw.csv -o normalized.csv \
--columns "0,1,2" --method MinMax --has-header false
Cleaning External Data (Mixed line endings)
# Strip \r from quoted fields + remove thousand separators
fileprepper clean -i external_export.csv -o cleaned.csv --strip-cr -s ','
// Pipeline API
await DataPipeline
.FromCsvAsync("external_export.csv")
.StripCarriageReturn()
.ToCsvAsync("cleaned.csv");
๐ก Common Use Cases
Data Cleaning Pipeline (CLI)
# 1. Remove unnecessary columns
fileprepper remove-columns --input raw.csv --output step1.csv \
--columns "Debug,TempCol,Notes"
# 2. Drop duplicates
fileprepper drop-duplicates --input step1.csv --output step2.csv \
--columns "Email" --keep First
# 3. Fill missing values
fileprepper fill-missing-values --input step2.csv --output step3.csv \
--columns "Age,Salary" --method Mean
# 4. Normalize numeric columns
fileprepper normalize-data --input step3.csv --output clean.csv \
--columns "Age,Salary,Score" --method MinMax
Time-Series Processing
# 5-minute window aggregation for sensor data
fileprepper window --input sensor_current.csv --output aggregated.csv \n --type resample --method mean \n --columns "RMS[A]" --time-column "Time_s[s]" \n --window 5T --header
# Rolling window for smoothing
fileprepper window --input noisy_data.csv --output smoothed.csv \n --type rolling --method mean \n --columns temperature,humidity --window-size 3 \n --suffix "_smooth" --header
ML Feature Engineering (SDK - Efficient!)
using FilePrepper.Pipeline;
// Single pipeline: Only 2 file I/O operations instead of 8!
await DataPipeline
.FromCsvAsync("orders.csv")
.AddColumn("Year", row => DateTime.Parse(row["OrderDate"]).Year.ToString())
.AddColumn("Month", row => DateTime.Parse(row["OrderDate"]).Month.ToString())
.Normalize(columns: new[] { "Revenue", "Quantity" },
method: NormalizationMethod.MinMax)
.FilterRows(row => int.Parse(row["Year"]) >= 2023)
.ToCsvAsync("features.csv");
// 67-90% reduction in file I/O compared to CLI approach!
Format Conversion
# CSV to JSON
fileprepper file-format-convert --input data.csv --output data.json --format JSON
# Excel to CSV
fileprepper file-format-convert --input report.xlsx --output report.csv --format CSV
# CSV to XML
fileprepper file-format-convert --input data.csv --output data.xml --format XML
Data Analysis
# Calculate statistics
fileprepper basic-statistics --input data.csv --output stats.csv \
--columns "Age,Salary,Score" --statistics Mean,Median,StdDev,ZScore
# Aggregate by group
fileprepper aggregate --input sales.csv --output summary.csv \
--group-by "Region,Category" --agg-columns "Revenue:Sum,Quantity:Mean"
# Sample data
fileprepper data-sampling --input large.csv --output sample.csv \
--method Random --sample-size 1000
๐ง Programmatic Usage (SDK)
FilePrepper provides a powerful SDK with Pipeline API for efficient data processing:
dotnet add package FilePrepper
โจ Pipeline API (Recommended)
Benefits: 67-90% reduction in file I/O, fluent API, in-memory processing
using FilePrepper.Pipeline;
using FilePrepper.Tasks.NormalizeData;
// Efficient: Only 2 file I/O operations (read + write)
await DataPipeline
.FromCsvAsync("data.csv")
.Normalize(columns: new[] { "Age", "Salary", "Score" },
method: NormalizationMethod.MinMax)
.FillMissing(columns: new[] { "Score" }, method: FillMethod.Mean)
.FilterRows(row => int.Parse(row["Age"]) >= 30)
.AddColumn("ProcessedDate", _ => DateTime.Now.ToString())
.ToCsvAsync("output.csv");
// Or work in-memory without any file I/O
var result = DataPipeline
.FromData(inMemoryData)
.Normalize(columns: new[] { "Age", "Salary" },
method: NormalizationMethod.MinMax)
.ToDataFrame(); // Get immutable snapshot
Advanced Pipeline Features
// Chain multiple transformations
var pipeline = await DataPipeline
.FromCsvAsync("sales.csv")
.RemoveColumns(new[] { "Debug", "TempCol" })
.RenameColumn("OldName", "NewName")
.AddColumn("Total", row =>
(double.Parse(row["Price"]) * double.Parse(row["Quantity"])).ToString())
.FilterRows(row => double.Parse(row["Total"]) > 100)
.Normalize(columns: new[] { "Total" }, method: NormalizationMethod.MinMax);
// Get intermediate results without file I/O
var dataFrame = pipeline.ToDataFrame();
Console.WriteLine($"Processed {dataFrame.RowCount} rows");
// Continue processing
await pipeline
.AddColumn("ProcessedAt", _ => DateTime.UtcNow.ToString("o"))
.ToCsvAsync("output.csv");
In-Memory Processing
// Work entirely in memory - zero file I/O
var data = new List<Dictionary<string, string>>
{
new() { ["Name"] = "Alice", ["Age"] = "25", ["Salary"] = "50000" },
new() { ["Name"] = "Bob", ["Age"] = "30", ["Salary"] = "60000" }
};
var result = DataPipeline
.FromData(data)
.Normalize(columns: new[] { "Age", "Salary" },
method: NormalizationMethod.MinMax)
.AddColumn("Category", row =>
int.Parse(row["Age"]) < 30 ? "Junior" : "Senior")
.ToDataFrame();
// Access results directly
foreach (var row in result.Rows)
{
Console.WriteLine($"{row["Name"]}: {row["Category"]}");
}
Traditional Task API
using FilePrepper.Tasks.NormalizeData;
using Microsoft.Extensions.Logging;
var options = new NormalizeDataOption
{
InputPath = "data.csv",
OutputPath = "normalized.csv",
TargetColumns = new[] { "Age", "Salary", "Score" },
Method = NormalizationMethod.MinMax
};
var task = new NormalizeDataTask(logger);
var context = new TaskContext(options);
bool success = await task.ExecuteAsync(context);
See SDK Usage Guide for comprehensive examples and best practices.
๐ Documentation
Getting Started
- Quick Start Guide - Get started in 5 minutes
- CLI Guide - Complete command reference
- Installation Guide - Detailed installation
SDK & Programming
- API Reference - Pipeline API and Task API reference
- Quick Start Guide - Get started with SDK in 5 minutes
Advanced Features
- Phase 2 Complete Guide - Window operations, datetime, string, conditional features
- Common Scenarios - Real-world use cases
Use Cases
- Common Scenarios - Real-world use cases
For more documentation, see the docs/ directory.
๐ฏ Use Cases
- Machine Learning - Prepare datasets for training (normalization, encoding, feature engineering)
- Time-Series Analysis - Window aggregations, resampling, lag features - Prepare datasets for training (normalization, encoding, feature engineering)
- Data Analysis - Clean and transform data for analysis
- ETL Pipelines - Extract, transform, and load data workflows with minimal I/O overhead
- Data Migration - Convert between formats and clean legacy data
- Automation - Script data processing with SDK or CLI
- In-Memory Processing - Chain transformations without file I/O costs
๐ Requirements
- .NET 10.0 or later
- Cross-platform - Windows, Linux, macOS
- Flexible Usage - CLI tool (no coding) or SDK (programmatic)
๐ค Contributing
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Links
- SDK NuGet Package: https://www.nuget.org/packages/FilePrepper
- CLI NuGet Package: https://www.nuget.org/packages/fileprepper-cli
- GitHub Repository: https://github.com/iyulab/FilePrepper
- Issues: https://github.com/iyulab/FilePrepper/issues
- Documentation: docs/
- Changelog: CHANGELOG.md
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- CsvHelper (>= 33.1.0)
- EPPlus (>= 8.5.0)
- ExcelDataReader (>= 3.8.0)
- ExcelDataReader.DataSet (>= 3.8.0)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 10.0.5)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.5)
- Microsoft.Extensions.Options (>= 10.0.5)
- Scrutor (>= 7.0.0)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on FilePrepper:
| Package | Downloads |
|---|---|
|
DataLens
Exploratory data analysis engine for CSV/Excel datasets. Produces JSON analysis results including profiling, descriptive statistics, correlation, regression, clustering, outlier detection, PCA, and feature importance. |
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 0.6.0 | 0 | 3/20/2026 |
| 0.5.0 | 146 | 2/21/2026 |
| 0.4.9 | 239 | 1/10/2026 |
| 0.4.8 | 169 | 11/16/2025 |
| 0.4.7 | 278 | 11/14/2025 |
| 0.4.5 | 318 | 11/13/2025 |
| 0.4.3 | 291 | 11/10/2025 |
| 0.4.0 | 220 | 11/3/2025 |
| 0.2.3 | 221 | 11/3/2025 |
| 0.2.2 | 181 | 1/17/2025 |
| 0.2.1 | 158 | 1/16/2025 |
| 0.2.0 | 191 | 1/11/2025 |
| 0.1.1 | 195 | 12/16/2024 |
| 0.1.0 | 188 | 12/6/2024 |