FilePrepper 0.7.1

.NET 10.0

dotnet add package FilePrepper --version 0.7.1

NuGet\Install-Package FilePrepper -Version 0.7.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="FilePrepper" Version="0.7.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="FilePrepper" Version="0.7.1" />
                    

                            Directory.Packages.props

<PackageReference Include="FilePrepper" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add FilePrepper --version 0.7.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: FilePrepper, 0.7.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package FilePrepper@0.7.1

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=FilePrepper&version=0.7.1
                    

                            Install as a Cake Addin

#tool nuget:?package=FilePrepper&version=0.7.1
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

FilePrepper

A powerful .NET library and CLI tool for data preprocessing. Features a Pipeline API for efficient in-memory data transformations with 67-90% reduction in file I/O. Perfect for ML data preparation, ETL pipelines, and data analysis workflows.

🚀 Quick Start

SDK Installation

# Install FilePrepper SDK for programmatic use
dotnet add package FilePrepper

# Or install CLI tool globally
dotnet tool install -g fileprepper-cli

SDK Usage (Recommended)

using FilePrepper.Pipeline;

// CSV Processing: Only 2 file I/O operations (read + write)
await DataPipeline
    .FromCsvAsync("data.csv")
    .Normalize(columns: new[] { "Age", "Salary", "Score" },
               method: NormalizationMethod.MinMax)
    .FillMissing(columns: new[] { "Score" }, method: FillMethod.Mean)
    .FilterRows(row => int.Parse(row["Age"]) >= 30)
    .ToCsvAsync("output.csv");

// Multi-Format Support: Excel → Transform → JSON
await DataPipeline
    .FromExcelAsync("sales.xlsx")
    .AddColumn("Total", row =>
        (double.Parse(row["Price"]) * double.Parse(row["Quantity"])).ToString())
    .FilterRows(row => double.Parse(row["Total"]) >= 1000)
    .ToJsonAsync("high_value_sales.json");

// Multi-File CSV Concatenation: Merge 33 files ⭐ NEW
await DataPipeline
    .ConcatCsvAsync("kemp-*.csv", "dataset/")
    .ParseKoreanTime("Time", "ParsedTime")  // Korean time format ⭐ NEW
    .ExtractDateFeatures("ParsedTime", DateFeatures.Hour | DateFeatures.Minute)
    .ToCsvAsync("processed.csv");

CLI Usage

# Normalize numeric columns
fileprepper normalize-data --input data.csv --output normalized.csv \
  --columns "Age,Salary,Score" --method MinMax

# Fill missing values
fileprepper fill-missing-values --input data.csv --output filled.csv \
  --columns "Age,Salary" --method Mean

# Get help
fileprepper --help
fileprepper <command> --help

📦 Supported Formats

Process data in multiple formats:

CSV (Comma-Separated Values)
TSV (Tab-Separated Values)
JSON (JavaScript Object Notation)
XML (Extensible Markup Language)
Excel (XLSX/XLS files)

🛠️ Feature Matrix (30 Tasks)

Category	CLI Command	Task	Description
Data Transformation	`normalize`	NormalizeData	MinMax, ZScore normalization
	`scale`	ScaleData	StandardScaler, MinMaxScaler, RobustScaler
	`one-hot-encoding`	OneHotEncoding	Categorical → binary columns
	`convert-type`	DataTypeConvert	Column data type conversion
	`extract-date`	DateExtraction	Extract Year, Month, Day, DayOfWeek
	`datetime`	DateTimeOps	Parse datetime and extract features
	`string`	StringOps	upper, lower, trim, substring, concat, replace
	`conditional`	Conditional	If-then-else column creation
	`expression`	Expression	Arithmetic expression-based columns
Data Cleaning	`fill-missing`	FillMissingValues	Mean, Median, Mode, Forward, Backward, Constant
	`drop-duplicates`	DropDuplicates	Remove duplicate rows by key columns
	`replace`	ValueReplace	Replace values in columns
	`remove-constants`	RemoveConstants	Remove constant/near-constant columns
	`clean`	CSVCleaner	Thousand separators, whitespace, `\r` strip
Column Operations	`add-columns`	AddColumns	Add computed columns
	`remove-columns`	RemoveColumns	Delete columns
	`rename-columns`	RenameColumns	Rename column headers
	`reorder-columns`	ReorderColumns	Change column order
	`column-interaction`	ColumnInteraction	Create interaction features between columns
Data Organization	`merge`	Merge	Vertical (concat) / Horizontal (join), glob support
	`merge-asof`	MergeAsOf	Time-series merge with tolerance
	`data-sampling`	DataSampling	Random, Stratified, Systematic sampling
	`convert-format`	FileFormatConvert	CSV ↔ TSV ↔ JSON ↔ XML ↔ Excel
	`unpivot`	Unpivot	Wide → Long format reshape
	`filter-rows`	FilterRows	Row filtering by conditions
Data Analysis	`stats`	BasicStatistics	Mean, Median, StdDev, ZScore
	`aggregate`	Aggregate	Group-by aggregations
Feature Engineering	`create-lag-features`	CreateLagFeatures	Time-series lag features
	`window`	WindowOps	Resample, rolling aggregations
Common Options	—	—	`--skip-rows`, `--has-header`, `--encoding`, `--ignore-errors`

🧪 ML Data Preparation Cookbook

Common scenarios for machine learning data preparation:

Large Dataset Sampling (100K+ rows → 10K sample)

# Random sampling with fixed seed for reproducibility
fileprepper data-sampling -i large_dataset.csv -o sampled.csv \
  --method Random --sample-size 10000 --seed 42

# Stratified sampling (preserve label distribution)
fileprepper data-sampling -i large_dataset.csv -o sampled.csv \
  --method Stratified --sample-size 10000 --stratify-column "label"

// Pipeline API
await DataPipeline
    .FromCsvAsync("large_dataset.csv")
    .Sample(10000, SamplingMethod.Random, seed: 42)
    .ToCsvAsync("sampled.csv");

Merging X/Y Split Files (Features + Labels)

# Horizontal merge: combine X_train.csv (features) + Y_train.csv (labels) by row index
fileprepper merge -i X_train.csv Y_train.csv -o merged_train.csv --direction Horizontal

// Pipeline API
var features = await DataPipeline.FromCsvAsync("X_train.csv");
var labels = await DataPipeline.FromCsvAsync("Y_train.csv");
await features
    .Join(labels, JoinType.Full, leftKey: null, rightKey: null) // row-by-row join
    .ToCsvAsync("merged_train.csv");

Multi-Row Header Files (Skip metadata rows)

# Skip first row (category header), use second row as actual column names
fileprepper filter-rows -i messy_data.csv -o clean_data.csv --skip-rows 1

# No header in file — use numeric column indices
fileprepper normalize -i raw.csv -o normalized.csv \
  --columns "0,1,2" --method MinMax --has-header false

Cleaning External Data (Mixed line endings)

# Strip \r from quoted fields + remove thousand separators
fileprepper clean -i external_export.csv -o cleaned.csv --strip-cr -s ','

// Pipeline API
await DataPipeline
    .FromCsvAsync("external_export.csv")
    .StripCarriageReturn()
    .ToCsvAsync("cleaned.csv");

💡 Common Use Cases

Data Cleaning Pipeline (CLI)

# 1. Remove unnecessary columns
fileprepper remove-columns --input raw.csv --output step1.csv \
  --columns "Debug,TempCol,Notes"

# 2. Drop duplicates
fileprepper drop-duplicates --input step1.csv --output step2.csv \
  --columns "Email" --keep First

# 3. Fill missing values
fileprepper fill-missing-values --input step2.csv --output step3.csv \
  --columns "Age,Salary" --method Mean

# 4. Normalize numeric columns
fileprepper normalize-data --input step3.csv --output clean.csv \
  --columns "Age,Salary,Score" --method MinMax

Time-Series Processing

# 5-minute window aggregation for sensor data
fileprepper window --input sensor_current.csv --output aggregated.csv \n  --type resample --method mean \n  --columns "RMS[A]" --time-column "Time_s[s]" \n  --window 5T --header

# Rolling window for smoothing
fileprepper window --input noisy_data.csv --output smoothed.csv \n  --type rolling --method mean \n  --columns temperature,humidity --window-size 3 \n  --suffix "_smooth" --header

ML Feature Engineering (SDK - Efficient!)

using FilePrepper.Pipeline;

// Single pipeline: Only 2 file I/O operations instead of 8!
await DataPipeline
    .FromCsvAsync("orders.csv")
    .AddColumn("Year", row => DateTime.Parse(row["OrderDate"]).Year.ToString())
    .AddColumn("Month", row => DateTime.Parse(row["OrderDate"]).Month.ToString())
    .Normalize(columns: new[] { "Revenue", "Quantity" },
               method: NormalizationMethod.MinMax)
    .FilterRows(row => int.Parse(row["Year"]) >= 2023)
    .ToCsvAsync("features.csv");

// 67-90% reduction in file I/O compared to CLI approach!

Format Conversion

# CSV to JSON
fileprepper file-format-convert --input data.csv --output data.json --format JSON

# Excel to CSV
fileprepper file-format-convert --input report.xlsx --output report.csv --format CSV

# CSV to XML
fileprepper file-format-convert --input data.csv --output data.xml --format XML

Data Analysis

# Calculate statistics
fileprepper basic-statistics --input data.csv --output stats.csv \
  --columns "Age,Salary,Score" --statistics Mean,Median,StdDev,ZScore

# Aggregate by group
fileprepper aggregate --input sales.csv --output summary.csv \
  --group-by "Region,Category" --agg-columns "Revenue:Sum,Quantity:Mean"

# Sample data
fileprepper data-sampling --input large.csv --output sample.csv \
  --method Random --sample-size 1000

🔧 Programmatic Usage (SDK)

FilePrepper provides a powerful SDK with Pipeline API for efficient data processing:

dotnet add package FilePrepper

✨ Pipeline API (Recommended)

Benefits: 67-90% reduction in file I/O, fluent API, in-memory processing

using FilePrepper.Pipeline;
using FilePrepper.Tasks.NormalizeData;

// Efficient: Only 2 file I/O operations (read + write)
await DataPipeline
    .FromCsvAsync("data.csv")
    .Normalize(columns: new[] { "Age", "Salary", "Score" },
               method: NormalizationMethod.MinMax)
    .FillMissing(columns: new[] { "Score" }, method: FillMethod.Mean)
    .FilterRows(row => int.Parse(row["Age"]) >= 30)
    .AddColumn("ProcessedDate", _ => DateTime.Now.ToString())
    .ToCsvAsync("output.csv");

// Or work in-memory without any file I/O
var result = DataPipeline
    .FromData(inMemoryData)
    .Normalize(columns: new[] { "Age", "Salary" },
               method: NormalizationMethod.MinMax)
    .ToDataFrame();  // Get immutable snapshot

Advanced Pipeline Features

// Chain multiple transformations
var pipeline = await DataPipeline
    .FromCsvAsync("sales.csv")
    .RemoveColumns(new[] { "Debug", "TempCol" })
    .RenameColumn("OldName", "NewName")
    .AddColumn("Total", row =>
        (double.Parse(row["Price"]) * double.Parse(row["Quantity"])).ToString())
    .FilterRows(row => double.Parse(row["Total"]) > 100)
    .Normalize(columns: new[] { "Total" }, method: NormalizationMethod.MinMax);

// Get intermediate results without file I/O
var dataFrame = pipeline.ToDataFrame();
Console.WriteLine($"Processed {dataFrame.RowCount} rows");

// Continue processing
await pipeline
    .AddColumn("ProcessedAt", _ => DateTime.UtcNow.ToString("o"))
    .ToCsvAsync("output.csv");

In-Memory Processing

// Work entirely in memory - zero file I/O
var data = new List<Dictionary<string, string>>
{
    new() { ["Name"] = "Alice", ["Age"] = "25", ["Salary"] = "50000" },
    new() { ["Name"] = "Bob", ["Age"] = "30", ["Salary"] = "60000" }
};

var result = DataPipeline
    .FromData(data)
    .Normalize(columns: new[] { "Age", "Salary" },
               method: NormalizationMethod.MinMax)
    .AddColumn("Category", row =>
        int.Parse(row["Age"]) < 30 ? "Junior" : "Senior")
    .ToDataFrame();

// Access results directly
foreach (var row in result.Rows)
{
    Console.WriteLine($"{row["Name"]}: {row["Category"]}");
}

Traditional Task API

using FilePrepper.Tasks.NormalizeData;
using Microsoft.Extensions.Logging;

var options = new NormalizeDataOption
{
    InputPath = "data.csv",
    OutputPath = "normalized.csv",
    TargetColumns = new[] { "Age", "Salary", "Score" },
    Method = NormalizationMethod.MinMax
};

var task = new NormalizeDataTask(logger);
var context = new TaskContext(options);
bool success = await task.ExecuteAsync(context);

See SDK Usage Guide for comprehensive examples and best practices.

📖 Documentation

Getting Started

Quick Start Guide - Get started in 5 minutes
CLI Guide - Complete command reference
Installation Guide - Detailed installation

SDK & Programming

API Reference - Pipeline API and Task API reference
Quick Start Guide - Get started with SDK in 5 minutes

Advanced Features

Phase 2 Complete Guide - Window operations, datetime, string, conditional features
Common Scenarios - Real-world use cases

Use Cases

Common Scenarios - Real-world use cases

For more documentation, see the docs/ directory.

🎯 Use Cases

Machine Learning - Prepare datasets for training (normalization, encoding, feature engineering)
Time-Series Analysis - Window aggregations, resampling, lag features - Prepare datasets for training (normalization, encoding, feature engineering)
Data Analysis - Clean and transform data for analysis
ETL Pipelines - Extract, transform, and load data workflows with minimal I/O overhead
Data Migration - Convert between formats and clean legacy data
Automation - Script data processing with SDK or CLI
In-Memory Processing - Chain transformations without file I/O costs

📋 Requirements

.NET 10.0 or later
Cross-platform - Windows, Linux, macOS
Flexible Usage - CLI tool (no coding) or SDK (programmatic)

🤝 Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

SDK NuGet Package: https://www.nuget.org/packages/FilePrepper
CLI NuGet Package: https://www.nuget.org/packages/fileprepper-cli
GitHub Repository: https://github.com/iyulab/FilePrepper
Issues: https://github.com/iyulab/FilePrepper/issues
Documentation: docs/
Changelog: CHANGELOG.md

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- CsvHelper (>= 33.1.0)
- EPPlus (>= 8.5.4)
- ExcelDataReader (>= 3.8.0)
- ExcelDataReader.DataSet (>= 3.8.0)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 10.0.8)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.8)
- Microsoft.Extensions.Options (>= 10.0.8)
- Scrutor (>= 7.0.0)

NuGet packages (1)

Showing the top 1 NuGet packages that depend on FilePrepper:

Package	Downloads
DataLens Exploratory data analysis engine for CSV/Excel datasets. Produces JSON analysis results including profiling, descriptive statistics, correlation, regression, clustering, outlier detection, PCA, and feature importance.	1.8K

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.7.1	53	7/8/2026
0.7.0	318	4/27/2026
0.6.2	141	4/27/2026
0.6.1	104	4/27/2026
0.6.0	359	3/20/2026
0.5.0	174	2/21/2026
0.4.9	264	1/10/2026
0.4.8	194	11/16/2025
0.4.7	299	11/14/2025
0.4.5	341	11/13/2025
0.4.3	314	11/10/2025
0.4.0	241	11/3/2025
0.2.3	243	11/3/2025
0.2.2	208	1/17/2025
0.2.1	175	1/16/2025
0.2.0	211	1/11/2025
0.1.1	210	12/16/2024
0.1.0	210	12/6/2024