InPage.Format 1.1.0

dotnet add package InPage.Format --version 1.1.0
                    
NuGet\Install-Package InPage.Format -Version 1.1.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="InPage.Format" Version="1.1.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="InPage.Format" Version="1.1.0" />
                    
Directory.Packages.props
<PackageReference Include="InPage.Format" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add InPage.Format --version 1.1.0
                    
#r "nuget: InPage.Format, 1.1.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package InPage.Format@1.1.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=InPage.Format&version=1.1.0
                    
Install as a Cake Addin
#tool nuget:?package=InPage.Format&version=1.1.0
                    
Install as a Cake Tool

inpage-format

<div align="center">

Open research for the InPage .INP binary file format

The only documented, tested, multi-language implementation of the InPage decoder

CI npm NuGet License: MIT JS Tests .NET Tests

</div>


Why this exists

InPage is the dominant Urdu/Arabic word processor used across Pakistan, India, and the Middle East for 30+ years. Newspapers, government records, books, and legal documents are locked in .INP files. No open tooling exists to read them.

The problem Impact
InPage 3.x cannot open InPage 2.x files Millions of archived documents are stranded
InPage is Windows-only No Linux, macOS, mobile, or server-side processing
The format is completely undocumented No importers in LibreOffice, Pandoc, or anywhere else
InPage web version does not support local file import Archives can't be migrated

This repository is the community's answer: a complete format specification derived from reverse engineering, with reference implementations in TypeScript and C# that anyone can use to build their own tools.


What's inside

inpage-format/
│
├── 📖  specs/              8 detailed specification documents
│   ├── 01-problem-statement.md
│   ├── 02-container-format.md      OLE2/CFB container layout
│   ├── 03-encoding-legacy.md       v1/v2 byte-pair encoding
│   ├── 04-encoding-v3.md           v3 UTF-16LE + struct array
│   ├── 05-character-maps.md        110+ character mappings
│   ├── 06-formatting-structures.md Style/font/alignment binary structures
│   ├── 07-text-filtering.md        Noise separation algorithm
│   ├── 08-security.md              CVE-2017-12824 & threat model
│   └── 09-write-feasibility.md     Feasibility: generating .INP files
│
├── 📦  lib/javascript/     TypeScript library — Node.js & browser
├── 📦  lib/dotnet/         C# library — .NET 9+
└── 🧪  test-fixtures/      Minimal binary fixtures for testing

Quick start

JavaScript / TypeScript

npm install inpage-format
import * as CFB from 'cfb';
import { decodeV1V2, decodeV3, filterParagraphsWithMeta } from 'inpage-format';

// 1. Parse the OLE2 container (use cfb or any OLE2 library)
const cfbFile = CFB.read(new Uint8Array(fileBuffer), { type: 'array' });

// 2. Detect version from stream name
const entry200 = CFB.find(cfbFile, '/InPage200');
const entry300 = CFB.find(cfbFile, '/InPage300');
const stream = new Uint8Array((entry300 ?? entry200).content);
const version = entry300 ? 3 : 2;

// 3. Decode
const decoded = version === 3 ? decodeV3(stream) : decodeV1V2(stream);

// 4. Filter noise
const { paragraphs, filteredCount } = filterParagraphsWithMeta(
  decoded.paragraphs,
  decoded.paragraphMeta,
);

console.log(`Extracted ${paragraphs.length} paragraphs (${filteredCount} noise paragraphs removed)`);
paragraphs.forEach(p => console.log(p));

C# / .NET

dotnet add package InPage.Format
using OpenMcdf;
using InPage.Format;

// 1. Parse the OLE2 container
using var cf = new CompoundFile("document.inp");

// 2. Detect version
string streamName = cf.RootStorage.TryGetStream("InPage300") != null
    ? "InPage300" : "InPage200";
int version = streamName == "InPage300" ? 3 : 2;

byte[] content = cf.RootStorage.GetStream(streamName).GetData();

// 3. Decode
var decoded = version == 3
    ? InPageDecoder.DecodeV3(content)
    : InPageDecoder.DecodeV1V2(content);

// 4. Filter noise
var (paragraphs, _, filteredCount) = TextFilter.FilterWithMeta(
    decoded.Paragraphs,
    decoded.ParagraphMeta
);

Console.WriteLine($"Extracted {paragraphs.Count} paragraphs ({filteredCount} filtered)");
foreach (var para in paragraphs)
    Console.WriteLine(para);

Format overview

InPage files are OLE2/CFB containers (same format as legacy .doc / .xls) with one named content stream:

Stream name InPage version Encoding
InPage100 1.x Proprietary byte-pair (0x04 prefix)
InPage200 2.x Proprietary byte-pair (0x04 prefix)
InPage300 3.x UTF-16LE with struct array

V1/V2 encoding at a glance

Every Urdu character is stored as a 2-byte pair. The first byte is always 0x04; the second byte indexes into a 110-entry lookup table:

04 81 → ا  (Alef)        04 9C → ک  (Kaf)
04 82 → ب  (Beh)         04 A4 → ی  (Yeh)
04 A5 → ے  (Yeh Barree)  04 F3 → ۔  (Urdu Full Stop)
04 F6 → ﷺ  (PBUH)        04 D1 → ۱  (Urdu 1)

Composite sequences use 4 bytes (base + modifier):

04 81 04 BF → أ  (Alef + Hamza Above)
04 81 04 B3 → آ  (Alef + Madda)

Word boundaries are implicit: a non-0x04 control byte between character sequences signals a word break.

V3 encoding at a glance

Text is standard UTF-16LE. Before the text, an array of [styleId: u32, byteLength: u32] structs maps formatting to text spans. The boundary between the struct array and the text is the 6-byte marker FF FF FF FF 0D 00.


Supported characters

Category Count Notes
Urdu/Arabic consonants 39 Including Urdu-specific: پ ٹ ڈ ڑ گ ں ے ہ ھ
Diacritical marks (harakat) 14 Zabar, zer, pesh, shadda, sukun + alternates
Urdu numerals (Extended Arabic-Indic) 10 ۰–۹ (U+06F0–U+06F9)
Arabic-Indic digits (Arabic mode) 10 ٠–٩ (U+0660–U+0669)
Punctuation & symbols 22 Including ۔ ، ؟ ؛ ﴾ ﴿ ﷺ
Religious symbols 4 ؑ ؔ ؓ ؒ
Composite sequences 4 أ آ ؤ یئ

Full table: specs/05-character-maps.md


Known limitations

Area Status
Text extraction (v1/v2) ✅ ~85–90% accuracy
Text extraction (v3) ✅ Working
Word spacing (v1/v2) ✅ Recovered via pendingSpace heuristic
Page breaks ✅ Form Feed → PAGE_BREAK_MARKER
Font name extraction ✅ Pattern-matched from header
Font size / alignment ⚠️ Partial — most files work, some edge cases
Bold / italic ⚠️ Partially reverse-engineered
Embedded images ❌ Not implemented
Tables / columns ❌ Layout structures unknown
Headers / footers ❌ Master page structure unknown

Security

InPage files have been used in APT campaigns targeting Pakistani civil society (CVE-2017-12824 — stack overflow in InPage). The library includes:

  • OLE2 magic signature validation
  • File size limits (50 MB max)
  • Exploit pattern scanning (68 72 68 72 egg-hunter + LuNdLuNd shellcode marker)
  • Strict bounds checking on all binary reads

Details: specs/08-security.md


Used in the wild

Project Description
ViewAnyFile Browser-based file viewer supporting dozens of formats
InPage Viewer Online InPage .INP file viewer — open any InPage document directly in the browser, no install needed

Built something with this library? Open a PR to add it here.


Contributing

All contributions are welcome:

  • 🔤 New character mappings — found a wrong glyph? Open an issue with the hex bytes
  • 🧪 Test fixtures — minimal binary snippets demonstrating edge cases
  • 🌐 New language ports — Python, Go, Rust, Java all welcome
  • 📝 Format discoveries — binary analysis of unknown byte sequences

See CONTRIBUTING.md for the full guide.


Research sources


License

MIT © inpage-format contributors

Character mapping data is factual Unicode assignment data — not copyrightable. Attribution to prior researchers is maintained in source comments.

Product Compatible and additional computed target framework versions.
.NET net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net9.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.1.0 104 4/6/2026
1.0.0 108 4/3/2026

v1.1.0 — Add FormatExtractor (font table, color palette, default style, per-paragraph overrides) and DecodeV3Fallback for files without boundary marker.