AI-Optimized Document Conversion Service Design Document

Generated: 2025-07-22 09:28 UTC
Status: Complete
Verified: (for verified information) / (for speculated information)

Overview

This document outlines the design for an AI-optimized document conversion service (code-name "MarkdownifyAI") that converts complex business documents into clean, AI-ready Markdown with rich metadata.

1. Purpose

Provide developers and data teams with the fastest, simplest way to convert complex business documents—including macro-enabled spreadsheets—into clean, AI-ready Markdown plus rich metadata. The service (code-name "MarkdownifyAI") focuses on minimal configuration, one-endpoint access, and output specifically structured for Retrieval-Augmented Generation (RAG) workflows.

2. Background / Problem Statement

Large-language-model pipelines need source documents broken into semantically meaningful Markdown chunks . Existing converters rarely handle spreadsheets with formulas, macros, or multiple sheets , and few attach metadata required for token-efficient chunking . Users are forced to maintain brittle scripts or use heavyweight SDKs .

3. Goals & Success Metrics

Goal	Metric	Target
Fast conversion	Median turnaround time (≤10 MB)	< 3 s
High fidelity	Structural accuracy score*	≥ 97 %
AI readiness	% docs with valid front-matter & token counts	100 %
Developer love	Net Promoter Score (NPS)	≥ 60

*Structural accuracy measured against golden Markdown snapshots .

4. Target Users

Data Engineers / ML Ops – need pipeline-ready Markdown
LLM Integrators / Consultants – rapid POCs without custom ETL
Technical Writers – export source docs to Markdown knowledge bases

5. Competitive Landscape (abridged)

Competitor	XLSM computed values	AI-ready Markdown	API simplicity	Notes
Pandoc	✗ (formulas only)	Partial	CLI only	Generic converter
CloudConvert	(values, no macros)	Raw tables	REST, pay-per-task	Generic format gateway
Aspose.Cells	(code)	No metadata	Heavy SDK	License cost
Unstructured.io	✗	Partitioned JSON	REST & OSS	Powerful but complex
Docparser	✗	JSON/CSV	Template rules	PDF-centric
Docling	✗	Markdown	OSS	Multi-sheet aggregation issue
Aryn DocParse		Markdown/JSON	REST	PDF-only focus
Docugami	✗	Knowledge Graph	SaaS	Enterprise-only
Mammoth	N/A (docx only)	HTML	Library	No spreadsheets

Opportunity: No existing service simultaneously 1) evaluates formulas/macros, 2) outputs AI-optimized Markdown with metadata, and 3) offers a zero-config API .

6. Key Differentiators

Macro/Formula Evaluation – run spreadsheets in a sandbox to emit resolved values and optionally inline original formulas as comments
AI-Optimized Output – clean Markdown with YAML front-matter (source path, author, page, tokens)
One-Endpoint API – POST /convert with auto-type detection; synchronous stream or async job
Smart Chunking – optional chunk=true returns folders of ≤2 k-token chunks with hierarchy preserved
Developer Ergonomics – CLI (markdownify file.docx) and typed SDKs

7. User Stories (sample)

DE-01 As a data engineer, I POST a 4-sheet .xlsm and receive a ZIP containing four .md files with computed values so I can upload them to my vector database
WR-02 As a technical writer, I drag-and-drop a PDF and download a Markdown file with images extracted to /assets

8. Functional Requirements

FR-01 Accept DOCX, PDF, PPTX, XLSX/XLSM, CSV, HTML, TXT
FR-02 Evaluate spreadsheets including VBA macros in a secure VM
FR-03 Extract and relink images, charts, and embedded objects
FR-04 Generate YAML front-matter with configurable keys
FR-05 Provide zip bundle or tar.gz streaming
FR-06 Offer Webhooks for async jobs

9. Non-Functional Requirements

Performance: < 5 s P50 for ≤10 MB files
Scalability: 100 concurrent jobs per tenant
Security & Compliance: Data encrypted at rest; auto-purge 30 min; ISO 27001 / SOC2 roadmap
Reliability: 99.9 % monthly uptime SLA

10. API Specification (high-level)

POST /convert
Headers: Authorization, Accept: application/zip
Body:
  file (binary)
  options: {
    output_format: "markdown",
    chunk: boolean,
    target_tokens: int,
    include_formulas: boolean
  }
Response:
  200 application/zip  (sync)
  202 Location: /jobs/{id} (async)

11. Architecture Diagram

Diagram

graph TD A[Client] -->|POST /convert| B[API Gateway] B --> C{File Type Router} C -->|XLSX/XLSM| D[Excel Converter] C -->|PDF| E[PDF Converter] C -->|DOCX| F[Word Converter] C -->|PPTX| G[PowerPoint Converter] D --> H[Formula Evaluator] H --> I[Sandbox VM] D --> J[Markdown Generator] E --> J F --> J G --> J J --> K[Metadata Extractor] K --> L[Token Counter] L --> M{Chunking?} M -->|Yes| N[Smart Chunker] M -->|No| O[Single File Output] N --> P[ZIP/TAR Generator] O --> P P --> Q[Response Handler] Q -->|Sync| R[Direct Response] Q -->|Async| S[Job Queue] S --> T[Webhook Callback]

12. Metrics & Analytics

Conversion success rate (automatic alert <95 %)
Median processing time per MIME type
Token count distribution for outputs

13. Pricing & Packaging (draft)

Tier	Pages/month	Overages	Features
Free	100	n/a	Single file, community support
Pro	50 k	$0.0005/page	Batch API, webhook
Enterprise	Custom	Volume	VPC deployment, SLA

14. Roadmap (MVP)

Milestone	Date
Alpha internal	2025-08-01
Public Beta (Web + API)	2025-09-15
GA v1.0	2025-11-30
AI-native Chunking v1.1	2026-01-15

15. Risks & Mitigations

Risk	Probability	Impact	Mitigation
Macro sandbox escape	Medium	High	Reuse hardened LibreOffice in Docker with seccomp + AppArmor
Vendor rate limits	Low	Medium	Autoscaling, queue back-pressure
Competing OSS clone	Medium	Medium	Dual-license core, focus on hosted UX

16. Open Questions

Support for epub input?
Automatic language detection & code block highlighting?
Do we include Docx comments as footnotes?

Document History

Date	Author	Changes
2025-07-22	Claude	Initial conversion from PRD format to design document format
2025-07-22	Original PRD	Last updated: 22 Jul 2025