AI-Optimized Document Conversion Service Design Document
Generated: 2025-07-22 09:28 UTC
Status: Complete
Verified: (for verified information) / (for speculated information)
Overview
This document outlines the design for an AI-optimized document conversion service (code-name "MarkdownifyAI") that converts complex business documents into clean, AI-ready Markdown with rich metadata.
1. Purpose
Provide developers and data teams with the fastest, simplest way to convert complex business documents—including macro-enabled spreadsheets—into clean, AI-ready Markdown plus rich metadata. The service (code-name "MarkdownifyAI") focuses on minimal configuration, one-endpoint access, and output specifically structured for Retrieval-Augmented Generation (RAG) workflows.
2. Background / Problem Statement
Large-language-model pipelines need source documents broken into semantically meaningful Markdown chunks . Existing converters rarely handle spreadsheets with formulas, macros, or multiple sheets , and few attach metadata required for token-efficient chunking . Users are forced to maintain brittle scripts or use heavyweight SDKs .
3. Goals & Success Metrics
Goal | Metric | Target | Status |
---|---|---|---|
Fast conversion | Median turnaround time (≤10 MB) | < 3 s | |
High fidelity | Structural accuracy score* | ≥ 97 % | |
AI readiness | % docs with valid front-matter & token counts | 100 % | |
Developer love | Net Promoter Score (NPS) | ≥ 60 |
*Structural accuracy measured against golden Markdown snapshots .
4. Target Users
- Data Engineers / ML Ops – need pipeline-ready Markdown
- LLM Integrators / Consultants – rapid POCs without custom ETL
- Technical Writers – export source docs to Markdown knowledge bases
5. Competitive Landscape (abridged)
Competitor | XLSM computed values | AI-ready Markdown | API simplicity | Notes |
---|---|---|---|---|
Pandoc | ✗ (formulas only) | Partial | CLI only | Generic converter |
CloudConvert | (values, no macros) | Raw tables | REST, pay-per-task | Generic format gateway |
Aspose.Cells | (code) | No metadata | Heavy SDK | License cost |
Unstructured.io | ✗ | Partitioned JSON | REST & OSS | Powerful but complex |
Docparser | ✗ | JSON/CSV | Template rules | PDF-centric |
Docling | ✗ | Markdown | OSS | Multi-sheet aggregation issue |
Aryn DocParse | Markdown/JSON | REST | PDF-only focus | |
Docugami | ✗ | Knowledge Graph | SaaS | Enterprise-only |
Mammoth | N/A (docx only) | HTML | Library | No spreadsheets |
Opportunity: No existing service simultaneously 1) evaluates formulas/macros, 2) outputs AI-optimized Markdown with metadata, and 3) offers a zero-config API .
6. Key Differentiators
- Macro/Formula Evaluation – run spreadsheets in a sandbox to emit resolved values and optionally inline original formulas as comments
- AI-Optimized Output – clean Markdown with YAML front-matter (source path, author, page, tokens)
- One-Endpoint API –
POST /convert
with auto-type detection; synchronous stream or async job - Smart Chunking – optional
chunk=true
returns folders of ≤2 k-token chunks with hierarchy preserved - Developer Ergonomics – CLI (
markdownify file.docx
) and typed SDKs
7. User Stories (sample)
- DE-01 As a data engineer, I POST a 4-sheet .xlsm and receive a ZIP containing four
.md
files with computed values so I can upload them to my vector database - WR-02 As a technical writer, I drag-and-drop a PDF and download a Markdown file with images extracted to
/assets
8. Functional Requirements
- FR-01 Accept DOCX, PDF, PPTX, XLSX/XLSM, CSV, HTML, TXT
- FR-02 Evaluate spreadsheets including VBA macros in a secure VM
- FR-03 Extract and relink images, charts, and embedded objects
- FR-04 Generate YAML front-matter with configurable keys
- FR-05 Provide
zip
bundle ortar.gz
streaming - FR-06 Offer Webhooks for async jobs
9. Non-Functional Requirements
- Performance: < 5 s P50 for ≤10 MB files
- Scalability: 100 concurrent jobs per tenant
- Security & Compliance: Data encrypted at rest; auto-purge 30 min; ISO 27001 / SOC2 roadmap
- Reliability: 99.9 % monthly uptime SLA
10. API Specification (high-level)
POST /convert
Headers: Authorization, Accept: application/zip
Body:
file (binary)
options: {
output_format: "markdown",
chunk: boolean,
target_tokens: int,
include_formulas: boolean
}
Response:
200 application/zip (sync)
202 Location: /jobs/{id} (async)
11. Architecture Diagram
12. Metrics & Analytics
- Conversion success rate (automatic alert <95 %)
- Median processing time per MIME type
- Token count distribution for outputs
13. Pricing & Packaging (draft)
Tier | Pages/month | Overages | Features |
---|---|---|---|
Free | 100 | n/a | Single file, community support |
Pro | 50 k | $0.0005/page | Batch API, webhook |
Enterprise | Custom | Volume | VPC deployment, SLA |
14. Roadmap (MVP)
Milestone | Date |
---|---|
Alpha internal | 2025-08-01 |
Public Beta (Web + API) | 2025-09-15 |
GA v1.0 | 2025-11-30 |
AI-native Chunking v1.1 | 2026-01-15 |
15. Risks & Mitigations
Risk | Probability | Impact | Mitigation |
---|---|---|---|
Macro sandbox escape | Medium | High | Reuse hardened LibreOffice in Docker with seccomp + AppArmor |
Vendor rate limits | Low | Medium | Autoscaling, queue back-pressure |
Competing OSS clone | Medium | Medium | Dual-license core, focus on hosted UX |
16. Open Questions
- Support for epub input?
- Automatic language detection & code block highlighting?
- Do we include Docx comments as footnotes?
Document History
Date | Author | Changes |
---|---|---|
2025-07-22 | Claude | Initial conversion from PRD format to design document format |
2025-07-22 | Original PRD | Last updated: 22 Jul 2025 |