Last updated: Jul 22, 2025, 11:23 AM UTC

AI-Optimized Document Conversion Service Design Document

Generated: 2025-07-22 09:28 UTC
Status: Complete
Verified: (for verified information) / (for speculated information)

Overview

This document outlines the design for an AI-optimized document conversion service (code-name "MarkdownifyAI") that converts complex business documents into clean, AI-ready Markdown with rich metadata.

1. Purpose

Provide developers and data teams with the fastest, simplest way to convert complex business documents—including macro-enabled spreadsheets—into clean, AI-ready Markdown plus rich metadata. The service (code-name "MarkdownifyAI") focuses on minimal configuration, one-endpoint access, and output specifically structured for Retrieval-Augmented Generation (RAG) workflows.

2. Background / Problem Statement

Large-language-model pipelines need source documents broken into semantically meaningful Markdown chunks . Existing converters rarely handle spreadsheets with formulas, macros, or multiple sheets , and few attach metadata required for token-efficient chunking . Users are forced to maintain brittle scripts or use heavyweight SDKs .

3. Goals & Success Metrics

Goal Metric Target Status
Fast conversion Median turnaround time (≤10 MB) < 3 s
High fidelity Structural accuracy score* ≥ 97 %
AI readiness % docs with valid front-matter & token counts 100 %
Developer love Net Promoter Score (NPS) ≥ 60

*Structural accuracy measured against golden Markdown snapshots .

4. Target Users

  • Data Engineers / ML Ops – need pipeline-ready Markdown
  • LLM Integrators / Consultants – rapid POCs without custom ETL
  • Technical Writers – export source docs to Markdown knowledge bases

5. Competitive Landscape (abridged)

Competitor XLSM computed values AI-ready Markdown API simplicity Notes
Pandoc ✗ (formulas only) Partial CLI only Generic converter
CloudConvert (values, no macros) Raw tables REST, pay-per-task Generic format gateway
Aspose.Cells (code) No metadata Heavy SDK License cost
Unstructured.io ✗ Partitioned JSON REST & OSS Powerful but complex
Docparser ✗ JSON/CSV Template rules PDF-centric
Docling ✗ Markdown OSS Multi-sheet aggregation issue
Aryn DocParse Markdown/JSON REST PDF-only focus
Docugami ✗ Knowledge Graph SaaS Enterprise-only
Mammoth N/A (docx only) HTML Library No spreadsheets

Opportunity: No existing service simultaneously 1) evaluates formulas/macros, 2) outputs AI-optimized Markdown with metadata, and 3) offers a zero-config API .

6. Key Differentiators

  1. Macro/Formula Evaluation – run spreadsheets in a sandbox to emit resolved values and optionally inline original formulas as comments
  2. AI-Optimized Output – clean Markdown with YAML front-matter (source path, author, page, tokens)
  3. One-Endpoint API – POST /convert with auto-type detection; synchronous stream or async job
  4. Smart Chunking – optional chunk=true returns folders of ≤2 k-token chunks with hierarchy preserved
  5. Developer Ergonomics – CLI (markdownify file.docx) and typed SDKs

7. User Stories (sample)

  • DE-01 As a data engineer, I POST a 4-sheet .xlsm and receive a ZIP containing four .md files with computed values so I can upload them to my vector database
  • WR-02 As a technical writer, I drag-and-drop a PDF and download a Markdown file with images extracted to /assets

8. Functional Requirements

  • FR-01 Accept DOCX, PDF, PPTX, XLSX/XLSM, CSV, HTML, TXT
  • FR-02 Evaluate spreadsheets including VBA macros in a secure VM
  • FR-03 Extract and relink images, charts, and embedded objects
  • FR-04 Generate YAML front-matter with configurable keys
  • FR-05 Provide zip bundle or tar.gz streaming
  • FR-06 Offer Webhooks for async jobs

9. Non-Functional Requirements

  • Performance: < 5 s P50 for ≤10 MB files
  • Scalability: 100 concurrent jobs per tenant
  • Security & Compliance: Data encrypted at rest; auto-purge 30 min; ISO 27001 / SOC2 roadmap
  • Reliability: 99.9 % monthly uptime SLA

10. API Specification (high-level)

POST /convert
Headers: Authorization, Accept: application/zip
Body:
  file (binary)
  options: {
    output_format: "markdown",
    chunk: boolean,
    target_tokens: int,
    include_formulas: boolean
  }
Response:
  200 application/zip  (sync)
  202 Location: /jobs/{id} (async)

11. Architecture Diagram

Diagram
graph TD A[Client] -->|POST /convert| B[API Gateway] B --> C{File Type Router} C -->|XLSX/XLSM| D[Excel Converter] C -->|PDF| E[PDF Converter] C -->|DOCX| F[Word Converter] C -->|PPTX| G[PowerPoint Converter] D --> H[Formula Evaluator] H --> I[Sandbox VM] D --> J[Markdown Generator] E --> J F --> J G --> J J --> K[Metadata Extractor] K --> L[Token Counter] L --> M{Chunking?} M -->|Yes| N[Smart Chunker] M -->|No| O[Single File Output] N --> P[ZIP/TAR Generator] O --> P P --> Q[Response Handler] Q -->|Sync| R[Direct Response] Q -->|Async| S[Job Queue] S --> T[Webhook Callback]

12. Metrics & Analytics

  • Conversion success rate (automatic alert <95 %)
  • Median processing time per MIME type
  • Token count distribution for outputs

13. Pricing & Packaging (draft)

Tier Pages/month Overages Features
Free 100 n/a Single file, community support
Pro 50 k $0.0005/page Batch API, webhook
Enterprise Custom Volume VPC deployment, SLA

14. Roadmap (MVP)

Milestone Date
Alpha internal 2025-08-01
Public Beta (Web + API) 2025-09-15
GA v1.0 2025-11-30
AI-native Chunking v1.1 2026-01-15

15. Risks & Mitigations

Risk Probability Impact Mitigation
Macro sandbox escape Medium High Reuse hardened LibreOffice in Docker with seccomp + AppArmor
Vendor rate limits Low Medium Autoscaling, queue back-pressure
Competing OSS clone Medium Medium Dual-license core, focus on hosted UX

16. Open Questions

  1. Support for epub input?
  2. Automatic language detection & code block highlighting?
  3. Do we include Docx comments as footnotes?

Document History

Date Author Changes
2025-07-22 Claude Initial conversion from PRD format to design document format
2025-07-22 Original PRD Last updated: 22 Jul 2025