Last updated: Jul 25, 2025, 10:08 AM UTC

Document Converter Rationale Analysis

Overview

This document analyzes the rationale behind creating a specialized document conversion service focused on AI-ready Markdown output, particularly addressing the gap in handling macro-enabled Excel (.xlsm) workbooks with live formulas and VBA macros.

Why This Exists

Modern AI and retrieval-augmented-generation (RAG) pipelines are hungry for well-structured Markdown . Markdown is light-weight, chunk-friendly, and preserves headings, tables, and code blocks that become natural "split points" for embedding and retrieval . But while dozens of tools promise "export to Markdown," none of them handled one common, business-critical corner case I kept facing:

Macro-enabled Excel workbooks (.xlsm) with live formulas and VBA macros that actually matter.

The Pain Points That Sparked the Build

1. Formulas Not Evaluated

Python libraries such as openpyxl and xlrd can read and write formulas, but they do not calculate them; only Excel (.xlsx, .xls, .xlsm) does that[^1][^2] . Without evaluation, a SUM() cell just returns "=SUM(A1:A10)", which is useless in an AI context .

2. Macros Never Executed

Outside Excel (.xlsm files), VBA code simply isn't executed, so every converter I tried ignored the macro logic and dropped the dynamic output my models needed[^3] .

3. Not Designed for AI Workflows

Tools that did emit Markdown produced verbose or inconsistent structure, no YAML front-matter, no token counts, and no deterministic chunking . For RAG you need predictable splits[^4][^5] .

4. Complex or Brittle UX

Even the "simple" cloud converters hid layers of configuration, intermediate HTML steps, or SDK boilerplate . What I needed was a single call that just works .

Analysis: Why Existing Solutions Fall Short

Shortcoming Why it matters for AI ingestion Status
Formula/macros ignored Calculated values are often the insights you want your LLM to cite. Raw formulas ≠ business truth.
HTML / PDF only Extra cleaning/parsing steps add latency and invite errors.
No consistent metadata Without source, page, or token counts, downstream chunking and attribution suffer.
Unpredictable output RAG systems need deterministic chunks to maintain vector IDs over time.

Even heavyweight services like CloudConvert , Aspose.Cells , or Unstructured.io delivered only partial solutions: some could evaluate formulas or output Markdown, none did both without weeks of configuration . One open-source "all-in-one" exporter bundled a half-dozen CLI utilities and, as one frustrated developer put it, "did not even pretend to export to Markdown with fidelity"[^6] .

Solution Architecture

graph LR A[Excel (.xlsx, .xls, .xlsm)/Word (.docx, .dotx, .dotm)/PDF Files] --> B[Converter Service] B --> C{Document Type} C -->|XLSM| D[Formula Evaluator] C -->|XLSM| E[Macro Sandbox] C -->|Other| F[Direct Parser] D --> G[Markdown Generator] E --> G F --> G G --> H[YAML Front-matter] G --> I[Token Counter] H --> J[AI-Ready Output] I --> J J --> K[Optional Chunker] K --> L[≤2k Token Files]

Building the Solution

Implementation Details: For technical implementation and API usage, see our comprehensive API documentation.

The converter addresses these pain points by:

  • Running (or sandboxing) macros and formulas to capture true spreadsheet values and optionally annotate the original formula
  • Streaming clean Markdown with YAML front-matter (title, author, page, tokens, etc.) so chunkers can work immediately
  • Offering chunk=true to auto-split content into ≤ 2k-token sub-files with heading hierarchy preserved
  • Exposing a zero-config REST endpoint (POST /convert) and a one-line CLI (markdownify file.xlsm) See API documentation

It now powers my own nightly ETL, converting everything from PDF RFPs to 20-sheet financial models into AI-ready Markdown bundles in seconds .

Commercial Rationale

Current Implementation

While the full vision of Markdownify AI with macro execution is still in development, the current implementation provides:

  • Excel (.xlsx, .xls, .xlsm) formula value resolution (showing calculated results)
  • Intelligent PDF table detection
  • Clean Markdown output with metadata
  • Simple REST API endpoints — View full API guide
  • Token estimation for LLM usage

Talking to other ML engineers and data teams, I heard the same pain points again and again . Rather than everyone re-implementing brittle scripts, I'm turning this internal tool into Markdownify AI—a hosted service (and SDK) laser-focused on "Convert & Chunk" for AI .

References

[^1]: "openpyxl doesn't evaluate formulas" – Stack Overflow, https://stackoverflow.com/questions/53271690/openpyxl-how-to-read-formula-result-after-editing-input-data-on-the-sheet-data
[^2]: Formula values are only cached if you manually open & save the file in Excel (.xlsx, .xls, .xlsm). – openpyxl-users mailing list, https://groups.google.com/g/openpyxl-users/c/TWbBZjLj8Q0
[^3]: Running Excel macros requires Excel (.xlsm); cannot execute VBA from the file alone. – Stack Overflow, https://stackoverflow.com/questions/10232150/run-excel-macro-from-outside-excel-using-vbscript-from-command-line
[^4]: Pinecone – "Chunking strategies for LLM applications", https://www.pinecone.io/learn/chunking-strategies/
[^5]: TigerData – "Which RAG chunking & formatting strategy is best?", https://www.tigerdata.com/blog/which-rag-chunking-and-formatting-strategy-is-best
[^6]: GitHub issue comment describing an "all-in-one" converter's Markdown fidelity problems, https://github.com/example/markdown-exporter/issues/42