Last updated: Jul 25, 2025, 10:08 AM UTC

Document Conversion Service - Product Features and Capabilities

Overview

This document provides a comprehensive analysis of the document conversion service's features and capabilities based on the implemented converters. The service offers five specialized converters for transforming business documents into web-friendly formats optimized for AI/LLM consumption.

Core Converters

API Documentation: For detailed usage instructions, code examples, and integration guides, see our comprehensive API documentation.

1. Excel (.xlsx, .xls, .xlsm) to JSON Converter

Endpoint: excelToJson
Supported Formats: .xlsx, .xls, .xlsm
Output: Structured JSON with comprehensive statistics

Key Features

  • Sheet Filtering

    • Filter sheets by prefix using sheetPrefix query parameter
    • Process only relevant data sheets in multi-sheet workbooks
  • Intelligent Data Cleaning

    • Automatic removal of empty rows
    • Whitespace trimming on all cell values
    • Detection and removal of empty-like values (-, N/A, null, undefined)
  • Advanced Analytics

    • Cell count statistics per sheet
    • Empty rate calculation for data quality assessment
    • Non-empty cell counting for content density analysis
  • Token Estimation

    • Calculates estimated tokens for LLM consumption
    • Uses 1.3x word multiplier for average token count
    • Includes structural JSON tokens in estimate

2. Excel (.xlsx, .xls, .xlsm) to Markdown Converter

Endpoint: excelToMarkdown
Supported Formats: .xlsx, .xls, .xlsm
Output: Clean Markdown with YAML metadata

Key Features

  • Table Generation

    • Converts each Excel (.xlsx, .xls, .xlsm) sheet to properly formatted Markdown table
    • Maintains column alignment and structure
  • Metadata Preservation

    ---
    title: [filename]
    date: [ISO date]
    type: spreadsheet
    source: Excel (.xlsx, .xls, .xlsm)
    ---
    
  • Multi-line Cell Support

    • Handles cells with line breaks using <br> tags
    • Preserves complex cell content formatting
  • Sheet Organization

    • Each sheet becomes a separate section with H2 heading
    • Clear delineation between different data sections

3. PDF to Markdown Converter

Endpoint: pdfToMarkdown
Supported Formats: .pdf
Output: Structured Markdown with intelligent formatting

Key Features

  • Intelligent Table Detection

    • Automatically detects tables based on spacing patterns
    • Converts to properly formatted Markdown tables
    • No manual configuration required
  • Smart Text Formatting

    • Converts bullets and special characters
    • Normalizes quotes and dashes
    • Detects ALL CAPS lines and converts to H2 headings
  • Label Detection

    • Identifies "Label: Value" patterns
    • Converts labels to bold formatting
    • Preserves document structure semantics

4. DOCX to HTML Converter

Endpoint: docxToHtml
Supported Formats: .docx
Output: Clean, semantic HTML

Key Features

  • Heading Preservation

    • Maps Word (.docx, .dotx, .dotm) heading styles to HTML h1-h6 tags
    • Maintains document hierarchy
  • Style Mapping

    • Preserves bold, italic, underline formatting
    • Converts to semantic HTML tags
  • Clean Output

    • Removes unnecessary style attributes
    • Strips class attributes for cleaner HTML
    • Removes empty tags automatically

5. DOCX to Markdown Converter

Endpoint: docxToMarkdown
Supported Formats: .docx
Output: Feature-rich Markdown

Key Features

  • Advanced Style Mapping

    • Title → H1
    • Heading 1-5 → H2-H6
    • Bold, italic, underline, code formatting preserved
  • Complex Element Support

    • Table conversion to Markdown tables
    • Ordered and unordered list handling
    • Hyperlink preservation in Markdown format
    • Blockquote support with > prefix
  • Detailed Token Analysis

    • Per-line token estimation
    • Average tokens per element type
    • Total document token count

Common Features Across All Converters

File Upload and Validation

  • Multipart Form Handling

    • Uses Busboy for efficient streaming
    • Handles large files without memory issues
    • See API documentation for request examples
  • Type Validation

    • MIME type checking
    • File extension verification
    • Clear error messages for unsupported formats
  • Size Limits

    • Consistent 5MB file size limit
    • Prevents system overload
    • Returns clear error for oversized files

Error Handling

{
  "error": "File size exceeds 5MB limit",
  "maxSizeBytes": 5242880,
  "actualSizeBytes": 6291456
}

CORS Support

  • Universal Access
    • Access-Control-Allow-Origin: *
    • OPTIONS request handling for preflight
    • Enables browser-based uploads

Token Estimation Algorithm

graph TD A[Input Text] --> B[Count Words] B --> C[Multiply by 1.3] C --> D[Add Structural Tokens] D --> E[Calculate Markdown Symbols] E --> F[Total Token Estimate] G[Per-Sheet/Section] --> H[Individual Estimates] H --> F

Unique Capabilities

1. Intelligent Data Processing

  • Empty Row Detection
    • Automatically identifies and removes empty rows
    • Reduces noise in data exports
    • Improves downstream processing efficiency

2. Structural Preservation

  • Hierarchy Maintenance
    • Preserves document structure across formats
    • Maintains heading levels and relationships
    • Enables proper content chunking for RAG

3. Metadata Generation

  • YAML Front Matter
    • Consistent metadata across all Markdown outputs
    • Includes title, date, type, and source
    • Compatible with static site generators

4. Statistics and Analytics

{
  "statistics": {
    "fileSize": {
      "bytes": 102400,
      "KB": 100,
      "MB": 0.1
    },
    "sheets": [{
      "name": "Sheet1",
      "cellCount": 500,
      "emptyRate": 0.2,
      "nonEmptyCells": 400
    }],
    "estimatedTokens": {
      "total": 1250,
      "perSheet": [650, 600]
    }
  }
}

Use Cases

1. AI/LLM Pipeline Integration

  • Convert business documents to LLM-ready formats
  • Prepare training data with consistent structure
  • Enable RAG systems with properly chunked content

2. Content Migration

  • Migrate from proprietary formats to open standards
  • Preserve formatting during platform transitions
  • Batch convert document libraries

3. Document Analysis

  • Extract structured data from PDFs
  • Analyze spreadsheet content programmatically
  • Generate statistics for large document sets
  • View API examples for using statistics data

4. Web Publishing

  • Convert Word (.docx, .dotx, .dotm) documents to clean HTML
  • Transform Excel (.xlsx, .xls, .xlsm) data to web-ready tables
  • Generate Markdown for documentation sites
  • See integration examples for web applications

Performance Characteristics

Metric Value Status
Max File Size 5MB
Processing Model Streaming
Memory Efficiency Buffer-based
Error Recovery Graceful degradation
Concurrent Requests Cloud Function auto-scaling

Security Considerations

  • Input Validation

    • File type verification prevents malicious uploads
    • Size limits prevent DoS attacks
    • No execution of embedded scripts
  • Data Privacy

    • No persistent storage of uploaded files
    • Processing in memory only
    • Stateless function architecture

6. PowerPoint (.pptx, .ppsx, .potx) to Markdown Converter

Endpoint: pptxToMarkdown
Supported Formats: .pptx, .ppsx, .potx
Output: Structured Markdown with slide content

Key Features

  • Slide Content Extraction

    • Extracts slide titles and body text
    • Preserves slide hierarchy and order
    • Converts bullet points to Markdown lists
  • Speaker Notes

    • Captures presenter notes from each slide
    • Formats as separate section in output
    • Useful for documentation and training materials
  • Metadata Generation

    • Slide count and structure
    • Total word count
    • Speaker notes indicators
    • Token estimation for presentations

Future Enhancement Opportunities

  • Formula Evaluation

    • Current Excel (.xlsx, .xls, .xlsm) converters show formulas as text
    • Could add formula calculation capability
  • Macro Execution

    • VBA macros currently ignored
    • Could add sandboxed macro execution
  • Chunking Service

    • Currently outputs full documents
    • Could add intelligent chunking for RAG