Document Conversion Service - Product Features and Capabilities

Overview

This document provides a comprehensive analysis of the document conversion service's features and capabilities based on the implemented converters. The service offers five specialized converters for transforming business documents into web-friendly formats optimized for AI/LLM consumption.

Core Converters

API Documentation: For detailed usage instructions, code examples, and integration guides, see our comprehensive API documentation.

1. Excel (.xlsx, .xls, .xlsm) to JSON Converter

Endpoint: excelToJson
Supported Formats: .xlsx, .xls, .xlsm
Output: Structured JSON with comprehensive statistics

Key Features

Sheet Filtering
- Filter sheets by prefix using sheetPrefix query parameter
- Process only relevant data sheets in multi-sheet workbooks
Intelligent Data Cleaning
- Automatic removal of empty rows
- Whitespace trimming on all cell values
- Detection and removal of empty-like values (-, N/A, null, undefined)
Advanced Analytics
- Cell count statistics per sheet
- Empty rate calculation for data quality assessment
- Non-empty cell counting for content density analysis
Token Estimation
- Calculates estimated tokens for LLM consumption
- Uses 1.3x word multiplier for average token count
- Includes structural JSON tokens in estimate

2. Excel (.xlsx, .xls, .xlsm) to Markdown Converter

Endpoint: excelToMarkdown
Supported Formats: .xlsx, .xls, .xlsm
Output: Clean Markdown with YAML metadata

Key Features

Table Generation
- Converts each Excel (.xlsx, .xls, .xlsm) sheet to properly formatted Markdown table
- Maintains column alignment and structure

Metadata Preservation

---
title: [filename]
date: [ISO date]
type: spreadsheet
source: Excel (.xlsx, .xls, .xlsm)
---

Multi-line Cell Support
- Handles cells with line breaks using <br> tags
- Preserves complex cell content formatting
Sheet Organization
- Each sheet becomes a separate section with H2 heading
- Clear delineation between different data sections

3. PDF to Markdown Converter

Endpoint: pdfToMarkdown
Supported Formats: .pdf
Output: Structured Markdown with intelligent formatting

Key Features

Intelligent Table Detection
- Automatically detects tables based on spacing patterns
- Converts to properly formatted Markdown tables
- No manual configuration required
Smart Text Formatting
- Converts bullets and special characters
- Normalizes quotes and dashes
- Detects ALL CAPS lines and converts to H2 headings
Label Detection
- Identifies "Label: Value" patterns
- Converts labels to bold formatting
- Preserves document structure semantics

4. DOCX to HTML Converter

Endpoint: docxToHtml
Supported Formats: .docx
Output: Clean, semantic HTML

Key Features

Heading Preservation
- Maps Word (.docx, .dotx, .dotm) heading styles to HTML h1-h6 tags
- Maintains document hierarchy
Style Mapping
- Preserves bold, italic, underline formatting
- Converts to semantic HTML tags
Clean Output
- Removes unnecessary style attributes
- Strips class attributes for cleaner HTML
- Removes empty tags automatically

5. DOCX to Markdown Converter

Endpoint: docxToMarkdown
Supported Formats: .docx
Output: Feature-rich Markdown

Key Features

Advanced Style Mapping
- Title → H1
- Heading 1-5 → H2-H6
- Bold, italic, underline, code formatting preserved
Complex Element Support
- Table conversion to Markdown tables
- Ordered and unordered list handling
- Hyperlink preservation in Markdown format
- Blockquote support with > prefix
Detailed Token Analysis
- Per-line token estimation
- Average tokens per element type
- Total document token count

Common Features Across All Converters

File Upload and Validation

Multipart Form Handling
- Uses Busboy for efficient streaming
- Handles large files without memory issues
- See API documentation for request examples
Type Validation
- MIME type checking
- File extension verification
- Clear error messages for unsupported formats
Size Limits
- Consistent 5MB file size limit
- Prevents system overload
- Returns clear error for oversized files

Error Handling

{
  "error": "File size exceeds 5MB limit",
  "maxSizeBytes": 5242880,
  "actualSizeBytes": 6291456
}

CORS Support

Universal Access
- Access-Control-Allow-Origin: *
- OPTIONS request handling for preflight
- Enables browser-based uploads

Token Estimation Algorithm

graph TD A[Input Text] --> B[Count Words] B --> C[Multiply by 1.3] C --> D[Add Structural Tokens] D --> E[Calculate Markdown Symbols] E --> F[Total Token Estimate] G[Per-Sheet/Section] --> H[Individual Estimates] H --> F

Unique Capabilities

1. Intelligent Data Processing

Empty Row Detection
- Automatically identifies and removes empty rows
- Reduces noise in data exports
- Improves downstream processing efficiency

2. Structural Preservation

Hierarchy Maintenance
- Preserves document structure across formats
- Maintains heading levels and relationships
- Enables proper content chunking for RAG

3. Metadata Generation

YAML Front Matter
- Consistent metadata across all Markdown outputs
- Includes title, date, type, and source
- Compatible with static site generators

4. Statistics and Analytics

{
  "statistics": {
    "fileSize": {
      "bytes": 102400,
      "KB": 100,
      "MB": 0.1
    },
    "sheets": [{
      "name": "Sheet1",
      "cellCount": 500,
      "emptyRate": 0.2,
      "nonEmptyCells": 400
    }],
    "estimatedTokens": {
      "total": 1250,
      "perSheet": [650, 600]
    }
  }
}

Use Cases

1. AI/LLM Pipeline Integration

Convert business documents to LLM-ready formats
Prepare training data with consistent structure
Enable RAG systems with properly chunked content

2. Content Migration

Migrate from proprietary formats to open standards
Preserve formatting during platform transitions
Batch convert document libraries

3. Document Analysis

Extract structured data from PDFs
Analyze spreadsheet content programmatically
Generate statistics for large document sets
View API examples for using statistics data

4. Web Publishing

Convert Word (.docx, .dotx, .dotm) documents to clean HTML
Transform Excel (.xlsx, .xls, .xlsm) data to web-ready tables
Generate Markdown for documentation sites
See integration examples for web applications

Performance Characteristics

Metric	Value	Status
Max File Size	5MB
Processing Model	Streaming
Memory Efficiency	Buffer-based
Error Recovery	Graceful degradation
Concurrent Requests	Cloud Function auto-scaling

Security Considerations

Input Validation
- File type verification prevents malicious uploads
- Size limits prevent DoS attacks
- No execution of embedded scripts
Data Privacy
- No persistent storage of uploaded files
- Processing in memory only
- Stateless function architecture

6. PowerPoint (.pptx, .ppsx, .potx) to Markdown Converter

Endpoint: pptxToMarkdown
Supported Formats: .pptx, .ppsx, .potx
Output: Structured Markdown with slide content

Key Features

Slide Content Extraction
- Extracts slide titles and body text
- Preserves slide hierarchy and order
- Converts bullet points to Markdown lists
Speaker Notes
- Captures presenter notes from each slide
- Formats as separate section in output
- Useful for documentation and training materials
Metadata Generation
- Slide count and structure
- Total word count
- Speaker notes indicators
- Token estimation for presentations

Future Enhancement Opportunities

Formula Evaluation
- Current Excel (.xlsx, .xls, .xlsm) converters show formulas as text
- Could add formula calculation capability
Macro Execution
- VBA macros currently ignored
- Could add sandboxed macro execution
Chunking Service
- Currently outputs full documents
- Could add intelligent chunking for RAG