Document Conversion Service - Product Features and Capabilities
Overview
This document provides a comprehensive analysis of the document conversion service's features and capabilities based on the implemented converters. The service offers five specialized converters for transforming business documents into web-friendly formats optimized for AI/LLM consumption.
Core Converters
API Documentation: For detailed usage instructions, code examples, and integration guides, see our comprehensive API documentation.
1. Excel (.xlsx, .xls, .xlsm) to JSON Converter
Endpoint: excelToJson
Supported Formats: .xlsx
, .xls
, .xlsm
Output: Structured JSON with comprehensive statistics
Key Features
Sheet Filtering
- Filter sheets by prefix using
sheetPrefix
query parameter - Process only relevant data sheets in multi-sheet workbooks
- Filter sheets by prefix using
Intelligent Data Cleaning
- Automatic removal of empty rows
- Whitespace trimming on all cell values
- Detection and removal of empty-like values (
-
,N/A
,null
,undefined
)
Advanced Analytics
- Cell count statistics per sheet
- Empty rate calculation for data quality assessment
- Non-empty cell counting for content density analysis
Token Estimation
- Calculates estimated tokens for LLM consumption
- Uses 1.3x word multiplier for average token count
- Includes structural JSON tokens in estimate
2. Excel (.xlsx, .xls, .xlsm) to Markdown Converter
Endpoint: excelToMarkdown
Supported Formats: .xlsx
, .xls
, .xlsm
Output: Clean Markdown with YAML metadata
Key Features
Table Generation
- Converts each Excel (.xlsx, .xls, .xlsm) sheet to properly formatted Markdown table
- Maintains column alignment and structure
Metadata Preservation
--- title: [filename] date: [ISO date] type: spreadsheet source: Excel (.xlsx, .xls, .xlsm) ---
Multi-line Cell Support
- Handles cells with line breaks using
<br>
tags - Preserves complex cell content formatting
- Handles cells with line breaks using
Sheet Organization
- Each sheet becomes a separate section with H2 heading
- Clear delineation between different data sections
3. PDF to Markdown Converter
Endpoint: pdfToMarkdown
Supported Formats: .pdf
Output: Structured Markdown with intelligent formatting
Key Features
Intelligent Table Detection
- Automatically detects tables based on spacing patterns
- Converts to properly formatted Markdown tables
- No manual configuration required
Smart Text Formatting
- Converts bullets and special characters
- Normalizes quotes and dashes
- Detects ALL CAPS lines and converts to H2 headings
Label Detection
- Identifies "Label: Value" patterns
- Converts labels to bold formatting
- Preserves document structure semantics
4. DOCX to HTML Converter
Endpoint: docxToHtml
Supported Formats: .docx
Output: Clean, semantic HTML
Key Features
Heading Preservation
- Maps Word (.docx, .dotx, .dotm) heading styles to HTML h1-h6 tags
- Maintains document hierarchy
Style Mapping
- Preserves bold, italic, underline formatting
- Converts to semantic HTML tags
Clean Output
- Removes unnecessary style attributes
- Strips class attributes for cleaner HTML
- Removes empty tags automatically
5. DOCX to Markdown Converter
Endpoint: docxToMarkdown
Supported Formats: .docx
Output: Feature-rich Markdown
Key Features
Advanced Style Mapping
- Title → H1
- Heading 1-5 → H2-H6
- Bold, italic, underline, code formatting preserved
Complex Element Support
- Table conversion to Markdown tables
- Ordered and unordered list handling
- Hyperlink preservation in Markdown format
- Blockquote support with
>
prefix
Detailed Token Analysis
- Per-line token estimation
- Average tokens per element type
- Total document token count
Common Features Across All Converters
File Upload and Validation
Multipart Form Handling
- Uses Busboy for efficient streaming
- Handles large files without memory issues
- See API documentation for request examples
Type Validation
- MIME type checking
- File extension verification
- Clear error messages for unsupported formats
Size Limits
- Consistent 5MB file size limit
- Prevents system overload
- Returns clear error for oversized files
Error Handling
{
"error": "File size exceeds 5MB limit",
"maxSizeBytes": 5242880,
"actualSizeBytes": 6291456
}
CORS Support
- Universal Access
Access-Control-Allow-Origin: *
- OPTIONS request handling for preflight
- Enables browser-based uploads
Token Estimation Algorithm
Unique Capabilities
1. Intelligent Data Processing
- Empty Row Detection
- Automatically identifies and removes empty rows
- Reduces noise in data exports
- Improves downstream processing efficiency
2. Structural Preservation
- Hierarchy Maintenance
- Preserves document structure across formats
- Maintains heading levels and relationships
- Enables proper content chunking for RAG
3. Metadata Generation
- YAML Front Matter
- Consistent metadata across all Markdown outputs
- Includes title, date, type, and source
- Compatible with static site generators
4. Statistics and Analytics
{
"statistics": {
"fileSize": {
"bytes": 102400,
"KB": 100,
"MB": 0.1
},
"sheets": [{
"name": "Sheet1",
"cellCount": 500,
"emptyRate": 0.2,
"nonEmptyCells": 400
}],
"estimatedTokens": {
"total": 1250,
"perSheet": [650, 600]
}
}
}
Use Cases
1. AI/LLM Pipeline Integration
- Convert business documents to LLM-ready formats
- Prepare training data with consistent structure
- Enable RAG systems with properly chunked content
2. Content Migration
- Migrate from proprietary formats to open standards
- Preserve formatting during platform transitions
- Batch convert document libraries
3. Document Analysis
- Extract structured data from PDFs
- Analyze spreadsheet content programmatically
- Generate statistics for large document sets
- View API examples for using statistics data
4. Web Publishing
- Convert Word (.docx, .dotx, .dotm) documents to clean HTML
- Transform Excel (.xlsx, .xls, .xlsm) data to web-ready tables
- Generate Markdown for documentation sites
- See integration examples for web applications
Performance Characteristics
Metric | Value | Status |
---|---|---|
Max File Size | 5MB | |
Processing Model | Streaming | |
Memory Efficiency | Buffer-based | |
Error Recovery | Graceful degradation | |
Concurrent Requests | Cloud Function auto-scaling |
Security Considerations
Input Validation
- File type verification prevents malicious uploads
- Size limits prevent DoS attacks
- No execution of embedded scripts
Data Privacy
- No persistent storage of uploaded files
- Processing in memory only
- Stateless function architecture
6. PowerPoint (.pptx, .ppsx, .potx) to Markdown Converter
Endpoint: pptxToMarkdown
Supported Formats: .pptx
, .ppsx
, .potx
Output: Structured Markdown with slide content
Key Features
Slide Content Extraction
- Extracts slide titles and body text
- Preserves slide hierarchy and order
- Converts bullet points to Markdown lists
Speaker Notes
- Captures presenter notes from each slide
- Formats as separate section in output
- Useful for documentation and training materials
Metadata Generation
- Slide count and structure
- Total word count
- Speaker notes indicators
- Token estimation for presentations
Future Enhancement Opportunities
Formula Evaluation
- Current Excel (.xlsx, .xls, .xlsm) converters show formulas as text
- Could add formula calculation capability
Macro Execution
- VBA macros currently ignored
- Could add sandboxed macro execution
Chunking Service
- Currently outputs full documents
- Could add intelligent chunking for RAG