Last updated: Jul 25, 2025, 10:08 AM UTC

Core Conversion Service PRD

Product Name: Convert to Markdown
Document Version: 1.0
Last Updated: 2025-07-23
Status: Implemented
Author: Product Team

Executive Summary

Convert to Markdown is a cloud-based document conversion service that transforms business documents (Excel, PDF, Word, PowerPoint) into web-friendly formats (JSON, Markdown, HTML). The service is designed to enable AI/LLM systems, developers, and businesses to extract and process information from traditional office documents efficiently.

Key Objectives

  1. Provide high-quality, accurate document conversion
  2. Support all major business document formats
  3. Optimize output for AI/LLM consumption
  4. Maintain zero-storage architecture for security
  5. Deliver sub-3 second conversion times

User Stories

As a Developer

  • I want to convert Excel files to JSON so I can process spreadsheet data programmatically
  • I want to extract text from PDFs as Markdown to feed into my AI system
  • I want to preserve document structure and formatting during conversion
  • I want clear error messages when conversions fail

As a Business User

  • I want to convert my reports to web-friendly formats for online publishing
  • I want to extract data from multiple sheets in Excel files
  • I want to maintain table structures when converting to Markdown
  • I want to process large documents without timeout errors

As an AI/LLM User

  • I want document content optimized for token efficiency
  • I want metadata about the conversion (token count, file size)
  • I want clean, well-structured output without artifacts
  • I want to handle documents up to 50MB for comprehensive analysis

Functional Requirements

1. File Format Support

Excel Conversion

  • Input Formats: .xlsx, .xls, .xlsm, .xlsb
  • Output Formats: JSON, Markdown
  • Features:
    • Multi-sheet support with sheet filtering
    • Preserve formulas and calculated values
    • Handle merged cells and formatting
    • Support for pivot tables and charts (as data)
    • Macro-enabled file support (content only)

PDF Conversion

  • Input Formats: .pdf (all versions)
  • Output Format: Markdown
  • Features:
    • Text extraction with layout preservation
    • Table detection and conversion
    • Image placeholder generation
    • Multi-column layout handling
    • OCR support for scanned documents (future)

Word Conversion

  • Input Formats: .docx, .doc, .dotx, .dotm
  • Output Formats: HTML, Markdown
  • Features:
    • Style preservation (headings, bold, italic)
    • Table conversion with alignment
    • List handling (bullet, numbered)
    • Footnote and endnote support
    • Track changes extraction (future)

PowerPoint Conversion

  • Input Formats: .pptx, .ppt, .ppsx, .potx
  • Output Format: Markdown
  • Features:
    • Slide-by-slide conversion
    • Speaker notes extraction
    • Text from shapes and text boxes
    • Table and chart data extraction
    • Slide metadata (title, number)

2. Conversion Quality Standards

Text Fidelity

  • 99.9% character accuracy for standard text
  • Proper Unicode support for international characters
  • Whitespace normalization without content loss
  • Special character preservation

Structure Preservation

  • Maintain document hierarchy (headings, sections)
  • Preserve table structures with proper alignment
  • Keep list formatting and nesting
  • Retain hyperlinks and references

Data Integrity

  • No data loss during conversion
  • Preserve numerical precision
  • Maintain date/time formats
  • Keep formula results in Excel files

3. Processing Features

Token Estimation

  • Calculate approximate token count for LLM usage
  • Provide character and word counts
  • Estimate processing costs for AI systems
  • Token optimization recommendations

Content Filtering

  • Remove empty rows/columns from tables
  • Strip excessive whitespace
  • Filter specific sheets in Excel files
  • Remove hidden content options

Metadata Generation

  • File size and type information
  • Conversion statistics
  • Document properties extraction
  • YAML frontmatter for Markdown

Non-Functional Requirements

Performance

  • Response Time: < 3 seconds for files up to 5MB
  • Throughput: 100 concurrent conversions
  • File Size Limits:
    • Free tier: 5MB
    • Pro tier: 50MB
    • Enterprise: 500MB (custom)

Scalability

  • Auto-scale from 0 to 1000 instances
  • Handle 10,000 conversions per minute (peak)
  • Support global deployment across regions
  • Queue-based processing for large files

Reliability

  • Uptime: 99.9% availability
  • Error Rate: < 0.1% failed conversions
  • Recovery: Automatic retry for transient failures
  • Monitoring: Real-time error tracking

Security

  • Zero-storage architecture (no file retention)
  • In-memory processing only
  • TLS encryption for all transfers
  • No logging of file contents
  • Input validation and sanitization

Technical Specifications

API Endpoints

POST /xlsx-converter
  Description: Convert Excel to JSON
  Input: multipart/form-data with file
  Output: JSON with sheets array
  
POST /xlsx-to-md
  Description: Convert Excel to Markdown
  Input: multipart/form-data with file
  Optional: sheetPrefix parameter
  Output: JSON with markdown string

POST /pdf-to-md
  Description: Convert PDF to Markdown
  Input: multipart/form-data with file
  Output: JSON with markdown string

POST /docx-to-html
  Description: Convert Word to HTML
  Input: multipart/form-data with file
  Output: JSON with html string

POST /docx-to-md
  Description: Convert Word to Markdown
  Input: multipart/form-data with file
  Output: JSON with markdown string

POST /pptx-to-md
  Description: Convert PowerPoint to Markdown
  Input: multipart/form-data with file
  Output: JSON with markdown string

Response Format

{
  "success": true,
  "filename": "report.xlsx",
  "processingTime": 1245,
  "statistics": {
    "inputSize": 1048576,
    "outputSize": 524288,
    "estimatedTokens": 15000,
    "characterCount": 45000,
    "wordCount": 7500
  },
  "sheets": [...] // For Excel
  "markdown": "..." // For Markdown output
  "html": "..." // For HTML output
}

Error Handling

{
  "success": false,
  "error": "File type not supported",
  "details": "The file appears to be corrupted or is not a valid XLSX file",
  "code": "INVALID_FILE_TYPE",
  "statusCode": 400
}

Success Metrics

Quality Metrics

  • Conversion Accuracy: > 99.5%
  • Customer Satisfaction: > 4.5/5 rating
  • Support Tickets: < 1% of conversions
  • Error Rate: < 0.1%

Performance Metrics

  • Average Response Time: < 2 seconds
  • P95 Response Time: < 5 seconds
  • Uptime: > 99.9%
  • Successful Conversions: > 99%

Business Metrics

  • Daily Active Users: 1,000+
  • Monthly Conversions: 100,000+
  • API Adoption Rate: 50% of users
  • Retention Rate: > 80% monthly

Dependencies

External Libraries

  • xlsx: Excel file parsing
  • pdf-parse: PDF text extraction
  • mammoth: Word document conversion
  • busboy: Multipart form parsing

Infrastructure

  • Google Cloud Functions (Gen2)
  • Node.js 20.x runtime
  • Cloud Storage for temporary processing
  • Firestore for usage tracking

Third-Party Services

  • Token counting algorithms
  • Character encoding libraries
  • Image processing services (future)

Timeline & Milestones

Phase 1: Core Conversion (Completed)

  • Excel to JSON/Markdown
  • PDF to Markdown
  • Word to HTML/Markdown
  • Basic error handling

Phase 2: Enhanced Features (Completed)

  • PowerPoint support
  • Token estimation
  • Sheet filtering
  • Performance optimization

Phase 3: Advanced Features (Q1 2025)

  • OCR support for scanned PDFs
  • Batch conversion API
  • Webhook notifications
  • Custom output formatting

Phase 4: Enterprise Features (Q2 2025)

  • On-premise deployment
  • Custom file size limits
  • Priority processing queues
  • Advanced security options

Risk Mitigation

Technical Risks

  • Large File Handling: Implement streaming for files > 50MB
  • Memory Limitations: Use worker processes for complex conversions
  • Format Compatibility: Maintain test suite with edge cases
  • Performance Degradation: Auto-scaling and load balancing

Business Risks

  • Competitor Features: Regular feature parity analysis
  • Pricing Pressure: Flexible pricing tiers
  • Technology Changes: Modular architecture for adaptability

Future Considerations

Format Expansion

  • CAD file support (DWG, DXF)
  • Email formats (MSG, EML)
  • Archive formats (ZIP with batch processing)
  • Image-based documents (with OCR)

AI Integration

  • Built-in summarization
  • Content classification
  • Language detection
  • Sentiment analysis

Platform Features

  • Desktop application
  • Browser extension
  • Mobile SDK
  • Offline conversion support

Advanced Processing

  • Custom conversion templates
  • Format transformation rules
  • Content redaction/filtering
  • Watermarking support