Core Conversion Service PRD
Product Name: Convert to Markdown
Document Version: 1.0
Last Updated: 2025-07-23
Status: Implemented
Author: Product Team
Executive Summary
Convert to Markdown is a cloud-based document conversion service that transforms business documents (Excel, PDF, Word, PowerPoint) into web-friendly formats (JSON, Markdown, HTML). The service is designed to enable AI/LLM systems, developers, and businesses to extract and process information from traditional office documents efficiently.
Key Objectives
- Provide high-quality, accurate document conversion
- Support all major business document formats
- Optimize output for AI/LLM consumption
- Maintain zero-storage architecture for security
- Deliver sub-3 second conversion times
User Stories
As a Developer
- I want to convert Excel files to JSON so I can process spreadsheet data programmatically
- I want to extract text from PDFs as Markdown to feed into my AI system
- I want to preserve document structure and formatting during conversion
- I want clear error messages when conversions fail
As a Business User
- I want to convert my reports to web-friendly formats for online publishing
- I want to extract data from multiple sheets in Excel files
- I want to maintain table structures when converting to Markdown
- I want to process large documents without timeout errors
As an AI/LLM User
- I want document content optimized for token efficiency
- I want metadata about the conversion (token count, file size)
- I want clean, well-structured output without artifacts
- I want to handle documents up to 50MB for comprehensive analysis
Functional Requirements
1. File Format Support
Excel Conversion
- Input Formats: .xlsx, .xls, .xlsm, .xlsb
- Output Formats: JSON, Markdown
- Features:
- Multi-sheet support with sheet filtering
- Preserve formulas and calculated values
- Handle merged cells and formatting
- Support for pivot tables and charts (as data)
- Macro-enabled file support (content only)
PDF Conversion
- Input Formats: .pdf (all versions)
- Output Format: Markdown
- Features:
- Text extraction with layout preservation
- Table detection and conversion
- Image placeholder generation
- Multi-column layout handling
- OCR support for scanned documents (future)
Word Conversion
- Input Formats: .docx, .doc, .dotx, .dotm
- Output Formats: HTML, Markdown
- Features:
- Style preservation (headings, bold, italic)
- Table conversion with alignment
- List handling (bullet, numbered)
- Footnote and endnote support
- Track changes extraction (future)
PowerPoint Conversion
- Input Formats: .pptx, .ppt, .ppsx, .potx
- Output Format: Markdown
- Features:
- Slide-by-slide conversion
- Speaker notes extraction
- Text from shapes and text boxes
- Table and chart data extraction
- Slide metadata (title, number)
2. Conversion Quality Standards
Text Fidelity
- 99.9% character accuracy for standard text
- Proper Unicode support for international characters
- Whitespace normalization without content loss
- Special character preservation
Structure Preservation
- Maintain document hierarchy (headings, sections)
- Preserve table structures with proper alignment
- Keep list formatting and nesting
- Retain hyperlinks and references
Data Integrity
- No data loss during conversion
- Preserve numerical precision
- Maintain date/time formats
- Keep formula results in Excel files
3. Processing Features
Token Estimation
- Calculate approximate token count for LLM usage
- Provide character and word counts
- Estimate processing costs for AI systems
- Token optimization recommendations
Content Filtering
- Remove empty rows/columns from tables
- Strip excessive whitespace
- Filter specific sheets in Excel files
- Remove hidden content options
Metadata Generation
- File size and type information
- Conversion statistics
- Document properties extraction
- YAML frontmatter for Markdown
Non-Functional Requirements
Performance
- Response Time: < 3 seconds for files up to 5MB
- Throughput: 100 concurrent conversions
- File Size Limits:
- Free tier: 5MB
- Pro tier: 50MB
- Enterprise: 500MB (custom)
Scalability
- Auto-scale from 0 to 1000 instances
- Handle 10,000 conversions per minute (peak)
- Support global deployment across regions
- Queue-based processing for large files
Reliability
- Uptime: 99.9% availability
- Error Rate: < 0.1% failed conversions
- Recovery: Automatic retry for transient failures
- Monitoring: Real-time error tracking
Security
- Zero-storage architecture (no file retention)
- In-memory processing only
- TLS encryption for all transfers
- No logging of file contents
- Input validation and sanitization
Technical Specifications
API Endpoints
POST /xlsx-converter
Description: Convert Excel to JSON
Input: multipart/form-data with file
Output: JSON with sheets array
POST /xlsx-to-md
Description: Convert Excel to Markdown
Input: multipart/form-data with file
Optional: sheetPrefix parameter
Output: JSON with markdown string
POST /pdf-to-md
Description: Convert PDF to Markdown
Input: multipart/form-data with file
Output: JSON with markdown string
POST /docx-to-html
Description: Convert Word to HTML
Input: multipart/form-data with file
Output: JSON with html string
POST /docx-to-md
Description: Convert Word to Markdown
Input: multipart/form-data with file
Output: JSON with markdown string
POST /pptx-to-md
Description: Convert PowerPoint to Markdown
Input: multipart/form-data with file
Output: JSON with markdown string
Response Format
{
"success": true,
"filename": "report.xlsx",
"processingTime": 1245,
"statistics": {
"inputSize": 1048576,
"outputSize": 524288,
"estimatedTokens": 15000,
"characterCount": 45000,
"wordCount": 7500
},
"sheets": [...] // For Excel
"markdown": "..." // For Markdown output
"html": "..." // For HTML output
}
Error Handling
{
"success": false,
"error": "File type not supported",
"details": "The file appears to be corrupted or is not a valid XLSX file",
"code": "INVALID_FILE_TYPE",
"statusCode": 400
}
Success Metrics
Quality Metrics
- Conversion Accuracy: > 99.5%
- Customer Satisfaction: > 4.5/5 rating
- Support Tickets: < 1% of conversions
- Error Rate: < 0.1%
Performance Metrics
- Average Response Time: < 2 seconds
- P95 Response Time: < 5 seconds
- Uptime: > 99.9%
- Successful Conversions: > 99%
Business Metrics
- Daily Active Users: 1,000+
- Monthly Conversions: 100,000+
- API Adoption Rate: 50% of users
- Retention Rate: > 80% monthly
Dependencies
External Libraries
- xlsx: Excel file parsing
- pdf-parse: PDF text extraction
- mammoth: Word document conversion
- busboy: Multipart form parsing
Infrastructure
- Google Cloud Functions (Gen2)
- Node.js 20.x runtime
- Cloud Storage for temporary processing
- Firestore for usage tracking
Third-Party Services
- Token counting algorithms
- Character encoding libraries
- Image processing services (future)
Timeline & Milestones
Phase 1: Core Conversion (Completed)
- Excel to JSON/Markdown
- PDF to Markdown
- Word to HTML/Markdown
- Basic error handling
Phase 2: Enhanced Features (Completed)
- PowerPoint support
- Token estimation
- Sheet filtering
- Performance optimization
Phase 3: Advanced Features (Q1 2025)
- OCR support for scanned PDFs
- Batch conversion API
- Webhook notifications
- Custom output formatting
Phase 4: Enterprise Features (Q2 2025)
- On-premise deployment
- Custom file size limits
- Priority processing queues
- Advanced security options
Risk Mitigation
Technical Risks
- Large File Handling: Implement streaming for files > 50MB
- Memory Limitations: Use worker processes for complex conversions
- Format Compatibility: Maintain test suite with edge cases
- Performance Degradation: Auto-scaling and load balancing
Business Risks
- Competitor Features: Regular feature parity analysis
- Pricing Pressure: Flexible pricing tiers
- Technology Changes: Modular architecture for adaptability
Future Considerations
Format Expansion
- CAD file support (DWG, DXF)
- Email formats (MSG, EML)
- Archive formats (ZIP with batch processing)
- Image-based documents (with OCR)
AI Integration
- Built-in summarization
- Content classification
- Language detection
- Sentiment analysis
Platform Features
- Desktop application
- Browser extension
- Mobile SDK
- Offline conversion support
Advanced Processing
- Custom conversion templates
- Format transformation rules
- Content redaction/filtering
- Watermarking support