Debug Guide: PDF-to-MD 500 Internal Server Error

Generated: 2025-07-23
Status: In Progress
Endpoint: https://convert-to-markdown.knowcode.tech/pdf-to-md
Error: 500 Internal Server Error

Issue Description

The pdf-to-md endpoint is returning a 500 Internal Server Error when attempting to convert PDF files to Markdown.

Debug Steps

1. Check Cloud Function Logs

First, let's view the recent logs to identify the exact error:

# View recent logs
gcloud functions logs read pdf-to-md --region=us-east4 --project=convert-to-markdown-us-east4 --limit=50

# Stream logs in real-time (open in separate terminal)
gcloud functions logs read pdf-to-md --region=us-east4 --project=convert-to-markdown-us-east4 --follow

2. Function Configuration Check

Verify the function configuration:

gcloud functions describe pdf-to-md --region=us-east4 --project=convert-to-markdown-us-east4

Expected configuration:

Memory: 512MB
Timeout: 60s
Runtime: nodejs20
Entry point: pdfToMarkdown

3. Local Testing

Test the function locally to isolate cloud-specific issues:

cd /Users/lindsaysmith/Documents/lambda1.nosync/xlsx-docx-ppt-convert-to-md

# Start local function server
npx @google-cloud/functions-framework --target=pdfToMarkdown --port=8080

# In another terminal, test with a PDF file
curl -X POST http://localhost:8080 -F "file=@test.pdf"

4. Common Error Points to Check

Based on source code analysis:

MIME Type Validation (src/pdfConverterToMD.js:23)
- Must be application/pdf
- Some PDFs may have different MIME types
File Size Limit (src/pdfConverterToMD.js:4)
- Maximum 5MB
- Check if file exceeds limit
PDF Parsing (lib/converters/pdf.js:6)
- Uses pdf-parse library
- May have compatibility issues with certain PDF versions
Memory Issues
- Function allocated 512MB
- Large or complex PDFs may exceed memory

5. Test with Different PDF Types

Create test PDFs to isolate the issue:

# Create test directory
mkdir -p test-pdfs
cd test-pdfs

# Test 1: Simple text PDF
echo "Simple test content" > test-simple.txt
# Convert to PDF using online tool or local converter

# Test 2: PDF with tables
# Create a document with tables

# Test 3: Small PDF (<100KB)
# Test 4: Different PDF versions (1.4, 1.5, 1.7)

6. Debug Script Usage

Run the debug deployment script:

cd deploy-gcp
# Modify debug script for pdf-to-md
sed -i 's/docx-to-html/pdf-to-md/g' debug-deployment.sh
./debug-deployment.sh

# Check generated files
cat pdf-function-details.json
cat pdf-run-details.json
cat deployment-logs.json

7. Production Test

Use the official test script:

cd deploy-gcp
./07-test-public-functions.sh

# Look specifically for pdf-to-md test results

8. Enhanced Logging

If needed, add debug logging to key points:

In src/pdfConverterToMD.js:

Before file validation
After file buffer creation
Before calling convertPdfToMarkdown

In lib/converters/pdf.js:

Before pdf-parse call
After successful parsing
In catch blocks

9. Cloud Console Monitoring

Check in GCP Console:

Cloud Functions → pdf-to-md → Metrics
Look for:
- Error rate spikes
- Memory usage patterns
- Cold start frequency
- Request/response sizes

10. Common Solutions

Based on typical 500 errors:

Dependency Issues

npm list pdf-parse
npm install pdf-parse@latest

Memory Increase

gcloud functions deploy pdf-to-md \
  --memory=1GB \
  --region=us-east4

Timeout Increase

gcloud functions deploy pdf-to-md \
  --timeout=120s \
  --region=us-east4

Error Log Analysis

Error Details

Error: Invalid file type. Only PDF files are allowed.
    at Multipart.<anonymous> (/workspace/src/pdfConverterToMD.js:24:26)

The error occurs because the MIME type validation was too strict, only accepting exactly application/pdf.

Root Cause

The MIME type check in pdfConverterToMD.js line 23 was using:

if (!mimeType.includes('application/pdf'))

This fails when:

Browsers send different MIME types (e.g., application/x-pdf)
MIME type is missing or empty
Different tools use variant PDF MIME types

Solution Applied

Updated the MIME type validation to accept multiple PDF MIME types and fallback to filename extension:

const validPdfTypes = ['application/pdf', 'application/x-pdf', 'application/acrobat', 
                       'applications/vnd.pdf', 'text/pdf', 'text/x-pdf'];
const isPdf = validPdfTypes.some(type => mimeType && mimeType.includes(type)) || 
              (filename && filename.toLowerCase().endsWith('.pdf'));

Also improved error message to show the actual MIME type received for debugging.

Test Results

Before Fix

Error: 500 Internal Server Error
Failure rate: 100%
Error message: "Invalid file type. Only PDF files are allowed."

After Fix (Local Testing)

Status: 200 OK
Success rate: 100%
Successfully converts PDF to Markdown

After Fix (Production)

Cloud Run URL: Works perfectly (https://pdf-to-md-qpg64cvnga-uk.a.run.app)
Public Domain: Still returns 500 error
Root Cause: Domain mapping issue - the domain is mapped to xlsx-converter service only

Resolution

The fix is working correctly at the Cloud Run level. The 500 error on the public domain is because:

The domain convert-to-markdown.knowcode.tech is mapped only to the xlsx-converter service
Other functions (like pdf-to-md) are not accessible through the custom domain
They work via their individual Cloud Run URLs but not through the shared domain

Verified Working URLs

Direct Cloud Run: https://pdf-to-md-qpg64cvnga-uk.a.run.app
Cloud Functions URL: https://us-east4-convert-to-markdown-us-east4.cloudfunctions.net/pdf-to-md
Custom Domain: https://convert-to-markdown.knowcode.tech/pdf-to-md (domain mapping issue)

Prevention

Add better error handling
Implement request validation
Add health check endpoint
Set up alerts for error rates > 1%

Related Issues

Similar issues with other converters: [None identified yet]
Dependencies requiring updates: [To be determined]

Next Steps: Execute steps 1-3 to gather initial error information.