Last updated: Jul 25, 2025, 10:08 AM UTC

Debug Guide: PDF-to-MD 500 Internal Server Error

Generated: 2025-07-23
Status: In Progress
Endpoint: https://convert-to-markdown.knowcode.tech/pdf-to-md
Error: 500 Internal Server Error

Issue Description

The pdf-to-md endpoint is returning a 500 Internal Server Error when attempting to convert PDF files to Markdown.

Debug Steps

1. Check Cloud Function Logs

First, let's view the recent logs to identify the exact error:

# View recent logs
gcloud functions logs read pdf-to-md --region=us-east4 --project=convert-to-markdown-us-east4 --limit=50

# Stream logs in real-time (open in separate terminal)
gcloud functions logs read pdf-to-md --region=us-east4 --project=convert-to-markdown-us-east4 --follow

2. Function Configuration Check

Verify the function configuration:

gcloud functions describe pdf-to-md --region=us-east4 --project=convert-to-markdown-us-east4

Expected configuration:

  • Memory: 512MB
  • Timeout: 60s
  • Runtime: nodejs20
  • Entry point: pdfToMarkdown

3. Local Testing

Test the function locally to isolate cloud-specific issues:

cd /Users/lindsaysmith/Documents/lambda1.nosync/xlsx-docx-ppt-convert-to-md

# Start local function server
npx @google-cloud/functions-framework --target=pdfToMarkdown --port=8080

# In another terminal, test with a PDF file
curl -X POST http://localhost:8080 -F "file=@test.pdf"

4. Common Error Points to Check

Based on source code analysis:

  1. MIME Type Validation (src/pdfConverterToMD.js:23)

    • Must be application/pdf
    • Some PDFs may have different MIME types
  2. File Size Limit (src/pdfConverterToMD.js:4)

    • Maximum 5MB
    • Check if file exceeds limit
  3. PDF Parsing (lib/converters/pdf.js:6)

    • Uses pdf-parse library
    • May have compatibility issues with certain PDF versions
  4. Memory Issues

    • Function allocated 512MB
    • Large or complex PDFs may exceed memory

5. Test with Different PDF Types

Create test PDFs to isolate the issue:

# Create test directory
mkdir -p test-pdfs
cd test-pdfs

# Test 1: Simple text PDF
echo "Simple test content" > test-simple.txt
# Convert to PDF using online tool or local converter

# Test 2: PDF with tables
# Create a document with tables

# Test 3: Small PDF (<100KB)
# Test 4: Different PDF versions (1.4, 1.5, 1.7)

6. Debug Script Usage

Run the debug deployment script:

cd deploy-gcp
# Modify debug script for pdf-to-md
sed -i 's/docx-to-html/pdf-to-md/g' debug-deployment.sh
./debug-deployment.sh

# Check generated files
cat pdf-function-details.json
cat pdf-run-details.json
cat deployment-logs.json

7. Production Test

Use the official test script:

cd deploy-gcp
./07-test-public-functions.sh

# Look specifically for pdf-to-md test results

8. Enhanced Logging

If needed, add debug logging to key points:

In src/pdfConverterToMD.js:

  • Before file validation
  • After file buffer creation
  • Before calling convertPdfToMarkdown

In lib/converters/pdf.js:

  • Before pdf-parse call
  • After successful parsing
  • In catch blocks

9. Cloud Console Monitoring

Check in GCP Console:

  1. Cloud Functions → pdf-to-md → Metrics
  2. Look for:
    • Error rate spikes
    • Memory usage patterns
    • Cold start frequency
    • Request/response sizes

10. Common Solutions

Based on typical 500 errors:

  1. Dependency Issues

    npm list pdf-parse
    npm install pdf-parse@latest
    
  2. Memory Increase

    gcloud functions deploy pdf-to-md \
      --memory=1GB \
      --region=us-east4
    
  3. Timeout Increase

    gcloud functions deploy pdf-to-md \
      --timeout=120s \
      --region=us-east4
    

Error Log Analysis

Error Details

Error: Invalid file type. Only PDF files are allowed.
    at Multipart.<anonymous> (/workspace/src/pdfConverterToMD.js:24:26)

The error occurs because the MIME type validation was too strict, only accepting exactly application/pdf.

Root Cause

The MIME type check in pdfConverterToMD.js line 23 was using:

if (!mimeType.includes('application/pdf'))

This fails when:

  • Browsers send different MIME types (e.g., application/x-pdf)
  • MIME type is missing or empty
  • Different tools use variant PDF MIME types

Solution Applied

Updated the MIME type validation to accept multiple PDF MIME types and fallback to filename extension:

const validPdfTypes = ['application/pdf', 'application/x-pdf', 'application/acrobat', 
                       'applications/vnd.pdf', 'text/pdf', 'text/x-pdf'];
const isPdf = validPdfTypes.some(type => mimeType && mimeType.includes(type)) || 
              (filename && filename.toLowerCase().endsWith('.pdf'));

Also improved error message to show the actual MIME type received for debugging.

Test Results

Before Fix

  • Error: 500 Internal Server Error
  • Failure rate: 100%
  • Error message: "Invalid file type. Only PDF files are allowed."

After Fix (Local Testing)

  • Status: 200 OK
  • Success rate: 100%
  • Successfully converts PDF to Markdown

After Fix (Production)

Resolution

The fix is working correctly at the Cloud Run level. The 500 error on the public domain is because:

  1. The domain convert-to-markdown.knowcode.tech is mapped only to the xlsx-converter service
  2. Other functions (like pdf-to-md) are not accessible through the custom domain
  3. They work via their individual Cloud Run URLs but not through the shared domain

Verified Working URLs

  • Direct Cloud Run: https://pdf-to-md-qpg64cvnga-uk.a.run.app
  • Cloud Functions URL: https://us-east4-convert-to-markdown-us-east4.cloudfunctions.net/pdf-to-md
  • Custom Domain: https://convert-to-markdown.knowcode.tech/pdf-to-md (domain mapping issue)

Prevention

  1. Add better error handling
  2. Implement request validation
  3. Add health check endpoint
  4. Set up alerts for error rates > 1%

Related Issues

  • Similar issues with other converters: [None identified yet]
  • Dependencies requiring updates: [To be determined]

Next Steps: Execute steps 1-3 to gather initial error information.