chartdb_chartdb/parser-comparison-analysis.md

5.3 KiB

PostgreSQL vs MySQL Parser Comparison Analysis

Overview

This document compares how the PostgreSQL and MySQL parsers in ChartDB handle SQL parsing, focusing on the differences that could cause the same SQL file to produce different results.

1. SQL Sanitization and Comment Handling

PostgreSQL Parser (postgresql-improved.ts)

Comment Removal Strategy:

  1. Order: Comments are removed FIRST, before any other processing
  2. Multi-line comments: Removed using regex: /\/\*[\s\S]*?\*\//g
  3. Single-line comments: Removed line-by-line, checking for -- while respecting string boundaries
  4. String-aware: Preserves -- inside quoted strings
// PostgreSQL approach (lines 60-100)
// 1. First removes ALL multi-line comments
cleanedSQL = cleanedSQL.replace(/\/\*[\s\S]*?\*\//g, '');

// 2. Then processes single-line comments while respecting strings
for (let i = 0; i < line.length; i++) {
    // Tracks if we're inside a string to avoid removing -- inside quotes
}

MySQL Parser (mysql-improved.ts)

Comment Removal Strategy:

  1. Order: Comments are sanitized but with special handling for problematic patterns
  2. Special handling: Specifically fixes multi-line comments that contain quotes or JSON
  3. Line-by-line: Processes comments line by line, removing lines that start with -- or #
// MySQL approach (lines 35-67)
// 1. First fixes specific problematic patterns
result = result.replace(/--\s*"[^"]*",?\s*\n\s*"[^"]*".*$/gm, function(match) {
    return match.replace(/\n/g, ' ');
});

// 2. Then removes comment lines entirely
.map((line) => {
    if (trimmed.startsWith('--') || trimmed.startsWith('#')) {
        return '';
    }
    return line;
})

Key Difference: PostgreSQL removes ALL comments upfront, while MySQL tries to fix problematic comment patterns first, then removes comment lines.

2. Order of Operations

PostgreSQL Parser

  1. Preprocess SQL (removes all comments first)
  2. Split statements by semicolons (handles dollar quotes)
  3. Categorize statements (table, index, alter, etc.)
  4. Parse with node-sql-parser
  5. Fallback to regex if parser fails
  6. Extract relationships

MySQL Parser

  1. Validate syntax (checks for known issues)
  2. Sanitize SQL (fixes problematic patterns)
  3. Extract statements by semicolons
  4. Parse with node-sql-parser
  5. Fallback to regex if parser fails
  6. Process relationships

Key Difference: MySQL validates BEFORE sanitizing, while PostgreSQL sanitizes first. This means MySQL can detect and report issues that PostgreSQL might silently fix.

3. Multi-line Comment Handling

PostgreSQL

  • Removes ALL multi-line comments using [\s\S]*? pattern
  • No special handling for comments containing quotes or JSON
  • Clean removal before any parsing

MySQL

  • Specifically detects and fixes multi-line comments with quotes:
    -- "Beliebt",
    "Empfohlen"  -- This breaks MySQL parser
    
  • Detects JSON arrays in comments spanning lines:
    -- [
        "Ubuntu 22.04",
        "CentOS 8"
    ]  -- This also breaks MySQL parser
    
  • Converts these to single-line comments before parsing

Key Difference: MySQL has specific handling for problematic comment patterns that PostgreSQL simply removes entirely.

4. Statement Splitting

PostgreSQL

  • Handles PostgreSQL-specific dollar quotes ($$ ... $$)
  • Tracks quote depth for proper splitting
  • Supports function bodies with dollar quotes

MySQL

  • Simple quote tracking (single, double, backtick)
  • Handles escape sequences (\)
  • No special quote constructs

5. Validation Approach

PostgreSQL

  • No pre-validation
  • Relies on parser and fallback regex
  • Reports warnings for unsupported features

MySQL

  • Pre-validates SQL before parsing
  • Detects known problematic patterns:
    • Multi-line comments with quotes
    • JSON arrays in comments
    • Inline REFERENCES (PostgreSQL syntax)
    • Missing semicolons
  • Can reject SQL before attempting to parse

6. Why Same SQL Gives Different Results

Example Problematic SQL:

CREATE TABLE products (
    id INT PRIMARY KEY,
    status VARCHAR(50), -- "active",
"inactive", "pending"
    data JSON -- [
    {"key": "value"},
    {"key": "value2"}
]
);

PostgreSQL Result:

  • Successfully parses (comments are removed entirely)
  • Table created with proper columns

MySQL Result:

  • Validation fails with errors:
    • MULTILINE_COMMENT_QUOTE at line 3
    • MULTILINE_JSON_COMMENT at line 5
  • Import blocked unless validation is skipped

7. Recommendations

  1. For Cross-Database Compatibility:

    • Avoid multi-line comments with quotes or JSON
    • Keep comments on single lines
    • Use proper FOREIGN KEY syntax instead of inline REFERENCES
  2. For MySQL Import:

    • Fix validation errors before import
    • Or use skipValidation: true option if SQL is known to work
  3. For PostgreSQL Import:

    • Be aware that comments are stripped entirely
    • Complex comments might hide syntax issues

Conclusion

The main difference is that PostgreSQL takes a "remove all comments first" approach, while MySQL tries to detect and handle problematic comment patterns. This makes PostgreSQL more forgiving but MySQL more explicit about potential issues. The same SQL file can succeed in PostgreSQL but fail in MySQL if it contains multi-line comments with special characters.