Secure Parser Development Guidelines

This document provides guidelines for developing secure parsers in valid8r to prevent Denial of Service (DoS) vulnerabilities and other security issues.

Table of Contents

  1. Overview

  2. DoS Prevention

  3. ReDoS Prevention

  4. Input Validation Order

  5. Testing Requirements

  6. Error Messages

  7. Common Patterns

  8. References

Overview

All parsers in valid8r must follow security-first design principles to prevent resource exhaustion attacks. The primary threat model is:

  • Attacker Goal: Exhaust server resources (CPU, memory) by sending malicious inputs

  • Attack Vector: Extremely long strings that trigger expensive operations

  • Impact: Service degradation or complete denial of service

DoS Prevention

Principle: Early Length Guards

ALWAYS validate input length BEFORE expensive operations:

def parse_example(text: str) -> Maybe[Example]:
    """Parse example with DoS protection."""
    # 1. Type check
    if not isinstance(text, str):
        return Maybe.failure('Input must be a string')

    # 2. Empty check
    s = text.strip()
    if s == '':
        return Maybe.failure('Input must not be empty')

    # 3. CRITICAL: Early length guard (DoS mitigation)
    # Check BEFORE any expensive operations (regex, external libraries, etc.)
    if len(text) > MAX_LENGTH:
        return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')

    # 4. Now safe to perform expensive operations
    # - Regex matching
    # - External library calls
    # - Complex parsing logic
    ...

Maximum Input Lengths

Use RFC standards and industry best practices to determine limits:

Parser

Max Length

Rationale

parse_email()

254 chars

RFC 5321 maximum email address length

parse_url()

2048 chars

Browser URL length limit

parse_uuid()

45 chars

Standard UUID format (36) + safety margin

parse_ip()

45 chars

IPv6 max length

parse_cidr()

50 chars

IPv6 + CIDR notation

parse_phone()

100 chars

NANP format + extensions + safety margin

parse_slug()

255 chars

Database VARCHAR(255) limits

parse_json()

1,000,000 chars

1MB (reasonable API payload)

parse_jwt()

10,000 chars

10KB (typical JWT size)

parse_base64()

10,000,000 chars

10MB (reasonable file size)

Performance Threshold

All parsers must reject malicious input in < 10ms (< 1ms for simple parsers).

This ensures that even under attack, the server can handle thousands of malicious requests per second without significant resource impact.

ReDoS Prevention

Overview of ReDoS Attacks

Regular Expression Denial of Service (ReDoS) attacks exploit the exponential time complexity of certain regex patterns. When a regex engine encounters nested quantifiers or overlapping patterns, it may experience “catastrophic backtracking,” causing processing time to grow exponentially with input length.

Automated ReDoS Detection

valid8r includes automated ReDoS detection in the CI/CD pipeline using regexploit:

# Scan a single file
uv run python scripts/check_regex_safety.py valid8r/core/parsers.py

# Scan entire directory
uv run python scripts/check_regex_safety.py valid8r/

# Run via tox
uv run tox -e security

Vulnerable Pattern Examples

# ❌ UNSAFE: Nested quantifiers cause exponential backtracking
re.compile(r'(a+)+')      # O(2^n) complexity
re.compile(r'(a*)*')      # O(2^n) complexity
re.compile(r'(a+)*')      # O(2^n) complexity

# ✅ SAFE: No nested quantifiers
re.compile(r'a+')         # O(n) linear complexity
re.compile(r'[a-z]+')     # O(n) linear complexity
re.compile(r'\d{3,10}')   # O(n) with fixed bounds

Safe Regex Patterns in valid8r

All regex patterns in valid8r have been verified safe:

# Phone extension pattern - safe alternation
_PHONE_EXTENSION_PATTERN = re.compile(
    r'\s*[,;]\s*(\d+)$|\s+(?:x|ext\.?|extension)\s*(\d+)$',
    re.IGNORECASE
)

# Phone valid chars - character class (inherently safe)
_PHONE_VALID_CHARS_PATTERN = re.compile(r'^[\d\s()\-+.]+$', re.MULTILINE)

# Slug validation - anchored with single quantifier
SLUG_PATTERN = re.compile(r'^[a-z0-9-]+$')

CI/CD Integration

The ReDoS scanner runs automatically on every PR via .github/workflows/security-checks.yml:

  • Trigger: Push to main/develop, pull requests

  • Action: Scans all Python files for vulnerable regex patterns

  • Result: PR is blocked if vulnerabilities are detected

Example output:

Safe patterns:

✅ All 4 regex pattern(s) are safe (no ReDoS vulnerabilities detected)

Vulnerable patterns (blocks PR):

❌ Found 1 vulnerable regex pattern(s):

  File: valid8r/core/parsers.py:123
  Pattern: (a+)+
  Reason: Exponential complexity (⭐×11) - catastrophic backtracking
  Attack string: a * 3456

Testing for ReDoS

Performance tests verify regex patterns are safe:

def it_phone_extension_pattern_is_safe(self) -> None:
    """Phone extension regex pattern is not vulnerable to catastrophic backtracking."""
    import re
    import time

    pattern = re.compile(r'\s*[,;]\s*(\d+)$|\s+(?:x|ext\.?|extension)\s*(\d+)$', re.IGNORECASE)

    # Test with adversarial input (many spaces)
    adversarial = ' ' * 1000 + 'x123'

    start = time.perf_counter()
    result = pattern.search(adversarial)
    elapsed_ms = (time.perf_counter() - start) * 1000

    # Should complete quickly (no catastrophic backtracking)
    assert elapsed_ms < 10, f'Pattern matching took {elapsed_ms:.2f}ms, should be < 10ms'

Input Validation Order

Follow this exact order for all parsers:

  1. Type Check: Verify input is the expected type (usually str)

  2. Empty Check: Reject empty/None inputs

  3. Length Guard: Check length BEFORE expensive operations (DoS protection)

  4. Library Check: Verify optional dependencies are available

  5. Expensive Operations: Regex, external libraries, complex parsing

Example:

def parse_secure(text: str) -> Maybe[Result]:
    # 1. Type check
    if not isinstance(text, str):
        return Maybe.failure('Input must be a string')

    # 2. Empty check
    s = text.strip()
    if s == '':
        return Maybe.failure('Input must not be empty')

    # 3. Length guard (CRITICAL - DoS protection)
    if len(text) > MAX_LENGTH:
        return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')

    # 4. Library check (if needed)
    if not HAS_LIBRARY:
        return Maybe.failure('Required library not installed')

    # 5. Expensive operations (NOW SAFE)
    try:
        result = expensive_operation(s)
        return Maybe.success(result)
    except LibraryError as e:
        return Maybe.failure(str(e))

Testing Requirements

Correctness Tests

Every parser must have tests verifying:

  1. Valid inputs (success cases)

  2. Invalid inputs (failure cases with correct error messages)

  3. Edge cases (boundary conditions, empty, whitespace)

Security Tests

Every parser must have DoS protection tests:

def it_rejects_excessively_long_input(self) -> None:
    """Reject extremely long input to prevent DoS attacks."""
    import time

    # Create malicious input exceeding maximum length
    malicious_input = 'a' * (MAX_LENGTH + 100)

    # Measure rejection time
    start = time.perf_counter()
    result = parse_function(malicious_input)
    elapsed_ms = (time.perf_counter() - start) * 1000

    # Verify correctness
    assert result.is_failure()
    assert 'too long' in result.error_or('').lower()

    # Verify performance (DoS protection)
    assert elapsed_ms < 10, f'Rejection took {elapsed_ms:.2f}ms, should be < 10ms'

Coverage Requirements

  • Line coverage: 100%

  • Branch coverage: 100%

  • Mutation coverage: Strongly recommended

Error Messages

User-Facing Messages

Error messages must be:

  1. Clear and actionable: Tell the user what went wrong and how to fix it

  2. Deterministic: Same input always produces same error

  3. Length-limited: Avoid revealing the entire malicious input in error messages

  4. Security-conscious: Don’t leak implementation details

Good:

return Maybe.failure('Email address is too long (maximum 254 characters)')
return Maybe.failure('Phone number must have 10 digits, got 8')
return Maybe.failure('Invalid format: phone number contains invalid characters')

Bad:

return Maybe.failure(f'Input too long: {text}')  # Leaks malicious input
return Maybe.failure('regex match failed')  # Implementation detail
return Maybe.failure('Error')  # Not actionable

Docstring Documentation

Every parser must document:

  1. Rules: What constitutes valid/invalid input

  2. Maximum length: Document the limit in the docstring

  3. Failure messages: List all possible error messages

  4. Examples: Provide usage examples in doctests

def parse_example(text: str) -> Maybe[Example]:
    """Parse example with security constraints.

    Rules:
    - Maximum length: 254 characters
    - Must contain '@' symbol
    - Cannot start with whitespace

    Failure messages:
    - Input must be a string
    - Input must not be empty
    - Input is too long (maximum 254 characters)
    - Invalid format: missing '@' symbol

    Args:
        text: Input string to parse

    Returns:
        Maybe[Example]: Success with parsed value or Failure with error

    Examples:
        >>> parse_example('valid@input').is_success()
        True
        >>> parse_example('a' * 300).is_failure()
        True
    """

Common Patterns

Pattern 1: External Library Validation

When using external libraries for validation (e.g., email-validator):

def parse_with_library(text: str) -> Maybe[Result]:
    # ALWAYS check length BEFORE calling library
    if len(text) > MAX_LENGTH:
        return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')

    try:
        result = external_library.validate(text)
        return Maybe.success(result)
    except LibraryError as e:
        return Maybe.failure(str(e))

Pattern 2: Regex Operations

When using regex for validation/parsing:

# Compile patterns at module level for performance
_PATTERN = re.compile(r'^[a-z0-9-]+$')

def parse_with_regex(text: str) -> Maybe[Result]:
    # ALWAYS check length BEFORE regex operations
    if len(text) > MAX_LENGTH:
        return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')

    # Now safe to use regex
    if not _PATTERN.match(text):
        return Maybe.failure('Invalid format')

    return Maybe.success(text)

Pattern 3: Multi-Stage Parsing

When parsing involves multiple stages:

def parse_complex(text: str) -> Maybe[Result]:
    # Check length ONCE at the beginning
    if len(text) > MAX_LENGTH:
        return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')

    # All subsequent operations are safe
    stage1 = _parse_stage1(text)  # No need to recheck length
    if stage1.is_failure():
        return stage1

    stage2 = _parse_stage2(stage1.value_or(''))
    return stage2

References

Security Standards

  • OWASP Top 10 2021: A04 - Insecure Design

  • CWE-400: Uncontrolled Resource Consumption

  • RFC 5321: SMTP (email address length limits)

Project References

  • Issue #134: ReDoS detection automation in CI/CD (this issue)

  • Issue #132: Comprehensive security audit of all parsers

  • Issue #131: Phone parser DoS vulnerability (fixed in v0.9.1)

Testing

  • Security Tests: /tests/security/test_redos_detection.py

  • Scanner Tests: /tests/security/test_check_regex_safety.py

  • ReDoS Scanner: /scripts/check_regex_safety.py

Checklist for New Parsers

When creating a new parser, verify:

  • Type check is first validation

  • Empty/None input is rejected

  • Length guard is present BEFORE expensive operations

  • Maximum length is documented in docstring

  • Error messages are clear and actionable

  • DoS protection test exists (< 10ms rejection)

  • ReDoS scanner passes for any regex patterns

  • Regex performance test exists if using patterns

  • Test coverage is 100% (line + branch)

  • Performance is measured and verified

Maintenance

This document should be updated when:

  • New parsers are added to the library

  • New security threats are identified

  • Performance thresholds change

  • RFC standards are updated

  • Best practices evolve

Last Updated: 2025-11-11 (Issue #134 ReDoS detection automation)