Secure Parser Development Guidelines

This document provides guidelines for developing secure parsers in valid8r to prevent Denial of Service (DoS) vulnerabilities and other security issues.

Table of Contents

Overview
DoS Prevention
ReDoS Prevention
Input Validation Order
Testing Requirements
Error Messages
Common Patterns
References

Overview

All parsers in valid8r must follow security-first design principles to prevent resource exhaustion attacks. The primary threat model is:

Attacker Goal: Exhaust server resources (CPU, memory) by sending malicious inputs
Attack Vector: Extremely long strings that trigger expensive operations
Impact: Service degradation or complete denial of service

DoS Prevention

Principle: Early Length Guards

ALWAYS validate input length BEFORE expensive operations:

def parse_example(text: str) -> Maybe[Example]:
    """Parse example with DoS protection."""
    # 1. Type check
    if not isinstance(text, str):
        return Maybe.failure('Input must be a string')

    # 2. Empty check
    s = text.strip()
    if s == '':
        return Maybe.failure('Input must not be empty')

    # 3. CRITICAL: Early length guard (DoS mitigation)
    # Check BEFORE any expensive operations (regex, external libraries, etc.)
    if len(text) > MAX_LENGTH:
        return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')

    # 4. Now safe to perform expensive operations
    # - Regex matching
    # - External library calls
    # - Complex parsing logic
    ...

Maximum Input Lengths

Use RFC standards and industry best practices to determine limits:

Parser	Max Length	Rationale
`parse_email()`	254 chars	RFC 5321 maximum email address length
`parse_url()`	2048 chars	Browser URL length limit
`parse_uuid()`	45 chars	Standard UUID format (36) + safety margin
`parse_ip()`	45 chars	IPv6 max length
`parse_cidr()`	50 chars	IPv6 + CIDR notation
`parse_phone()`	100 chars	NANP format + extensions + safety margin
`parse_slug()`	255 chars	Database VARCHAR(255) limits
`parse_json()`	1,000,000 chars	1MB (reasonable API payload)
`parse_jwt()`	10,000 chars	10KB (typical JWT size)
`parse_base64()`	10,000,000 chars	10MB (reasonable file size)

Performance Threshold

All parsers must reject malicious input in < 10ms (< 1ms for simple parsers).

This ensures that even under attack, the server can handle thousands of malicious requests per second without significant resource impact.

ReDoS Prevention

Overview of ReDoS Attacks

Regular Expression Denial of Service (ReDoS) attacks exploit the exponential time complexity of certain regex patterns. When a regex engine encounters nested quantifiers or overlapping patterns, it may experience “catastrophic backtracking,” causing processing time to grow exponentially with input length.

Automated ReDoS Detection

valid8r includes automated ReDoS detection in the CI/CD pipeline using regexploit:

# Scan a single file
uv run python scripts/check_regex_safety.py valid8r/core/parsers.py

# Scan entire directory
uv run python scripts/check_regex_safety.py valid8r/

# Run via tox
uv run tox -e security

Vulnerable Pattern Examples

# ❌ UNSAFE: Nested quantifiers cause exponential backtracking
re.compile(r'(a+)+')      # O(2^n) complexity
re.compile(r'(a*)*')      # O(2^n) complexity
re.compile(r'(a+)*')      # O(2^n) complexity

# ✅ SAFE: No nested quantifiers
re.compile(r'a+')         # O(n) linear complexity
re.compile(r'[a-z]+')     # O(n) linear complexity
re.compile(r'\d{3,10}')   # O(n) with fixed bounds

Safe Regex Patterns in valid8r

All regex patterns in valid8r have been verified safe:

# Phone extension pattern - safe alternation
_PHONE_EXTENSION_PATTERN = re.compile(
    r'\s*[,;]\s*(\d+)$|\s+(?:x|ext\.?|extension)\s*(\d+)$',
    re.IGNORECASE
)

# Phone valid chars - character class (inherently safe)
_PHONE_VALID_CHARS_PATTERN = re.compile(r'^[\d\s()\-+.]+$', re.MULTILINE)

# Slug validation - anchored with single quantifier
SLUG_PATTERN = re.compile(r'^[a-z0-9-]+$')

CI/CD Integration

The ReDoS scanner runs automatically on every PR via .github/workflows/security-checks.yml:

Trigger: Push to main/develop, pull requests
Action: Scans all Python files for vulnerable regex patterns
Result: PR is blocked if vulnerabilities are detected

Example output:

Safe patterns:

✅ All 4 regex pattern(s) are safe (no ReDoS vulnerabilities detected)

Vulnerable patterns (blocks PR):

❌ Found 1 vulnerable regex pattern(s):

  File: valid8r/core/parsers.py:123
  Pattern: (a+)+
  Reason: Exponential complexity (⭐×11) - catastrophic backtracking
  Attack string: a * 3456

Testing for ReDoS

Performance tests verify regex patterns are safe:

def it_phone_extension_pattern_is_safe(self) -> None:
    """Phone extension regex pattern is not vulnerable to catastrophic backtracking."""
    import re
    import time

    pattern = re.compile(r'\s*[,;]\s*(\d+)$|\s+(?:x|ext\.?|extension)\s*(\d+)$', re.IGNORECASE)

    # Test with adversarial input (many spaces)
    adversarial = ' ' * 1000 + 'x123'

    start = time.perf_counter()
    result = pattern.search(adversarial)
    elapsed_ms = (time.perf_counter() - start) * 1000

    # Should complete quickly (no catastrophic backtracking)
    assert elapsed_ms < 10, f'Pattern matching took {elapsed_ms:.2f}ms, should be < 10ms'

Input Validation Order

Follow this exact order for all parsers:

Type Check: Verify input is the expected type (usually str)
Empty Check: Reject empty/None inputs
Length Guard: Check length BEFORE expensive operations (DoS protection)
Library Check: Verify optional dependencies are available
Expensive Operations: Regex, external libraries, complex parsing

Example:

def parse_secure(text: str) -> Maybe[Result]:
    # 1. Type check
    if not isinstance(text, str):
        return Maybe.failure('Input must be a string')

    # 2. Empty check
    s = text.strip()
    if s == '':
        return Maybe.failure('Input must not be empty')

    # 3. Length guard (CRITICAL - DoS protection)
    if len(text) > MAX_LENGTH:
        return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')

    # 4. Library check (if needed)
    if not HAS_LIBRARY:
        return Maybe.failure('Required library not installed')

    # 5. Expensive operations (NOW SAFE)
    try:
        result = expensive_operation(s)
        return Maybe.success(result)
    except LibraryError as e:
        return Maybe.failure(str(e))

Testing Requirements

Correctness Tests

Every parser must have tests verifying:

Valid inputs (success cases)
Invalid inputs (failure cases with correct error messages)
Edge cases (boundary conditions, empty, whitespace)

Security Tests

Every parser must have DoS protection tests:

def it_rejects_excessively_long_input(self) -> None:
    """Reject extremely long input to prevent DoS attacks."""
    import time

    # Create malicious input exceeding maximum length
    malicious_input = 'a' * (MAX_LENGTH + 100)

    # Measure rejection time
    start = time.perf_counter()
    result = parse_function(malicious_input)
    elapsed_ms = (time.perf_counter() - start) * 1000

    # Verify correctness
    assert result.is_failure()
    assert 'too long' in result.error_or('').lower()

    # Verify performance (DoS protection)
    assert elapsed_ms < 10, f'Rejection took {elapsed_ms:.2f}ms, should be < 10ms'

Coverage Requirements

Line coverage: 100%
Branch coverage: 100%
Mutation coverage: Strongly recommended

Error Messages

User-Facing Messages

Error messages must be:

Clear and actionable: Tell the user what went wrong and how to fix it
Deterministic: Same input always produces same error
Length-limited: Avoid revealing the entire malicious input in error messages
Security-conscious: Don’t leak implementation details

Good:

return Maybe.failure('Email address is too long (maximum 254 characters)')
return Maybe.failure('Phone number must have 10 digits, got 8')
return Maybe.failure('Invalid format: phone number contains invalid characters')

Bad:

return Maybe.failure(f'Input too long: {text}')  # Leaks malicious input
return Maybe.failure('regex match failed')  # Implementation detail
return Maybe.failure('Error')  # Not actionable

Docstring Documentation

Every parser must document:

Rules: What constitutes valid/invalid input
Maximum length: Document the limit in the docstring
Failure messages: List all possible error messages
Examples: Provide usage examples in doctests

def parse_example(text: str) -> Maybe[Example]:
    """Parse example with security constraints.

    Rules:
    - Maximum length: 254 characters
    - Must contain '@' symbol
    - Cannot start with whitespace

    Failure messages:
    - Input must be a string
    - Input must not be empty
    - Input is too long (maximum 254 characters)
    - Invalid format: missing '@' symbol

    Args:
        text: Input string to parse

    Returns:
        Maybe[Example]: Success with parsed value or Failure with error

    Examples:
        >>> parse_example('valid@input').is_success()
        True
        >>> parse_example('a' * 300).is_failure()
        True
    """

Common Patterns

Pattern 1: External Library Validation

When using external libraries for validation (e.g., email-validator):

def parse_with_library(text: str) -> Maybe[Result]:
    # ALWAYS check length BEFORE calling library
    if len(text) > MAX_LENGTH:
        return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')

    try:
        result = external_library.validate(text)
        return Maybe.success(result)
    except LibraryError as e:
        return Maybe.failure(str(e))

Pattern 2: Regex Operations

When using regex for validation/parsing:

# Compile patterns at module level for performance
_PATTERN = re.compile(r'^[a-z0-9-]+$')

def parse_with_regex(text: str) -> Maybe[Result]:
    # ALWAYS check length BEFORE regex operations
    if len(text) > MAX_LENGTH:
        return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')

    # Now safe to use regex
    if not _PATTERN.match(text):
        return Maybe.failure('Invalid format')

    return Maybe.success(text)

Pattern 3: Multi-Stage Parsing

When parsing involves multiple stages:

def parse_complex(text: str) -> Maybe[Result]:
    # Check length ONCE at the beginning
    if len(text) > MAX_LENGTH:
        return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')

    # All subsequent operations are safe
    stage1 = _parse_stage1(text)  # No need to recheck length
    if stage1.is_failure():
        return stage1

    stage2 = _parse_stage2(stage1.value_or(''))
    return stage2

References

Security Standards

OWASP Top 10 2021: A04 - Insecure Design
CWE-400: Uncontrolled Resource Consumption
RFC 5321: SMTP (email address length limits)

Project References

Issue #134: ReDoS detection automation in CI/CD (this issue)
Issue #132: Comprehensive security audit of all parsers
Issue #131: Phone parser DoS vulnerability (fixed in v0.9.1)

Testing

Security Tests: /tests/security/test_redos_detection.py
Scanner Tests: /tests/security/test_check_regex_safety.py
ReDoS Scanner: /scripts/check_regex_safety.py

Maintenance

This document should be updated when:

New parsers are added to the library
New security threats are identified
Performance thresholds change
RFC standards are updated
Best practices evolve

Last Updated: 2025-11-11 (Issue #134 ReDoS detection automation)