Secure Parser Development Guidelines
This document provides guidelines for developing secure parsers in valid8r to prevent Denial of Service (DoS) vulnerabilities and other security issues.
Table of Contents
Overview
All parsers in valid8r must follow security-first design principles to prevent resource exhaustion attacks. The primary threat model is:
Attacker Goal: Exhaust server resources (CPU, memory) by sending malicious inputs
Attack Vector: Extremely long strings that trigger expensive operations
Impact: Service degradation or complete denial of service
DoS Prevention
Principle: Early Length Guards
ALWAYS validate input length BEFORE expensive operations:
def parse_example(text: str) -> Maybe[Example]:
"""Parse example with DoS protection."""
# 1. Type check
if not isinstance(text, str):
return Maybe.failure('Input must be a string')
# 2. Empty check
s = text.strip()
if s == '':
return Maybe.failure('Input must not be empty')
# 3. CRITICAL: Early length guard (DoS mitigation)
# Check BEFORE any expensive operations (regex, external libraries, etc.)
if len(text) > MAX_LENGTH:
return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')
# 4. Now safe to perform expensive operations
# - Regex matching
# - External library calls
# - Complex parsing logic
...
Maximum Input Lengths
Use RFC standards and industry best practices to determine limits:
Parser |
Max Length |
Rationale |
|---|---|---|
|
254 chars |
RFC 5321 maximum email address length |
|
2048 chars |
Browser URL length limit |
|
45 chars |
Standard UUID format (36) + safety margin |
|
45 chars |
IPv6 max length |
|
50 chars |
IPv6 + CIDR notation |
|
100 chars |
NANP format + extensions + safety margin |
|
255 chars |
Database VARCHAR(255) limits |
|
1,000,000 chars |
1MB (reasonable API payload) |
|
10,000 chars |
10KB (typical JWT size) |
|
10,000,000 chars |
10MB (reasonable file size) |
Performance Threshold
All parsers must reject malicious input in < 10ms (< 1ms for simple parsers).
This ensures that even under attack, the server can handle thousands of malicious requests per second without significant resource impact.
ReDoS Prevention
Overview of ReDoS Attacks
Regular Expression Denial of Service (ReDoS) attacks exploit the exponential time complexity of certain regex patterns. When a regex engine encounters nested quantifiers or overlapping patterns, it may experience “catastrophic backtracking,” causing processing time to grow exponentially with input length.
Automated ReDoS Detection
valid8r includes automated ReDoS detection in the CI/CD pipeline using regexploit:
# Scan a single file
uv run python scripts/check_regex_safety.py valid8r/core/parsers.py
# Scan entire directory
uv run python scripts/check_regex_safety.py valid8r/
# Run via tox
uv run tox -e security
Vulnerable Pattern Examples
# ❌ UNSAFE: Nested quantifiers cause exponential backtracking
re.compile(r'(a+)+') # O(2^n) complexity
re.compile(r'(a*)*') # O(2^n) complexity
re.compile(r'(a+)*') # O(2^n) complexity
# ✅ SAFE: No nested quantifiers
re.compile(r'a+') # O(n) linear complexity
re.compile(r'[a-z]+') # O(n) linear complexity
re.compile(r'\d{3,10}') # O(n) with fixed bounds
Safe Regex Patterns in valid8r
All regex patterns in valid8r have been verified safe:
# Phone extension pattern - safe alternation
_PHONE_EXTENSION_PATTERN = re.compile(
r'\s*[,;]\s*(\d+)$|\s+(?:x|ext\.?|extension)\s*(\d+)$',
re.IGNORECASE
)
# Phone valid chars - character class (inherently safe)
_PHONE_VALID_CHARS_PATTERN = re.compile(r'^[\d\s()\-+.]+$', re.MULTILINE)
# Slug validation - anchored with single quantifier
SLUG_PATTERN = re.compile(r'^[a-z0-9-]+$')
CI/CD Integration
The ReDoS scanner runs automatically on every PR via .github/workflows/security-checks.yml:
Trigger: Push to main/develop, pull requests
Action: Scans all Python files for vulnerable regex patterns
Result: PR is blocked if vulnerabilities are detected
Example output:
Safe patterns:
✅ All 4 regex pattern(s) are safe (no ReDoS vulnerabilities detected)
Vulnerable patterns (blocks PR):
❌ Found 1 vulnerable regex pattern(s):
File: valid8r/core/parsers.py:123
Pattern: (a+)+
Reason: Exponential complexity (⭐×11) - catastrophic backtracking
Attack string: a * 3456
Testing for ReDoS
Performance tests verify regex patterns are safe:
def it_phone_extension_pattern_is_safe(self) -> None:
"""Phone extension regex pattern is not vulnerable to catastrophic backtracking."""
import re
import time
pattern = re.compile(r'\s*[,;]\s*(\d+)$|\s+(?:x|ext\.?|extension)\s*(\d+)$', re.IGNORECASE)
# Test with adversarial input (many spaces)
adversarial = ' ' * 1000 + 'x123'
start = time.perf_counter()
result = pattern.search(adversarial)
elapsed_ms = (time.perf_counter() - start) * 1000
# Should complete quickly (no catastrophic backtracking)
assert elapsed_ms < 10, f'Pattern matching took {elapsed_ms:.2f}ms, should be < 10ms'
Input Validation Order
Follow this exact order for all parsers:
Type Check: Verify input is the expected type (usually
str)Empty Check: Reject empty/None inputs
Length Guard: Check length BEFORE expensive operations (DoS protection)
Library Check: Verify optional dependencies are available
Expensive Operations: Regex, external libraries, complex parsing
Example:
def parse_secure(text: str) -> Maybe[Result]:
# 1. Type check
if not isinstance(text, str):
return Maybe.failure('Input must be a string')
# 2. Empty check
s = text.strip()
if s == '':
return Maybe.failure('Input must not be empty')
# 3. Length guard (CRITICAL - DoS protection)
if len(text) > MAX_LENGTH:
return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')
# 4. Library check (if needed)
if not HAS_LIBRARY:
return Maybe.failure('Required library not installed')
# 5. Expensive operations (NOW SAFE)
try:
result = expensive_operation(s)
return Maybe.success(result)
except LibraryError as e:
return Maybe.failure(str(e))
Testing Requirements
Correctness Tests
Every parser must have tests verifying:
Valid inputs (success cases)
Invalid inputs (failure cases with correct error messages)
Edge cases (boundary conditions, empty, whitespace)
Security Tests
Every parser must have DoS protection tests:
def it_rejects_excessively_long_input(self) -> None:
"""Reject extremely long input to prevent DoS attacks."""
import time
# Create malicious input exceeding maximum length
malicious_input = 'a' * (MAX_LENGTH + 100)
# Measure rejection time
start = time.perf_counter()
result = parse_function(malicious_input)
elapsed_ms = (time.perf_counter() - start) * 1000
# Verify correctness
assert result.is_failure()
assert 'too long' in result.error_or('').lower()
# Verify performance (DoS protection)
assert elapsed_ms < 10, f'Rejection took {elapsed_ms:.2f}ms, should be < 10ms'
Coverage Requirements
Line coverage: 100%
Branch coverage: 100%
Mutation coverage: Strongly recommended
Error Messages
User-Facing Messages
Error messages must be:
Clear and actionable: Tell the user what went wrong and how to fix it
Deterministic: Same input always produces same error
Length-limited: Avoid revealing the entire malicious input in error messages
Security-conscious: Don’t leak implementation details
Good:
return Maybe.failure('Email address is too long (maximum 254 characters)')
return Maybe.failure('Phone number must have 10 digits, got 8')
return Maybe.failure('Invalid format: phone number contains invalid characters')
Bad:
return Maybe.failure(f'Input too long: {text}') # Leaks malicious input
return Maybe.failure('regex match failed') # Implementation detail
return Maybe.failure('Error') # Not actionable
Docstring Documentation
Every parser must document:
Rules: What constitutes valid/invalid input
Maximum length: Document the limit in the docstring
Failure messages: List all possible error messages
Examples: Provide usage examples in doctests
def parse_example(text: str) -> Maybe[Example]:
"""Parse example with security constraints.
Rules:
- Maximum length: 254 characters
- Must contain '@' symbol
- Cannot start with whitespace
Failure messages:
- Input must be a string
- Input must not be empty
- Input is too long (maximum 254 characters)
- Invalid format: missing '@' symbol
Args:
text: Input string to parse
Returns:
Maybe[Example]: Success with parsed value or Failure with error
Examples:
>>> parse_example('valid@input').is_success()
True
>>> parse_example('a' * 300).is_failure()
True
"""
Common Patterns
Pattern 1: External Library Validation
When using external libraries for validation (e.g., email-validator):
def parse_with_library(text: str) -> Maybe[Result]:
# ALWAYS check length BEFORE calling library
if len(text) > MAX_LENGTH:
return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')
try:
result = external_library.validate(text)
return Maybe.success(result)
except LibraryError as e:
return Maybe.failure(str(e))
Pattern 2: Regex Operations
When using regex for validation/parsing:
# Compile patterns at module level for performance
_PATTERN = re.compile(r'^[a-z0-9-]+$')
def parse_with_regex(text: str) -> Maybe[Result]:
# ALWAYS check length BEFORE regex operations
if len(text) > MAX_LENGTH:
return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')
# Now safe to use regex
if not _PATTERN.match(text):
return Maybe.failure('Invalid format')
return Maybe.success(text)
Pattern 3: Multi-Stage Parsing
When parsing involves multiple stages:
def parse_complex(text: str) -> Maybe[Result]:
# Check length ONCE at the beginning
if len(text) > MAX_LENGTH:
return Maybe.failure(f'Input is too long (maximum {MAX_LENGTH} characters)')
# All subsequent operations are safe
stage1 = _parse_stage1(text) # No need to recheck length
if stage1.is_failure():
return stage1
stage2 = _parse_stage2(stage1.value_or(''))
return stage2
References
Security Standards
OWASP Top 10 2021: A04 - Insecure Design
CWE-400: Uncontrolled Resource Consumption
RFC 5321: SMTP (email address length limits)
Project References
Issue #134: ReDoS detection automation in CI/CD (this issue)
Issue #132: Comprehensive security audit of all parsers
Issue #131: Phone parser DoS vulnerability (fixed in v0.9.1)
Testing
Security Tests:
/tests/security/test_redos_detection.pyScanner Tests:
/tests/security/test_check_regex_safety.pyReDoS Scanner:
/scripts/check_regex_safety.py
Checklist for New Parsers
When creating a new parser, verify:
Type check is first validation
Empty/None input is rejected
Length guard is present BEFORE expensive operations
Maximum length is documented in docstring
Error messages are clear and actionable
DoS protection test exists (< 10ms rejection)
ReDoS scanner passes for any regex patterns
Regex performance test exists if using patterns
Test coverage is 100% (line + branch)
Performance is measured and verified
Maintenance
This document should be updated when:
New parsers are added to the library
New security threats are identified
Performance thresholds change
RFC standards are updated
Best practices evolve
Last Updated: 2025-11-11 (Issue #134 ReDoS detection automation)