Normalization

While parsing various document types, Ocrolus encounters diverse data formats. These data types are managed through the Validation as a Service (VaaS), which ensures fields adhere to expected formats and modifies values to align with these standards. Common data types handled include amounts, numbers, phone numbers, email addresses, ZIP codes, social security numbers, and dates.

Normalized and original values are provided in responses, which allows you to choose between raw and standardized formats.

👍

Info

Normalization functions more efficiently within the Instant workflow compared to the Complete workflow.

"drivers_license-General:state": {
  "value": "TEXAS",
  "normalized_value": "TX",
  "is_empty": false,
  "alias_used": null,
  "source_filename": "DRIVING-LICENSE.pdf",
  "confidence": 1
}

Normalization rules

The following are essential normalization formats for key data types:

  • Amount: Currency symbol, commas (e.g., 1,000), decimal point, and two decimal places.
  • Multiple Choice: Selection from predefined dropdown values.
  • Number: Digits with optional decimals, supporting negatives.
  • Integer: Numeric values without decimals, including negatives and zero.
  • Phone Number: Formatted as (XXX) XXX-XXXX; optional support for international formats like +1 123-456-7890.
  • Email Address: Standard email format (e.g., [email protected]).
  • State: Two-letter abbreviation from a predefined dropdown list.
  • ZIP Code: Standard five-digit (XXXXX) or ZIP+4 format (XXXXX-XXXX) for the US; Canadian format A1A 1A1; British postal formats as needed.
  • Social Security Number: XXX-XX-XXXX, exactly nine digits.
  • SSN Last 4: Exactly four digits.
  • Routing Number: Exactly nine digits.
  • Comma Amount: Commas for integers every three digits; mandatory decimal point without commas in fractional part.
  • Decimal Required: Mandatory decimal points with digits following; integers without decimals are invalid.
  • Date (US): MM/DD/YYYY
  • Date (en-GB): DD/MM/YYYY
  • Percentage: Numeric value followed by %.
  • EIN: Formatted as XX-XXXXXXX; replaces "O" with "0" and removes whitespace.
  • TIN: Formatted as 9XX-XX-XXXX, nine digits starting with “9”.
  • Alphanumeric: Letters and numbers without special characters or spaces.
  • Unix Timestamp: Integer counting seconds from January 1, 1970, UTC.
  • Website: Standard URL formats, accommodating various structures such as http(s), paths, query parameters, and subdomains.