Skip to content

Contains Social Security Number

Use the Contains Social Security Number rule when a field is expected to store a valid Social Security Number (SSN).

What Is Contains Social Security Number?

Think of Contains Social Security Number as a data integrity guard for sensitive identity fields.

It checks whether a field actually contains a valid SSN, instead of missing, malformed, or incomplete values that may look correct but fail compliance or downstream usage.

In simple terms: It ensures SSN data is present, usable, and consistently formatted.

Add Contains Social Security Number Check

Use this check to validate that a specific field contains a valid SSN and to detect records where the SSN is missing or incorrectly stored.

What Does This Check Do?

The rule scans a single field and confirms:

  • An SSN value exists
  • The SSN follows a recognizable pattern (for example: XXX-XX-XXXX)
  • The value is not null, truncated, or malformed

If a record fails any of these conditions, it is flagged as an anomaly.

How Does Contains Social Security Number Work?

Step 1: Select the Field

Choose the field that is expected to contain an SSN (for example: C_SSN).

Step 2: Validate the Content

Qualytics verifies whether the field value matches a valid SSN format.

Step 3: Review the Results

  • Pass – Field contains a valid SSN
  • Fail – SSN is missing, malformed, or not detected

Why Should You Use This Check?

1. Prevent Compliance Gaps

Missing or invalid SSNs can lead to regulatory violations, audit failures, and processing issues.

2. Catch Issues Early

Instead of discovering problems during reporting or audits, this check flags them as soon as data is scanned.

3. Protect Sensitive Workflows

Payroll, identity verification, and compliance systems depend on accurate SSN data. This rule helps keep those workflows reliable.

Real-Life Example: Customer Identity Data Validation

The Situation

BrightMart is a retail company that maintains a central customer database containing identity information required for compliance and verification.

Each customer record is expected to include a valid Social Security Number (SSN).

This data is used for:

  • Customer identity verification
  • Regulatory and compliance reporting
  • Internal audits and analytics

The Problem They Had

As BrightMart’s customer base grew, the data team relied on manual reviews to verify SSN values.

During a routine compliance review, the team noticed that some customer records could not be validated, even though SSN data appeared to be present. This raised an important question:

Why are some records failing verification when they seem correct at first glance?

After investigating, the team discovered:

  • Some SSN values were missing or empty
  • Some values were incorrectly structured
  • Manually reviewing thousands of records took significant time
  • Human review made it easy to miss subtle issues

These problems had gone unnoticed for weeks, creating a serious compliance risk and delaying reporting.

The Solution: Contains Social Security Number

To eliminate manual checks, the team implemented the Contains Social Security Number rule type.

Instead of reviewing records one by one, the rule automatically scanned all customer records and validated whether each one actually contained a valid SSN.

What the Check Reviewed

After running the check, Qualytics produced the following Source Records view. Records that failed SSN validation were automatically highlighted, making them easy to identify.

C_SSN_JSON _CHECKSUM C_CUSTKEY
{ "ssn": "966-15-3666" } 00a7b86dc440f5799515d58dafdb5be 43661
{ "ssn": "938-35-4653" } 0097c6fd693f6b42207cf0cfa38f53d 93248
{ "ssn": "980-04-5020" } 009abea2080b1713b168bc998c52f0f 83639
{ "ssn": "918-58-8770" } 0076beb855ab8742a5b0ef7fc1442b 131926
{ "ssn": "933-20-6278" } 006b9646dcba176709eb7a322e61e2b7 21691
{ "ssn": "989-09-1445" } 00650747d878eeca4d468a880de38b2 11860
{ "ssn": "963-62-7104" } 0061f5779e53186143b077a516e5a0c4 137432
{ "ssn": "961-06-0718" } 004bd16c4a27e11bed1d9e9d13d94c9 126126
{ "ssn": "913-38-6794" } 004881ebe3826fc16cfd124968f2a5bb 12223
{ "ssn": "796-00-1624" } 001d507e80c4e4d2ce4aba05590f8313 80971

anomaly-result

What Contains Social Security Number Discovered

ALERT GENERATED:

SSN VALIDATION FAILED

  • Rule Applied: Contains Social Security Number
  • Anomalous Records: 50
  • Violation Rate: 0.335% of 14,941 records

Although the failure rate was small, each failed record represented a potential compliance issue.

🔍 Summary

  • Most customer records contained valid SSNs
  • A small but critical subset did not meet SSN requirements
  • The failed records were clearly marked in red, eliminating guesswork

The Outcome

Immediate Benefits

  • Manual SSN reviews were eliminated
  • Invalid records were identified instantly
  • Review time dropped from hours to minutes

Long-Term Benefits

  • Continuous monitoring of SSN data quality
  • Reduced audit and compliance risk
  • Higher confidence in customer identity data

Why This Matters

Just like missing orders can impact revenue, missing or invalid SSNs can impact compliance.

The Contains Social Security Number check acts as an automated safeguard—ensuring sensitive identity data is always complete, valid, and audit-ready, without relying on slow and error-prone manual reviews.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
String

General Properties

Name Supported
Filter
Allows the targeting of specific data based on conditions
Coverage Customization
Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage" collapsible="true
O_TOTALPRICE > 1000
C_MKTSEGMENT = 'BUILDING'
Incorrect usage" collapsible="true
WHERE O_TOTALPRICE > 1000
WHERE C_MKTSEGMENT = 'BUILDING'

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage" collapsible="true
O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'
(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'
Incorrect usage" collapsible="true
WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'
O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage" collapsible="true
RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'
LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7
Incorrect usage" collapsible="true
RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'
EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage" collapsible="true
O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)
Incorrect usage" collapsible="true
O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type Supported
Record
Flag inconsistencies at the row level
Shape
Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all C_CONTACT_DETAILS entries in the CUSTOMER table contain valid social security numbers.

Sample Data

C_CUSTKEY C_CONTACT_DETAILS
1 {"name": "John Doe", "ssn": "234567890"}
2 {"name": "Amy Lu", "ssn": "666-12-3456"}
3 {"name": "Jane Smith", "ssn": "429-14-2216"}
{
    "description": "Ensure that all C_CONTACT_DETAILS entries in the CUSTOMER table contain valid social security numbers",
    "coverage": 1,
    "properties": {},
    "tags": [],
    "fields": ["C_CONTACT_DETAILS"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "containsSocialSecurityNumber",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with C_CUSTKEY 2 does not satisfy the rule because its C_CONTACT_DETAILS value does not contain the typical social security number format.

graph TD
A[Start] --> B[Retrieve C_CONTACT_DETAILS]
B --> C{Does C_CONTACT_DETAILS contain a valid SSN format?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    c_custkey,
    c_contact_details
from customer 
where
    not regexp_like(ssn, '^[0-9]{3}-[0-9]{2}-[0-9]{4}$')

Potential Violation Messages

Record Anomaly

The C_CONTACT_DETAILS value of {"name": "Amy Lu", "ssn": "666-12-3456"} does not contain a social security number.

Shape Anomaly

In C_CONTACT_DETAILS, 33.333% of 3 filtered records (1) do not contain social security numbers.