Skip to content

Scan Operation

The Scan Operation in Qualytics is performed on a datastore to enforce data quality checks for various data collections such as tables, views, and files. This operation has several key functions:

  • Record Anomalies: Identifies a single record (row) as anomalous and provides specific details regarding why it is considered anomalous. The simplest form of a record anomaly is a row that lacks an expected value for a field.

  • Shape Anomalies: Identifies structural issues within a dataset at the column or schema level. It highlights broader patterns or distributions that deviate from expected norms. If a dataset is expected to have certain fields and one or more fields are missing or contain inconsistent patterns, this would be flagged as a shape anomaly.

  • Anomaly Data Recording: All identified anomalies, along with related analytical data, are recorded in the associated Enrichment Datastore for further examination.

Additionally, the Scan Operation offers flexible options, including the ability to:

  • Perform checks on incremental loads versus full loads.
  • Limit the number of records scanned.
  • Run scans on a selected list of tables or files.
  • Schedule scans for future execution.

Let's get started! 🚀

Step 1: Select a source datastore from the side menu on which you would like to perform the scan operation.

side-menu side-menu

Step 2: Clicking on your preferred datastore will navigate you to the datastore details page. Within the overview tab (default view), click on the Run button under Scan to initiate the catalog operation.

details-page details-page

Note

Scanning operation can be commenced once the catalog operation and profile operation is completed.

Configuration

Step 1: Click on the Run button to initiate the scan operation.

run run

Step 2: Select tables (in your JDBC datastore) or file patterns (in your DFS datastore) and tags you would like to be scanned.

1. All Tables/File Patterns

This option includes all tables or file patterns currently available for scanning in the datastore. It means that every table or file pattern recognized in your datastore will be subjected to the defined data quality checks. Use this when you want to perform a comprehensive scan covering all the available data without any exclusions.

all-operation all-operation

2. Specific Tables/File Patterns

This option allows you to manually select the individual table(s) or file pattern(s) in your datastore to scan. Upon selecting this option, all the tables or file patterns associated with your datastore will be automatically populated allowing you to select the datasets you want to scan.

You can also search the tables/file patterns you want to scan directly using the search bar. Use this option when you need to target particular datasets or when you want to exclude certain files from the scan for focused analysis or testing purpoaes

specific specfic

3. Tag

This option enables you to automatically scan file patterns associated with the selected tags. Tags can be predefined or created to categorize and manage file patterns effectively.

tag tag

Step 3: Click on the Next button to Configure Read Settings.

next next

Step 4: Configure Read Settings, Starting Threshold (Optional), and the Record Limit.

1.Select the Read Strategy for your scan operation.

  • Incremental: This strategy is used to scan only the new or updated records since the last scan operation. On the initial run, a full scan is conducted unless a specific starting threshold is set. For subsequent scans, only the records that have changed since the last scan are processed. If tables or views do not have a defined incremental key, a full scan will be performed. Ideal for regular scans where only changes need to be tracked, saving time and computational resources.

  • Full: This strategy performs a comprehensive scan of all records within the specified data collections, regardless of any previous changes or scans. Every scan operation will include all records, ensuring a complete check each time. Suitable for periodic comprehensive checks or when incremental scanning is not feasible due to the nature of the data.

incremental incremental

Warning

If any selected tables do not have an incremental identifier, a full scan will be performed for those tables.

Info

When running an Incremental Scan for the first time, Qualytics automatically performs a full scan, saving the incremental field for subsequent runs.

  • This ensures that the system establishes a baseline and captures all relevant data.

  • Once the initial full scan is completed, the system intelligently uses the saved incremental field to execute future Incremental Scans efficiently, focusing only on the new or updated data since the last scan.

  • This approach optimizes the scanning process while maintaining data quality and consistency.

2.Define the Starting Threshold (Optional) i.e. - specify a minimum incremental identifier value to set a starting point for the scan.

  • Greater Than Time: This option applies only to tables with an incremental timestamp strategy. Users can specify a timestamp to scan records that were modified after this time.

  • Greater Than Batch: This option applies to tables with an incremental batch strategy. Users can set a batch value, ensuring that only records with a batch identifier greater than the specified value are scanned.

starting-threshold starting-threshold

3.Define the record limit- the maximum number of records to be scanned per table after any initial filtering.

record-limit-line record-limit-line

Step 5: Click on the Next button to Configure the Scan Settings.

next-button next-button

Step 6: Configure the Scan Settings.

  1. Check Categories: Users can choose one or more check categories when initiating a scan. This allows for flexible selection based on the desired scope of the operation:

    • Metadata: Include checks that define the expected properties of the table, such as volume. It belongs to the Volumetric rule type.

    • Data Integrity: Include checks that specify the expected values for the data stored in the table. It belongs to all rule types except volumetric.

scan-settings scan-settings

2. Anomaly Options: Enable the option to automatically archive duplicate anomalies detected in previous scans that overlap with the current scan. This feature helps improve data management by minimizing redundancy and ensuring a more organized anomaly record.

  • Archive Duplicate Anomalies: Automatically archive duplicate anomalies from previous scans that overlap with the current scan to enhance data management efficiency.

anomaly-option anomaly-option

Step 7: Click on the Next button to Configure the Enrichment Settings.

next-button next-button

Step 8: Configure the Enrichment Settings.

  1. Remediation Strategy: This strategy dictates how your source tables are replicated in your enrichment datastore:

    • None: This option does not replicate source tables. It only writes anomalies and associated source data to the enrichment datastore. This is useful when the primary goal is to track anomalies without duplicating the entire dataset.

    • Append: This option replicates source tables using an append-first strategy. It adds new records to the enrichment datastore, maintaining a history of all data changes over time. This approach is beneficial for auditing and historical analysis.

    • Overwrite: This option replicates source tables using an overwrite strategy, replacing existing data in the enrichment datastore with the latest data from the source. This method ensures the enrichment datastore always contains the most current data, which is useful for real-time analysis and reporting.

scan-operation scan-operation

2. Source Record Limit: Sets a maximum limit on the number of records written to the enrichment datastore for each detected anomaly. This helps manage storage and processing requirements effectively.

source-record-limit souce-record-limit

Run Instantly

Click on the Run Now button to perform the scan operation immediately.

run-now run-now

Schedule

Step 1: Click on the Schedule button to configure the available schedule options for your scan operation.

click-schedule click-schedule

Step 2: Set the scheduling preferences for the profile operation.

1. Hourly: This option allows you to schedule the scan to run every hour at a specified minute. You can define the frequency in hours and the exact minute within the hour the scan should start. Example: If set to Every 1 hour(s) on minute 0, the scan will run every hour at the top of the hour (e.g., 1:00, 2:00, 3:00).

hourly hourly

2. Daily: This option schedules the scan to run once every day at a specific time. You specify the number of days between scans and the exact time of day in UTC. Example: If set to Every 1 day(s) at 00:00 UTC, the scan will run every day at midnight UTC.

daily daily

3. Weekly: This option schedules the scan to run on specific days of the week at a set time. You select the days of the week and the exact time of day in UTC for the scan to run. Example: If configured to run on "Sunday" and "Friday" at 00:00 UTC, the scan will execute at midnight UTC on these days.

weekly weekly

4. Monthly: This option schedules the scan to run once a month on a specific day at a set time. You specify the day of the month and the time of day in UTC. If set to "On the 1st day of every 1 month(s), at 00:00 UTC," the scan will run on the first day of each month at midnight UTC.

monthly monthly

5. Advanced: The advanced section for scheduling operations allows users to set up more complex and custom scheduling using Cron expressions. This option is particularly useful for defining specific times and intervals for profile operations with precision.

Cron expressions are a powerful and flexible way to schedule tasks. They use a syntax that specifies the exact timing of the task based on five fields:

  • Minute (0 - 59)
  • Hour (0 - 23)
  • Day of the month (1 - 31)
  • Month (1 - 12)
  • Day of the week (0 - 6) (Sunday to Saturday)

Each field can be defined using specific values, ranges, or special characters to create the desired schedule.

Example: For instance, the Cron expression 0 0 * * * schedules the profile operation to run at midnight (00:00) every day. Here’s a breakdown of this expression:

  • 0 (Minute) - The task will run at the 0th minute.
  • 0 (Hour) - The task will run at the 0th hour (midnight).
  • *(Day of the month) - The task will run every day of the month.
  • *(Month) - The task will run every month.
  • *(Day of the week) - The task will run every day of the week.

Users can define other specific schedules by adjusting the Cron expression. For example:

  • 0 12 * * 1-5 - Runs at 12:00 PM from Monday to Friday.
  • 30 14 1 * * - Runs at 2:30 PM on the first day of every month.
  • 0 22 * * 6 - Runs at 10:00 PM every Saturday.

To define a custom schedule, enter the appropriate Cron expression in the "Custom Cron Schedule (UTC)" field before specifying the schedule name. This will allow for precise control over the timing of the profile operation, ensuring it runs exactly when needed according to your specific requirements.

advanced advanced

Step 3: Define the Schedule Name to identify the scheduled operation at the running time.

schedule-name schedule-name

Step 4: Click on the Schedule button to schedule your scan operation.

schedule schedule

Note

You will receive a notification when the profile operation is completed.

Advanced Options

The advanced use cases described below require options that are not yet exposed in our user interface but possible through interaction with Qualytics API.

Runtime Variable Assignment

It is possible to reference a variable in a check definition (declared in double curly braces) and then assign that variable a value when a Scan operation is initiated. Variables are supported within any Spark SQL expression and are most commonly used in a check filter.

If a Scan is meant to assert a check with a variable, a value for that variable must be supplied as part of the Scan operation's check_variables property.

For example, a check might include a filter.- transaction_date == {{ checked_date }} which will be asserted against any records where transaction_date is equal to the value supplied when the Scan operation is initiated. In this case that value would be assigned by passing the following payload when calling /api/operations/run

{
    "type": "scan",
    "datastore_id": 42,
    "container_names": ["my_container"],
    "incremental": true,
    "remediation": "none",
    "max_records_analyzed_per_partition": 0,
    "check_variables": {
        "checked_date": "2023-10-15"
    },
    "high_count_rollup_threshold": 10
}

Operations Insights

When the scan operation is completed, you will receive the notification and can navigate to the Activity tab for the datastore on which you triggered the Scan Operation and learn about the scan results.

Top Panel

1. Runs (Default View): Provides insights into the operations that have been performed

2. Schedule: Provides insights into the scheduled operations.

3. Search: Search any operation (including scan) by entering the operation ID

4. Sort by: Organize the list of operations based on the Created Date or the Duration.

5. Filter: Narrow down the list of operations based on:

  • Operation Type
  • Oeration Status
  • Table

activity-operation activity-operation

Activity Heatmap

The activity heatmap shown in the snippet below represents activity levels over a period, with each square indicating a day and the color intensity representing the number of operations or activities on that day. It is useful in tracking the number of operations performed on each day within a specific timeframe.

Tip

You can click on any of the squares from the Activity Heatmap to filter operations.

activity activity

Operation Detail

Running

This status indicates that the scan operation is still running at the moment and is yet to be completed. A scan operation having a running status reflects the following details and actions:

running running

No. Parameter Interpretation
1 Operation ID and Type Unique identifier and type of operation performed (catalog, profile, or scan).
2 Timestamp Timestamp when the operation was started.
3 Progress Bar The progress of the operation.
4 Triggered By The author who triggered the operation.
5 Schedule Indicates whether the operation was scheduled or not.
6 Incremental Field Indicates whether Incremental was enabled or disabled in the operation.
7 Remediation Indicates whether Remediation was enabled or disabled in the operation.
8 Anomalies Identified Provides a count of the number of anomalies detected during the running operation.
9 Read Record Limit Defines the maximum number of records to be scanned per table after initial filtering.
10 Check Categories Indicates which categories should be included in the scan (e.g., Metadata, Data Integrity).
11 Archive Duplicate Anomalies Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation.
12 Source Record Limit Indicates the limit on records stored in the enrichment datastore for each detected anomaly.
13 Results View the details of the ongoing scan operation. This includes information on which tables are currently being scanned, the anomalies identified so far (if any), and other related data collected during the active scan.
14 Abort The Abort button enables you to stop the ongoing scan operation.
15 Summary The summary section provides an overview of the scan operation in progress. It includes:
  • Tables Requested: The total number of tables that were scheduled for scanning. Click on the adjacent magnifying glass icon to view the tables requested.
  • Tables Scanned: The number of tables that have been scanned so far. Click on the adjacent magnifying glass icon to view the tables scanned.
  • Partitions Scanned: The number of partitions scanned during the ongoing operation.
  • Records Scanned: The total number of records processed up to this point.
  • Anomalies Identified: The number of anomalies detected so far during the scan.

Aborted

This status indicates that the scan operation was manually stopped before it could be completed. A scan operation having an aborted status reflects the following details and actions:

aborted-operation aborted-operation

No. Parameter Interpretation
1 Operation ID and Type Unique identifier and type of operation performed (catalog, profile, or scan).
2 Timestamp Timestamp when the operation was started
3 Progress Bar The progress of the operation
4 Aborted By The author who triggered the operation
5 Schedule Whether the operation was scheduled or not
6 Incremental Field Indicates whether Incremental was enabled or disabled in the operation
7 Remediation Indicates whether Remediation was enabled or disabled in the operation
8 Anomalies Identified Provides a count on the number of anomalies detected before the operation was aborted
9 Read Record Limit Defines the maximum number of records to be scanned per table after initial filtering
10 Check Categories Indicates which categories should be included in the scan (Metadata, Data Integrity)
11 Archive Duplicate Anomalies Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation
12 Source Record Limit Indicates the limit on records stored in the enrichment datastore for each detected anomaly
13 Results View the details of the scan operation that was aborted, including tables scanned and anomalies identified
14 Resume Provides an option to continue the scan operation from where it left off
15 Rerun The "Rerun" button allows you to start a new scan operation using the same settings as the aborted scan
16 Delete Removes the record of the aborted scan operation from the system, permanently deleting scan results and anomalies
17 Summary The summary section provides an overview of the scan operation up to the point it was aborted. It includes:
  • Tables Requested: The total number of tables that were scheduled for scanning. Click on the adjacent magnifying glass icon to view the tables requested.
  • Tables Scanned: The number of tables that have been scanned so far. Click on the adjacent magnifying glass icon to view the tables scanned.
  • Partitions Scanned: The number of partitions scanned before the operation was aborted.
  • Records Scanned: The total number of records processed before the scan was stopped.
  • Anomalies Identified: The number of anomalies detected during the partial scan.

Warning

This status signals that the scan operation encountered some issues and displays the logs that facilitate improved tracking of the blockers and issue resolution. A scan operation having a completed with warning status reflects the following details and actions:

warning warning

No. Parameter Interpretation
1 Operation ID and Type Unique identifier and type of operation performed (catalog, profile, or scan).
2 Timestamp Timestamp when the operation was started
3 Progress Bar The progress of the operation
4 Triggered By The author who triggered the operation
5 Schedule Whether the operation was scheduled or not
6 Incremental Field Indicates whether Incremental was enabled or disabled in the operation
7 Remediation Indicates whether Remediation was enabled or disabled in the operation
8 Anomalies Identified Provides a count on the number of anomalies detected before the operation was warned.
9 Read Record Limit Defines the maximum number of records to be scanned per table after initial filtering
10 Check Categories Indicates which categories should be included in the scan (Metadata, Data Integrity)
11 Archive Duplicate Anomalies Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation
12 Source Record Limit Indicates the limit on records stored in the enrichment datastore for each detected anomaly
13 Result View the details of the scan operation that was completed with warning, including tables scanned and anomalies identified
14 Rerun The "Rerun" button allows you to start a new scan operation using the same settings as the warning scan
15 Delete Removes the record of the warning operation from the system, permanently deleting scan results and anomalies
16 Summary The summary section provides an overview of the scan operation, highlighting any warnings encountered. It includes:
  • Tables Requested: The total number of tables that were scheduled for scanning. Click on the adjacent magnifying glass icon to view the tables requested.
  • Tables Scanned: The number of tables that have been scanned so far. Click on the adjacent magnifying glass icon to view the tables scanned.
  • Partitions Scanned: The number of partitions scanned during the operation, including any partitions that triggered warnings.
  • Records Scanned: The total number of records processed during the scan, along with any records that raised warnings.
  • Anomalies Identified: The number of anomalies detected during the partial scan.
17 Logs Logs include error messages, warnings, and other pertinent information that occurred during the execution of the Scan Operation.

Success

The summary section provides an overview of the scan operation upon successful completion. It includes:

success success

No. Parameter Interpretation
1 Operation ID and Type Unique identifier and type of operation performed (catalog, profile, or scan).
2 Timestamp Timestamp when the operation was started
3 Progress Bar The progress of the operation
4 Triggered By The author who triggered the operation
5 Schedule Whether the operation was scheduled or not
6 Incremental Field Indicates whether Incremental was enabled or disabled in the operation
7 Remediation Indicates whether Remediation was enabled or disabled in the operation
8 Anomalies Identified Provides a count of the number of anomalies detected during the successful completion of the operation.
9 Read Record Limit Defines the maximum number of records to be scanned per table after initial filtering
10 Archive Duplicate Anomalies Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation
11 Source Record Limit Indicates the limit on records stored in the enrichment datastore for each detected anomaly
12 Results View the details of the completed scan operation. This includes information on which tables were scanned, the anomalies identified (if any), and other relevant data collected throughout the successful completion of the scan.
13 Rerun The "Rerun" button allows you to start a new scan operation using the same settings as the success scan
14 Delete Removes the record of the aborted scan operation from the system, permanently deleting scan results and anomalies
15 Summary The summary section provides an overview of the scan operation upon successful completion. It includes:
  • Tables Requested: The total number of tables that were scheduled for scanning. Click on the adjacent magnifying glass icon to view the tables requested.
  • Tables Scanned: The number of tables that have been scanned successfully. Click on the adjacent magnifying glass icon to view the tables scanned. Click on the adjacent magnifying glass icon to view the tables scanned.
  • Partitions Scanned: The number of partitions scanned.
  • Records Scanned: The total number of records processed.
  • Anomalies Identified: The number of anomalies detected.

Full View of Metrics in Operation Summary

Users can now hover over abbreviated metrics to see the full value for better clarity. For demonstration purposes, we are hovering over the Records Scanned field to display the full value.

records-scan-operation records-scan-operation

Post Operation Details

Step 1: Click on any of the successful Scan Operations from the list and hit the Results button.

result-scan-operation result-scan-operation

Step 2: The Scan Results modal demonstrates the highlighted anomalies (if any) identified in your datastore with the following properties:

result result

Ref. Scan Properties Description
1. Table/File The table or file where the anomaly is found.
2. Field The field(s) where the anomaly is present. Field(s) of the anomaly.
3. Location Fully qualified location of the anomaly.
4. Rule Inferred and authored checks that failed assertions.
5. Description Human-readable, auto-generated description of the anomaly.
6. Status The status of the anomaly. Active, Acknowledged, Resolved, or Invalid
7. Type The type of anomaly (e.g., Record or Shape)
8. Date time The date and time when the anomaly was found.

API Payload Examples

This section provides payload examples for running, scheduling, and checking the status of scan operations. Replace the placeholder values with data specific to your setup.

Running a Scan operation

To run a scan operation, use the API payload example below and replace the placeholder values with your specific values.

Endpoint (Post):

/api/operations/run (post)

  • container_names: [] means that it will scan all containers.
  • max_records_analyzed_per_partition: null means that it will scan all records of all containers.
  • Remediation: append replicates source containers using an append-first strategy.
{
    "type":"scan",
    "name":null,
    "datastore_id": datastore-id,
    "container_names":[],
    "remediation":"append",
    "incremental":false,
    "max_records_analyzed_per_partition":null,
    "enrichment_source_record_limit":10
}
  • container_names: ["table_name_1", "table_name_2"] means that it will scan only the tables table_name_1 and table_name_2.
  • max_records_analyzed_per_partition: 1000000 means that it will scan a maximum of 1 million records per partition.
  • Remediation: overwrite replicates source containers using an overwrite strategy.
{
    "type":"profile",
    "name":null,
    "datastore_id":datastore-id,
    "container_names":[
      "table_name_1",
      "table_name_2"
    ],
    "max_records_analyzed_per_partition":1000000,
    "enrichment_source_record_limit":10
}

Scheduling scan operation of all containers

To schedule a scan operation, use the API payload example below and replace the placeholder values with your specific values.

Endpoint (Post):

/api/operations/schedule (post)

This payload is to run a scheduled scan operation every day at 00:00

{
    "type":"scan",
    "name":"My scheduled Scan operation",
    "datastore_id":"datastore-id",
    "container_names":[],
    "remediation": "overwrite"
    "incremental": false,
    "max_records_analyzed_per_partition":null,
    "enrichment_source_record_limit":10,
    "crontab":"00 00 */2 * *"
}

Retrieving Scan Operation Information

Endpoint (Get)

/api/operations/{id} (get)

{
    "items": [
        {
            "id": 12345,
            "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "type": "scan",
            "start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "result": "success",
            "message": null,
            "triggered_by": "user@example.com",
            "datastore": {
                "id": 101,
                "name": "Datastore-Sample",
                "store_type": "jdbc",
                "type": "db_type",
                "enrich_only": false,
                "enrich_container_prefix": "data_prefix",
                "favorite": false
            },
            "schedule": null,
            "incremental": false,
            "remediation": "none",
            "max_records_analyzed_per_partition": -1,
            "greater_than_time": null,
            "greater_than_batch": null,
            "high_count_rollup_threshold": 10,
            "enrichment_source_record_limit": 10,
            "status": {
                "total_containers": 2,
                "containers_analyzed": 2,
                "partitions_scanned": 2,
                "records_processed": 28,
                "anomalies_identified": 2
            },
            "containers": [
                {
                "id": 234,
                "name": "Container1",
                "container_type": "table",
                "table_type": "table"
                },
                {
                "id": 235,
                "name": "Container2",
                "container_type": "table",
                "table_type": "table"
                }
            ],
            "container_scans": [
                {
                "id": 456,
                "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "container": {
                    "id": 235,
                    "name": "Container2",
                    "container_type": "table",
                    "table_type": "table"
                },
                "start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "records_processed": 8,
                "anomaly_count": 1,
                "result": "success",
                "message": null
                },
                {
                "id": 457,
                "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "container": {
                    "id": 234,
                    "name": "Container1",
                    "container_type": "table",
                    "table_type": "table"
                },
                "start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "records_processed": 20,
                "anomaly_count": 1,
                "result": "success",
                "message": null
                }
            ],
            "tags": []
        }
    ],
    "total": 1,
    "page": 1,
    "size": 50,
    "pages": 1
}