User Guide
Description | User Guide for the Qualytics data quality platform |
Author(s) | Qualytics Team |
Repository | https://github.com/Qualytics/userguide |
Copyright | Copyright © 2024 Qualytics |
User Guide: Introduction to Qualytics
Qualytics is the Active Data Quality Platform that enables teams to manage data quality at scale through advanced automation. Qualytics analyzes your historic data for its shapes and patterns in order to infer contextual data quality rules that are then asserted against new data (often in incremental loads) to identify anomalies. When an anomaly is identified, Qualytics provides your team with everything needed to take corrective actions using their existing data tooling & preferred monitoring solutions.
Managing Data Quality
With Qualytics, your data teams can quickly address data issues in a proactive way by automating the discovery and maintenance of data quality measures you need.
Here's how it works:
-
Analyzing Historical Data: Qualytics examines your historical data to understand its patterns and characteristics, allowing it to create rules that define good data quality.
-
Finding Anomalies: These rules, combined with any rules you create yourself, are used to identify any abnormalities or inconsistencies in your historical data or new data (even when new data is added incrementally).
-
Taking Corrective Actions: When an anomaly is detected, Qualytics helps your team take appropriate actions. Utilizing tags, it can send notifications through the platforms you use (such as Teams, Slack, or PagerDuty), trigger workflows in tools (like Airflow, Fivetran or Airbyte), provide additional information about the anomaly to your chosen datastore (compatible with SQL-based integrations like dbt), and even suggest the best course of action through its user interface and API.
-
Continuous Monitoring and Improvement: Qualytics continuously monitors and scores your data quality. It keeps your quality checks up to date, taking into account any changes in your actual data and your business needs. This ongoing process helps improve your overall data quality and boosts trust and confidence in your organization's data.
By leveraging Qualytics, you can efficiently manage data quality, proactively address issues, and enhance trust in the data driving your organization.
Key Features
Qualytics offers a range of powerful features designed to enhance your data quality management:
-
Automated Data Profiling: Qualytics leverages your existing data to automatically generate profiles for each of your data assets. These profiles provide valuable insights into your data and serve as the foundation for maintaining data quality.
-
Rule Inference: Crafting and maintaining data quality rules at scale can be a daunting task. Qualytics simplifies this process by automatically inferring appropriate data quality rules based on your data profiles. This saves you time and effort while ensuring accurate anomaly detection.
-
Anomaly Detection: Identifying anomalies within your data is crucial for maintaining data quality. Qualytics excels in detecting anomalies at rest and in flight throughout your data ecosystem. By highlighting outliers and irregularities, it helps you identify and address data quality issues effectively.
-
Anomaly Remediation: Once anomalies are detected, Qualytics provides the necessary tools to take corrective actions. It enables you to seamlessly integrate with your preferred data tooling and initiate remediation workflows. This ensures that data outliers are addressed promptly and efficiently.
-
Freshness Monitoring: Qualytics includes functionality for monitoring data freshness Service Level Agreements (SLAs). It allows you to define and track SLAs for the timeliness of data updates, ensuring that your data remains up-to-date and meets the required service level agreements.
-
Insights Dashboard: Qualytics provides an intuitive executive dashboard called Insights. This dashboard gives you a holistic view of the health and quality of your data. You can easily visualize key data quality metrics, track progress, and gain actionable insights. With the executive dashboard, you can make informed decisions and drive data-driven strategies for your organization.
Seamless Integration and Deployment
Qualytics offers flexible integration options to fit your data infrastructure seamlessly:
-
Deployment Options: Whether you prefer an on-premise, single-tenant cloud, or SaaS deployment, Qualytics adapts to your specific needs. It meets you where your data resides, ensuring a hassle-free integration process.
-
Support for Modern & Legacy Data Stacks: Qualytics seamlessly integrates with a wide range of data platforms. From modern solutions like Snowflake and Amazon S3 to legacy systems like Oracle and MSSQL, Qualytics supports your data stack. This versatility ensures that data quality remains a priority across all your data sources.
Demo
Here is a short video demonstrating the platform with a quick walkthrough:
Embarking on Your Journey
This user guide will walk you through the key functionalities of Qualytics and provide step-by-step instructions to help you make the most of this powerful platform. Whether you are new to Qualytics or looking to deepen your understanding, this guide will be your companion in optimizing your data quality management.
Let's embark on this journey to empower your organization with accurate, reliable, and trustworthy data using Qualytics!
Getting Started ↵
Onboarding
Qualytics is a comprehensive data quality management solution, designed to help enterprises proactively manage their full data quality lifecycle at scale through automated profiling, contextual data quality checks, quality rule inference, anomaly detection, remediation, tailored notifications, and more.
This comprehensive document is designed to help enterprises get started with Qualytics, ensuring a smooth and efficient onboarding process.
Let’s get started 🚀
Onboarding Process
Qualytics onboarding begins with understanding your enterprise's requirements. Based on your data records, it offers a tailored approach to smoothly onboard you to the platform.
1. Screening & Criteria Gathering
Schedule a demo with us to help our team understand your enterprise data. During this session, the Qualytics team will create the plan to identify key success criteria and tailor the deployment to meet your specific needs, exploring relevant use cases.
2. User Invitations
Once the deployment setup is complete, Qualytics sends invitations to the provided email addresses. These invitations include instructions for accessing the platform and assigning admin or member roles based on your preferences. Admins have full access to configure and manage the platform, while members have access according to the permissions set by the admins.
Deployment Options
Qualytics offers flexible deployment options to seamlessly plug and acknowledge your data infrastructure requirements.
1. SaaS Deployment (Default)
The Software as a Service (SaaS) deployment is a fully managed service hosted by Qualytics. This option provides ease of use and minimal maintenance, allowing your team to focus on data quality management without worrying about infrastructure upkeep. SaaS deployment offers rapid scalability and seamless updates, ensuring you always have access to the latest features and improvements.
2. On-Premise Deployment
This option is ideal for organizations that prefer to keep their data within their own data centers. By deploying Qualytics on-premise, you maintain complete control over your data and its security, ensuring compliance with internal policies and regulations.
Tip
This deployment option is recommended for customers with sensitive data
Frequent Asked Questions (FAQs)
Q 1: What type of support is provided during a POC?
A 1: A dedicated Customer Success Manager, with mandatory weekly check-ins.
Q 2: What are the deployment options for POC?
A 2: Qualytics offers deployment options for Proof of Concept (POC) primarily as a Software as a Service (SaaS) solution.
Q 3: What type of data should we use for a POC?
A 3: In most cases, potential customers use their actual data during a POC. This provides the best representation of a live instance of Qualytics. Some customers use cleaned data to remove PII or sample test data.
Q 4: Are there limitations to data size for POC?
A 4: There are no limitations to data size for a Proof of Concept (POC)
Q 5: What type of support is provided during the Onboarding process?
A 5: A dedicated Customer Success Manager, with mandatory weekly check-ins.
Q 6: What types of data stacks does Qualytics support?
A 6: Qualytics supports two types of data stacks: modern solutions and legacy systems.
-
Modern Solutions
Qualytics supports modern data platforms to ensure robust data quality management across infrastructures. This includes solutions like Snowflake, Amazon S3, BigQuery, etc.
-
Legacy Systems
Qualytics also integrates with legacy systems to maintain high standards of data quality across all data sources. This includes reliable and scalable relational database management systems MySQL, Microsoft SQL Server, etc.
To integrate these data stacks refer to the quick start guide.
Q 7: What types of database technology can you connect in Qualytics?
A 7: Qualytics supports connecting to any Apache Spark-compatible datastore, including relational databases (RDBMS) and raw file formats such as CSV, XLSX, JSON, Avro, and Parquet.
Q 8: What is an enrichment datastore?
A 8: An Enrichment Datastore is a user-managed storage location where the Qualytics platform records and accesses metadata through a set of system-defined tables. It is purpose-built to capture metadata generated by the platform's profiling and scanning operations.
Q 9: Can I download my metadata and data quality checks?
A 9: Yes, Qualytics's metadata export feature is specifically designed to capture the mutable states of various data entities. This functionality enables the export of Quality Checks, Field Profiles, and Anomalies metadata from selected profiles into a designated enrichment datastore.
Q 10: How is the Quality Score calculated?
A 10: Quality Scores are measures of data quality calculated at the field, container, and datastore level and recorded as a time series enabling you to track movement over time. A quality score ranges from 0-100, with higher scores indicating higher quality.
Q 11: What is a catalog operation?
A 11: The Catalog Operation is run on a datastore to import the named collections (e.g., tables, views and files) of data available within it. The operation will also attempt to automatically identify the best way to support:
-
Incremental scanning
-
Data partitioning
-
Record identification
Q 12: What is a profiling operation?
A 12: A Profile Operation will analyze every available record in all available containers in a datastore. Full Profiles provide the benefit of generating metadata with 100% fidelity at the cost of maximum compute time.
Q 13: What is a scan operation?
A 13: The Scan Operation is executed on a datastore to assert the data quality checks defined for the named collections of data (e.g., tables, views and files) within it. The operation will:
-
Produce a record anomaly for any record where anomalous values are
detected.
-
Produce a shape anomaly for anomalous values that span multiple
records.
-
Record the anomaly data along with related analysis in the associated
Enrichment Datastore.
Quick Start Guide
This guide is designed to help you quickly get started with using Qualytics. From onboarding to the platform and signing in to configuring your first datastore and performing essential operations, this quick start guide will walk you through every step of the process to ensure you can effectively utilize the powerful features of Qualytics.
Let's get started 🚀
Onboarding
The Qualytics onboarding process ensures a smooth setup and deployment tailored to your needs. This streamlined process ensures that your environment is set up according to your specifications, facilitating a quick and efficient start with Qualytics.
1. Screening and Criteria Gathering
Qualytics conducts a screening to determine if sample data (e.g. customer records), should be included and gathers the primary customer success criteria for the new environment, such as exploring specific use cases.
2. Deployment URL Creation
Based on the domain name (DNS) information you provide, Qualytics creates and provides a customized deployment URL.
3. Cloud Provider and Region Selection
The Qualytics team will ask for your preferred cloud providers and deployment region to finalize the deployment and go-live process.
4. User Invitations
After deployment, Qualytics sends invitations to the provided email addresses, assigning admin or member roles based on your preferences.
Signing In
After onboarding to Qualytics, you will receive login credentials to access the Qualytics dashboard.
Method 1: Using Sign-in Credentials
This method outlines an approach for customers who have not yet integrated their Identity Provider (IdP), thereby not benefiting from Single Sign-On (SSO). Typically, this approach is used during a trial period or Proof of Concept (POC). Once the customer transitions to a paid plan, they generally move to an SSO configuration for enhanced security and convenience.
For instance, the step involving the invitation link (as explained in the onboarding process above) is predominantly associated with this method (Using Sign-in Credentials), which relies on standard email and password credentials.
This allows users to access the system without the need for an integrated IdP during the initial trial phase. This approach is intended to provide ease of access and evaluate the platform's capabilities before committing to full integration with their IdP for SSO.
Once the customer completes the onboarding through the invitation link sent to the email, the credentials are produced that can be used for signing in to your Qualytics account and accessing the dashboard.
Method 2: Qualytics SSO
With SSO (Single Sign-On), you can access Qualytics more quickly and conveniently without having to go through separate authentication processes for each session.
Most customers will have their own SSO integration. Typically, the login screen will display two buttons:
-
Qualytics SSO: Intended for use by Qualytics employees to provide support to customers.
-
Customer SSO: Used by the organization's users, leveraging their own SSO for seamless access.
Datastore
Adding a datastore in Qualytics builds a symbolic link to the location where your data is stored, such as a database or file system. During operations like cataloging, profiling, and scanning, Qualytics reads data from this source location you connect to the platform.
Additionally, Qualytics supports an "Enrichment Datastore," used solely for writing metadata. Even though Qualytics writes to this location, it is still managed by the user, ensuring full control over their data.
When the user is authenticated, the Qualytics onboarding screen appears, where you can click and add your datastore to the Qualytics platform.
Configuring Datastores
Qualytics allows you to configure the datastore according to where your data is stored. These datastores are categorized into two types based on their characteristics:
1. JDBC Datastores
JDBC datastores are relational databases that support connectivity through the JDBC API, providing universal data access and integration with relational databases.
Here is a list of available JDBC datastores you can add and configure in Qualytics:
REF | ADD DATASTORE | DESCRIPTION |
---|---|---|
1. | BigQuery | A fully managed, serverless data warehouse that enables scalable analysis over petabytes of data. |
2. | Databricks | A unified analytics platform that stores your data to accelerate innovation by unifying data science, engineering, and business. |
3. | DB2 | It is an IBM database known for its scalability, performance, and availability, primarily used for large enterprises. |
4. | Hive | A data warehouse infrastructure built on top of Hadoop for providing data query and analysis. |
5. | MariaDB | An open-source relational database management system, a fork of MySQL, renowned for its performance and reliability. |
6. | Microsoft SQL Server | A relational database management system developed by Microsoft, offering a wide range of data tools and services. |
7. | MySQL | An open-source relational database management system widely used for web applications and various other uses. |
8. | Oracle | A multi-model database management system widely used for running online transaction processing and data warehousing. |
9. | PostgreSQL | An advanced, open-source relational database known for its robustness, extensibility, and standards compliance. |
10. | Presto | A distributed SQL query engine for big data, allowing users to run interactive analytic queries against data sources. |
11. | Amazon RedShift | A fully managed data warehouse service in the cloud, designed to handle large-scale data sets and analytics. |
12. | Snowflake | A cloud-based data warehousing solution that provides data storage, processing, and analytics. |
13. | Synapse | An analytics service that brings together big data and data warehousing. |
14. | Timescale DB | A relational database for time-series data, built on PostgreSQL. |
15. | Trino | A distributed SQL query engine for big data, designed to query large data sets across multiple data sources. |
2. DFS (Distributed File Systems) Datastores
A distributed file system datastore manages files and directories across different servers, designed for scalability and high availability in Qualytics.
Here is a list of available DFS datastores you can configure in the Qualytics platform:
REF | ADD DATASTORE | DESCRIPTION |
---|---|---|
1. | Amazon S3 | A scalable object storage service from Amazon Web Services, used for storing and retrieving any amount of data. |
2. | Azure Blob Storage | A Microsoft Azure service for storing large amounts of unstructured data, such as text or binary data. |
3. | Azure DataLake Storage | A scalable data storage and analytics service from Microsoft Azure designed for big data analytics. |
4. | Google Cloud Storage | A scalable, fully managed object storage service for unstructured data in Google Cloud. |
5. | Qualytics File System (QFS) | A custom file system designed by Qualytics for optimized data storage and retrieval within the platform. |
Operations
When you configure and add your datastore in the Qualytics platform you will be redirected to the data assets section where you can perform data operations to analyze metadata, gather statistics, and create profiles. These operations help identify data fitness and anomalies to improve data quality through feedback loops. The operations are categorized as follows.
1. Catalog Operation
The Catalog operation involves systematically collecting data structures along with their corresponding metadata. This process also includes a thorough analysis of the existing metadata within the datastore. This ensures a solid foundation for the subsequent Profile and Scan operations.
2. Profile Operation
The Profile operation enables training of the collected data structures and their associated metadata values. This is crucial for gathering comprehensive aggregating statistics on the selected data, providing deeper insights, and preparing the data for quality assessment.
3. Scan Operation
The Scan operation asserts rigorous quality checks to identify any anomalies within the data. This step ensures data integrity and reliability by recording the analyzed data in your configured enrichment datastore, facilitating continuous data quality improvement.
Checks & Rules
Checks and rules are essential components for maintaining data quality in Qualytics. A check encapsulates a data quality rule along with additional contexts such as tags, filters, and tolerances. Rules define the criteria that data must meet, and checks enforce these rules to ensure data integrity.
In Qualytics, you will come across two types of checks:
1. Inferred Checks
Qualytics automatically generates inferred checks during a Profile operation. These checks typically cover 80-90% of the rules needed by users. They are created and maintained through profiling, which involves statistical analysis and machine learning methods.
2. Authored Checks
Authored checks are manually created by users within the Qualytics platform or API. You can author many types of checks, ranging from simple templates for common checks to complex rules using Spark SQL and User-Defined Functions (UDF) in Scala.
Explore
The Explore dashboard helps manage and analyze your data effectively. It includes several sections, each offering specific functionalities:
1. Insights
The Insights section provides an overview of anomaly detection and comprehensive data monitoring options. You can fine-tune the view by source datastores, tags, and dates. You can also check the profile data, applied checks, quality scores, and records scanned for your connected source datastores.
2. Activity
The Activity section offers an in-depth look at activities across source datastores. It includes a heatmap to visualize the daily volume of operations, along with any detected anomalies.
3. Profiles
The Profiles section provides a unified view of all containers under one roof including:
- Tables
- Views
- Computed Tables
- Computed Files
- Fields
It also offers search, sort, and filter functionality to help you efficiently find what you need.
4. Checks
The Checks section provides an overview of all applied checks, including both inferred and authored checks, across all source datastores. This allows you to monitor and manage the rules ensuring your data quality.
5. Anomalies
The Anomalies section gives an overview of all detected anomalies across your source datastores. This helps in quickly identifying and addressing any issues.
Library
The library dashboard offers various options for managing check templates and editing applied checks in your configured source datastores. It includes the following functionalities:
1. Add Check Templates
Easily add new check templates to manage and apply standardized checks across different source datastores efficiently.
2. Export Check Templates
The export feature is an operation that writes metadata to a specified Enrichment datastore. In the case of “Check Templates”, Qualytics will write check template metadata to the selected Enrichment datastore from the dropdown list.
Settings
This section allows you to manage global configurations. You can configure various settings as explained below
1. Tags
The Tags allow users to categorize and organize entities effectively, while also providing the ability to assign weights for prioritization. They can drive notifications and downstream workflows, and users can configure tags, associate notifications based on Tags, and associate tags to specific properties.
2. Notifications
You can set up notifications when an operation is completed (e.g.- catalog/profile/scan) or when anomalies are identified.
3. Connection
Delete, edit, or add new datastore sources, ensuring efficient management of your configured datastores.
4. Integration
Configure necessary parameters to integrate external tools with the Qualytics dashboard.
5. Security
Delete, edit, or add new teams, and assign roles to users for better access control and management.
6. Tokens
Create tokens to enable secure and direct interaction with the Qualytics API.
7. Health
This page provides an easy way to monitor the health of Qualytics deployment while also providing the option to restart the Analytics engine.
Technical Quick Start Guide
Accessing your Qualytics Deployment
Each Qualytics deployment is a single-tenant, dedicated cloud instance, configured per requirements discussed with your organization. Therefore, your deployment will be accessible from a custom URL
specific to your organization.
For example, ACME's Qualytics deployment might be published at https://acme.qualytics.io
, and the corresponding API documentation would be available at https://acme.qualytics.io/api/docs
. Note that your specific credentials and URL
will be provided to you automatically by email.
Tip
Please check your spam folder if you don’t see the invite.
After you've obtained access to your deployment, you'll want to:
- Connect a Datastore
- Initiate a profiling on the source datastore by running a Profile Operation. This step will automatically infer a set of data quality checks from your data.
- Assert those checks to detect data anomalies
Connecting a Datastore
The first step of configuring a Qualytics instance is to Add Source Datastore
. In order to add a Source Datastore via Qualytics, you need to select the specific Connector
. This is necessary so that the appropriate form for collecting connection details can be rendered.
As you provide the required connection details, the UI verifies network connectivity and indicates whether the combination is accessible. This feature assists in diagnosing any network routing restrictions.
While configuring the connection, you'll also come across an option to automatically trigger an asynchronous Catalog operation upon successful Datastore creation.
Once the connection details are confirmed, the "Add Datastore" process moves to a second optional but strongly recommended step: the configuration of an Enrichment Datastore.
The Enrichment Datastore serves as a location to record enrichment data (anomalies and metadata for a Source Datastore). This is a crucial step as it significantly enhances Qualytics's ability to detect anomalies.
Warning
While it is optional, not setting an Enrichment Datastore may limit the Qualytics's features.
In this step, you'll have two options:
- Configure a new Enrichment Datastore
- Use the dropdown list to select an existing Enrichment Datastore that was previously configured
The process of configuring a new Enrichment Datastore is similar to that of a Source Datastore, with one key difference: the connection details you provide must have the ability to write enrichment data.
Note
If you don't have a specific location to store these results, you can request the QFS (Qualytics File System) connector provided by Qualytics for this purpose.
During the Source Datastore and Enrichment Datastore configuration steps, you'll find an option to Test Connection
. This initiates a synchronous operation that verifies whether the indicated Datastore can be appropriately accessed from the Compute Daemon:
- If the operation is successful, you can proceed with the configuration. Any issues during this
Test Connection
process will result in an error message being displayed on the current step of the form, be it the Source Datastore or Enrichment Datastore step.
Note
If any future operation fails to establish a connection with the Datastore, the UI will provide warnings to guide you in resolving the connectivity issues.
Generate a Profile
The majority of a data scientist's work revolves around upfront curation of data, which involves taking steps to determine which type of ML modeling might be beneficial for the given data. Our Data Compute Daemon begins this process much like a data scientist initiating a new modeling effort. It profiles customer data through systematic computational analysis, executed in a fully automated and scalable manner.
We track a wide range of metadata in addition to standard field metadata, which includes:
type
min
/max
min_length
/max_length
completeness
/sparsity
histograms
(ratios of values)- Our Compute Daemon also calculates more sophisticated statistical measures such as
skewness
,kurtosis
, andpearson correlation coefficients
.
- Our Compute Daemon also calculates more sophisticated statistical measures such as
While these numerical analysis techniques do not strictly fall under Machine Learning
, their role in statistical analysis and preparatory work is indispensable. They facilitate ML by generating appropriate and statistically relevant data quality rules at scale.
One crucial part of profiling is identifying column data types, which is typically a tedious task in large tables. A machine learning model trained to infer the data types and properties can help accelerate this task by automatically identifying key phrases and linking them to commonly associated attributes.
Upon completing the initial profiling and metadata generation, the Qualytics Inference Engine carries out ML via various learning methods such as inductive
and unsupervised learning
. The engine applies numerous machine learning models & techniques to the training data in an effort to discover well-fitting data quality constraints. The inferred constraints are then filtered by testing them against the held out testing set & only those that assert true are converted to data quality Checks.
Two concrete examples of sophisticated rule types automatically inferred at this stage are:
- the application of a robust normality test: this is applied to each numeric field to discover whether certain types of anomaly checks are applicable & bases its quality check recommendations upon that learning.
- the generation of linear regression models: this is automatically generated to fit any highly correlated fields in the same table. If a good fit model is identified, it's recorded as a predicting model for those correlated fields and used to identify future anomalies.
Now that you have a deeper understanding of how our profiling operation works, you're ready to take action. To initiate a Profile Operation, navigate to the details of the specific source datastore you've created. There, you'll find a step to start the Profile Operation.
Initiating and Reviewing a Scan for Anomalies
After the initial Profile Operation is complete, you can start a Scan Operation. By default, Qualytics initiates a Full
Scan for the first operation. This comprehensive scan establishes a baseline for generating Quality Scores and facilitates the validation of all defined checks.
As the Scan Operation progresses, you can monitor its status in real-time. If you choose, you can set up in-app notifications to alert you when the operation is complete, whether you're currently signed in or you log back in later.
Upon completion of the Scan operation, you can review the following data points:
Start time
andFinish Time
of operationTotal counts
for all scanned Containers, including:- Records scanned
- Anomalies detected
- Total Records
Info
Any issues, such as the failure to scan any Container of the source datastore, will be indicated, along with suggestions on how to address the issue, such as assigning an identifier or setting a record limit.
Dashboard
Upon signing in to Qualytics, users are greeted with a thoughtfully designed dashboard that offers intuitive navigation and quick access to essential features and datasets, ensuring an efficient and comprehensive data quality management experience.
In this documentation, we will explore every component of the Qualytics dashboard.
Let’s get started 🚀
Global Search
The Global Search feature in Qualytics is designed to streamline the process of finding crucial assets such as Datastores, Containers, and Fields. This enhancement provides quick and precise search results, significantly improving navigation and user interaction. By entering keywords in the search bar located at the top of the dashboard, users can efficiently locate specific data elements, facilitating better data management and access. This functionality is especially useful for large datasets, ensuring users can swiftly find the information they need without navigating through multiple layers of the interface.
Tip
Press the shortcut key: Ctrl+K for quick access to Global Search
In-App Notifications
In-app notifications in Qualytics are real-time alerts that keep users informed about various events related to their data operations and quality checks. These notifications are displayed within the Qualytics interface and cover a range of activities, including operation completions, and anomaly detections.
Discover
The Discover option in Qualytics features a dropdown menu that provides access to various resources and tools to help users navigate and utilize the platform effectively. The menu includes the following options:
Resources:
-
User Guide: Opens the comprehensive user guide for Qualytics, which provides detailed instructions and information on how to use the platform effectively.
-
SparkSQL: Directs users to resources or documentation related to using SparkSQL within the Qualytics platform, aiding in advanced data querying and analysis.
API:
-
Docs: Opens the API documentation, offering detailed information on how to interact programmatically with the Qualytics platform. This is essential for developers looking to integrate Qualytics with other systems or automate tasks.
-
Playground: Provides access to an interactive environment where users can test and experiment with API calls. This feature is particularly useful for developers who want to understand how the API works and try out different queries before implementing them in their applications.
Support:
- Qualytics Helpdesk: Qualytics Helpdesk provides users with access to a support environment where they can get assistance with any issues or questions related to the platform.
Theme
Qualytics offers both dark mode and light mode to enhance user experience and cater to different preferences and environments.
Light Mode:
-
This is the default visual theme of Qualytics, featuring a light background with dark text.
-
It provides a clean and bright interface, which is ideal for use in well-lit environments.
-
To switch from dark mode to light mode, click on the same Light Mode button.
Dark Mode:
-
Dark mode features a dark background with light text, reducing eye strain and glare, especially in low-light environments.
-
It is designed to be easier on the eyes during prolonged usage and can help save battery life on devices.
-
To activate dark mode, click on the same Dark Mode button.
View Mode
In Qualytics, users have the option to switch between two display modes: List View and Card View. These modes are available on the Source Datastore page, Enrichment Datastore page, and Library page, allowing users to choose their preferred method of displaying information.
-
List View: List View arranges items in a linear, vertical list format. This mode focuses on providing detailed information in a compact and organized manner. To activate List View, click the "List View" button (represented by an icon with three horizontal lines) located at the top of the page.
-
Card View: Card View displays items as individual cards arranged in a grid. Each card typically includes a summary of the most important information about the item. To switch to Card View, click the "Card View" button (represented by an icon with a grid of squares) located at the top of the page.
User Profile
The user profile section in Qualytics provides essential information and settings related to the user's account. Here's an explanation of each element:
-
Name: Displays the user's email address used as the account identifier.
-
Role: Indicates the user's role within the Qualytics platform (e.g., Admin), which defines their level of access and permissions.
-
Teams: Shows the teams to which the user belongs (e.g., Public), helping organize users and manage permissions based on group membership.
-
Preview Features: A toggle switch that enables or disables preview features. When turned on, it adds an AI Readiness Benchmark for the Quality Score specifically on the Explore page.
-
Logout: A button that logs the user out of their Qualytics account, ending the current session and returning them to the login page.
-
Version: Displays the current version of the Qualytics platform being used, which is helpful for troubleshooting and ensuring compatibility with other tools and features.
Navigation Menu (Left Sidebar)
The left sidebar of the dashboard contains the primary navigation menu, which allows users to quickly access various functionalities of the Qualytics platform. The menu items include:
Source Datastores (Default View)
Lists all the source datastores at the left sidebar connected to Qualytics. Also provides the option to:
-
Add a new source datastore
-
Search from existing source datastores
-
Sort existing datastores based on the name, records, checks, etc.
-
Filter source datastores
Enrichment Datastores
Lists all the enrichment datastores at the left sidebar connected to Qualytics. Also provides the option to:.
- Add an enrichment datastore
- Search from existing enrichment datastores
- Sort existing datastores based on the name, records, checks, etc.
Explore
The Explore dashboard in Qualytics enables effective data management and analysis through several key sections:
-
Insights: Offers an overview of anomaly detection and data monitoring, allowing users to filter by source datastores, tags, and dates. It displays profile data, applied checks, quality scores, records scanned, and more. Moreover, you can also export the insight reports into a PDF format.
-
Activity: Provides a detailed view of operations (catalog, profile, and scan) across source datastores with a heatmap to visualize daily activities and detected anomalies.
-
Profiles: Unifies all containers, including tables, views, computed tables, computed files, and fields, with search, sort, and filter functionalities.
-
Observability: Observability gives users an easy way to track changes in data volume over time. It introduces two types of checks: Volumetric and Metric.
-
Checks: Shows all applied checks, both inferred and authored, across source datastores to monitor and manage data quality rules.
-
Anomalies: Lists all detected anomalies across source datastores for quick identification and resolution of issues.
Library
The library dashboard allows for managing check templates and editing applied checks in source datastores with two main functionalities:
-
Add Check Templates: Easily add new templates to apply standardized checks across datastores.
-
Export Check Templates: Export template metadata to a specified Enrichment datastore.
Tip
You can also search, sort, and filter checks across the source datastores
Tags
Tags help users organize and prioritize data assets by categorizing them. They can be applied to Datastores, Profiles, Fields, Checks, and Anomalies, improving data management and workflows.
Notification Rules
Qualytics allows users to set up notification rules with specific triggers and channels, ensuring timely alerts for critical events. Notifications can be delivered through multiple channels, including in-app, email, Slack, Microsoft Teams, and more, helping users stay informed and manage data quality issues in real time.
Global Settings
Manage global configurations with the following options:
-
Connection: Manage datastore sources (add, edit, delete).
-
Integration: Configure parameters for integrating external tools.
-
Security: Manage teams, roles, and user access.
-
Tokens: Create tokens for secure API interactions.
-
Health: Monitor and restart the Qualytics deployment.
Anomaly
Something that deviates from the standard, normal, or expected. This can be in the form of a single data point, record, or a batch of data
Accuracy
The data represents the real-world values they are expected to model.
Catalog Operation
Used to read fundamental metadata from a Datastore required for the proper functioning of subsequent Operations such as Profile, Hash and Scan
Comparison
An evaluation to determine if the structure and content of the source and target Datastores match
Comparison Runs
An action to perform a comparison
Completeness
Required fields are fully populated.
Conformity
Alignment of the content to the required standards, schemas, and formats.
Connectors
Components that can be easily connected to and used to integrate with other applications and databases. Common uses include sending and receiving data.
Info
We can connect to any Apache Spark accessible datastore. If you have a datastore we don’t yet support, talk to us! We currently support: Files (CSV, JSON, XLSX, Parquet) on Object Storage (S3, Azure Blob, GCS); ETL/ELT Providers (Fivetran, Stitch, Airbyte, Matillion – and any of their connectors!); Data Warehouses (BigQuery, Snowflake, Redshift); Data Pipelining (Airflow, DBT, Prefect), Databases (MySQL, PostgreSQL, MSSQL, SQLite, etc.) and any other JDBC source
Consistency
The value is the same across all datastores within the organization.
Container (of a Datastore)
The uniquely named abstractions within a Datastore that hold data adhering to a known schema. The Containers within a RDBMS are tables, the containers in a filesystem are well formatted files, etc…
Data-at-rest
Data that is stored in a database, warehouse, file system, data lake, or other datastore.
Data Drift
Changes in a data set’s properties or characteristics over time.
Data-in-flight
Data that is on the move, transporting from one location to another, such as through a message queue, API, or other pipeline
Data Lake
A centralized repository that allows you to store all your structured and unstructured data at any scale. (**)
Data Quality
Ensuring data is free from errors, including duplicates, inaccuracies, inappropriate fields, irrelevant data, missing elements, non-conforming data, and poor data entry.
Data Quality Check
aka "Check" is an expression regarding the values of a Container that can be evaluated to determine whether the actual values are expected or not.
Datastore
Where data is persisted in a database, file system, or other connected retrieval systems. You can check more in Datastore Overview.
Data Warehouse
A system that aggregates data from different sources into a single, central, consistent datastore to support data analysis, data mining, artificial intelligence (AI), and machine learning.
Distinctness (of a Field)
The fraction of distinct values (appear at least once) to total values that appear in a Field
Enrichment Datastore
Additional properties that are added to a data set to enhance its meaning. Qualytics enrichment includes whether a record is anomalous, what caused it to be an anomaly, what characteristics it was expected to have, and flags that allow other systems to act upon the data.
Favorite
Users can mark instances of an abstraction (Field, Container, Datastore, Check, Anomaly, etc..) as a personalized favorite to ensure it ranks higher in default ordering and is prioritized in other personalized views & workflows.
Compute Daemon
An application that protects a system from contamination due to inputs, reducing the likelihood of contamination from an outside source. The Compute Daemon will quarantine data that is problematic, allowing the user to act upon quarantined items.
Incremental Identifier
A Field that can be used to group the records in the Table Container into distinct ordered Qualytics Partitions in support of incremental operations upon those partitions:
- a whole number - then all records with the same partition_id value are considered part of the same partition
- a float or timestamp - then all records between two defined values are considered part of the same partition (the defining values will be set by incremental scan/profile business logic) Since Qualytics Partitions are required to support Incremental Operations, an Incremental Identifier is required for a Table Container to support incremental Operations.
Incremental Scan Operation
A Scan Operation where only new records (inserted since the last Scan Operation) are analyzed. The underlying Container must support determining which records are new for incremental scanning to be a valid option for it.
Inference Engine
After Compute Daemon gathers all the metadata generated by a profiling operation, it feeds that metadata into our Inference Engine. The inference engine then initiates a "true machine learning" (specifically, this is referred to as Inductive Learning) process whereby the available customer data is partitioned into a training set and a testing set. The engine applies numerous machine learning models & techniques to the training data in an effort to discover well-fitting data quality constraints. Those inferred constraints are then filtered by testing them against the held out testing set & only those that assert true are converted to inferred data quality Checks.
Metadata
Data about other data, including descriptions and additional information.
Object Storage
A type of data storage used for handling large amounts of unstructured data managed as objects.
Operation
the asynchronous (often long running) tasks that operate on Datastores are collectively referred to as "Operations." Examples include Catalog, Profile, Hash, and Scan
Partition Identifier
A Field that can be used by Spark to group the records in a Dataframe into smaller sets that fit within our Spark worker’s memory. The ideal Partition Identifier is an Incremental Identifier of type datetime since that can serve as both but we identify alternatives should that not be available.
Pipeline
A workflow that processes and moves data between systems.
Precision
Your data is the resolution that is expected- How tightly can you define your data?
Profile Operation
An operation that generates metadata describing the characteristics of your actual data values.
Profiling
The process of collecting statistics on the characteristics of a dataset involving examining, analyzing, and reviewing the data.
Proprietary Algorithms
A procedure utilizing a combination of processes, tools, or systems of interrelated connections that are the property of a business or individual in order to solve a problem.
Quality Score
A measure of data quality calculated at the Field, Container, and Datastore level. Quality Scores are recorded as time-series enabling you to track movement over time. You can read more in Quality Scoring.
Qualytics App
aka "App" this is the user interface for our Product delivered as a web application
Qualytics Deployment
A single instance of our product (the k8s cluster, postgres database, hub/app/compute daemon pods, etc…)
Qualytics Compute Daemon
aka "Compute Daemon" this is the layer of our Product that connects to Datastores and directly operates on users’ data.
Qualytics Implementation
A customer’s Deployment plus any associated integrations
Qualytics Surveillance Hub
aka "Hub" this is the layer of our Product that exposes an Application Programming Interface (API).
Qualytics Partition
The smallest grouping of records that can be incrementally processed. For DFS datastores, each file is a Qualytics Partition. For JDBC datastores, partitions are defined by each table’s incremental identifier values.
Record (of a Container)
A distinct set of values for all Fields defined for a Container (e.g. a row of a table)
Schema
The organization of data in a datastore. This could be the columns of a table, the header of a CSV file, the fields in a JSON file, or other structural constraints.
Schema Differences
Differences in the organization of information between two datastores that are supposed to hold the same content.
Source
The origin of data in a pipeline, migration, or other ELT/ETL process. It’s where data gets extracted.
Tag
Users can assign Tags Datastores, Profiles (Files, Tables, Containers), Checks and Anomalies. Add a Description and Assign a Weight #. The weight value directly correlates with the level of importance, where a higher weight indicates higher significance.
Target
The destination of data in a pipeline, migration, or other ELT/ETL process. It’s where data gets loaded.
Third-party data
Data acquired from a source outside of your company which may not be controlled by the same data quality processes. You may not have the same level of confidence in the data and it may not be as trustworthy as internally vetted datasets.
Timeliness
It can be calculated as the time between when information should be available and when it is actually available, focused on if data is available when it’s expected.
Volumetrics
Data has the same size and shape across similar cycles. It includes statistics about the size of a data set including calculations or predictions on the rate of change over time.
Weight
The weight value directly correlates with the level of importance, where a higher weight indicates higher significance.
Add Datastores ↵
Datastores Overview
A Datastore
can be any Apache Spark-compatible data source, such as:
- Traditional
RDBMS
. - Raw files (
CSV
,XLSX
,JSON
,Avro
,Parquet
) on:- AWS S3.
- Azure Blob Storage.
- GCP Cloud Storage.
A Datastore
is a medium holding structured data. Qualytics supports Spark-compatible Datastores via the conceptual layers depicted below
Configuration
The first step of configuring a Qualytics instance is to add a source datastore:
- In the
main
menu, selectDatastores
tab -
Click on
Add Source Datastore
button:
Info
A datastore can be any Apache Spark-compatible data source:
- traditional RDBMS,
- raw files (
CSV
,XLSX
,JSON
,Avro
,Parquet
etc...) on :AWS S3
.Azure Blob Storage
.GCP Cloud Storage
Credentials
Configuring a datastore will require you to enter configuration credentials dependent upon each datastore. Here is an example of a Snowflake datastore being added:
When a datastore is added, it’ll be populated in the home screen along with other datastores:
Clicking into a datastore will guide the user through the capabilities and operations of the platform.
When a user configures a datastore for the first time, they’ll see an empty Activity tab.
Heatmap view
Running a Catalog of the Datastore
The first operation of Catalog will automatically kick off. You can see this through the Activity tab.
- This operation typically takes a short amount of time to complete.
- After this is completed, they’ll need to run a Profile operation (under
Run
->Profile
) to generate metadata and infer data quality checks.
JDBC Datastores ↵
JDBC Datastore Overview
The JDBC Datastore in Qualytics facilitates seamless integration with relational databases using the Java Database Connectivity (JDBC) API. This allows users to connect, analyze, and profile data stored in various relational databases.
Supported Databases
Qualytics supports a range of relational databases, including but not limited to:
- BigQuery
- Databricks
- DB2
- Hive
- MariaDB
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
- Presto
- Amazon Redshift
- Snowflake
- Synapse
- Timescale DB
- Trino
- Athena
Connection Details
Users are required to provide specific connection details such as Host/Port or URI for the JDBC Datastore.
The connection details are used to establish a secure and reliable connection to the target database.
Catalog Operation
Upon successful verification, a Catalog operation can be initiated, providing metadata about the JDBC Datastore, including containers, field names, and record counts.
Field Types Inference
During the Catalog operation, Qualytics infers field types by weighted histogram analysis. This allows for automatic detection of data types within the JDBC Datastore, facilitating more accurate data profiling.
Containers Overview
For a more detailed understanding of how Qualytics manages and interacts with containers in JDBC Datastores, please refer to the Containers section in our comprehensive user guide.
This section covers topics such as container deletion, field deletion, and the initial profile of a Datastore's containers.
Athena
Adding and configuring an Amazon Athena connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on adding Athena as a source datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Athena environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Add the Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. Athena is an example of such a datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the Athena datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name | Specify the name of the datastore (e.g., The specified name will appear on the datastore cards) |
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection |
3. | Connector | Select Athena from the dropdown list. |
Option I: Create a Datastore with a new Connection
If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Athena connector from the dropdown list and add connection properties such as Secrets Management, host, port, username, and password, along with datastore properties like catalog, database, etc.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host | Get Hostname from your account and Athena add it to this field. |
2. | Port | Specify the Port number. |
3. | User | Enter the User ID to connect. |
4. | Password | Enter the password to connect to the database. |
5. | S3 Output Location | Define the S3 bucket location where the output will be stored. This is specific to AWS Athena and specifies where query results are saved. |
6. | Catalog | Enter the catalog name. In AWS Athena, this refers to the data catalog that contains database and table metadata. |
7. | Database | Specify the database name. |
8. | Teams | Select one or more teams from the dropdown to associate with this source datastore. |
9. | Initial Cataloging | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Catalog, Database, Teams, and Initiate Cataloging.
Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
Click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
After successfully testing and verifying your source datastore connection, you have the option to add an enrichment datastore (recommended). This datastore is used to store analyzed results, including. This setup provides comprehensive visibility into your data quality, enabling you to manage and improve it effectively.
Warning
Qualytics does not support the Athena connector as an enrichment datastore, but you can point to a different enrichment datastore.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles for Add New connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Note
Qualytics does not support Athena as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using BigQuery as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.
Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 4: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 5: Close the Success dialogue and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the toggle for Use an existing enrichment datastore is turned on, you will be prompted to configure the enrichment datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
Note
Qualytics does not support Athena as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using BigQuery as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
Host: This is the server address where the Athena instance is hosted. It is the endpoint used to connect to the PostgreSQL environment.
-
Database: Refers to the specific database within the Athena environment where the data is stored.
-
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
Creating a Source Datastore
This section provides a sample payload for creating a Athena datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post): /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "athena_catalog",
"schema": "athena_database",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"host": "athena_host",
"port": 443,
"username": "athena_user",
"password": "athena_password",
"parameters": { "output": "s3://<bucket_name>" },
"type": "athena"
}
}
Link an Enrichment Datastore to a Source Datastore
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
BigQuery
Adding and configuring BigQuery connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on adding BigQuery as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their BigQuery environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let's get started 🚀
BigQuery Setup Guide
This guide explains how to create and use a temporary dataset with an expiration time in BigQuery. This dataset helps manage intermediate query results and temporary tables when using the Google BigQuery JDBC driver.
It is recommended for efficient data management, performance optimization, and automatic reduction of storage costs by deleting data when it is no longer needed.
Access the BigQuery Console
Step 1: Navigate to the BigQuery console within your Google Cloud Platform (GCP) account.
Step 2: Click on the vertical ellipsis, it will open a popup menu for creating a dataset. Click on the Create dataset to set up a new dataset.
Step 3: Fill details for the following fields to create a new dataset.
Info
- Dataset Location: Select the location that aligns with where your other datasets reside to minimize data transfer delays.
- Default Table Expiration: Set the expiration to
1 day
to ensure any table created in this dataset is automatically deleted one day after its creation.
Step 4: Click the Create Dataset button to apply the configuration and create the dataset.
Step 5: Navigate to the created dataset and find the Dataset ID in the Dataset Info.
The dataset Info section contains the Dataset ID and other information related to the created dataset. This generated Dataset ID is used to configure the BigQuery datastore.
BigQuery Roles and Permissions
This section explains the roles required for viewing, editing, and running jobs in BigQuery. To integrate BigQuery with Qualytics, you need specific roles and permissions.
Assigning these roles ensures Qualytics can perform data discovery, management, and analytics tasks efficiently while maintaining security and access control.
BigQuery Roles
-
BigQuery Data Editor (
roles/bigquery.dataEditor
) Allows modification of data within BigQuery, including adding new tables and changing table schemas. It is suitable if you want to regularly update or insert data. -
BigQuery Data Viewer (
roles/bigquery.dataViewer
) Enables viewing datasets, tables, and the contents. It is essential if you need to read data structures and information. -
BigQuery Job User (
roles/bigquery.jobUser
) Allows creating and managing jobs in BigQuery, such as queries, data imports, and data exports. It is important if you want to run automated queries. -
BigQuery Read Session User (
roles/bigquery.readSessionUser
) Allows usage of the BigQuery Storage API for efficient retrieval of large data volumes. It provides capabilities to create and manage read sessions, facilitating large-scale data transfers.
Warning
If a temporary dataset already exists in BigQuery and users want to use it when creating a new datastore connection, the service account must have the bigquery.tables.create
permission to perform the test connection and proceed to the datastore creation.
Datastore BigQuery Privileges
The following table outlines the privileges associated with BigQuery roles when configuring datastore connections in Qualytics:
Source Datastore Permissions (Read-Only)
Provides read access to view table data and metadata:
REF | READ-ONLY PERMISSIONS | DESCRIPTION |
---|---|---|
1. | roles/bigquery.dataViewer |
Allows viewing of datasets, tables, and their data. |
2. | roles/bigquery.jobUser |
Enables running of jobs such as queries and data loading. |
3. | roles/bigquery.readSessionUser |
Facilitates the creation of read sessions for efficient data retrieval. |
Enrichment Datastore Permissions (Read-Write)
Grants read and write access for data editing and management:
REF | WRITE-ONLY PERMISSIONS | DESCRIPTION |
---|---|---|
1. | roles/bigquery.dataEditor |
Provides editing permissions for table data and schemas. |
2. | roles/bigquery.dataViewer |
Allows viewing of datasets, tables, and their data. |
3. | roles/bigquery.jobUser |
Enables running of jobs such as queries and data loading. |
4. | roles/bigquery.readSessionUser |
Facilitates the creation of read sessions for efficient data retrieval. |
Add a Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. BigQuery is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name (Required) | Specify the name of the datastore (e.g. The specified name will appear on the datastore cards.) |
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection. |
3. | Connector (Required) | Select BigQuery from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the BigQuery connector from the dropdown list and add connection details such as temp dataset ID, service account key, project ID, and dataset ID.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Temp Dataset ID (Optional) | Enter a temporary Dataset ID for intermediate data storage during BigQuery operations. |
2. | Service Account Key (Required) | Upload a JSON file that contains the credentials required for accessing BigQuery. |
3. | Project ID (Required) | Enter the Project ID associated with BigQuery. |
4. | Dataset ID (Required) | Enter the Dataset ID (schema name) associated with BigQuery. |
5. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
6. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Project ID, Dataset ID, Teams, and Initiate Cataloging.
Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata for your selected datastore. |
2. | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles for Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $ format in Connection form to reference a Vault secret.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 3: The configuration form, requesting credential details after selected enrichment datastore connector.
REF. | FIELD | ACTIONS |
---|---|---|
1. | Temp Dataset ID (Optional) | Enter a temporary Dataset ID for intermediate data storage during BigQuery operations. |
2. | Service Account Key (Required) | Upload a JSON file that contains the credentials required for accessing BigQuery. |
3. | Project ID (Required) | Enter the Project ID associated with BigQuery. |
4. | Dataset ID (Required) | Enter the Dataset ID (schema name) associated with BigQuery. |
5. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the enrichment datastore has been successfully verified.
Step 5: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the "Use an existing enrichment datastore" option is selected from the dropdown menu, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
Host: This is the server address where the BigQuery instance is hosted. It is the endpoint used to connect to the BigQuery environment.
-
Database: Refers to the specific database within the BigQuery environment where the data is stored.
-
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating a BigQuery datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Databricks
Adding and configuring Databricks connection within Qualytics empowers the platform to build a symbolic link with your database to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add Databricks as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Databricks environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let's get started 🚀
Databricks Setup Guide
This guide provides a comprehensive walkthrough for setting up Databricks. It highlights the distinction between SQL Warehouses and All-Purpose Compute, the functionality of node pools, and the enhancements they offer.
Additionally, it details the process for attaching compute resources to node pools and explains the minimum requirements for effective operation.
Understanding SQL Warehouses and All-Purpose Compute
SQL Warehouses (Serverless)
SQL Warehouses (Serverless) in Databricks utilize serverless SQL endpoints for running SQL queries.
REF | ATTRIBUTE | DESCRIPTION |
---|---|---|
1. | Cost-effectiveness | Serverless SQL endpoints allow you to pay only for the queries you execute, without the need to provision or manage dedicated infrastructure, making it more cost-effective for ad-hoc or sporadic queries. |
2. | Scalability | Serverless architectures automatically scale resources based on demand, ensuring optimal performance for varying workloads. |
3. | Simplified Management | With serverless SQL endpoints, you don't need to manage clusters or infrastructure, reducing operational overhead. |
4. | Minimum Requirements | The minimum requirements for using SQL Warehouse with serverless typically include access to a Databricks workspace and appropriate permissions to create and run SQL queries. |
All-Purpose Compute
All-purpose compute in Databricks refers to clusters that are not optimized for specific tasks. While they offer flexibility, they may not provide the best performance or cost-effectiveness for certain workloads.
REF | ATTRIBUTE | DESCRIPTION |
---|---|---|
1. | Slow Spin-up Time | All-purpose compute clusters may take longer to spin up compared to specialized clusters, resulting in delays before processing can begin. |
2. | Timeout Connections | Due to longer spin-up times, there's a risk of timeout connections, especially for applications or services that expect quick responses. |
Node Pool and Its Usage
A node pool in Databricks is a set of homogeneous virtual machines (VMs) within a cluster. It allows you to have a fixed set of instances dedicated to specific tasks, ensuring consistent performance and resource isolation.
REF | ATTRIBUTE | DESCRIPTION |
---|---|---|
1. | Resource Isolation | Node pools provide resource isolation, allowing different workloads or applications to run without impacting each other's performance. |
2. | Optimized Performance | By dedicating specific nodes to particular tasks, you can optimize performance for those workloads. |
3. | Cost-effectiveness | Node pools can be more cost-effective than using all-purpose compute for certain workloads, as you can scale resources according to the specific requirements of each task. |
Improving All-Purpose Compute with Node Pools
To improve the performance of all-purpose compute using node pools, you can follow these steps:
REF | ATTRIBUTE | DESCRIPTION |
---|---|---|
1. | Define Workload-Specific Node Pools | Identify the specific tasks or workloads that require optimized performance and create dedicated node pools for them. |
2. | Specify Minimum Requirements | Determine the minimum resources (such as CPU, memory, and disk) required for each workload and configure the node pools accordingly. |
3. | Monitor and Adjust | Continuously monitor the performance of your node pools and adjust resource allocations as needed to ensure optimal performance. |
Step 1: Configure details for Qualytics Node Pool.
Step 2: Attach Compute details with the Node Pool.
Retrieve the Connection Details
This section explains how to retrieve the connection details that you need to connect to Databricks.
Credentials to Connect with Qualytics
To configure Databricks, you need the following credentials:
REF | FIELDS | ACTIONS |
---|---|---|
1. | Host (Required) | Get Hostname from your Databricks account and add it to this field. |
2. | HTTP Path (Required) | Add HTTP Path (web address) to fetch data from your Databricks account. |
3. | Catalog (Required) | Add a Catalog to fetch data structures and metadata from Databricks. |
4. | Database (Required) | Specify the database name to be accessed. |
5. | Personal Access Token (Required) | Generate a Personal Access Token from your Databricks account and add it for authentication. |
Get Connection Details for the SQL Warehouse
Follow the given steps to get the connection details for the SQL warehouse:
- Click on the SQL Warehouses in the sidebar.
- Choose a warehouse to connect to.
- Navigate to the Connection Details tab.
- Copy the connection details.
Get Connection Details for the Cluster
Follow the given steps to get the connection details for the cluster:
- Click on the Compute in the sidebar.
- Choose a cluster to connect to.
- Navigate to the Advanced Options.
- Click on the JDBC/ODBC tab.
- Copy the connection details.
Get the Access Token
Step 1: In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop-down menu.
Note
Refer to the Databricks Official Docs to generate the Access Token.
Step 2: In the Settings page, select the Developer option in the User section.
Step 3: In the Developer page, click on Manage in Access Tokens.
Step 4: In the Access Tokens page, click on the Generate new token button.
Step 5: You will see a modal to add a description and validation time (in days) for the token.
Step 6: After adding the contents, click on Generate, and it will show the token
Warning
Before closing the modal window by clicking on the Done button, ensure the Personal Access Token is saved to a secure location
Step 7: You can see the new token on the Access Tokens page.
You can also revoke a token on the Access Tokens page by clicking on the Revoke token button
Add a Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. Databricks is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name (Reqired) | Specify the datastore name (e.g., This name will appear on the datastore cards) |
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection. |
3. | Connector (Required) | Select Databricks from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for Add new connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Databricks connector from the dropdown list and add connection details such as Secrets Management, host, HTTP path, database, and personal access token.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $ format in Connection form to reference a Vault secret.
REF | FIELDS | ACTIONS |
---|---|---|
1 | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2 | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3 | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4 | Secret URL | Enter the URL where the secret is stored in Vault. |
5 | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6 | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELD | ACTIONS |
---|---|---|
1. | Host (Required) | Get the hostname from your Databricks account and add it to this field. |
2. | HTTP Path (Required) | Add the HTTP Path (web address) to fetch data from your Databricks account. |
3. | Personal Access Token (Required) | Generate a Personal Access Token from your Databricks account and add it for authentication. |
4. | Catalog (Required) | Add a Catalog to fetch data structures and metadata from the Databricks. |
5. | Database (Optional) | Specify the database name to be accessed. |
6. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
7. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Catalog, Database, Teams, and Initiate Cataloging.
Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $ format in Connection form to reference a Vault secret.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 3: The configuration form, requesting credential details after selected enrichment datastore connector.
REF. | FIELD | ACTIONS |
---|---|---|
1. | Host (Required) | Get the hostname from your Databricks account and add it to this field. |
2. | HTTP Path (Required) | Add the HTTP Path (web address) to fetch data from your Databricks account. |
3. | Personal Access Token (Required) | Generate a Personal Access Token from your Databricks account and add it for authentication. |
4. | Catalog (Required) | Add a Catalog to fetch data structures and metadata from Databricks. |
5. | Database (Optional) | Specify the database name |
6. | Teams (Required) | Select one or more teams from the dropdown to associate with this enrichment datastore. |
Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the enrichment datastore has been successfully verified.
Step 5: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
Host: This is the server address where the Databricks instance is hosted. It is the endpoint used to connect to the Databricks environment.
-
Database: Refers to the specific database within the Databricks environment where the data is stored.
-
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating a Databricks datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "databricks_database",
"schema": "databricks_catalog",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "databricks",
"host": "databricks_host",
"password": "databricks_token",
"parameters": {
"path": "databricks_http_path"
}
}
}
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "databricks_database",
"schema": "databricks_enrichment_catalog",
"enrich_only": true,
"connection": {
"name": "your_connection_name",
"type": "databricks",
"host": "databricks_host",
"password": "databricks_token",
"parameters": {
"path": "databricks_http_path"
}
}
}
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
DB2
Adding and configuring a DB2 connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add DB2 as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their DB2 environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Add a Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. DB2 is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name (Required) | Specify the datastore name (e.g., This name will appear on the datastore cards) |
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection. |
3. | Connector (Required) | Select DB2 from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the DB2 connector from the dropdown list and add connection details such as Secrets Management, host, port, user, password, SSL connection, database, and schema.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host (Required) | Get Hostname from your DB2 account and add it to this field. |
2. | Port (Required) | Specify the Port number. |
3. | User (Required) | Enter the User to connect. |
4. | Password (Required) | Enter the password to connect to the database. |
5. | SSL Connection | Enable the SSL connection to ensure secure communication between Qualytics and the selected datastore. |
6. | Database (Required) | Specify the database name. |
7. | Schema (Required) | Define the schema within the database that should be used. |
8. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
9. | Initial Catalog | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Database, Schema, Teams, and Initiate Cataloging.
Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). This datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3 | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $ format in Connection form to reference a Vault secret.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 3: The configuration form, requesting credential details after selected enrichment datastore connector.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host | Get Hostname from your DB2 account and add it to this field. |
2. | Port | Specify the Port number. |
3. | User | Enter the User to connect. |
4. | Password | Enter the password to connect to the database. |
5. | SSL Connection | Enable the SSL connection to ensure secure communication between Qualytics and the selected datastore. |
6. | Database | Specify the database name. |
7. | Schema | Define the schema within the database that should be used. |
8. | Teams | Select one or more teams from the dropdown to associate with this datastore. |
Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the enrichment datastore has been successfully verified.
Step 5: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
Host: This is the server address where the DB2 instance is hosted. It is the endpoint used to connect to the DB2 environment.
-
Database: Refers to the specific database within the DB2 environment where the data is stored.
-
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating a DB2 datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "db2_database",
"schema": "db2_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "db2",
"host": "db2_host",
"port": "db2_port",
"username": "db2_username",
"password": "db2_password",
"parameters": {
"ssl": true
}
}
}
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "db2_database",
"schema": "db2_enrichment_schema",
"enrich_only": true,
"connection": {
"name": "your_connection_name",
"type": "db2",
"host": "db2_host",
"port": "db2_port",
"username": "db2_username",
"password": "db2_password",
"parameters": {
"ssl": true
}
}
}
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Hive
Steps to setup Hive
Fill the form with the credentials of your data source.
Once the form is completed, it's necessary to test the connection to verify if Qualytics is able to connect to your source of data. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and skipping the configuration of an Enrichment Datastore.
- To configure an Enrichment Datastore in another moment, please refer to this section
Note
It is important to associate an Enrichment Datastore
with your new Datastore
- The
Enrichment Datastore
will allow Qualytics to recordenrichment data
, copies of the sourceanomalous data
and additionalmetadata
for yourDatastore
Configuring an Enrichment Datastore
Warning
Qualytics does not support Hive
connector as an enrichment datastore, but you can point to a different connector.
- To configure an Enrichment Datastore in another moment, please refer to this section
-
If you have an
Enrichment Datastore
already setup, you can link it by enable to use an existing Enrichment Datastore and select from the list -
If you don't have an
Enrichment Datastore
, you can create one at the same page.
Once the form is completed, it's necessary to test the connection. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and link or create the Enrichment Datastore
Fields
Name
required
required
- The datastore name to be created in Qualytics App
Hostname
required
required
- The address of the server to connect to. This address can be a DNS or IP address.
Port
required
required
- The port to connect to on serverName.
-
The default is
10000
.Note: If you're using the default, you don't have to specify the port
Schema
required
required
- The
schema
name to be connected.
User
required
required
- The
user
to connect in Hive.
Password
required
required
- The
password
to connect in Hive.
API Payload Examples
Creating a Datastore
This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post)
/api/datastores
(post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "hive_database",
"schema": "hive_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "hive",
"host": "hive_host",
"port": "hive_port",
"username": "hive_username",
"password": "hive_password",
"parameters": {
"zookeeper": false
}
}
}
Linking Datastore to an Enrichment Datastore through API
Endpoint (Patch)
/api/datastores/{datastore-id}/enrichment/{enrichment-id}
(patch)
MariaDB
Steps to setup MariaDB
Fill the form with the credentials of your data source.
Once the form is completed, it's necessary to test the connection to verify if Qualytics is able to connect to your source of data. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and skipping the configuration of an Enrichment Datastore.
- To configure an Enrichment Datastore in another moment, please refer to this section
Note
It is important to associate an Enrichment Datastore
with your new Datastore
- The
Enrichment Datastore
will allow Qualytics to recordenrichment data
, copies of the sourceanomalous data
and additionalmetadata
for yourDatastore
Configuring an Enrichment Datastore
-
If you have an
Enrichment Datastore
already setup, you can link it by enable to use an existing Enrichment Datastore and select from the list -
If you don't have an
Enrichment Datastore
, you can create one at the same page:
Once the form is completed, it's necessary to test the connection. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and link or create the Enrichment Datastore
Fields
Name
required
required
- The datastore name to be created in Qualytics App.
Host
required
required
- The
host
to Connect to the MariaDB.
Port
required
required
- The TCP/IP port number to use for the connection. The default is
3306
.
Database
required
required
- The
database
name of the MariaDB you want to connect.
User
required
required
- The MariaDB user name to use when connecting to the server.
Password
required
required
- The password of the MariaDB account.
Information on how to connect with MariaDB
API Payload Examples
Creating a Datastore
This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post)
/api/datastores
(post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "mariadb_database",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "mariadb",
"host": "mariadb_host",
"port": "mariadb_port",
"username": "mariadb_username",
"password": "mariadb_password"
}
}
Creating an Enrichment Datastore
Endpoint (Post)
/api/datastores
(post)
This section provides a sample payload for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Linking Datastore to an Enrichment Datastore through API
Endpoint (Patch)
/api/datastores/{datastore-id}/enrichment/{enrichment-id}
(patch)
Microsoft SQL Server
Adding and configuring Microsoft SQL Server connection within Qualytics empowers the platform to build a symbolic link with your file system to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on adding Microsoft SQL Server as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Microsoft SQL Server environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Add a Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. Microsoft SQL Server is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name (Required) | Specify the datastore name (e.g., This name will appear on the datastore cards) |
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection. |
3. | Connector (Required) | Select Microsoft SQL Server from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Microsoft SQL Server connector from the dropdown list and add connection details such as Secret Management, host, port, username, password, and database.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host (Required) | Get Hostname from your Microsoft SQL Server account and add it to this field. |
2. | Port (Optional) | Specify the Port number. |
3. | User (Required) | Enter the User to connect. |
4. | Password (Required) | Enter the password to connect to the database. |
5. | Database (Required) | Specify the database name. |
6. | Schema (Required) | Define the schema within the database that should be used. |
7. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
8. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Database, Schema, Teams, and Initiate Cataloging.
Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure to add an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $ format in Connection form to reference a Vault secret.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 3: The configuration form, requesting credential details after selected enrichment datastore connector.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host (Required) | Get Hostname from your Microsoft SQL Server account and add it to this field. |
2. | Port (Optional) | Specify the Port number. |
3. | User (Required) | Enter the User to connect. |
4. | Password (Required) | Enter the Password to connect to the database. |
5. | Database (Required) | Specify the database name. |
6. | Schema (Required) | Define the schema within the database that should be used. |
7. | Teams (Required) | Select one or more teams from the dropdown to associate with this datastore. |
Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 5: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2:: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
Host: This is the server address where the Microsoft SQL Server instance is hosted. It is the endpoint used to connect to the Microsoft SQL Server environment.
-
Database: Refers to the specific database within the Microsoft SQL Server environment where the data is stored.
-
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API.
Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating a Microsoft SQL Server datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "sqlserver_database",
"schema": "sqlserver_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "sqlserver",
"host": "sqlserver_host",
"port": "sqlserver_port",
"username": "sqlserver_username",
"password": "sqlserver_password"
}
}
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "sqlserver_database",
"schema": "sqlserver_enrichment_schema",
"enrich_only": true,
"connection": {
"name": "your_connection_name",
"type": "sqlserver",
"host": "sqlserver_host",
"port": "sqlserver_port",
"username": "sqlserver_username",
"password": "sqlserver_password",
}
}
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
MySQL
Adding and configuring MySQL connection within Qualytics empowers the platform to build a symbolic link with your database to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add MySQL as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their MySQL environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Add a Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. MySQL is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name (Required) | Specify the datastore name (e.g., This name will appear on the datastore cards) |
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection. |
3. | Connector (Required) | Select MySQL from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for Add New existing connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the MySQL connector from the dropdown list and add connection details such as Secret Management, host, port, username, password, and database.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host (Required) | Get Hostname from your MySQL account and add it to this field. |
2. | Port (Required) | Specify the Port number. |
3. | User (Required) | Enter the User to connect. |
4. | Password (Required) | Enter the password to connect to the database. |
5. | Database (Required) | Specify the database name. |
6. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
7. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a "connection" to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Database, Teams and Initiate Cataloging.
Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3 | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $ format in Connection form to reference a Vault secret.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 3: The configuration form, requesting credential details after selected enrichment datastore connector.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host (Required) | Get Hostname from your MySQL account and add it to this field. |
2. | Port (Required) | Specify the Port number. |
3. | User (Required) | Enter the User to connect. |
4. | Password (Required) | Enter the password to connect to the database. |
5. | Database (Required) | Specify the database name. |
6. | Teams (Required) | Select one or more teams from the dropdown to associate with this datastore. |
Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 5: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Team: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
Host: This is the server address where the MySQL instance is hosted. It is the endpoint used to connect to the MySQL environment.
-
Database: Refers to the specific database within the MySQL environment where the data is stored.
Step 3: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API.
Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating a MySQL datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "mysql_database",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "mysql",
"host": "mysql_host",
"port": "mysql_port",
"username": "mysql_username",
"password": "mysql_password"
}
}
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Oracle
Adding and configuring an Oracle connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add Oracle as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Oracle environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Add the Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. Oracle, for example, is a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the Oracle datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
Step | Description |
---|---|
1. | Name Specify the name of the datastore (e.g., The specified name will appear on the datastore cards) |
2. | Toggle Button Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection. |
3. | Connector Select “Oracle” from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for Add new connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Oracle connector from the dropdown list and add connection details such as such as Secret Management, host, port, username, sid, and schema.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host | Get “Hostname” from your Oracle account and add it to this field. |
2. | Port | Specify the “Port” number. |
3. | Protocol | Specifies the connection protocol used for communicating with the database. Choose between TCP or TCPS. |
4. | Connect By | You can choose between SID or Service Name to establish a connection with the Oracle database, depending on how your database instance is configured. |
5. | User | Enter the “User ID” to connect. |
6. | Password | Enter the “password” to connect to the database. |
7. | Schema | Define the schema within the database that should be used. |
8. | Teams | Select one or more teams from the dropdown to associate with this source data store. |
9. | Initial Cataloging | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Database, Schema, Teams, and Initiate Cataloging.
Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Warning
Qualytics does not support the Oracle connector as an enrichment datastore, but you can point to a different enrichment datastore.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Note
Qualytics does not support Oracle as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using Microsoft SQL Server as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.
Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore* has been successfully verified.
Step 4: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 5: Close the Success dialogue and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
Note
Qualytics does not support Oracle as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using Microsoft SQL Server as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
- Teams: The team associated with managing the enrichment datastore is based on the role of public or private. For example, Marked as Public means that this datastore is accessible to all the users.
- Host: This is the server address where the Oracle instance is hosted. It is the endpoint used to connect to the Oracle environment.
- Database: Refers to the specific database within the Oracle environment where the data is stored.
- Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects(tables, views, etc.).Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
Creating Source a Datastore
This section provides a sample payload for creating a Oracle datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post): /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "oracle_database",
"schema": "oracle_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "oracle",
"host": "oracle_host",
"port": "oracle_port",
"username": "oracle_username",
"password": "oracle_password",
"parameters": {
"sid": "orcl"
}
}
}
Link an Enrichment Datastore to a Source Datastore
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
PostgreSQL
Adding and configuring a PostgreSQL connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add PostgreSQL as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their PostgreSQL environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle
Let’s get started 🚀
Add a Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. PostgreSQL is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name (Required) | Specify the name of the datastore. (e.g., The specified name will appear on the datastore cards.) |
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection. |
3. | Connector (Required) | Select PostgreSQL from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for Add New existing connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the PostgreSQL connector from the dropdown list and add connection details such as Secret Managemet, host, port, username, database, and schema.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host (Required) | Get Hostname from your PostgreSQL account and add it to this field. |
2. | Port (Required) | Specify the Port number. |
3. | User (Required) | Enter the User to connect. |
4. | Password (Required) | Enter the password to connect to the database. |
5. | Database (Required) | Specify the database name. |
6. | Schema (Required) | Define the schema within the database that should be used. |
7. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
8. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Database, Schema, Teams and Initiate Cataloging.
Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Info
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3 | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $ format in Connection form to reference a Vault secret.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 3: The configuration form, requesting credential details after selected enrichment datastore connector.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host (Required) | Get Hostname from your PostgreSQL account and add it to this field. |
2. | Port (Required) | Specify the Port number. |
3. | User (Required) | Enter the User to connect. |
4. | Password (Required) | Enter the password associated with the Snowflake user account. |
5. | Database (Required) | Specify the database name to be accessed. |
6. | Schema (Required) | Define the schema within the database that should be used. |
7. | Teams (Required) | Select one or more teams from the dropdown to associate with this datastore. |
Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 5: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Team: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
Host: This is the server address where the PostgreSQL instance is hosted. It is the endpoint used to connect to the PostgreSQL environment.
-
Database: Refers to the specific database within the PostgreSQL environment where the data is stored.
-
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating a PostgreSQL datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "postgresql_database",
"schema": "postgresql_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "postgresql",
"host": "postgresql_host",
"port": "postgresql_port",
"username": "postgresql_username",
"password": "postgresql_password"
}
}
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "postgresql_database",
"schema": "postgresql_schema",
"enrich_only": true,
"connection": {
"name": "your_connection_name",
"type": "postgresql",
"host": "postgresql_host",
"port": "postgresql_port",
"username": "postgresql_username",
"password": "postgresql_password"
}
}
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Presto
Steps to setup Presto
Fill the form with the credentials of your data source.
Once the form is completed, it's necessary to test the connection to verify if Qualytics is able to connect to your source of data. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and skipping the configuration of an Enrichment Datastore.
- To configure an Enrichment Datastore in another moment, please refer to this section
Note
It is important to associate an Enrichment Datastore
with your new Datastore
- The
Enrichment Datastore
will allow Qualytics to recordenrichment data
, copies of the sourceanomalous data
and additionalmetadata
for yourDatastore
Configuring an Enrichment Datastore
Warning
Qualytics does not support Presto
connector as an enrichment datastore, but you can point to a different connector.
- To configure an Enrichment Datastore in another moment, please refer to this section
-
If you have an
Enrichment Datastore
already setup, you can link it by enable to use an existing Enrichment Datastore and select from the list -
If you don't have an
Enrichment Datastore
, you can create one at the same page.
Once the form is completed, it's necessary to test the connection. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and link or create the Enrichment Datastore
Fields
Host
required
required
- The address of the server to connect to. This address can be a DNS or IP address.
Port
required
required
- The port to connect to on serverName.
-
The default is
8080
.Note: If you're using the default, you don't have to specify the port
Catalog
required
required
- The
catalog
name to be connected.
Schema
required
required
- The
schema
name to be connected. - The default is
default
.
User
required
required
- The
user
to connect in Hive.
Password
required
required
- The
password
to connect in Hive.
SSL TrustStore
required
required
- A keystore file that contains certificates from other parties that you expect to communicate with or from Certificate Authorities that you trust to identify other parties
Configuring Presto for Hive Table Access Control:
Locate the Hive Connector Configuration File:
- The configuration file for the Hive connector in Presto is typically named
hive.properties
and is located in theetc/catalog
directory of your Presto installation.
Modify the Configuration:
- Open the
hive.properties
file in a text editor. - Add or modify the line
hive.allow-drop-table=true
to allow dropping tables. If you set it tofalse
, it will disallow dropping tables.
Restart Presto: - After making changes to the configuration, you'll need to restart Presto for the changes to take effect.
- Or, you can use the restart command to do it in one step:
Info
The hive.allow-drop-table
configuration is just one of the many configurations available. If you want to control more granular permissions, such as read/write access, you might need to look into using a combination of Hive's native permissions and the configurations available in Presto.
API Payload Examples
Creating a Datastore
This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post)
/api/datastores
(post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "presto_database",
"schema": "presto_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "presto",
"host": "presto_host",
"port": "presto_port",
"username": "presto_username",
"password": "presto_password",
"parameters":{
"ssl_truststore":"truststore.jks"
},
}
}
Linking Datastore to an Enrichment Datastore through API
Endpoint (Patch)
/api/datastores/{datastore-id}/enrichment/{enrichment-id}
(patch)
Redshift
Adding and configuring Redshift connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add Redshift as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Redshift environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle
Let’s get started 🚀
Add a Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. Redshift is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name (Required) | Specify the name of the datastore. (e.g., The specified name will appear on the datastore cards.) |
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection. |
3. | Connector (Required) | Select Redshift from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for *Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Redshift connector from the dropdown list and add connection details such as Secret Management, port, host, password, database, and schema.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host (Required) | Get Hostname from your Redshift account and add it to this field. |
2. | Port (Required) | Specify the Port number. |
3. | User (Required) | Enter the User to connect. |
4. | Password (Required) | Enter the password associated with the Redshift user account. |
5. | Database (Required) | Specify the database name. |
6. | Schema (Required) | Define the schema within the database that should be used. |
7. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
8. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Database, Schema, Teams and Initiate Cataloging.
Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Info
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you can add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3 | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $ format in Connection form to reference a Vault secret.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 3: The configuration form, requesting credential details after selected enrichment datastore connector.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host (Required) | Get Hostname from your Redshift account and add it to this field. |
2. | Port (Required) | Specify the Port number. |
3. | User (Required) | Enter the User to connect. |
4. | Password (Required) | Enter the password associated with the Redshift user account. |
5. | Database (Required) | Specify the database name to be accessed. |
6. | Schema (Required) | Define the schema within the database that should be used. |
7. | Teams (Required) | Select one or more teams from the dropdown to associate with this datastore. |
Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 5: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Team: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
Host: This is the server address where the Redshift instance is hosted. It is the endpoint used to connect to the Redshift environment.
-
Database: Refers to the specific database within the Redshift environment where the data is stored.
-
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating a Redshift datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "redshift_database",
"schema": "redshift_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "redshift",
"host": "redshift_host",
"port": "redshift_port",
"username": "redshift_username",
"password": "redshift_password"
}
}
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "redshift_database",
"schema": "redshift_schema",
"enrich_only": true,
"connection": {
"name": "your_connection_name",
"type": "redshift",
"host": "redshift_host",
"port": "redshift_port",
"username": "redshift_username",
"password": "redshift_password"
}
}
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Snowflake
Adding and configuring a Snowflake connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add Snowflake as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Snowflake environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Snowflake Setup Guide
The Snowflake Setup Guide provides step-by-step instructions for configuring warehouses and roles, ensuring efficient data management and access control. It explains how to create a warehouse with minimal requirements with the setup of a default warehouse for a user. It also explains how to create custom read-only and read-write roles and grant the necessary privileges for data access and modification.
This guide is designed to help you optimize your Snowflake environment for performance and security, whether setting it up for the first time or refining your configuration.
Warehouse & Role Configuration
This section provides instructions for configuring Snowflake warehouses and roles. It includes creating a warehouse with minimum requirements, assigning a default warehouse for a user, creating custom read-only and read-write roles, and granting privileges to these roles for data access and modification.
Create a Warehouse
Use the following command to create a warehouse with minimum requirements:
Set a specific warehouse as the default for a user:
Source Datastore Privileges and Permissions
Create a new role called qualytics_read_role
and grant it privileges:
CREATE ROLE qualytics_read_role;
GRANT USAGE ON WAREHOUSE qualytics_wh TO ROLE qualytics_read_role;
GRANT USAGE ON DATABASE <database_name> TO ROLE qualytics_read_role;
GRANT USAGE ON SCHEMA <database_name>.<schema_name> TO ROLE qualytics_read_role;
GRANT SELECT ON TABLE <database_name>.<schema_name>.<table_name> TO ROLE qualytics_read_role;
GRANT SELECT ON ALL TABLES IN SCHEMA <database_name>.<schema_name> TO ROLE qualytics_read_role;
GRANT SELECT ON ALL VIEWS IN SCHEMA <database_name>.<schema_name> TO ROLE qualytics_read_role;
GRANT SELECT ON FUTURE TABLES IN SCHEMA <database_name>.<schema_name> TO ROLE qualytics_read_role;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA <database_name>.<schema_name> TO ROLE qualytics_read_role;
GRANT ROLE qualytics_read_role TO USER <user_name>;
Enrichment Datastore Privileges and Permissions
Create a new role called qualytics_readwrite_role
and grant it privileges:
CREATE ROLE qualytics_readwrite_role;
GRANT USAGE ON WAREHOUSE qualytics_wh TO ROLE qualytics_readwrite_role;
GRANT USAGE, MODIFY ON DATABASE <database_name> TO ROLE qualytics_readwrite_role;
GRANT USAGE, MODIFY ON SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT CREATE TABLE ON SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT SELECT ON FUTURE TABLES IN SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT SELECT ON ALL TABLES IN SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT SELECT ON ALL VIEWS IN SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT ROLE qualytics_readwrite_role TO USER <user_name>;
Add a Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. Snowflake is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | Name (Required) | Specify the name of the datastore. (e.g., The specified name will appear on the datastore cards.) |
2️. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection. |
3️. | Connector (Required) | Select Snowflake from the dropdown list. |
Option I: Create a Source Datastore with a New Connection
If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Snowflake connector from the dropdown list and add the connection details.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | Account (Required) | Define the account identifier to be used for accessing the Snowflake. |
2️. | Role (Required) | Specify the user role that grants appropriate access and permissions. |
3️. | Warehouse (Required) | Provide the warehouse name that will be used for computing resources. |
4 | Authentication | You can choose between Basic authentication or Keypair authentication for validating and securing the connection to your Snowflake instance. Basic Authentication: This method uses a username and password combination for authentication. It is a straightforward method where the user's credentials are directly used to access Snowflake.
|
5. | Database (Required) | Specify the database name to be accessed. |
6. | Schema (Required) | Define the schema within the database that should be used. |
7. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
8. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
Option II: Use an Existing Connection
If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Database, Schema, Teams and Initiate Cataloging.
Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Info
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore Connection
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3 | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $ format in Connection form to reference a Vault secret.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 3: The configuration form, requesting credential details after selected enrichment datastore connector.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | Account (Required) | Define the account identifier to be used for accessing the Snowflake. |
2️. | Role (Required) | Specify the user role that grants appropriate access and permissions. |
3️. | Warehouse (Required) | Provide the warehouse name that will be used for computing resources. |
4 | Authentication | You can choose between Basic authentication or Keypair authentication for validating and securing the connection to your Snowflake instance. Basic Authentication: This method uses a username and password combination for authentication. It is a straightforward method where the user's credentials are directly used to access Snowflake.
|
5. | Database (Required) | Specify the database name to be accessed. |
6. | Schema (Required) | Define the schema within the database that should be used. |
7. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
Step 4: Click on the Test Connection button to verify the enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the enrichment datastore has been successfully verified.
Step 5: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Datastore
If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | Prefix (Required) | Add a prefix name to uniquely identify tables/files for metadata. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore. Example - All users are assigned to the Public team, which means that this enrichment datastore is accessible to all the users.
-
Host: This is the host domain of the Snowflake instance.
-
Database: Refers to the specific database within the Snowflake environment. This database is a logical grouping of schemas. Each database belongs to a single Snowflake account.
-
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating a Snowflake datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "snowflake_database",
"schema": "snowflake_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "snowflake",
"host": "snowflake_host",
"username": "snowflake_username",
"password": "snowflake_password",
"passphrase": "key_passphrase",
"parameters": {
"role": "snowflake_read_role",
"warehouse": "qualytics_wh",
"authentication_type": "KEYPAIR"
}
}
}
Note
If the authentication_type
parameter is removed, BASIC
authentication will be used by default.
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "snowflake_database",
"schema": "snowflake_schema",
"enrich_only": true,
"connection": {
"name": "your_connection_name",
"type": "snowflake",
"host": "snowflake_host",
"username": "snowflake_username",
"password": "snowflake_password",
"parameters": {
"role": "snowflake_readwrite_role",
"warehouse": "qualytics_wh"
}
}
}
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Synapse
Steps to setup Synapse
Fill the form with the credentials of your data source.
Once the form is completed, it's necessary to test the connection to verify if Qualytics is able to connect to your source of data. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and skipping the configuration of an Enrichment Datastore.
- To configure an Enrichment Datastore in another moment, please refer to this section
Note
It is important to associate an Enrichment Datastore
with your new Datastore
- The
Enrichment Datastore
will allow Qualytics to recordenrichment data
, copies of the sourceanomalous data
and additionalmetadata
for yourDatastore
Configuring an Enrichment Datastore
-
If you have an
Enrichment Datastore
already setup, you can link it by enable to use an existing Enrichment Datastore and select from the list -
If you don't have an
Enrichment Datastore
, you can create one at the same page:
Once the form is completed, it's necessary to test the connection. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and link or create the Enrichment Datastore
Fields
Name
required
required
- The datastore name to be created in Qualytics App
Host
required
required
- Host url to be connected.
- Hostname in the form
Port
required
required
- Port number to access the
Synapse
database. - The default port is
1433
Database
required
required
- The
database
name to be connected or which the account user has access to.
Schema
required
required
- The
schema
name to be connected or which the account user has access to.
User
required
required
- The
user
that has access to theSynapse
application.
Password
required
required
- The
password
that has access to theSynapse
application.
Information on how to connect with Synapse
API Payload Examples
Creating a Datastore
This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post)
/api/datastores
(post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "synapse_database",
"schema": "synapse_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "synapse",
"host": "synapse_host",
"port": "synapse_port",
"username": "synapse_username",
"password": "synapse_password"
}
}
Creating an Enrichment Datastore
Endpoint (Post)
/api/datastores
(post)
This section provides a sample payload for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "synapse_database",
"schema": "synapse_schema",
"enrich_only": true,
"connection": {
"name": "your_connection_name",
"type": "synapse",
"host": "synapse_host",
"port": "synapse_port",
"username": "synapse_username",
"password": "synapse_password"
}
}
Linking Datastore to an Enrichment Datastore through API
Endpoint (Patch)
/api/datastores/{datastore-id}/enrichment/{enrichment-id}
(patch)
Teradata
Adding and configuring a Teradata connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on adding Teradata as a source datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Teradata environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Add the Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. Teradata is an example of such a datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the Teradata datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name | Specify the name of the datastore (e.g., The specified name will appear on the datastore cards) |
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection |
3. | Connector | Select Teradata from the dropdown list. |
Option I: Create a Datastore with a new Connection
If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Teradata connector from the dropdown list and add connection details such as Secret Management, host, port, username, etc.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host | Get Hostname from your Teradata account and add it to this field. |
2. | Port | Specify the Port number. |
3. | User | Enter the User ID to connect. |
4. | Password | Enter the password to connect to the database. |
5. | Database | Specify the database name. |
6. | Teams | Select one or more teams from the dropdown to associate with this source datastore. |
7. | Initial Cataloging | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Use an existing connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Database, Teams, and Initiate Cataloging.
Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
After successfully testing and verifying your source datastore connection, you have the option to add an enrichment datastore (recommended). This datastore is used to store analyzed results, including. This setup provides comprehensive visibility into your data quality, enabling you to manage and improve it effectively.
Warning
Qualytics does not support the Teradata connector as an enrichment datastore, but you can point to a different enrichment datastore.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3 | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Note
Qualytics does not support Teradata as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using BigQuery as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.
Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 4: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 5: Close the Success dialogue and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Note
Qualytics does not support Teradata as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using BigQuery as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
Host: This is the server address where the Teradata instance is hosted. It is the endpoint used to connect to the PostgreSQL environment.
-
Database: Refers to the specific database within the Teradata environment where the data is stored.
-
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
Creating a Source Datastore
This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post): /api/datastores (post)
Link an Enrichment Datastore to a Source Datastore
Endpoint Patch:
/api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
TimescaleDB
Adding and configuring a TimescaleDB connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on adding TimescaleDB as a source datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their TimescaleDB environment is properly connected with Qualytics, unlocking the platform’s potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Add the Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. TimescaleDB is an example of such a datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the TimescaleDB datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
1. | Name | Specify the name of the datastore (e.g., The specified name eill appear on the datastore cards) |
---|---|---|
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection |
3. | Connector | Select TimescaleDB from the dropdown list. |
Option I: Create a Datastore with a new Connection
If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1 : Select the TimescaleDB connector from the dropdown list and add connection details such as Secret Management, host, port, username, database, and schema.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Host | Get Hostname from your TimescaleDB account and add it to this field. |
2. | Port | Specify the Port number. |
3. | User | Enter the User ID to connect. |
4. | Password | Enter the password to connect to the database. |
5. | Database | Specify the database name. |
6. | Schema | Define the schema within the database that should be used. |
7. | Teams | Select one or more teams from the dropdown to associate wit this source datastore. |
8. | Initial cataloging | Tick the checkbox to automatically perform catalog operation on the configured source to gather data structures and coressponding metadata. |
Step 3 : After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
After successfully testing and verifying your source datastore connection, you have the option to add an enrichment datastore (recommended). This datastore is used to store analyzed results, including. This setup provides comprehensive visibility into your data quality, enabling you to manage and improve it effectively.
Warning
Qualytics does not support the TimescaleDB connector as an enrichment datastore, but you can point to a different enrichment datastore.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an “enrichment datastore”.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3 | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Note
Qualytics does not support TimescaleDB as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using Microsoft SQL Server as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.
Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 4: Click on the “Finish” button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 5: Close the Success dialogue and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
Note
Qualytics does not support Timescale as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using Bank Enrichnment as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
- Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
- Host: This is the server address where the TimescaleDB instance is hosted. It is the endpoint used to connect to the PostgreSQL environment.
- Database: Refers to the specific database within the TimescaleDB environment where the data is stored.
- Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal window will display and a success flash message stating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured
API Payload Examples
Creating a Source Datastore
This section provides a sample payload for creating a TimescaleDB datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post): /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "timescale_database",
"schema": "timescale_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "timescale",
"host": "timescale_host",
"port": "timescale_port",
"username": "timescale_username",
"password": "timescale_password"
}
}
Link an Enrichment Datastore to a Source Datastore
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Trino
Steps to setup Trino
Fill the form with the credentials of your data source.
Once the form is completed, it's necessary to test the connection to verify if Qualytics is able to connect to your source of data. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and skipping the configuration of an Enrichment Datastore.
- To configure an Enrichment Datastore in another moment, please refer to this section
Note
It is important to associate an Enrichment Datastore
with your new Datastore
- The
Enrichment Datastore
will allow Qualytics to recordenrichment data
, copies of the sourceanomalous data
and additionalmetadata
for yourDatastore
Configuring an Enrichment Datastore
-
If you have an
Enrichment Datastore
already setup, you can link it by enable to use an existing Enrichment Datastore and select from the list -
If you don't have an
Enrichment Datastore
, you can create one at the same page:
Once the form is completed, it's necessary to test the connection. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and link or create the Enrichment Datastore
Fields
Host
required
required
- The address of the server to connect to. This address can be a DNS or IP address.
Port
required
required
- The port to connect to on serverName.
-
The default is
8080
.Note: If you're using the default, you don't have to specify the port
Catalog
required
required
- The
catalog
name to be connected.
Schema
required
required
- The
schema
name to be connected. - The default is
default
.
User
required
required
- The
user
to connect in Hive.
Password
required
required
- The
password
to connect in Hive.
SSL TrustStore
required
required
- A keystore file that contains certificates from other parties that you expect to communicate with or from Certificate Authorities that you trust to identify other parties
Configuring Trino for Hive Table Access Control:
Locate the Hive Connector Configuration File:
- The configuration file for the Hive connector in Trino is typically named
hive.properties
and is located in theetc/catalog
directory of your PreTrinosto installation.
Modify the Configuration:
- Open the
hive.properties
file in a text editor. - Add or modify the line
hive.allow-drop-table=true
to allow dropping tables. If you set it tofalse
, it will disallow dropping tables.
Restart Trino: - After making changes to the configuration, you'll need to restart Trino for the changes to take effect.
- Or, you can use the restart command to do it in one step:
Info
The hive.allow-drop-table
configuration is just one of the many configurations available. If you want to control more granular permissions, such as read/write access, you might need to look into using a combination of Hive's native permissions and the configurations available in Trino.
API Payload Examples
Creating a Datastore
This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post)
/api/datastores
(post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "trino_database",
"schema": "trino_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "trino",
"host": "trino_host",
"port": "trino_port",
"username": "trino_username",
"password": "trino_password",
"parameters":{
"ssl_truststore":"truststore.jks"
},
}
}
Creating an Enrichment Datastore
Endpoint (Post)
/api/datastores
(post)
This section provides a sample payload for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "trino_database",
"schema": "trino_schema",
"enrich_only": true,
"connection": {
"name": "your_connection_name",
"type": "trino",
"host": "trino_host",
"port": "trino_port",
"username": "trino_username",
"password": "trino_password",
"parameters":{
"ssl_truststore":"truststore.jks"
},
}
}
Linking Datastore to an Enrichment Datastore through API
Endpoint (Patch)
/api/datastores/{datastore-id}/enrichment/{enrichment-id}
(patch)
Dremio
Adding and configuring a Dremio connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on adding Dremio as a source datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Dremio environment is properly connected with Qualytics, unlocking the platform’s potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Add the Source Datastore
A source datastore is a storage location used to connect to and access data from external sources. Dremio is an example of such a datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the Dremio datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name | Specify the name of the datastore (e.g., The specified name will appear on the datastore cards) |
2. | Toggle Button | Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection |
3. | Connector | Select Dremio from the dropdown list. |
Option I: Create a Datastore with a new Connection
If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Dremio connector from the dropdown list and add connection properties such as Secrets Management, host, port, username, and password, along with datastore properties like catalog, database, etc.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4 | Secret URL | Enter the URL where the secret is stored in Vault. |
5 | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6 | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Host | Get Hostname from your account and Dremio add it to this field. |
2 | Port | Specify the Port number. |
3 | Project ID | Enter the Project ID associated with Dremio. |
4 | SSL Connection | Enable the SSL connection to ensure secure communication between Qualytics and the selected datastore. |
5 | Authentication | You can choose between Basic authentication or Access Token for validating and securing the connection to your Dremio instance. Basic Authentication: This method uses a username and password combination for authentication. It is a straightforward method where the user's credentials are directly used to access Dremio.
|
6 | Schema | Define the schema within the database that should be used. |
7 | Teams | Select one or more teams from the dropdown to associate with this source datastore. |
8 | Initial Cataloging | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Schema, Teams and Initiate Cataloging.
Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
Click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
After successfully testing and verifying your source datastore connection, you have the option to add an enrichment datastore (recommended). This datastore is used to store analyzed results, including. This setup provides comprehensive visibility into your data quality, enabling you to manage and improve it effectively.
Warning
Qualytics does not support the Dremio connector as an enrichment datastore, but you can point to a different enrichment datastore.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF. | FIELD | ACTIONS |
---|---|---|
1 | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3 | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles for Add New connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection
REF. | FIELDS | ACTION |
---|---|---|
1 | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Name | Give a name for the enrichment datastore. |
3 | Toggel Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4 | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Note
Qualytics does not support Dremio as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using DB2 as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.
Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 4: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 5: Close the Success dialogue and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the toggle for Use an existing enrichment datastore is turned on, you will be prompted to configure the enrichment datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
Note
Qualytics does not support Dremio as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using DB2 as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
Host: This is the server address where the Dremio instance is hosted. It is the endpoint used to connect to the PostgreSQL environment.
-
Database: Refers to the specific database within the Dremio environment where the data is stored.
-
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.
Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating a Dremio datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post): /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"database": "dremio_database",
"schema": "dremio_schema",
"enrich_only": false,
"trigger_catalog": true,
"connection": {
"name": "your_connection_name",
"type": "dremio",
"host": "dremio_host",
"port": 443,
"project_id": "dremio_id",
"ssl": true,
"authentication": {
"type": "access_token",
"personal_access_token": "your_personal_access_token"},
}
}
Link an Enrichment Datastore to a Source Datastore
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Ended: JDBC Datastores
DFS Datastores ↵
DFS Datastore Overview
The DFS (Distributed File System) Datastore feature in Qualytics is designed to handle data stored in distributed file systems.
This includes file systems like Hadoop Distributed File System (HDFS) or similar distributed storage solutions.
Supported Distributed File Systems
Qualytics supports DFS Datastores, catering to distributed file systems like:
Connection Details
Users provide connection details for DFS Datastores, allowing Qualytics to establish a connection to the distributed file system.
Catalog Operation
The Catalog operation involves walking the directory tree, reading files with supported filename extensions, and creating containers based on file metadata.
Data Quality and Profiling
DFS Datastores support the initiation of Profile Operations, allowing users to understand the structure and characteristics of the data stored in the distributed file system.
Containers Overview
For a more detailed understanding of how Qualytics manages and interacts with containers in DFS Datastores, please refer to the Containers section in our comprehensive user guide.
This section covers topics such as container deletion, field deletion, and the initial profile of a Datastore's containers.
Multi-Token Filename Globbing and Container Formation
Filenames with similar structures in the same folder are automatically included in a single globbed container during the Catalog operation.
Use Folders for Precise File Grouping
Organizing files within distinct folders is a straightforward and effective strategy in Distributed File Systems (DFS).
When all files in a folder share a common schema, it simplifies the process of grouping and managing them.
This approach ensures precise file grouping without relying on complex glob patterns.
How to Use Folders for Shared Schema
1. Create a Folder:
Begin by creating a new folder in your distributed filesystem.
-
Suppose you have order data files with filenames like
orders_20240229.csv
,orders-20240228.csv
,orders-20240227.csv
. -
Create a folder named
Orders
to group these files.
Qualytics Pattern: Qualytics will automatically create the container orders_*.csv
based on the filenames.
2. Place Related Files in the Folder:
Move or upload files that share a common schema into the created folder.
- Move the order data files into the
Orders
folder.
3. Repeat for Each Schema:
Create separate folders for different schemas, and organize files accordingly.
- Suppose you have customer data files with filenames like
customers_us.csv
,customers_eu.csv
. - Create a folder named
Customers
to group these files.
Qualytics Pattern: Qualytics will automatically create the pattern customers_*.csv
based on the filenames.
4. Naming Conventions:
Consider adopting clear and consistent naming conventions for folders to enhance organization.
- Use descriptive names for folders, such as
Orders
,Customers
, to make it easier to identify the contents.
Flowchart: Using Folders for Shared Schema
graph TD
A[Start] -->|Create a Folder| B(Create Folder)
B -->|Place Related Files| C(Move or Upload Files)
C -->|Repeat for Each Schema| D(Create Separate Folders)
D -->|Naming Conventions| E(Consider Clear Naming)
E --> F[End]
Use Filename Conventions for Posix Globs:
This option leverages filename conventions that align with POSIX globs, allowing our system to automatically organize files for you.
The system intelligently analyzes filename patterns, making the process seamless and efficient.
How to Use Filename Conventions for Posix Globs
1. Follow Clear Filename Conventions:
Adopt clear and consistent filename conventions that lend themselves to POSIX globs.
- Suppose you have log files with filenames like
app_log_20240229.txt
,app_log_20240228.txt
,app_log_20240227.txt
. - Use a consistent naming convention like
app_log_*.txt
, where*
serves as a placeholder for varying elements. - The
*
in the convention acts as a wildcard, representing any variation in the filename. In this example, it matches the date part (20240229
,20240228
, etc.).
2. Upload or Move Files:
Upload or move files with filenames following the adopted conventions to your distributed filesystem.
- Move log files with the specified naming convention to your DFS.
3. System Analysis:
Our system will automatically detect and analyze the filename conventions, creating appropriate glob patterns.
- With filenames like
app_log_20240229.txt
,app_log_20240228.txt
, the system will create the patternapp_log_*.txt
.
Flowchart: Using Folders for Filename Conventions
graph TD
A[Start] -->|Follow Clear Conventions| B(Adopt Consistent Conventions)
B -->|Upload or Move Files| C(Move Files to DFS)
C -->|System Analysis| D(Automatic Pattern Creation)
D --> E[End]
Why not manually creating your own Globs?
While our system offers powerful features to automate file organization, we strongly discourage manually creating globs.
This option may lead to errors, inconsistencies, and hinder the efficiency of our system.
We recommend leveraging our automated tools for a seamless and error-free experience.
Complexity and Error-Prone:
Manually creating globs can be complex, prone to typos, and susceptible to errors in pattern formation.
- Suppose you want to group log files with the pattern
app_log_*.txt
. A manual attempt might result in mistakes likeapp_log_202*.txt
orapp_log_*.tx
.
Inconsistencies Across Files:
Manual glob creation may lead to inconsistencies across different files, making it challenging to establish a uniform file organization.
- Trying to manually create globs for order data files with varying date formats (
orders_20240229.csv
,orders-20240228.csv
) can result in inconsistent patterns.
Explore Deeper Knowledge
If you want to go deeper into the knowledge or if you are curious and want to learn more about DFS filename globbing, you can explore our comprehensive guide here: How DFS Filename Globbing Works.
Amazon S3
Adding and configuring an Amazon S3 connection within Qualytics empowers the platform to build a symbolic link with your file system to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add Amazon S3 as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Amazon S3 environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Amazon S3 Setup Guide
This section provides a simple walkthrough for setting up Amazon S3, including retrieving URIs. It also explains how to retrieve the Access Key and Secret Key to configure datastore permissions.
By following the Amazon S3 setup process, you will ensure secure and efficient access to your stored data, allowing seamless datastore integration and proper access management in Qualytics.
Retrieve the URI
The S3 URI is the unique resource identifier within the context of the S3 protocol. They follow this naming convention: S3://bucket-name/key-name
To retrieve the URL of an S3 object via the AWS Console, follow these steps:
- Navigate to the AWS S3 console and click on your bucket's name (use the search input to find the object if necessary).
- Click on the checkbox next to the object's name
- Click on the Copy S3 URI button
Retrieve the Access Key and Secret Key
The access keys are long-term credentials for an IAM user or the AWS account root user. You can use these keys to sign programmatic requests to the AWS CLI or AWS API (directly or using the AWS SDK).
To retrieve the Access Key and Secret Access Key, follow these steps:
- Open the IAM console.
- From the navigation menu, click on the Users.
- Select your IAM user name.
- Click on the User Actions, and then click on the Manage Access Keys.
- Click on the Create Access Key.
- Your keys will look something like this:
- Access key ID example:
AKIAIOSFODNN7EXAMPLE
. - Secret access key example:
wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
.
- Access key ID example:
- Click on the Download Credentials, and store the keys in a secure location.
Warning
Your Secret Access Key will be visible only once at the time of creation. Please ensure you copy and securely store it for future use.
Datastore Privileges
If you are using a private bucket, authentication is required for the connection.
Source Datastore Permissions (Read-Only)
To create a policy, follow these steps:
- Open the IAM console.
- Navigate to Policies in the IAM dashboard and select Create Policy.
- Go to the JSON tab and paste the provided JSON into the Policy editor.
Tip
Ensure you replace <bucket/path>
with your specific resource.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:Get*"
],
"Resource": [
"arn:aws:s3:::<bucket>/*",
"arn:aws:s3:::<bucket>"
]
}
]
}
Warning
Currently, object-level permissions alone are insufficient to authenticate the connection. Please ensure you also include bucket-level permissions as demonstrated in the example above.
Enrichment Datastore Permissions (Read-Write)
To create a policy, follow these steps:
- Open the IAM console.
- Navigate to Policies in the IAM dashboard and select Create Policy.
- Go to the JSON tab and paste the provided JSON into the Policy editor.
Tip
Ensure you replace <bucket/path>
with your specific resource.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:PutObject",
"s3:DeleteObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": [
"arn:aws:s3:::<bucket>/*",
"arn:aws:s3:::<bucket>"
]
}
]
}
Warning
Currently, object-level permissions alone are insufficient to authenticate the connection. Please ensure you also include bucket-level permissions as demonstrated in the example above.
Add a Source Datastore
A source datastore is a storage location used to connect and access data from external sources. Amazon S3 is an example of a source datastore, specifically a type of Distributed File System (DFS) datastore that is designed to handle data stored in distributed file systems. Configuring a DFS datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | Name (Required) | Specify the name of the datastore (e.g., The specified name will appear on the datastore cards.). |
2️. | Toggle Button | Toggle ON to reuse credentials from an existing connection, or toggle OFF to create a new source datastore from scratch. |
3️. | Connector (Required) | Select Amazon S3 from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for Use an existing connection is turned off, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Amazon S3 connector from the dropdown list and add connection details such as URI, access key, secret key, root path, and teams.
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | URI (Required) | Enter the Uniform Resource Identifier (URI) of the Amazon S3. |
2️. | Access Key (Required) | Input the access key provided for secure access. |
3️. | Secret Key (Required) | Input the secret key associated with the access key for secure authentication. |
4️. | Root Path (Required) | Specify the root path where the data is stored. |
5️. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
6️. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Use an existing connection is turned on, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Root Path, Teams and initiate Cataloging.
Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). This datastore is used to store the analyzed results, including any anomalies and additional metadata in files. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Add Enrichment Datastore will appear, providing you with the options to configure to add an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2️. | Toggle Button for existing enrichment datastore | Toggle ON to link the source datastore to an existing enrichment datastore, or toggle OFF to link it to a brand new enrichment datastore. |
3️. | Name (Required) | Give a name for the enrichment datastore |
4️. | Toggle Button for using an existing connection | Toggle ON to reuse credentials from an existing connection, or toggle OFF to create a new enrichment from scratch. |
5️. | Connector (Required) | Select a datastore connector as Amazon S3 from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles for Use an existing enrichment datastore and Use an existing connection are turned off, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Add connection details for your selected enrichment datastore connector.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | URI (Required) | Enter the Uniform Resource Identifier (URI) for the Amazon S3. |
2. | Access Key (Required) | Input the access key provided for secure access. |
3. | Secret Key (Required) | Input the secret key associated with the access key for secure authentication. |
4. | Root Path (Required) | Specify the root path where the data is stored. |
5. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
Step 2: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 3: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 4: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the toggle for Use an existing enrichment datastore is turned on, you will be prompted to configure the enrichment datastore using existing connection details.
Step 1: Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Toggle Button for existing enrichment datastore | Toggle ON to link the source datastore to an existing enrichment datastore. |
3. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 2: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
URI: Uniform Resource Identifier (URI) points to the specific location of the source data and should be formatted accordingly (e.g.,
s3://bucket-name
for Amazon S3). -
Root Path: Specify the root path where the data is stored. This path defines the base directory or folder from which all data operations will be performed.
Step 3: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating the Amazon S3 datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Azure Blob Storage
Adding and configuring an Azure Blob Storage connection within Qualytics empowers the platform to build a symbolic link with your file system to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add Azure Blob Storage as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Azure Blob Storage environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Azure Blob Storage Setup Guide
This setup guide details the process for retrieving the Account Name and Access Key of your Azure Blob Storage account, essential for seamless configuration in Qualytics.
Azure Blob Storage URI
The Uniform Resource Identifier (URI) for Azure Blob Storage is structured to uniquely identify resources within your storage account. The format of the URI is as follows:
<container-name>
: The name of the container within your Azure Blob storage account.<storage-account-name>
: The name of your Azure Blob storage account.<path>
: A forward slash-delimited (/) representation of the directory structure within the container.
Retrieve the Account Name and Access Key
To configure Azure Blob Storage Datastore in Qualytics, you need the account name and access key. Follow these steps to retrieve them:
-
To get the
account_name
andaccess_key
you need to access your local storage in Azure. -
Click on the Access Keys tab and copy the values.
Tip
Refer to the Azure Blob Storage documentation for more information on how to retrieve the account name and access key of your storage account.
Add a Source Datastore
A source datastore is a storage location used to connect and access data from external sources. Azure Blob Storage is an example of a source datastore, specifically a type of Distributed File System (DFS) datastore that is designed to handle data stored in distributed file systems. Configuring a DFS datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF | FIELD | ACTION |
---|---|---|
1️. | Name (Required) | Specify the name of the datastore. Example: The specified name will appear on the datastore cards. |
2️. | Toggle Button | Toggle ON to reuse credentials from an existing connection, or toggle OFF to create a new source datastore from scratch. |
3️. | Connector (Required) | Select Azure Blob Storage from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for Use an existing connection is turned off, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Azure Blob Storage connector from the dropdown list and add connection details such as URI, account name, access key, root path, and teams.
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF | FIELD | ACTION |
---|---|---|
1️. | URI (Required) | Enter the Uniform Resource Identifier (URI) of the Azure Blob Storage. |
2️. | Account Name (Required) | Input the account name to access the Azure Blob Storage. |
3️. | Access Key (Required) | Input the access key provided for secure access. |
4️. | Root Path (Required) | Specify the root path where the data is stored. |
5️. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
6️. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Use an existing connection is turned on, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Root Path, Teams and Initiate Cataloging.
Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in files. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Add Enrichment Datastore will appear, providing you with the options to configure to add an enrichment datastore.
REF | FIELD | ACTION |
---|---|---|
1️. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2️. | Toggle Button for existing enrichment datastore | Toggle ON to link the source datastore to an existing enrichment datastore, or toggle OFF to link it to a brand new enrichment datastore. |
3️. | Name (Required) | Give a name for the enrichment datastore. |
4️. | Toggle Button for using an existing connection | Toggle ON to reuse credentials from an existing connection, or toggle OFF to create a new enrichment from scratch. |
5️. | Connector (Required) | Select a datastore connector as Azure Blob Storage from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles for Use an existing enrichment datastore and Use an existing connection are turned off, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Add connection details for your selected enrichment datastore connector.
REF | FIELD | ACTION |
---|---|---|
1️. | URI (Required) | Enter the Uniform Resource Identifier (URI) of the Azure Blob Storage. |
2️. | Account Name (Required) | Input the account name to access the Azure Blob Storage. |
3️. | Access Key (Required) | Input the access key provided for secure access. |
4️. | Root Path (Required) | Specify the root path where the data is stored. |
5️. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
Step 2: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 3: Click on the Finish button to complete the configuration process
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 4: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the toggle for Use an existing enrichment datastore is turned on, you will be prompted to configure the enrichment datastore using existing connection details.
Step 1: Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF | FIELD | ACTION |
---|---|---|
1️. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2️. | Toggle Button for existing enrichment datastore | Toggle ON to link the source datastore to an existing enrichment datastore. |
3️. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 2: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
URI: Uniform Resource Identifier (URI) points to the specific location of the source data and should be formatted accordingly (e.g.,
wasbs://storage-url
for Azure Blob Storage). -
Root Path: Specify the root path where the data is stored. This path defines the base directory or folder from which all data operations will be performed.
Step 3: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating the Azure Blob Storage datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"trigger_catalog": true,
"root_path": "/azure_root_path",
"enrich_only": false,
"connection": {
"name": "your_connection_name",
"type": "wasb",
"uri": "wasb://<container>@<account_name>.blob.core.windows.net",
"access_key": "azure_account_nme",
"secret_key": "azure_access_key"
}
}
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your_datastore_name",
"teams": ["Public"],
"trigger_catalog": true,
"root_path": "/azure_root_path",
"enrich_only": true,
"connection": {
"name": "your_connection_name",
"type": "wasb",
"uri": "wasb://<container>@<account_name>.blob.core.windows.net",
"access_key": "azure_account_nme",
"secret_key": "azure_access_key"
}
}
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Azure Datalake Storage
Adding and configuring an Azure Datalake Storage connection within Qualytics empowers the platform to build a symbolic link with your file system to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add Azure Datalake Storage as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Azure Datalake Storage environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle
Let’s get started 🚀
Azure Datalake Storage Setup Guide
This setup guide details the process for retrieving the Account Name and Access Key of your Azure Datalake Storage account, essential for seamless configuration in Qualytics.
Azure Datalake Storage URI
The Uniform Resource Identifier (URI) for Azure Datalake Storage is structured to uniquely identify resources within your storage account. The format of the URI is as follows:
abfs[s]
: Theabfs
orabfss
protocol is used as the scheme identifier.\<file_system>
: The parent location that holds the files and folders. This is similar to containers in the Azure Storage Blobs service.<account-name>
: The name assigned to your storage account during creation.<path>
: A forward slash delimited (/) representation of the directory structure.
Retrieve the Account Name and Access Key
To configure Azure Datalake Storage Datastore in Qualytics, you need the account name and access key. Follow these steps to retrieve them:
-
To get the
account_name
andaccess_key
you need to access your local storage in Azure. -
Click on Access Keys tab and copy the values.
Tip
Refer to the Azure Datalake Storage documentation for more information on how to retrieve the account name and access key of your storage account.
Add a Source Datastore
A source datastore is a storage location used to connect and access data from external sources. Azure Datalake Storage is an example of a source datastore, specifically a type of Distributed File System (DFS) datastore that is designed to handle data stored in distributed file systems. Configuring a DFS datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Name (Required) | Specify the name of the datastore.(e.g., The specified name will appear on the datastore cards.) |
2. | Toggle Button | Toggle ON to reuse credentials from an existing connection, or toggle OFF to create a new source datastore from scratch. |
3. | Connector (Required) | Select Azure Datalake Storage from the dropdown list. |
Option I: Create a Source Datastore with a new Connection
If the toggle for Use an existing connection is turned off, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Azure Datalake Storage connector from the dropdown list and add connection details such as URI, account name, access key, root path, and teams.
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF | FIELDS | ACTIONS |
---|---|---|
1. | URI (Required) | Enter the Uniform Resource Identifier (URI) of the Azure Datalake Storage. |
2. | Account Name (Required) | Input the account name to access the Azure Datalake Storage. |
3. | Access Key (Required) | Input the access key provided for secure access. |
4. | Root Path (Required) | Specify the root path where the data is stored. |
5. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
6. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Use an existing connection is turned on, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Root Path, Teams, and Initiate Cataloging.
Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). This datastore is used to store the analyzed results, including any anomalies and additional metadata in files. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Add Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Toggle Button for existing enrichment datastore | Toggle ON to link the source datastore to an existing enrichment datastore, or toggle OFF to link it to a brand new enrichment datastore. |
3. | Name (Required) | Give a name for the enrichment datastore. |
4. | Toggle Button for using an existing connection | Toggle ON to reuse credentials from an existing connection, or toggle OFF to create a new enrichment from scratch. |
5. | Connector (Required) | Select a datastore connector as Azure Datalake Storage from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles for Use an existing enrichment datastore and Use an existing connection are turned off, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Add connection details for your selected enrichment datastore connector.
REF | FIELDS | ACTIONS |
---|---|---|
1. | URI (Required) | Enter the Uniform Resource Identifier (URI) of the Azure Datalake Storage. |
2. | Account Name (Required) | Input the account name to access the Azure Datalake Storage. |
3. | Access Key (Required) | Input the access key provided for secure access. |
4. | Root Path (Required) | Specify the root path where the data is stored. |
5. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
Step 2: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 3: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 4: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the toggle for Use an existing enrichment datastore is turned on, you will be prompted to configure the enrichment datastore using existing connection details.
Step 1: Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Toggle Button for existing enrichment datastore | Toggle ON to link the source datastore to an existing enrichment datastore. |
3. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 2: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
URI: Uniform Resource Identifier (URI) points to the specific location of the source data and should be formatted accordingly (e.g.,
abfss://storage-url
for Azure Datalake Storage). -
Root Path: Specify the root path where the data is stored. This path defines the base directory or folder from which all data operations will be performed.
Step 3: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating the Azure Datalake Storage datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your\_datastore\_name",
"teams": \["Public"\],
"trigger_catalog": true,
"root_path": "/azure\_root\_path",
"enrich_only": false,
"connection": {
"name": "your\_connection\_name",
"type": "abfs",
"uri": "abfs://<container>@<account_name>.dfs.core.windows.net",
"access_key": "azure\_account\_nme",
"secret_key": "azure\_access\_key"
}
}
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
{
"name": "your\_datastore\_name",
"teams": \["Public"\],
"trigger_catalog": true,
"root_path": "/azure\_root\_path",
"enrich_only": true,
"connection": {
"name": "your\_connection\_name",
"type": "abfs",
"uri": "abfs://<container>@<account_name>.dfs.core.windows.net",
"access_key": "azure\_account\_nme",
"secret_key": "azure\_access\_key"
}
}
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Google Cloud Storage
Adding and configuring Google Cloud Storage connection within Qualytics empowers the platform to build a symbolic link with your file system to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.
This documentation provides a step-by-step guide on how to add Google Cloud Storage as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.
By following these instructions, enterprises can ensure their Google Cloud Storage environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.
Let’s get started 🚀
Google Cloud Storage Setup Guide
This guide will walk you through the steps to set up Google Cloud Storage, including how to retrieve the necessary URIs, access keys, and secret keys, which are essential for integrating this datastore into Qualytics.
Retrieve Google Cloud Storage URI
To retrieve the Cloud Cloud Storage URI follow the given steps:
- Go to the Cloud Storage Console.
- Navigate to the location of the object (file) that holds the source data.
- At the top of the Cloud Storage console, locate and note down the path to the object.
- Create the URI using the following format:
-
bucket
is the name of the Cloud Storage bucket. -
file
is the name of the object (file) containing the data.
Retrieve the Access Key and Secret Key
You need these keys when integrating Google Cloud Storage with other applications or services, such as when adding it as a datastore in Qualytics. The keys allow you to reuse existing code to access Google Cloud Storage without needing to implement a different authentication mechanism.
To retrieve the access key and secret key in the Google Cloud Storage Console account, follow the given steps:
Step 1: Log in to the Google Cloud Console, navigate to the Google Cloud Storage settings, and this will redirect you to the Settings page.
Step 2: Click on the Interoperability tab.
Step 3: Scroll down the Interoperability page and under Access keys for your user account, click the CREATE A KEY button to generate a new Access Key and Secret Key.
Step 4: Use these generated Access Key and Secret Key values when adding your Google Cloud Storage account to SimpleBackups.
For example, once you generate the keys, they might look like this:
-
Access Key:
GOOG1234ABCDEFGH5678
-
Secret Key:
abcd1234efgh5678ijklmnopqrstuvwx
Warning
Make sure to store these keys securely, as they provide access to your Google Cloud Storage resources.
Add a Source Datastore
A source datastore is a storage location used to connect and access data from external sources. Google Cloud Storage is an example of a source datastore, specifically a type of Distributed File System (DFS) datastore that is designed to handle data stored in distributed file systems. Configuring a DFS datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.
Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.
Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | Name (Required) | Specify the name of the datastore (e.g., The specified name will appear on the datastore cards) |
2️. | Toggle Button | Toggle ON to reuse credentials from an existing connection, or toggle OFF to create a new source datastore from scratch. |
3️. | Connector (Required) | Select Google Cloud Storage from the dropdown list. |
Option I: Create a Datastore with a new Connection
If the toggle for Use an existing connection is turned off, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.
Step 1: Select the Google Cloud Storage connector from the dropdown list and add connection details such as URI, service account key, root path, and teams.
Step 2: The configuration form will expand, requesting credential details before establishing the connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | URI (Required) | Enter the Uniform Resource Identifier (URI) of the Google Cloud Storage. |
2️. | Service Account Key (Required) | Upload a JSON file that contains the credentials required for accessing the Google Cloud Storage. |
3️. | Root Path (Required) | Specify the root path where the data is stored. |
4️. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
5️. | Initiate Cataloging (Optional) | Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata. |
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Option II: Use an Existing Connection
If the toggle for Use an existing connection is turned on, then this will prompt you to configure the source datastore using the existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Root Path, Teams, and Initiate Cataloging.
Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.
Note
Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.
Tip
It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.
Add Enrichment Datastore
Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). This datastore is used to store the analyzed results, including any anomalies and additional metadata files. This setup provides full visibility into your data quality, helping you manage and improve it effectively.
Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.
Step 2: A modal window- Add Enrichment Datastore will appear, providing you with the options to configure to add an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2️. | Toggle Button for existing enrichment datastore | Toggle ON to link the source datastore to an existing enrichment datastore, or toggle OFF to link it to a brand new enrichment datastore. |
3️. | Name (Required) | Give a name for the enrichment datastore |
4️. | Toggle Button for using an existing connection | Toggle ON to reuse credentials from an existing connection, or toggle OFF to create a new enrichment from scratch. |
5️. | Connector (Required) | Select a datastore connector as Google Cloud Storage from the dropdown list. |
Option I: Create an Enrichment Datastore with a new Connection
If the toggles for Use an existing enrichment datastore and Use an existing connection are turned off, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.
Step 1: Add connection details for your selected enrichment datastore connector.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | URI (Required) | Enter the Uniform Resource Identifier (URI) for the Google Cloud Storage. |
2️. | Service Account Key (Required) | Upload a JSON file that contains the credentials required for accessing the Google Cloud Storage. |
3️. | Root Path (Required) | Specify the root path where the data is stored. |
4️. | Teams (Required) | Select one or more teams from the dropdown to associate with this source datastore. |
Step 2: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.
Step 3: Click on the Finish button to complete the configuration process.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Step 4: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.
Option II: Use an Existing Connection
If the toggle for Use an existing enrichment datastore is turned on, you will be prompted to configure the enrichment datastore using existing connection details.
Step 1: Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | Prefix (Required) | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2️. | Toggle Button for existing enrichment datastore | Toggle ON to link the source datastore to an existing enrichment datastore. |
3️. | Enrichment Datastore (Required) | Select an enrichment datastore from the dropdown list. |
Step 2: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:
-
Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
-
URI: Uniform Resource Identifier (URI) points to the specific location of the source data and should be formatted accordingly (e.g.,
gs://bucket/file
for Google Cloud Storage). -
Root Path: Specify the root path where the data is stored. This path defines the base directory or folder from which all data operations will be performed.
Step 3: Click on the Finish button to complete the configuration process for the existing enrichment datastore.
When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.
Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.
API Payload Examples
This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.
Creating a Source Datastore
This section provides sample payloads for creating the Google Cloud Storage datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
Creating an Enrichment Datastore
This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.
Endpoint: /api/datastores (post)
Link an Enrichment Datastore to a Source Datastore
Use the provided endpoint to link an enrichment datastore to a source datastore:
Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)
Qualytics File System (QFS)
Overview
-
A
QFS
orQualytics FileSystem
is a datastore option that Qualytics manages & controls. -
This is convenient for end-users who do not have or do not wish to configure their own datastores but would like to load data onto a datastore managed by Qualytics.
-
With a QFS datastore, end-users are able to upload Qualytics readable files (Excel, CSVs, Json) directly through the Qualytics user-interface.
Steps to setup QFS
Fill the form with the credentials of your data source.
Once the form is completed, it's necessary to test the connection to verify if Qualytics is able to connect to your source of data. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and skipping the configuration of an Enrichment Datastore.
- To configure an Enrichment Datastore in another moment, please refer to this section
Note
It is important to associate an Enrichment Datastore
with your new Datastore
- The
Enrichment Datastore
will allow Qualytics to recordenrichment data
, copies of the sourceanomalous data
and additionalmetadata
for yourDatastore
Configuring an Enrichment Datastore
-
If you have an
Enrichment Datastore
already setup, you can link it by enable to use an existing Enrichment Datastore and select from the list -
If you don't have an
Enrichment Datastore
, you can create one at the same page:
Once the form is completed, it's necessary to test the connection. A successful message will be shown:
Warning
By clicking on the Finish
button, it will create the Datastore and link or create the Enrichment Datastore
Fields
Name
required
required
- The datastore name to be created in Qualytics App
Ended: DFS Datastores
Connections Overview
Connections facilitate the management of datastores by allowing you to share common connection parameters across multiple datastores.
Setup a Connection
When you create your first datastore, a Connection is automatically generated using the parameters you provide. This Connection can then be used for other datastores with similar configuration needs.
Example:
-
Click on Add Source Datastore.
-
Enter the necessary connection parameters (e.g., hostname, database name, user credentials).
- After saving, a Connection is created and linked to your datastore, ready for reuse.
Reuse a Connection
For subsequent datastores that require the same connection parameters, you can reuse existing connections.
Example:
- Open the Create Datastore form.
-
Select Use an existing connection.
-
Pick the desired Connection from a dropdown list.
-
Provide any additional details required for the new datastore (e.g., database name, root path).
Manage Connections
For managing the connections please see Manage Connections.
Conclusion
Using Connections optimizes datastore management by enabling the reuse of connection parameters, making the process more streamlined and organized.
Ended: Add Datastores
Source Datastores ↵
Right Click Options
Once you add a source datastore, whether JDBC or DFS, Qualytics provides your right-click options on the following:
- Added source datastore
- Tables or files within the source datastore
- Fields within the tables
- Checks within the source datastore
- Anomalies within the source datastore
Let’s get started 🚀
Right Click Source Datastore
Log in to your Qualytics account and right-click on the source datastore whether JDBC or DFS. A dropdown list of options will appear:
-
Open in New Tab.
-
Copy Link.
-
Copy ID.
-
Copy Name.
No | Field | Description |
---|---|---|
1 | Open in new Tab | Opens the selected source datastore in a new browser tab, where you can view its quality score, sampling, completeness, active checks, active anomalies, etc. |
2 | Copy Link | Copy the unique URL of the selected source datastore to your clipboard. |
3 | Copy ID | Copy the unique ID of the selected source datastore. |
4 | Copy Name | Copy the name of the selected source datastore to your clipboard. |
Alternatively, you can access these right-click options by performing the direct right-click operation on a source datastore from the list.
Right Click Tables & Files
Right-click on the specific table or file underlying a connected source datastore.
A dropdown list of options will appear:
-
Open in New Tab.
-
Copy Link.
-
Copy ID.
-
Copy Name.
No | Field | Description |
---|---|---|
1 | Open in new Tab | Open the selected table from the datastore in a new browser tab, where you can view its quality score, sampling, completeness, active checks, active anomalies, etc. |
2 | Copy Link | Copy the unique URL of the selected table to your clipboard. |
3 | Copy ID | Copy the unique identifier (ID) of the selected table. |
4 | Copy Name | Copy the name of the selected table to your clipboard. |
Alternatively, you can access these right-click options by opening the dedicated page of the source datastore, navigating to its Tables or files section, and performing the right-click operation on any table or file from the list.
Right Click Fields
Right-click on the specific field underlying within a table or file.
A dropdown list of options will appear:
-
Open in New Tab.
-
Copy Link.
-
Copy ID.
-
Copy Name.
No | Field | Description |
---|---|---|
1 | Open in new Tab | Open the selected field in a new browser tab, where you can view its quality score, sampling, completeness, active checks, active anomalies, etc. |
2 | Copy Link | Copy the unique URL of the selected field to your clipboard. |
3 | Copy ID | Copy the unique identifier (ID) of the selected field. |
4 | Copy Name | Copy the name of the selected field to your clipboard. |
Alternatively, you can access these right-click options by opening the dedicated page of the table, navigating to its Fields section, and performing the right-click operation on any field from the list.
Right Click Checks
Right-click on the specific check whether All, Active, Draft and Archived within a connected source datastores.
A dropdown list of options will appear:
-
Open in New Tab.
-
Copy Link.
-
Copy ID.
-
Copy Name.
No | Field | Description |
---|---|---|
1 | Open in new Tab | Open the selected field in a new browser tab, where you can view its quality score, sampling, completeness, active checks, active anomalies, etc. |
2 | Copy Link | Copy the unique URL of the selected field to your clipboard. |
3 | Copy ID | Copy the unique identifier (ID) of the selected field. |
4 | Copy Name | Copy the name of the selected field to your clipboard. |
Alternatively, you can access these right-clcik options by navigating to the Checks from the Explore section.
Right Click Anomalies
Right-click on the specific anomalies whether All, Active, Acknowledged and Archived within a connected source datastores.
A dropdown list of options will appear:
-
Open in New Tab.
-
Copy Link.
-
Copy ID.
-
Copy Name.
No | Field | Description |
---|---|---|
1 | Open in new Tab | Open the selected field in a new browser tab, where you can view its quality score, sampling, completeness, active checks, active anomalies, etc. |
2 | Copy Link | Copy the unique URL of the selected field to your clipboard. |
3 | Copy ID | Copy the unique identifier (ID) of the selected field. |
4 | Copy Name | Copy the name of the selected field to your clipboard. |
Alternatively, you can access these right-clcik options by navigating to the Anomalies from the Explore section.
Link Enrichment Datastore
An enrichment datastore is a database used to enhance your existing data by adding additional, relevant information. This helps you to provide more comprehensive insight into data and improve data accuracy.
You have the option to link an enrichment datastore to your existing source datastore. However, some datastores cannot be linked as enrichment datastores. For example, Oracle, Athena, Dremio, and Timescale cannot be used for this purpose.
Let's get started 🚀
Step 1: Select a source datastore from the side menu for which you would like to configure Link Enrichment.
Step 2: Click on the Settings icon from the top right window and select the Enrichment option from the dropdown menu.
A modal window- Link Enrichment Datastore will appear, providing you with two options to link an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Link New Enrichment
If the toggle Add new connection is turned on, then this will prompt you to link a new enrichment datastore from scratch without using existing connection details.
Note
Connection details can vary from datastore to datastore. For illustration, we have demonstrated linking BigQuery as a new enrichment datastore.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1 | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2 | Name | Give a name for the enrichment datastore. |
3 | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4 | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Step 4: Click on Save button.
Step 5: After clicking on the Save button a modal window will appear Your Datastore has been successfully updated.
Option II: Link Existing Connection
If the Use an existing enrichment datastore option is selected from the dropdown menu, you will be prompted to link the enrichment datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: View and check the connection details of the enrichment and click on the Save button.
Step 4: After clicking on the Save button a modal window will appear Your Datastore has been successfully updated.
When an Enrichment Datastore is linked, you can see a green light showing that the connection between the Datastore and the Enrichment Datastore are stable or red if it's unstable.
Endpoint (Patch)
/api/datastores/{datastore-id}/enrichment/{enrichment-id}
(patch)
Assign Tag
Assigning tags to your Datastore, helps you to identify and categorize your datastore easily. Tags serve as labels that categorize and identify various data sets, enhancing efficiency and organization. By highlighting checks and anomalies, tags make it easier to monitor data quality. They also allow you to list file patterns and assign quality scores, enabling quick identification and resolution of issues.
In this documentation, we will explore the steps to assign a tag to the datastore.
Step 1: Login into your Qualytics account and select the datastore from the left menu on which you want to assign a tag.
Step 2: Click on Assign Tag to this Datastore located at the bottom-left corner of the interface.
Step 3: A drop-up menu will appear, providing you with a list of tags. Assign an appropriate tag to your datastore to simplify sorting, accessing, and managing data.
You can also create the new tag by clicking on the call to action (➕) button.
A modal window will appear, providing the options to create the tag. Enter the required values to get started.
For more information on creating tags, refer to the Add Tag section.
Step 4: Once you have assigned a tag, the tag will be instantly labeled on your source Datastore, and all related records will be updated.
For demonstration, we have assigned the High tag for the Snowflake source datastore. Covid-19 Data, so it will automatically be applied to all related tables and checks within the datastore.
Catalog Operation
A Catalog Operation imports named data collections like tables, views, and files into a Source Datastore. It identifies incremental fields for incremental scans, and offers options to recreate or delete these containers, streamlining data management and enhancing data discovery.
Let's get started 🚀
Key Components
Incremental Identifier
An incremental identifier is essential for supporting incremental scan operations, as it allows the system to detect changes since the last operation.
Partition Identifier
For large data containers or partitions, a partition identifier is necessary to process data efficiently. In DFS datastores, the default fields for both incremental and partition identifiers are set by the last-modified timestamp. If a partition identifier is missing, the system uses repeatable ordering candidates (order-by fields) to process containers, although this method is less efficient for handling large datasets with many rows.
Info
Attribute Overrides: After the profile operation, the qualytics engine might automatically update the containers to have partition fields and incremental fields. Those "attributes" can be manually overridden.
Note
Advanced users should be able to override these auto-detected selections and overwritten options should persist through subsequent Catalog Operations.
Initialization & Operation Options
Automatic Catalog Operation
While adding the datastore, tick the Initiate Cataloging checkbox to automatically perform catalog operation on the configured source datastore.
With the automatic cataloging option turned on, you will be redirected to the datastore details page once the datastore (whether JDBC or DFS) is successfully added. You will observe the cataloging operation running automatically with the following default options:
-
Prune: Disabled ❌
-
Recreate: Disabled ❌
-
Include: Tables and views ✔️
Manual Catalog Operation
If automatic cataloging is disabled while adding the datastore, users can initiate the catalog operation manually by selecting preferred options. Manual catalog operation offers the users the flexibility to set up custom catalog configurations like syncing only tables or views.
Step 1: Select a source datastore from the side menu on which you would like to perform the catalog operation.
Step 2: Clicking on your preferred datastore will navigate you to the datastore details page. Within the overview tab (default view), click on the Run button under Catalog to initiate the catalog operation.
A modal window will display Operation Triggered and you will be notified once the catalog operation is completed.
Note
You will receive a notification when the catalog operation is completed.
Step 3: Close the Success modal window and you will observe in the UI that the Catalog operation has been completed and it has gathered the data structures, file patterns, and corresponding metadata from your configured datastore.
Users might encounter a warning error if the schema of the datastore is empty or if the specified user for logging does not have the necessary permissions to read the objects. This ensures that proper access controls are in place and that the data structure is correctly defined.
Custom Catalog Configuration
The catalog operation can be custom-configured with the following options:
-
Prune: Remove any existing named collections that no longer appear in the datastore
-
Recreate: Restore any previously removed named collection that does currently appear in the database
-
Include: Include Tables and Views
Step 1: Click on the Run button from the datastore details page (top-right corner) and select Catalog from the dropdown list.
Step 2: When configuring the catalog operation settings, you have two options to tune:
-
Prune: This option allows the removal of any named collections (tables, views, files, etc.) that no longer exist in the datastore. This ensures that outdated or obsolete collections are not included in future operations, keeping the datastore clean and relevant.
-
Recreate: This option enables the recreation of any named collections that have been previously deleted in Qualytics. It is useful for restoring collections that may have been removed accidentally or need to be brought back for analysis.
Step 3: The user can choose whether to include only tables, only views, or both in the catalog operation. This flexibility allows for more targeted metadata analysis based on the specific needs of the data management task.
Run Instantly
Click on the “Run Now” button, and perform the catalog operation immediately.
Schedule
Step 1: Click on the “Schedule” button to configure the available schedule options in the catalog operation.
Step 2: Set the scheduling preferences for the catalog operation.
1. Hourly: This option allows you to schedule the catalog operation to run every hour at a specified minute. You can define the frequency in hours and the exact minute within the hour the cataloging should start. Example: If set to "Every 1 hour(s) on minute 0," the catalog operation will run every hour at the top of the hour (e.g., 1:00, 2:00, 3:00).
2. Daily: This option schedules the catalog operation to run once every day at a specific time. You specify the number of days between scans and the exact time of day in UTC. Example: If set to "Every 1 day(s) at 00:00 UTC," the scan will run every day at midnight UTC.
3. Weekly: This option schedules the catalog operation to run on specific days of the week at a set time. You select the days of the week and the exact time of day in UTC for the catalog operation to run. Example: If configured to run on "Sunday" and "Friday" at 00:00 UTC, the scan will execute at midnight UTC on these days.
4. Monthly: This option schedules the catalog operation to run once a month on a specific day at a set time. You specify the day of the month and the time of day in UTC. If set to "On the 1st day of every 1 month(s), at 00:00 UTC," the catalog operation will run on the first day of each month at midnight UTC.
5. Advanced: The advanced section for scheduling operations allows users to set up more complex and custom scheduling using Cron expressions. This option is particularly useful for defining specific times and intervals for catalog operations with precision.
Cron expressions are a powerful and flexible way to schedule tasks. They use a syntax that specifies the exact timing of the task based on five fields:
- Minute (0 - 59)
- Hour (0 - 23)
- Day of the month (1 - 31)
- Month (1 - 12)
- Day of the week (0 - 6) (Sunday to Saturday)
Each field can be defined using specific values, ranges, or special characters to create the desired schedule.
Example: For instance, the Cron expression 0 0 * * *
** schedules the catalog operation to run at midnight (00:00) every day. Here’s a breakdown of this expression:
- 0 (Minute) - The task will run at the 0th minute.
- 0 (Hour) - The task will run at the 0th hour (midnight).
- *(Day of the month) - The task will run every day of the month.
- *(Month) - The task will run every month.
- *(Day of the week) - The task will run every day of the week.
Users can define other specific schedules by adjusting the Cron expression. For example:
- 0 12 * * 1-5 - Runs at 12:00 PM from Monday to Friday.
- 30 14 1 * * - Runs at 2:30 PM on the first day of every month.
- 0 22 * * 6 - Runs at 10:00 PM every Saturday.
To define a custom schedule, enter the appropriate Cron expression in the "Custom Cron Schedule (UTC)" field before specifying the schedule name. This will allow for precise control over the timing of the catalog operation, ensuring it runs exactly when needed according to your specific requirements.
Step 3: Define the “Schedule Name” to identify the scheduled operation at the running time.
Step 4: Click on the “Schedule” button to activate your catalog operation schedule.
Once the catalog operation is triggered, your view will be automatically switched to the Activity tab, allowing you to explore post-operation details on your ongoing/completed catalog operation.
Operations Insights
When the catalog operation is completed, you will receive the notification and can navigate to the Activity tab for the datastore on which you triggered the Catalog Operation and learn about the operation results.
Top Panel
1. Runs (Default View): Provides insights into the operations that have been performed.
2. Search: Search any operation (including catalog) by entering the operation ID
3. Sort by: Organize the list of operations based on the Created Date or the Duration.
4. Filter: Narrow down the list of operations based on:
-
Operation Type
-
Operation Status
-
Table
Activity Heatmap
The activity heatmap shown in the snippet below represents activity levels over a period, with each square indicating a day and the color intensity representing the number of operations or activities on that day. It is useful in tracking the number of operations performed on each day within a specific timeframe.
Tip
You can click on any of the squares from the Activity Heatmap to filter operations
Operation Detail
Running
This status indicates that the catalog operation is still running at the moment and is yet to be completed. A catalog operation having a running status reflects the following details and actions:
Parameter | Interpretation |
---|---|
Operation ID | Unique identifier |
Operation Type | Type of operation performed (catalog, profile, or scan) |
Timestamp | Timestamp when the operation was started |
Progress Bar | The progress of the operation |
Triggered By | The author who triggered the operation |
Schedule | Whether the operation was scheduled or not |
Prune | Indicates whether Prune was enabled or disabled in the operation |
Recreate | Indicates whether Recreate was enabled or disabled in the operation |
Table | Indicates whether the Table was included in the operation or not |
Views | Indicates whether the Views was included in the operation or not |
Abort | Click on the Abort button to stop the catalog operation |
Aborted
This status indicates that the catalog operation was manually stopped before it could be completed. A catalog operation having an aborted status reflects the following details and actions:
Parameter | Interpretation |
---|---|
Operation ID | Unique identifier |
Operation Type | Type of operation performed (catalog, profile, or scan) |
Timestamp | Timestamp when the operation was started |
Progress Bar | The progress of the operation |
Triggered By | The author who triggered the operation |
Schedule | Whether the operation was scheduled or not |
Prune | Indicates whether Prune was enabled or disabled in the operation |
Recreate | Indicates whether Recreate was enabled or disabled in the operation |
Table | Indicates whether the Table was included in the operation or not |
Views | Indicates whether the Views was included in the operation or not |
Resume | Click on the Resume button to continue a previously aborted catalog operation from where it left off |
Rerun | Click on the Rerun button to initiate the catalog operation from the beginning, ignoring any previous attempts |
Delete | Click on the Delete button to remove the record of the catalog operation from the list |
Warning
This status signals that the catalog operation encountered some issues and displays the logs that facilitate improved tracking of the blockers and issue resolution. A catalog operation having a warning status reflects the following details and actions:
Parameter | Interpretatio |
---|---|
Operation ID | Unique identifier |
Operation Type | Type of operation performed (catalog, profile, or scan) |
Timestamp | Timestamp when the operation was started |
Progress Bar | The progress of the operation |
Triggered By | The author who triggered the operation |
Schedule | Whether the operation was scheduled or not |
Prune | Indicates whether Prune was enabled or disabled in the operation |
Recreate | Indicates whether Recreate was enabled or disabled in the operation |
Table | Indicates whether the Table was included in the operation or not |
Views | Indicates whether the Views was included in the operation or not |
Rerun | Click on the Rerun button to initiate the catalog operation from the beginning, ignoring any previous attempts |
Delete | Click on the Delete button to remove the record of the catalog operation from the list |
Logs | Logs include error messages, warnings, and other pertinent information that occurred during the execution of the Catalog Operation |
Success
This status confirms that the catalog operation was completed successfully without any issues. A catalog operation having a success status reflects the following details and actions:
Parameter | Interpretation |
---|---|
Operation ID | Unique identifier |
Operation Type | Type of operation performed (catalog, profile, or scan) |
Timestamp | Timestamp when the operation was started |
Progress Bar | The progress of the operation |
Triggered By | The author who triggered the operation |
Schedule | Whether the operation was scheduled or not |
Prune | Indicates whether Prune was enabled or disabled in the operation |
Recreate | Indicates whether Recreate was enabled or disabled in the operation |
Table | Indicates whether the Table was included in the operation or not |
Views | Indicates whether the Views was included in the operation or not |
Rerun | Click on the Rerun button to initiate the catalog operation from the beginning, ignoring any previous attempts |
Delete | Click on the Delete button to remove the record of the catalog operation from the list |
Post-Operation Details
For JDBC Source Datastores
After the catalog operation is completed on a JDBC source datastore, users can view the following information:
Container Names: These are the names of the data collections (e.g., tables, views) identified during the catalog operation.
Fields for Each Container: Each container will display its fields or columns, which were detected during the catalog operation.
Incremental Identifiers and Partition Fields: These settings are automatically configured based on the catalog operation. Incremental identifiers help in recognizing changes since the last scan, and partition fields aid in efficient data processing.
Tree view > Container node > Gear icon > Settings option
For DFS Source Datastores
After the catalog operation is completed on a DFS source datastore, users can view the following information:
-
Container Names: Similar to JDBC, these are the data collections identified during the catalog operation.
-
Fields for Each Container: Each container will list its fields or metadata detected during the catalog operation.
-
Directory Tree Traversal: The catalog operation traverses the directory tree, treating each file with a supported extension as a single-partition container. It reveals metadata such as the relative path, filename, and extension.
-
Incremental Identifier and Partition Field: By default, both the incremental identifier and partition field are set to the last-modified timestamp. This ensures efficient incremental scans and data partitioning.
-
"Globbed" Containers: Files in the same folder with the same extensions and similar naming formats are grouped into a single container, where each file is treated as a partition. This helps in managing and querying large datasets effectively.
API Payload Examples
This section provides API payload examples for initiating and checking the running status of a catalog operation. Replace the placeholder values with data specific to your setup.
Running a Catalog operation
To run a catalog operation, use the API payload example below and replace the placeholder values with your specific values:
Endpoint (Post): /api/operations/run (post)
{
"type": "catalog",
"datastore_id": "datastore-id",
"prune": false,
"recreate": false,
"include": [
"table",
"view"
]
}
Retrieving Catalog Operation Status
To retrieve the catalog operation status, use the API payload example below and replace the placeholder values with your specific values:
Endpoint (Get): /api/operations/{id} (get)
{
"items": [
{
"id": 12345,
"created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"type": "catalog",
"start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"result": "success",
"message": null,
"triggered_by": "user@example.com",
"datastore": {
"id": 54321,
"name": "Datastore-Sample",
"store_type": "jdbc",
"type": "db_type",
"enrich_only": false,
"enrich_container_prefix": "_data_prefix",
"favorite": false
},
"schedule": null,
"include": [
"table",
"view"
],
"prune": false,
"recreate": false
}
],
"total": 1,
"page": 1,
"size": 50,
"pages": 1
}
Profile Operation
The Profile Operation is a comprehensive analysis conducted on every record within all available containers in a datastore. This process is aimed at understanding and improving data quality by generating metadata for each field within the collections of data (like tables or files).
By gathering detailed statistical data and interacting with the Qualytics Inference Engine, the operation not only identifies and evaluates data quality but also suggests and refines checks to ensure ongoing data integrity. Executing Profile Operations periodically helps maintain up-to-date and accurate data quality checks based on the latest data.
This guide explains how to configure the profile operation with available functionalities such as tables, tags, and schedule options.
Let's get started 🚀
How Profiling Works
Fields Identification
The initial step involves recognizing and identifying all the fields within each data container. This step is crucial as it lays the foundation for subsequent analysis and profiling.
Statistical Data Gathering
After identifying the fields, the Profile Operation collects statistical data for each field based on its declared or inferred data type. This data includes essential metrics such as minimum and maximum values, mean, standard deviation, and other relevant statistics. These metrics provide valuable insights into the characteristics and distribution of the data, helping to understand its quality and consistency.
Metadata Generation
The gathered statistical data is then submitted to the Qualytics Inference Engine. The engine utilizes this data to generate metadata that forms the basis for creating appropriate data quality checks. This metadata is essential for setting up robust quality control mechanisms within the data management system.
Data Quality Checks
The inferred data quality checks are rigorously tested against the actual source data. This testing phase is critical to fine-tuning the checks to the desired sensitivity levels, ensuring they are neither too strict (causing false positives) nor too lenient (missing errors). By calibrating these checks accurately, the system can maintain high data integrity and reliability.
Navigation to Profile Operation
Step 1: Select a source datastore from the side menu on which you would like to perform the profile operation.
Step 2: Clicking on your preferred datastore will navigate you to the datastore details page. Within the overview tab (default view), click on the Run button under Profile to initiate the profile operation.
Configuration
Step 1: Click on the Run button to initiate the profile operation.
Note
You can run Profile Operation anytime to update the inferred data quality checks, automatically based on new data in the Datastore. It is recommended to schedule the profile operations periodically to update inferred rules. More details are discussed in the Schedule section below.
Step 2: Select tables (in your JDBC datastore) or file patterns (in your DFS datastore) and tags you would like to be profiled.
1. All Tables/File Patterns
This option includes all tables or files currently available in the datastore for profiling. Selecting this will profile every table within the source datastore without the need for further selection.
2. Specific
This option allows users to manually select individual tables or files for profiling. It provides the flexibility to focus on particular tables of interest, which can be useful if the user is only interested in a subset of the available data.
3. Tag
This option automatically profiles tables associated with selected tags. Tags are used to categorize tables, and by selecting a specific tag, all tables associated with that tag will be profiled. This option helps in managing and profiling grouped data efficiently.
Step 3: After making the relevant selections, click on the Next button to configure the Operation Settings.
Step 4: Configure the following two Read Settings:
- Starting Threshold
- Record Limit
Starting Threshold
This setting allows users to specify a minimum incremental identifier value to set a starting point for the profile operation. It helps in filtering data from a specific point in time or a particular batch value.
-
Greater Than Time: Users can select a timestamp in UTC to start profiling data from a specific time onwards. This is useful for focusing on recent data or data changes since a particular time.
-
Greater Than Batch: Users can enter a batch value to start profiling from a specific batch. This option is helpful for scenarios where data is processed in batches, allowing the user to profile data from a specific batch number onwards.
Note
The starting threshold i.e. Greater Than Time and Greater Than Batch are applicable only to the tables or files with an incremental timestamp strategy.
Record Limit
Define the maximum number of records to be profiled: This slider allows users to set a limit on the number of records to be profiled per table. The range can be adjusted from 1 million to all available records. This setting helps in controlling the scope of the profiling operation, particularly for large datasets, by capping the number of records to analyze.
Step 5: After making the relevant selections, click on the Next button to configure the Inference Settings.
Step 6: Configure the following two Inference Settings:
- Inference Threshold
- Inference State
Inference Threshold
The Inference Threshold allows you to customize the data quality checks that are automatically created and updated when your data is analyzed. This means you can adjust the data quality checks based on how complex the data rules are, giving you more control over how your data is checked and monitored.
Default Configuration
By default, the Inference Threshold is set to 2, which provides a comprehensive range of checks designed to ensure data integrity across different scenarios. Users have the flexibility to adjust this threshold based on their specific needs, allowing for either basic or advanced checks as required.
Levels of Check Inference
The Inference Threshold ranges from 0 to 5, with each level including progressively more complex and comprehensive checks. Below is an explanation of each level:
Note
Each level includes all the checks from the previous levels and adds new checks specific to that level. For example, at Level 1, there are five basic checks. At Level 2, you get those five checks plus additional ones for Level 2. By the time you reach Level 5, it covers all the checks from Levels 1 to 4 and adds its own new checks for complete review.
Level 0: No Inference
At this level, no checks are automatically inferred. This is suitable when users want complete control over which checks are applied, or if no checks are needed. Ideal for scenarios where profiling should not infer any constraints, and all checks will be manually defined.
Level 1: Basic Data Integrity and Simple Value Threshold Checks
This level includes fundamental rules for basic data integrity and simple validations. It ensures that basic constraints like completeness, non-negative numbers, and valid date ranges are applied. Included Checks:
-
Completeness Checks: Ensure data fields are complete if previously complete.
-
Categorical Range Checks: Validate if values fall within a predefined set of categories.
-
Non-Negative Numbers: Ensure numeric values are non-negative.
-
Non-Future Date/Time: Ensure datetime values are not set in the future.
Use Case: Suitable for datasets where basic integrity checks are sufficient.
The following table shows the inferred checks that the Analytics Engine can generate based on the user's data. At Level 1, five checks are created.
Inferred Checks | Reference |
---|---|
Not Null (record) | See more. |
Any Not Null (record) | See more. |
Expected Values (record) | See more. |
Not Negative | See more. |
Not Future | See more. |
Level 2: Value Range and Pattern Checks
Builds upon Level 1 by adding more specific checks related to value ranges and patterns. This level is more detailed and begins to enforce rules related to the nature of the data itself. Included Checks:
-
Date Range Checks: Ensure dates fall within a specified range.
-
Numeric Range Checks: Validate that numeric values are within acceptable ranges.
-
String Pattern Checks: Ensure strings match specific patterns (e.g., email formats).
-
Approximate Uniqueness: Validate uniqueness of values if they are approximately unique.
Use Case: Ideal for datasets where patterns and ranges are important for ensuring data quality.
The following table shows the inferred checks that the Analytics Engine can generate based on the user's data. At Level 2, four checks are created.
Checks | Reference |
---|---|
Between Times | See more. |
Between | See more. |
Matches Pattern | See more. |
Unique | See more. |
Level 3: Time Series and Comparative Relationship Checks
This level includes all checks from Level 2 and adds sophisticated checks for time series and comparative relationships between datasets. Included Checks:
-
Date Granularity Checks: Ensure the granularity of date values is consistent (e.g., day, month, year).
-
Consistent Relationships: Validate that relationships between overlapping datasets are consistent.
Use Case: Suitable for scenarios where data quality depends on time-series data or when comparing data across different datasets.
The following table shows the inferred checks that the Analytics Engine can generate based on the user's data. At Level 3, eight checks are created.
Inferred checks | Reference |
---|---|
Time Distribution Size | See more. |
After Date Time | See more. |
Before Date Time | See more. |
Greater Than | See more. |
Greater Than a Field | See more. |
Less Than | See more. |
Less Than a Field | See more. |
Equal To Field | See more. |
Level 4: Linear Regression and Cross-Datastore Relationship Checks
This level includes all checks from Level 3 and adds even more advanced checks, including linear regression analysis and validation of relationships across different data stores. Included Checks:
-
Linear Regression Checks: Validate data using regression models to identify trends and outliers.
-
Cross-Datastore Relationships: Ensure that data relationships are maintained across different data sources.
Use Case: Best for complex datasets where advanced analytical checks are necessary.
The following table shows the inferred checks that the Analytics Engine can generate based on the user's data. At Level 4, four checks are created.
Inferred Checks | Reference |
---|---|
Predicted By | See more. |
Exists In | See more. |
Not Exists In | See more. |
Is Replica Of | See more. |
Level 5: Shape Checks
The most comprehensive level, includes all previous checks plus checks that validate the shape of certain distribution patterns that can be identified in your data. Included Checks:
- Shape Checks: Checks that define an expectation for some percentage of your data less than 100%. The property “coverage” holds the percentage of your data for which the expressed check should be true.
Use Case: Ideal for scenarios where each incremental set of scanned data should exhibit the same distributions of values as the training set. For example, a transactions table is configured for a weekly incremental scan after each week’s data is loaded. A shape check could define that 80% of all transactions are expected to be performed using “cash” or “credit”
This table shows the inferred checks that the Analytics Engine can generate based on the user's data. At Level 5, three checks are created.
Inferred Checks | Reference |
---|---|
Expected Values (Shape) | See more. |
Matches Pattern (Shape) | See more. |
Not Null (Shape) | See more. |
Warning
If the checks inferred during a profile operation do not detect any anomalies, and the check inference level decreases in the next profile operation, the checks that did not generate anomalies will be archived or discarded. However, if the checks detect any anomalies, they will be retained to continue monitoring the data and addressing potential issues.
Inference State
Check the box labeled "Infer As Draft" to ensure that all inferred checks will be generated in a draft state. This allows for greater flexibility as you can review and refine these checks before they are finalized.
Run Instantly
Click on the Run Now button, and perform the profile operation immediately.
Schedule
Step 1: Click on the Schedule button to configure the available schedule options in the profile operation.
Step 2: Set the scheduling preferences for the profile operation.
1. Hourly: This option allows you to schedule the profile operation to run every hour at a specified minute. You can define the frequency in hours and the exact minute within the hour the profiling should start. Example: If set to "Every 1 hour(s) on minute 0," the profile operation will run every hour at the top of the hour (e.g., 1:00, 2:00, 3:00).
2. Daily: This option schedules the profile operation to run once every day at a specific time. You specify the number of days between scans and the exact time of day in UTC. Example: If set to "Every 1 day(s) at 00:00 UTC," the scan will run every day at midnight UTC.
3. Weekly: This option schedules the profile operation to run on specific days of the week at a set time. You select the days of the week and the exact time of day in UTC for the profile operation to run. Example: If configured to run on "Sunday" and "Friday" at 00:00 UTC, the scan will execute at midnight UTC on these days.
4. Monthly: This option schedules the profile operation to run once a month on a specific day at a set time. You specify the day of the month and the time of day in UTC. If set to "On the 1st day of every 1 month(s), at 00:00 UTC," the profile operation will run on the first day of each month at midnight UTC.
5. Advanced: The advanced section for scheduling operations allows users to set up more complex and custom scheduling using Cron expressions. This option is particularly useful for defining specific times and intervals for profile operations with precision.
Cron expressions are a powerful and flexible way to schedule tasks. They use a syntax that specifies the exact timing of the task based on five fields:
- Minute (0 - 59)
- Hour (0 - 23)
- Day of the month (1 - 31)
- Month (1 - 12)
- Day of the week (0 - 6) (Sunday to Saturday)
Each field can be defined using specific values, ranges, or special characters to create the desired schedule.
Example: For instance, the Cron expression 0 0 * * *
schedules the profile operation to run at midnight (00:00) every day. Here’s a breakdown of this expression:
- 0 (Minute) - The task will run at the 0th minute.
- 0 (Hour) - The task will run at the 0th hour (midnight).
- *(Day of the month) - The task will run every day of the month.
- *(Month) - The task will run every month.
- *(Day of the week) - The task will run every day of the week.
Users can define other specific schedules by adjusting the Cron expression. For example:
- 0 12 * * 1-5 - Runs at 12:00 PM from Monday to Friday.
- 30 14 1 * * - Runs at 2:30 PM on the first day of every month.
- 0 22 * * 6 - Runs at 10:00 PM every Saturday.
To define a custom schedule, enter the appropriate Cron expression in the Custom Cron Schedule (UTC) field before specifying the schedule name. This will allow for precise control over the timing of the profile operation, ensuring it runs exactly when needed according to your specific requirements.
Step 3: Define the Schedule Name to identify the scheduled operation at the running time.
Step 4: Click on the Schedule button to activate your profile operation schedule.
Note
You will receive a notification when the profile operation is completed.
Operation Insights
When the profile operation is completed, you will receive the notification and can navigate to the Activity tab for the datastore on which you triggered the Profile Operation and learn about the operation results.
Top Panel
-
Runs (Default View): Provides insights into the operations that have been performed
-
Schedule: Provides insights into the scheduled operations.
-
Search: Search any operation (including profile) by entering the operation ID
-
Sort by: Organize the list of operations based on the Created Date or the Duration.
-
Filter: Narrow down the list of operations based on:
-
Operation Type
- Operation Status
- Table
Activity Heatmap
The activity heatmap shown in the snippet below represents activity levels over a period, with each square indicating a day and the color intensity representing the number of operations or activities on that day. It is useful in tracking the number of operations performed on each day within a specific timeframe.
Tip
You can click on any of the squares from the Activity Heatmap to filter operations
Operation Detail
Running
This status indicates that the profile operation is still running at the moment and is yet to be completed. A profile operation having a running status reflects the following details and actions:
No. | Parameter | Interpretation |
---|---|---|
1. | Operation ID & Operation Type | Unique identifier and type of operation performed (catalog, profile, or scan) |
2. | Timestamp | Timestamp when the operation was started |
3. | Progress Bar | The progress of the operation |
4. | Triggered By | The author who triggered the operation |
5. | Schedule | Whether the operation was scheduled or not |
6. | Inference Threshold | Indicates how much control you have over automatic data quality checks, adjustable based on complexity |
7. | Checks Synchronized | Indicates the count of Checks Synchronized in the operation |
8. | Infer as Draft | Indicates whether Infer as Draft was enabled or disabled in the operation |
9. | Read Record Limit | Defines the maximum number of records to be scanned per table after initial filtering |
10. | Results | Provides immediate insights into the profile operation conducted |
11. | Abort | The "Abort" button enables you to stop the ongoing profile operation |
12. | Summary | The "Summary" section provides a real-time overview of the profile operation's progress. It includes key metrics such as:
|
Aborted
This status indicates that the profile operation was manually stopped before it could be completed. A profile operation having an aborted status reflects the following details and actions:
No. | Parameter | Interpretation |
---|---|---|
1. | Operation ID & Operation Type | Unique identifier and type of operation performed (catalog, profile, or scan) |
2. | Timestamp | Timestamp when the operation was started |
3. | Progress Bar | The progress of the operation |
4. | Aborted By | The author who Aborted the operation |
5. | Schedule | Whether the operation was scheduled or not |
6. | Inference Threshold | Indicates how much control you have over automatic data quality checks, adjustable based on complexity |
7. | Checks Synchronized | Indicates the count of Checks Synchronized in the operation |
8. | Infer as Draft | Indicates whether Infer as Draft was enabled or disabled in the operation |
9. | Read Record Limit | Defines the maximum number of records to be scanned per table after initial filtering |
10. | Results | Provides immediate insights into the profile operation conducted |
11. | Resume | Provides an option to continue the profile operation from where it left off |
12. | Rerun | Allows you to start a new profile operation using the same settings as the aborted scan |
13. | Delete | Removes the record of the aborted profile operation from the system, permanently deleting results |
14. | Summary | The "Summary" section provides a real-time overview of the profile operation's progress. It includes key metrics such as:
|
Warning
This status signals that the profile operation encountered some issues and displays the logs that facilitate improved tracking of the blockers and issue resolution. A profile operation having a completed with warning status reflects the following details and actions:
No. | Parameter | Interpretation |
---|---|---|
1. | Operation ID & Operation Type | Unique identifier and type of operation performed (catalog, profile, or scan) |
2. | Timestamp | Timestamp when the operation was started |
3. | Progress Bar | The progress of the operation |
4. | Triggered By | The author who triggered the operation |
5. | Schedule | Whether the operation was scheduled or not |
6. | Inference Threshold | Indicates how much control you have over automatic data quality checks, adjustable based on complexity |
7. | Checks Synchronized | Indicates the count of Checks Synchronized in the operation |
8. | Infer as Draft | Indicates whether Infer as Draft was enabled or disabled in the operation |
9. | Read Record Limit | Defines the maximum number of records to be scanned per table after initial filtering |
10. | Result | Provides immediate insights into the profile operation conducted |
11. | Rerun | Allows you to start a new profile operation using the same settings as the warning scan |
12. | Delete | Removes the record of the profile operation, permanently deleting all results |
13. | Summary | The "Summary" section provides a real-time overview of the profile operation's progress. It includes key metrics such as:
|
14. | Logs | Logs include error messages, warnings, and other pertinent information that occurred during the execution of the Profile Operation. |
Success
This status confirms that the profile operation was completed successfully without any issues. A profile operation having a success status reflects the following details and actions:
No. | Parameter | Interpretation |
---|---|---|
1. | Operation ID & Operation Type | Unique identifier and type of operation performed (catalog, profile, or scan) |
2. | Timestamp | Timestamp when the operation was started |
3. | Progress Bar | The progress of the operation |
4. | Triggered By | The author who triggered the operation |
5. | Schedule | Whether the operation was scheduled or not |
6. | Inference Threshold | Indicates how much control you have over automatic data quality checks, allowing adjustments based on data complexity |
7. | Checks Synchronized | Indicates the count of Checks Synchronized in the operation |
8. | Infer as Draft | Indicates whether Infer as Draft was enabled or disabled in the operation |
9. | Read Record Limit | Defines the maximum number of records to be scanned per table after initial filtering |
10. | Results | Provides immediate insights into the profile operation conducted |
11. | Rerun | Allows you to start a new profile operation using the same settings as the warning scan, useful for restarting after errors |
12. | Delete | Removes the record of the profile operation from the system, permanently deleting all results; this action cannot be undone |
13. | Summary | The "Summary" section provides a real-time overview of the profile operation's progress. It includes key metrics such as:
|
Full View of Metrics in Operation Summary
Users can now hover over abbreviated metrics to see the full value for better clarity. For demonstration purposes, we are hovering over the Records Profiled field to display the full value.
Post Operation Details
Step 1: Click on any of the successful Profile Operations from the list and hit the Results button.
Step 2: The Profile Results modal displays a list of both profiled and non-profiled containers. You can filter the view to show only non-profiled containers by toggling on button, which will display the complete list of unprofiled containers.
The Profile Results modal also provides two analysis options for you:
- Details for a Specific Container (Container's Profile)
- Details for a Specific Field of a Container (Field Profile)
Unwrap any of the containers from the Profile Results modal and click on the arrow icon.
Details for a Specific Container (Container's Profile)
Based on your selection of container from the profile operation results, you will be automatically redirected to the container details on the source datastore details page.
The following details (metrics) will be visible for analyzing the specific container you selected:
-
Quality Score (79): This represents an overall quality assessment of the field, likely on a scale of 0 to 100. A score of 79 suggests that the data quality is relatively good but may need further improvement.
-
Sampling (100%): Indicates that 100% of the data in this field was sampled for analysis. This means the entire dataset for this field was reviewed.
-
Completeness (100%): Suggests that all entries in this field are complete, with no missing or null values, signifying data integrity.
-
Active Checks (2): This shows that 2 data quality checks are actively running on this field. These checks likely monitor aspects such as format, uniqueness, or consistency.
-
Active Anomalies (0): Indicates that there are no active anomalies or issues detected in the field, meaning no irregularities have been found during the checks.
Details for a Specific Field of a Container (Field Profile)
Unwrap the container to view the underlying fields. The following details (metrics) will be visible for analyzing a specific field of the container:
No | Profile | Description |
---|---|---|
1 | Type Inferred | Indicates whether the type is declared by the source or inferred. |
2 | Distinct Values | Count of distinct values observed in the dataset. |
3 | Min Length | Shortest length of the observed string values or lowest value for numerics. |
4 | Max Length | Greatest length of the observed string values or highest value for numerics. |
5 | Mean | Mathematical average of the observed numeric values. |
6 | Median | The median of the observed numeric values. |
7 | Standard Deviation | Measure of the amount of variation in observed numeric values. |
8 | Kurtosis | Measure of the 'tailedness' of the distribution of observed numeric values. |
9 | Skewness | Measure of the asymmetry of the distribution of observed numeric values. |
10 | Q1 | The first quartile; the central point between the minimum and the median. |
11 | Q3 | The third quartile; the central point between the median and the maximum. |
12 | Sum | Total sum of all observed numeric values. |
- Histogram
API Payload Examples
This section provides payload examples for initiating and checking the running status of a profile operation. Replace the placeholder values with data specific to your setup.
Running a Profile operation
To run a profile operation, use the API payload example below and replace the placeholder values with your specific values:
Endpoint (Post): /api/operations/run (post)
Option I: Running a profile operation of all containers
-
container_names: [ ]: This setting indicates that profiling will encompass all containers.
-
max_records_analyzed_per_partition: null: This setting implies that all records within all containers will be profiled.
-
infer_threshold: 5: This setting indicates that the engine will automatically infer quality checks of level 5 for you.
{
"type":"profile",
"datastore_id": datastore-id,
"container_names":[],
"max_records_analyzed_per_partition":null,
"inference_threshold":5
}
Option II: Running a profile operation of specific containers
-
container_names: ["table_name_1", "table_name_2"]: This setting indicates that profiling will only cover the tables named table_name_1 and table_name_2.
-
max_records_analyzed_per_partition: 1000000: This setting means that up to 1 million rows per container will be profiled.
-
infer_threshold: 0: This setting indicates that the engine will not automatically infer quality checks for you.
{
"type":"profile",
"datastore_id":datastore-id,
"container_names":[
"table_name_1",
"table_name_2"
],
"max_records_analyzed_per_partition":1000000,
"inference_threshold":0
}
Scheduling a Profile operation
Below is a sample payload for scheduling a profile operation. Please substitute the placeholder values with the appropriate data relevant to your setup.
Endpoint (Post): /api/operations/schedule (post)
INFO: This payload is to run a scheduled profile operation every day at 00:00
Scheduling profile operation of all containers
{
"type":"profile",
"name":"My scheduled Profile operation",
"datastore_id":"datastore-id",
"container_names":[],
"max_records_analyzed_per_partition":null,
"infer_constraints":5,
"crontab":"00 00 /1 *"
}
Retrieving Profile Operation Status
To retrieve the profile operation status, use the API payload example below and replace the placeholder values with your specific values:
Endpoint (Get): /api/operations/{id} (get)
{
"items": [
{
"id": 12345,
"created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"type": "profile",
"start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"result": "success",
"message": null,
"triggered_by": "user@example.com",
"datastore": {
"id": 101,
"name": "Sample-Store",
"store_type": "jdbc",
"type": "db_type",
"enrich_only": false,
"enrich_container_prefix": "data_prefix",
"favorite": false
},
"schedule": null,
"inference_threshold": 5,
"max_records_analyzed_per_partition": -1,
"max_count_testing_sample": 100000,
"histogram_max_distinct_values": 100,
"greater_than_time": null,
"greater_than_batch": null,
"percent_testing_threshold": 0.4,
"high_correlation_threshold": 0.5,
"status": {
"total_containers": 2,
"containers_analyzed": 2,
"partitions_analyzed": 2,
"records_processed": 1126,
"fields_profiled": 9,
"checks_synchronized": 26
},
"containers": [
{
"id": 123,
"name": "Container1",
"container_type": "table",
"table_type": "table"
},
{
"id": 456,
"name": "Container2",
"container_type": "table",
"table_type": "table"
}
],
"container_profiles": [
{
"id": 789,
"created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"parent_profile_id": null,
"container": {
"id": 456,
"name": "Container2",
"container_type": "table",
"table_type": "table"
},
"records_count": 550,
"records_processed": 550,
"checks_synchronized": 11,
"field_profiles_count": 4,
"result": "success",
"message": null
},
{
"id": 790,
"created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"parent_profile_id": null,
"container": {
"id": 123,
"name": "Container1",
"container_type": "table",
"table_type": "table"
},
"records_count": 576,
"records_processed": 576,
"checks_synchronized": 15,
"field_profiles_count": 5,
"result": "success",
"message": null
}
],
"tags": []
}
],
"total": 1,
"page": 1,
"size": 50,
"pages": 1
}
Scan Operation
The Scan Operation in Qualytics is performed on a datastore to enforce data quality checks for various data collections such as tables, views, and files. This operation has several key functions:
-
Record Anomalies: Identifies a single record (row) as anomalous and provides specific details regarding why it is considered anomalous. The simplest form of a record anomaly is a row that lacks an expected value for a field.
-
Shape Anomalies: Identifies structural issues within a dataset at the column or schema level. It highlights broader patterns or distributions that deviate from expected norms. If a dataset is expected to have certain fields and one or more fields are missing or contain inconsistent patterns, this would be flagged as a shape anomaly.
-
Anomaly Data Recording: All identified anomalies, along with related analytical data, are recorded in the associated Enrichment Datastore for further examination.
Additionally, the Scan Operation offers flexible options, including the ability to:
- Perform checks on incremental loads versus full loads.
- Limit the number of records scanned.
- Run scans on a selected list of tables or files.
- Schedule scans for future execution.
Let's get started! 🚀
Navigation to Scan Operation
Step 1: Select a source datastore from the side menu on which you would like to perform the scan operation.
Step 2: Clicking on your preferred datastore will navigate you to the datastore details page. Within the overview tab (default view), click on the Run button under Scan to initiate the catalog operation.
Note
Scanning operation can be commenced once the catalog operation and profile operation is completed.
Configuration
Step 1: Click on the Run button to initiate the scan operation.
Step 2: Select tables (in your JDBC datastore) or file patterns (in your DFS datastore) and tags you would like to be scanned.
1. All Tables/File Patterns
This option includes all tables or file patterns currently available for scanning in the datastore. It means that every table or file pattern recognized in your datastore will be subjected to the defined data quality checks. Use this when you want to perform a comprehensive scan covering all the available data without any exclusions.
2. Specific Tables/File Patterns
This option allows you to manually select the individual table(s) or file pattern(s) in your datastore to scan. Upon selecting this option, all the tables or file patterns associated with your datastore will be automatically populated allowing you to select the datasets you want to scan.
You can also search the tables/file patterns you want to scan directly using the search bar. Use this option when you need to target particular datasets or when you want to exclude certain files from the scan for focused analysis or testing purpoaes
3. Tag
This option enables you to automatically scan file patterns associated with the selected tags. Tags can be predefined or created to categorize and manage file patterns effectively.
Step 3: Click on the Next button to Configure Read Settings.
Step 4: Configure Read Settings, Starting Threshold (Optional), and the Record Limit.
1.Select the Read Strategy for your scan operation.
-
Incremental: This strategy is used to scan only the new or updated records since the last scan operation. On the initial run, a full scan is conducted unless a specific starting threshold is set. For subsequent scans, only the records that have changed since the last scan are processed. If tables or views do not have a defined incremental key, a full scan will be performed. Ideal for regular scans where only changes need to be tracked, saving time and computational resources.
-
Full: This strategy performs a comprehensive scan of all records within the specified data collections, regardless of any previous changes or scans. Every scan operation will include all records, ensuring a complete check each time. Suitable for periodic comprehensive checks or when incremental scanning is not feasible due to the nature of the data.
Warning
If any selected tables do not have an incremental identifier, a full scan will be performed for those tables.
Info
When running an Incremental Scan for the first time, Qualytics automatically performs a full scan, saving the incremental field for subsequent runs.
-
This ensures that the system establishes a baseline and captures all relevant data.
-
Once the initial full scan is completed, the system intelligently uses the saved incremental field to execute future Incremental Scans efficiently, focusing only on the new or updated data since the last scan.
-
This approach optimizes the scanning process while maintaining data quality and consistency.
2.Define the Starting Threshold (Optional) i.e. - specify a minimum incremental identifier value to set a starting point for the scan.
-
Greater Than Time: This option applies only to tables with an incremental timestamp strategy. Users can specify a timestamp to scan records that were modified after this time.
-
Greater Than Batch: This option applies to tables with an incremental batch strategy. Users can set a batch value, ensuring that only records with a batch identifier greater than the specified value are scanned.
3.Define the record limit- the maximum number of records to be scanned per table after any initial filtering.
Step 5: Click on the Next button to Configure the Scan Settings.
Step 6: Configure the Scan Settings.
-
Check Categories: Users can choose one or more check categories when initiating a scan. This allows for flexible selection based on the desired scope of the operation:
-
Metadata: Include checks that define the expected properties of the table, such as volume. It belongs to the Volumetric rule type.
-
Data Integrity: Include checks that specify the expected values for the data stored in the table. It belongs to all rule types except volumetric.
-
2. Anomaly Options: Enable the option to automatically archive duplicate anomalies detected in previous scans that overlap with the current scan. This feature helps improve data management by minimizing redundancy and ensuring a more organized anomaly record.
- Archive Duplicate Anomalies: Automatically archive duplicate anomalies from previous scans that overlap with the current scan to enhance data management efficiency.
Step 7: Click on the Next button to Configure the Enrichment Settings.
Step 8: Configure the Enrichment Settings.
-
Remediation Strategy: This strategy dictates how your source tables are replicated in your enrichment datastore:
-
None: This option does not replicate source tables. It only writes anomalies and associated source data to the enrichment datastore. This is useful when the primary goal is to track anomalies without duplicating the entire dataset.
-
Append: This option replicates source tables using an append-first strategy. It adds new records to the enrichment datastore, maintaining a history of all data changes over time. This approach is beneficial for auditing and historical analysis.
-
Overwrite: This option replicates source tables using an overwrite strategy, replacing existing data in the enrichment datastore with the latest data from the source. This method ensures the enrichment datastore always contains the most current data, which is useful for real-time analysis and reporting.
-
2. Source Record Limit: Sets a maximum limit on the number of records written to the enrichment datastore for each detected anomaly. This helps manage storage and processing requirements effectively.
Run Instantly
Click on the Run Now button to perform the scan operation immediately.
Schedule
Step 1: Click on the Schedule button to configure the available schedule options for your scan operation.
Step 2: Set the scheduling preferences for the profile operation.
1. Hourly: This option allows you to schedule the scan to run every hour at a specified minute. You can define the frequency in hours and the exact minute within the hour the scan should start. Example: If set to Every 1 hour(s) on minute 0, the scan will run every hour at the top of the hour (e.g., 1:00, 2:00, 3:00).
2. Daily: This option schedules the scan to run once every day at a specific time. You specify the number of days between scans and the exact time of day in UTC. Example: If set to Every 1 day(s) at 00:00 UTC, the scan will run every day at midnight UTC.
3. Weekly: This option schedules the scan to run on specific days of the week at a set time. You select the days of the week and the exact time of day in UTC for the scan to run. Example: If configured to run on "Sunday" and "Friday" at 00:00 UTC, the scan will execute at midnight UTC on these days.
4. Monthly: This option schedules the scan to run once a month on a specific day at a set time. You specify the day of the month and the time of day in UTC. If set to "On the 1st day of every 1 month(s), at 00:00 UTC," the scan will run on the first day of each month at midnight UTC.
5. Advanced: The advanced section for scheduling operations allows users to set up more complex and custom scheduling using Cron expressions. This option is particularly useful for defining specific times and intervals for profile operations with precision.
Cron expressions are a powerful and flexible way to schedule tasks. They use a syntax that specifies the exact timing of the task based on five fields:
- Minute (0 - 59)
- Hour (0 - 23)
- Day of the month (1 - 31)
- Month (1 - 12)
- Day of the week (0 - 6) (Sunday to Saturday)
Each field can be defined using specific values, ranges, or special characters to create the desired schedule.
Example: For instance, the Cron expression 0 0 * * *
schedules the profile operation to run at midnight (00:00) every day. Here’s a breakdown of this expression:
- 0 (Minute) - The task will run at the 0th minute.
- 0 (Hour) - The task will run at the 0th hour (midnight).
- *(Day of the month) - The task will run every day of the month.
- *(Month) - The task will run every month.
- *(Day of the week) - The task will run every day of the week.
Users can define other specific schedules by adjusting the Cron expression. For example:
- 0 12 * * 1-5 - Runs at 12:00 PM from Monday to Friday.
- 30 14 1 * * - Runs at 2:30 PM on the first day of every month.
- 0 22 * * 6 - Runs at 10:00 PM every Saturday.
To define a custom schedule, enter the appropriate Cron expression in the "Custom Cron Schedule (UTC)" field before specifying the schedule name. This will allow for precise control over the timing of the profile operation, ensuring it runs exactly when needed according to your specific requirements.
Step 3: Define the Schedule Name to identify the scheduled operation at the running time.
Step 4: Click on the Schedule button to schedule your scan operation.
Note
You will receive a notification when the profile operation is completed.
Advanced Options
The advanced use cases described below require options that are not yet exposed in our user interface but possible through interaction with Qualytics API.
Runtime Variable Assignment
It is possible to reference a variable in a check definition (declared in double curly braces) and then assign that variable a value when a Scan operation is initiated. Variables are supported within any Spark SQL expression and are most commonly used in a check filter.
If a Scan is meant to assert a check with a variable, a value for that variable must be supplied as part of the Scan operation's check_variables
property.
For example, a check might include a filter.- transaction_date == {{ checked_date }}
which will be asserted against any records where transaction_date is equal to the value supplied when the Scan operation is initiated. In this case that value would be assigned by passing the following payload when calling /api/operations/run
{
"type": "scan",
"datastore_id": 42,
"container_names": ["my_container"],
"incremental": true,
"remediation": "none",
"max_records_analyzed_per_partition": 0,
"check_variables": {
"checked_date": "2023-10-15"
},
"high_count_rollup_threshold": 10
}
Operations Insights
When the scan operation is completed, you will receive the notification and can navigate to the Activity tab for the datastore on which you triggered the Scan Operation and learn about the scan results.
Top Panel
1. Runs (Default View): Provides insights into the operations that have been performed
2. Schedule: Provides insights into the scheduled operations.
3. Search: Search any operation (including scan) by entering the operation ID
4. Sort by: Organize the list of operations based on the Created Date or the Duration.
5. Filter: Narrow down the list of operations based on:
- Operation Type
- Oeration Status
- Table
Activity Heatmap
The activity heatmap shown in the snippet below represents activity levels over a period, with each square indicating a day and the color intensity representing the number of operations or activities on that day. It is useful in tracking the number of operations performed on each day within a specific timeframe.
Tip
You can click on any of the squares from the Activity Heatmap to filter operations.
Operation Detail
Running
This status indicates that the scan operation is still running at the moment and is yet to be completed. A scan operation having a running status reflects the following details and actions:
No. | Parameter | Interpretation |
---|---|---|
1 | Operation ID and Type | Unique identifier and type of operation performed (catalog, profile, or scan). |
2 | Timestamp | Timestamp when the operation was started. |
3 | Progress Bar | The progress of the operation. |
4 | Triggered By | The author who triggered the operation. |
5 | Schedule | Indicates whether the operation was scheduled or not. |
6 | Incremental Field | Indicates whether Incremental was enabled or disabled in the operation. |
7 | Remediation | Indicates whether Remediation was enabled or disabled in the operation. |
8 | Anomalies Identified | Provides a count of the number of anomalies detected during the running operation. |
9 | Read Record Limit | Defines the maximum number of records to be scanned per table after initial filtering. |
10 | Check Categories | Indicates which categories should be included in the scan (e.g., Metadata, Data Integrity). |
11 | Archive Duplicate Anomalies | Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation. |
12 | Source Record Limit | Indicates the limit on records stored in the enrichment datastore for each detected anomaly. |
13 | Results | View the details of the ongoing scan operation. This includes information on which tables are currently being scanned, the anomalies identified so far (if any), and other related data collected during the active scan. |
14 | Abort | The Abort button enables you to stop the ongoing scan operation. |
15 | Summary | The summary section provides an overview of the scan operation in progress. It includes:
|
Aborted
This status indicates that the scan operation was manually stopped before it could be completed. A scan operation having an aborted status reflects the following details and actions:
No. | Parameter | Interpretation |
---|---|---|
1 | Operation ID and Type | Unique identifier and type of operation performed (catalog, profile, or scan). |
2 | Timestamp | Timestamp when the operation was started |
3 | Progress Bar | The progress of the operation |
4 | Aborted By | The author who triggered the operation |
5 | Schedule | Whether the operation was scheduled or not |
6 | Incremental Field | Indicates whether Incremental was enabled or disabled in the operation |
7 | Remediation | Indicates whether Remediation was enabled or disabled in the operation |
8 | Anomalies Identified | Provides a count on the number of anomalies detected before the operation was aborted |
9 | Read Record Limit | Defines the maximum number of records to be scanned per table after initial filtering |
10 | Check Categories | Indicates which categories should be included in the scan (Metadata, Data Integrity) |
11 | Archive Duplicate Anomalies | Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation |
12 | Source Record Limit | Indicates the limit on records stored in the enrichment datastore for each detected anomaly |
13 | Results | View the details of the scan operation that was aborted, including tables scanned and anomalies identified |
14 | Resume | Provides an option to continue the scan operation from where it left off |
15 | Rerun | The "Rerun" button allows you to start a new scan operation using the same settings as the aborted scan |
16 | Delete | Removes the record of the aborted scan operation from the system, permanently deleting scan results and anomalies |
17 | Summary | The summary section provides an overview of the scan operation up to the point it was aborted. It includes:
|
Warning
This status signals that the scan operation encountered some issues and displays the logs that facilitate improved tracking of the blockers and issue resolution. A scan operation having a completed with warning status reflects the following details and actions:
No. | Parameter | Interpretation |
---|---|---|
1 | Operation ID and Type | Unique identifier and type of operation performed (catalog, profile, or scan). |
2 | Timestamp | Timestamp when the operation was started |
3 | Progress Bar | The progress of the operation |
4 | Triggered By | The author who triggered the operation |
5 | Schedule | Whether the operation was scheduled or not |
6 | Incremental Field | Indicates whether Incremental was enabled or disabled in the operation |
7 | Remediation | Indicates whether Remediation was enabled or disabled in the operation |
8 | Anomalies Identified | Provides a count on the number of anomalies detected before the operation was warned. |
9 | Read Record Limit | Defines the maximum number of records to be scanned per table after initial filtering |
10 | Check Categories | Indicates which categories should be included in the scan (Metadata, Data Integrity) |
11 | Archive Duplicate Anomalies | Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation |
12 | Source Record Limit | Indicates the limit on records stored in the enrichment datastore for each detected anomaly |
13 | Result | View the details of the scan operation that was completed with warning, including tables scanned and anomalies identified |
14 | Rerun | The "Rerun" button allows you to start a new scan operation using the same settings as the warning scan |
15 | Delete | Removes the record of the warning operation from the system, permanently deleting scan results and anomalies |
16 | Summary | The summary section provides an overview of the scan operation, highlighting any warnings encountered. It includes:
|
17 | Logs | Logs include error messages, warnings, and other pertinent information that occurred during the execution of the Scan Operation. |
Success
The summary section provides an overview of the scan operation upon successful completion. It includes:
No. | Parameter | Interpretation |
---|---|---|
1 | Operation ID and Type | Unique identifier and type of operation performed (catalog, profile, or scan). |
2 | Timestamp | Timestamp when the operation was started |
3 | Progress Bar | The progress of the operation |
4 | Triggered By | The author who triggered the operation |
5 | Schedule | Whether the operation was scheduled or not |
6 | Incremental Field | Indicates whether Incremental was enabled or disabled in the operation |
7 | Remediation | Indicates whether Remediation was enabled or disabled in the operation |
8 | Anomalies Identified | Provides a count of the number of anomalies detected during the successful completion of the operation. |
9 | Read Record Limit | Defines the maximum number of records to be scanned per table after initial filtering |
10 | Archive Duplicate Anomalies | Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation |
11 | Source Record Limit | Indicates the limit on records stored in the enrichment datastore for each detected anomaly |
12 | Results | View the details of the completed scan operation. This includes information on which tables were scanned, the anomalies identified (if any), and other relevant data collected throughout the successful completion of the scan. |
13 | Rerun | The "Rerun" button allows you to start a new scan operation using the same settings as the success scan |
14 | Delete | Removes the record of the aborted scan operation from the system, permanently deleting scan results and anomalies |
15 | Summary | The summary section provides an overview of the scan operation upon successful completion. It includes:
|
Full View of Metrics in Operation Summary
Users can now hover over abbreviated metrics to see the full value for better clarity. For demonstration purposes, we are hovering over the Records Scanned field to display the full value.
Post Operation Details
Step 1: Click on any of the successful Scan Operations from the list and hit the Results button.
Step 2: The Scan Results modal demonstrates the highlighted anomalies (if any) identified in your datastore with the following properties:
Ref. | Scan Properties | Description |
---|---|---|
1. | Table/File | The table or file where the anomaly is found. |
2. | Field | The field(s) where the anomaly is present. Field(s) of the anomaly. |
3. | Location | Fully qualified location of the anomaly. |
4. | Rule | Inferred and authored checks that failed assertions. |
5. | Description | Human-readable, auto-generated description of the anomaly. |
6. | Status | The status of the anomaly. Active, Acknowledged, Resolved, or Invalid |
7. | Type | The type of anomaly (e.g., Record or Shape) |
8. | Date time | The date and time when the anomaly was found. |
API Payload Examples
This section provides payload examples for running, scheduling, and checking the status of scan operations. Replace the placeholder values with data specific to your setup.
Running a Scan operation
To run a scan operation, use the API payload example below and replace the placeholder values with your specific values.
Endpoint (Post):
/api/operations/run (post)
- container_names:
[]
means that it will scan all containers. - max_records_analyzed_per_partition:
null
means that it will scan all records of all containers. - Remediation:
append
replicates source containers using an append-first strategy.
- container_names:
["table_name_1", "table_name_2"]
means that it will scan only the tables table_name_1 and table_name_2. - max_records_analyzed_per_partition:
1000000
means that it will scan a maximum of 1 million records per partition. - Remediation:
overwrite
replicates source containers using an overwrite strategy.
Scheduling scan operation of all containers
To schedule a scan operation, use the API payload example below and replace the placeholder values with your specific values.
Endpoint (Post):
/api/operations/schedule (post)
This payload is to run a scheduled scan operation every day at 00:00
Retrieving Scan Operation Information
Endpoint (Get)
/api/operations/{id} (get)
{
"items": [
{
"id": 12345,
"created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"type": "scan",
"start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"result": "success",
"message": null,
"triggered_by": "user@example.com",
"datastore": {
"id": 101,
"name": "Datastore-Sample",
"store_type": "jdbc",
"type": "db_type",
"enrich_only": false,
"enrich_container_prefix": "data_prefix",
"favorite": false
},
"schedule": null,
"incremental": false,
"remediation": "none",
"max_records_analyzed_per_partition": -1,
"greater_than_time": null,
"greater_than_batch": null,
"high_count_rollup_threshold": 10,
"enrichment_source_record_limit": 10,
"status": {
"total_containers": 2,
"containers_analyzed": 2,
"partitions_scanned": 2,
"records_processed": 28,
"anomalies_identified": 2
},
"containers": [
{
"id": 234,
"name": "Container1",
"container_type": "table",
"table_type": "table"
},
{
"id": 235,
"name": "Container2",
"container_type": "table",
"table_type": "table"
}
],
"container_scans": [
{
"id": 456,
"created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"container": {
"id": 235,
"name": "Container2",
"container_type": "table",
"table_type": "table"
},
"start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"records_processed": 8,
"anomaly_count": 1,
"result": "success",
"message": null
},
{
"id": 457,
"created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"container": {
"id": 234,
"name": "Container1",
"container_type": "table",
"table_type": "table"
},
"start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"records_processed": 20,
"anomaly_count": 1,
"result": "success",
"message": null
}
],
"tags": []
}
],
"total": 1,
"page": 1,
"size": 50,
"pages": 1
}
External Scan Operation
An external scan is ideal for ad hoc scenarios, where you may receive a file intended to be replicated to a source datastore. Before loading, you can perform an external scan to ensure the file aligns with existing data standards. The schema of the file must match the target table or file pattern that has already been profiled within Qualytics, allowing you to reuse the quality checks to identify any issues before data integration.
Let’s get started 🚀
Navigation to External Scan Operation
Step 1: Select a source datastore from the side menu to perform the external scan operation.
Step 2: After selecting your preferred source datastore, you will be taken to the details page. From there, click on "Tables" and select the table you want to perform the external scan operation on.
Note
This example is based on a JDBC table, but the same steps apply to DFS as well. For DFS source datastores, you will need to click on "File Patterns" and select a File Pattern to run the external scan.
For demonstration purposes, we have selected the “CUSTOMER” table.
External Scan Configuration
Step 1: Click on the “Run” button and select the “External Scan” option.
Step 2: After selecting the "External Scan" option, a modal window will appear with an input for uploading your external file. After uploading the file, click the “Run” button to start the operation.
Note
An External Scan operation supports the following file formats: CSV, XLSX, and XLS.
Step 3: After clicking the "Run" button, the external scan operation will begin, and you will receive a confirmation message if the operation is successfully triggered.
Supported File Formats
External scan operation accepts CSV, XLSX, and XLS files. CSV is a simple text format, while XLSX and XLS are Excel formats that support more complex data structures. This versatility enables seamless integration of data from various sources.
An External Scan Operation can be configured with the following file formats:
File Extension | .csv | .xls | .xlsx |
---|---|---|---|
File Format | Comma-separated values | Microsoft Excel 97-2003 Workbook | Microsoft Excel 2007+ Workbook |
Header Row | Required for optimal reading. It should contain column names. | Recommended, but not strictly required. | Recommended, but not strictly required. |
Empty Cells | Represented as empty strings. | Allowed. | Allowed. |
Data Types | Typically inferred by Spark. | May require explicit specification for complex types. | May require explicit specification for complex types. |
Nested Data | Not directly supported. Consider flattening or using alternative file formats. | Not directly supported. Consider flattening or using alternative file formats. | Not directly supported. Consider flattening or using alternative file formats. |
Additional Considerations | - Ensure consistent delimiter usage (usually commas). - Avoid special characters or line breaks within fields. - Enclose text fields containing commas or delimiters in double quotes. |
- Use a plain XLS format without macros or formatting. - Consider converting to CSV for simpler handling. |
- Use a plain XLSX format without macros or formatting. - Consider converting to CSV for simpler handling. |
Scenario
A company maintains a large sales database containing information about various transactions, customers, and products. They have received a new sales data file that will be integrated into the existing database. Before loading the data, the organization wants to ensure there are no issues with the file.
An External Scan is initiated to perform checks on the incoming file, validating that it aligns with the quality standards of the sales table.
Specific Checks:
Check | Description |
---|---|
Expected Schema |
Verify that all columns have the same data type as the selected profile structure. |
Exists in |
Verify that all transactions have valid customer and product references. |
Between Times |
Ensure that transaction dates fall within an expected range. |
Satisfies Expression |
Validate that the calculated revenue aligns with the unit price and quantity sold. The formula is: R=Quantity×Unit Price |
Potential Anomalies:
This overview highlights common issues such as data type mismatches, missing references, out-of-range dates, and inconsistent revenue calculations. Each anomaly affects data integrity and requires corrective action.
Anomaly | Description |
---|---|
Data type issue | The external resource does not follow the data type schema. |
Missing References | Transactions without valid customer or product references. |
Out-of-Range Dates | Transactions with dates outside the expected range. |
Inconsistent Revenue | Mismatch between calculated revenue and unit price times quantity. |
Benefits of External Scan:
Benefit | Description |
---|---|
Quality Assurance | Identify and rectify data inconsistencies before downstream processes. |
Data Integrity | Ensure that all records adhere to defined schema and constraints. |
Anomaly Detection | Uncover potential issues that might impact business analytics and reporting. |
CSV Table (Sales Data):
This dataset includes transaction records with details such as Transaction_ID, Customer_ID, Product_ID, Transaction_Date, Quantity, and Unit_Price. It provides essential information for tracking and analyzing sales activities.
Transaction_ID | Customer_ID | Product_ID | Transaction_Date | Quantity | Unit_Price |
---|---|---|---|---|---|
1 | 101 | 201 | 2023-01-15 | 5 | 20.00 |
2 | 102 | 202 | 2023-02-20 | 3 | 15.50 |
3 | 103 | 201 | 2023-03-10 | 2 | 25.00 |
4 | 104 | 203 | 2023-04-05 | 1 | 30.00 |
... | ... | ... | ... | ... | ... |
graph TB
subgraph Init
A[Start] --> B[Load Sales Data]
end
subgraph Checks
B --> C1[Expected schema]
B --> C2[Exists in]
B --> C3[Between times]
B --> C4[Satisfies expression]
C1 -->|Invalid| E1[Expected schema anomaly]
C2 -->|Invalid| E2[Exists in anomaly]
C3 -->|Invalid| E3[Between times anomaly]
C4 -->|Invalid| E4[Satisfies expression anomaly]
end
subgraph End
E1 --> J[Finish]
E2 --> J[Finish]
E3 --> J[Finish]
E4 --> J[Finish]
end
API Payload Examples
Running an External Scan operation
This section provides a sample payload for running an external scan operation. Replace the placeholder values with actual data relevant to your setup.
Endpoint (Post)
/api/containers/{container-id}/scan
(post)
Retrieving an External Scan Operation Status
Endpoint (Get)
/api/operations/{id}
(get)
{
"items": [
{
"id": 12345,
"created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"type": "external_scan",
"start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"end_time": null,
"result": "running",
"message": null,
"triggered_by": "user@example.com",
"datastore": {
"id": 101,
"name": "Datastore-Sample",
"store_type": "jdbc",
"type": "db_type",
"enrich_only": false,
"enrich_container_prefix": "data_prefix",
"favorite": false
},
"schedule": null,
"incremental": false,
"remediation": "none",
"max_records_analyzed_per_partition": -1,
"greater_than_time": null,
"greater_than_batch": null,
"high_count_rollup_threshold": 10,
"enrichment_source_record_limit": 10,
"status": {
"total_containers": 1,
"containers_analyzed": 0,
"partitions_scanned": 0,
"records_processed": 0,
"anomalies_identified": 0
},
"containers": [
{
"id": 234,
"name": "Container1",
"container_type": "table",
"table_type": "table"
}
],
"container_scans": [
{
"id": 456,
"created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
"container": {
"id": 234,
"name": "Container1",
"container_type": "table",
"table_type": "table"
},
"start_time": null,
"end_time": null,
"records_processed": 0,
"anomaly_count": 0,
"result": "running",
"message": null
}
],
"tags": []
}
],
"total": 1,
"page": 1,
"size": 50,
"pages": 1
}
Datastore Settings
Qualytics allows you to manage your datastore efficiently by editing source datastore information, linking an enrichment datastore for enhanced insights, establishing new connections to expand data sources, choosing connectors to integrate diverse data, adjusting the quality score to ensure data accuracy, and deleting the store. This ensures flexibility and control over your data management processes within the platform.
Let's get started 🚀
Navigation to Settings
Step 1: Select a source datastore from the side menu for which you would like to manage the settings.
Step 2: Click on the Settings icon from the top right window. A drop-down menu will appear with the following options:
- Edit
- Enrichment
- Score
- Delete
Edit Datastore
The Edit Datastore setting allows users to modify the connection details of the datastore. This includes updating the host, port, SID, username, password, schema, and any associated teams.
Note
Connection details can vary based on the type of datastore being edited. For example, details for BigQuery will differ from Snowflake or Athena.
Step 1: Click on the Edit option
Step 2: After selecting the Edit option, a modal window will appear, displaying the connection details. This window allows you to modify any specific connection details.
Step 3: After editing the connection details, click on the Save button.
Link Enrichment Datastore
An enrichment datastore is a database used to enhance your existing data by adding additional, relevant information. This helps you to provide more comprehensive insight into data and improve data accuracy.
You have the option to link an enrichment datastore to your existing source datastore. However, some datastores cannot be linked as enrichment datastores. For example, Oracle, Athena, and Timescale cannot be used for this purpose.
Step 1: Click on the Enrichment from the dropdown list.
A modal window-Link Enrichment Datastore will appear, providing you with two options to link an enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Caret Down Button | Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore. |
3. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Option I: Link New Enrichment
If the toggle for Add new connection is turned on, then this will prompt you to link a new enrichment datastore from scratch without using existing connection details.
Step 1: Click on the caret button and select Add Enrichment Datastore.
A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Name | Give a name for the enrichment datastore. |
3. | Toggle Button for add new connection | Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection. |
4. | Connector | Select a datastore connector from the dropdown list. |
Step 2: Add connection details for your selected enrichment datastore connector.
Note
Connection details can vary from datastore to datastore. For illustration, we have demonstrated linking BigQuery as a new enrichment datastore.
Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.
If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.
Step 4: Click on the Save button.
Step 5: After clicking on the Save button a modal window will appear Your Datastore has been successfully updated
Option II: Link Existing Connection
If the Use an existing enrichment datastore option is selected from the dropdown menu, you will be prompted to link the enrichment datastore using existing connection details.
Step 1: Click on the caret button and select Use Enrichment Datastore.
Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Prefix | Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore. |
2. | Enrichment Datastore | Select an enrichment datastore from the dropdown list. |
Step 3: View and check the connection details of the enrichment and click on the Save button.
Step 4: After clicking on the Save button a modal window will appear Your Datastore has been successfully updated
Quality Score Settings
Quality Scores are quantified measures of data quality calculated at the field and container levels recorded as time series to enable tracking of changes over time. Scores range from 0-100 with higher values indicating superior quality. These scores integrate eight distinct factors providing a granular analysis of the attributes that impact the overall data quality.
Each field receives a total quality score based on eight key factors each evaluated on a 0-100 scale. The overall score is a composite reflecting the relative importance and configured weights of these factors:
-
Completeness: Measures the average completeness of a field across all profiles.
-
Coverage: Assesses the adequacy of data quality checks for the field.
-
Conformity: Checks alignment with standards defined by quality checks.
-
Consistency: Ensures uniformity in type and scale across all data representations.
-
Precision: Evaluates the resolution of field values against defined quality checks.
-
Timeliness: Gauges data availability according to schedule inheriting the container's timeliness.
-
Volumetrics: Analyzes consistency in data size and shape over time inheriting the container's volumetrics.
-
Accuracy: Determines the fidelity of field values to their real-world counterparts.
The Quality Score Settings allow users to tailor the impact of each quality factor on the total score by adjusting their weights allowing the scoring system to align with your organization’s data governance priorities.
Step 1: Click on the Score option in the settings icon.
Step 2: A modal window "Quality Score Settings" will appear.
Step 3: The Decay Period slider sets the time frame over which the system evaluates historical data to determine the quality score. The decay period for considering past data events defaults to 180 days but can be customized to fit your operational needs ensuring the scores reflect the most relevant data quality insights.
Step 4: Adjust the Factor Weights using the sliding bar. The factor weights determine the importance of different data quality aspects.
Step 5: Click on the Save button to save the quality score settings.
Delete Datastore
The Delete Datastore action permanently removes a datastore and all associated profiles, checks, and anomalies. This action cannot be undone and requires confirmation by typing the datastore name before proceeding.
Step 1: Click on the Delete option in the settings icon.
Step 2: A modal window Delete Datastore will appear.
Step 3: Enter the Name of the datastore in the given field (confirmation check) and then click on the I’M SURE, DELETE THIS DATASTORE button to delete the datastore.
Mark Datastore as Favorite
Marking a datastore as a favorite allows you to quickly access important data sources. This feature helps you prioritize and manage the datastores you frequently use, making data management more efficient.
Step 1: Click on the bookmark icon to mark the Datastores as a favorite.
After Clicking on the bookmark icon your datastore is successfully marked as a favorite and a success flash message will appear stating “The datastore has been favorited”.
Step 2: To unmark a datastore, simply click on the bookmark icon of the marked datastore. This will remove it from your favorites
Ended: Source Datastores
Enrichment Datastores ↵
Enrichment Datastore Overview
An Enrichment Datastore is a user-managed storage location where the Qualytics platform records and accesses metadata through a set of system-defined tables. It is purpose-built to capture metadata generated by the platform's profiling and scanning operations.
Let’s get started 🚀
Key Points
-
Metadata Storage: The Enrichment Datastore acts as a dedicated mechanism for writing and retaining metadata that the platform generates. This includes information about anomalies, quality checks, field profiling, and additional details that enrich the source data.
-
Feature Enablement: By using the Enrichment Datastore, the platform unlocks certain features such as the previewing of source records. For instance, when an anomaly is detected, the platform typically previews a limited set of affected records. For a comprehensive view and persistent access, the Enrichment Datastore captures and maintains a complete snapshot of the source records associated with the anomalies.
-
User-Managed Location: While the Qualytics platform handles the generation and processing of metadata, the actual storage is user-managed. This means that the user maintains control over the Enrichment Datastore, deciding where and how this data is stored, adhering to their governance and compliance requirements.
-
Insight and Reporting: Beyond storing metadata, the Enrichment Datastore allows users to derive actionable insights and develop custom reports for a variety of use cases, from compliance tracking to data quality improvement initiatives.
Navigation
Log in to your Qualytics account and click the Enrichment Datastores button on the left side panel of the interface.
Table Types
The Enrichment Datastore contains several types of tables, each serving a specific purpose in the data enrichment and remediation process. These tables are categorized into:
- Enrichment Tables
- Remediation Tables
- Metadata Tables
Enrichment Tables
When anomalies are detected, the platform writes metadata into four primary enrichment tables:
- <enrichment_prefix>_check_metrics
- <enrichment_prefix>_failed_checks
- <enrichment_prefix>_source_records
- <enrichment_prefix>_scan_operations
_CHECK_METRICS_Table
Captures and logs detailed metrics for every data quality check performed within the Qualytics Platform, providing insights into asserted and anomalous records across datasets.
Columns
Name | Data Type | Description |
---|---|---|
OPERATION_ID | NUMBER | Unique Identifier for the check metric. |
CONTAINER_ID | NUMBER | Identifier for the container associated with the check metric. |
SOURCE_DATASTORE | STRING | Datastore where the source data resides. |
SOURCE_CONTAINER | STRING | Name of the source data container. |
SOURCE_PARTITION | STRING | Partition of the source data. |
QUALITY_CHECK_ID | NUMBER | Unique identifier for the quality check performed. |
ASSERTED_RECORDS_COUNT | NUMBER | Count of records expected or asserted in the source. |
ANOMALOUS_RECORDS_COUNT | NUMBER | Count of records identified as anomalous. |
_QUALYTICS_SOURCE_PARTITION | STRING | Partition information specific to Qualytics metrics. |
_FAILED_CHECKS Table
Acts as an associative entity that consolidates information on failed checks, associating anomalies with their respective quality checks.
Columns
Name | Data Type | Description |
---|---|---|
QUALITY_CHECK_ID | NUMBER | Unique identifier for the quality check. |
ANOMALY_UUID | STRING | UUID for the anomaly detected. |
QUALITY_CHECK_MESSAGE | STRING | Message describing the quality check outcome. |
SUGGESTED_REMEDIATION_FIELD | STRING | Field suggesting remediation. |
SUGGESTED_REMEDIATION_VALUE | STRING | Suggested value for remediation. |
SUGGESTED_REMEDIATION_SCORE | FLOAT | Score indicating confidence in remediation. |
QUALITY_CHECK_RULE_TYPE | STRING | Type of rule applied for quality check. |
QUALITY_CHECK_TAGS | STRING | Tags associated with the quality check. |
QUALITY_CHECK_PARAMETERS | STRING | Parameters used for the quality check. |
QUALITY_CHECK_DESCRIPTION | STRING | Description of the quality check. |
OPERATION_ID | NUMBER | Identifier for the operation detecting anomaly. |
DETECTED_TIME | TIMESTAMP | Timestamp when the anomaly was detected. |
SOURCE_CONTAINER | STRING | Name of the source data container. |
SOURCE_PARTITION | STRING | Partition of the source data. |
SOURCE_DATASTORE | STRING | Datastore where the source data resides. |
Info
This table is not characterized by unique ANOMALY_UUID
or QUALITY_CHECK_ID
values alone. Instead, the combination of ANOMALY_UUID
and QUALITY_CHECK_ID
serves as a composite key, uniquely identifying each record in the table.
_SOURCE_RECORDS Table
Stores source records in JSON format, primarily to enable the preview source record feature in the Qualytics App.
Columns
Name | Data Type | Description |
---|---|---|
SOURCE_CONTAINER | STRING | Name of the source data container. |
SOURCE_PARTITION | STRING | Partition of the source data. |
ANOMALY_UUID | STRING | UUID for the associated anomaly. |
CONTEXT | STRING | Contextual information for the anomaly. |
RECORD | STRING | JSON representation of the source record. |
_SCAN_OPERATIONS Table
Captures and stores the results of every scan operation conducted on the Qualytics Platform.
Columns
Name | Data Type | Description |
---|---|---|
OPERATION_ID | NUMBER | Unique identifier for the scan operation. |
DATASTORE_ID | NUMBER | Identifier for the source datastore associated with the operation. |
CONTAINER_ID | NUMBER | Identifier for the container associated with the operation. |
CONTAINER_SCAN_ID | NUMBER | Identifier for the container scan associated with the operation. |
PARTITION_NAME | STRING | Name of the source partition on which the scan operation is performed. |
INCREMENTAL | BOOLEAN | Boolean flag indicating whether the scan operation is incremental. |
RECORDS_PROCESSED | NUMBER | Total number of records processed during the scan operation. |
ENRICHMENT_SOURCE_RECORD_LIMIT | NUMBER | Maximum number of records written to the enrichment for each anomaly detected. |
MAX_RECORDS_ANALYZED | NUMBER | Maximum number of records analyzed in the scan operation. |
ANOMALY_COUNT | NUMBER | Total number of anomalies identified in the scan operation. |
START_TIME | TIMESTAMP | Timestamp marking the start of the scan operation. |
END_TIME | TIMESTAMP | Timestamp marking the end of the scan operation. |
RESULT | STRING | Textual representation of the scan operation's status. |
MESSAGE | STRING | Detailed message regarding the process of the scan operation. |
Remediation Tables
When anomalies are detected in a container, the platform has the capability to create remediation tables in the Enrichment Datastore. These tables are detailed snapshots of the affected container, capturing the state of the data at the time of anomaly detection. They also include additional columns for metadata and remediation purposes. However, the creation of these tables depends upon the chosen remediation strategy during the scan operation.
Currently, there are three types of remediation strategies:
- None: No remediation tables will be created, regardless of anomaly detection.
- Append: Replicate source containers using an append-first strategy.
- Overwrite: Replicate source containers using an overwrite strategy.
Note
The naming convention for the remediation tables follows the pattern of <enrichment_prefix>_remediation_<container_id>
, where <enrichment_prefix>
is user-defined during the Enrichment Datastore configuration and <container_name>
corresponds to the original source container.
Illustrative Table
_{ENRICHMENT_CONTAINER_PREFIX}_REMEDIATION_{CONTAINER_ID}
This remediation table is an illustrative snapshot of the "Orders" container for reference purposes.
Name | Data Type | Description |
---|---|---|
_QUALYTICS_SOURCE_PARTITION | STRING | The partition from the source data container. |
ANOMALY_UUID | STRING | Unique identifier of the anomaly. |
ORDERKEY | NUMBER | Unique identifier of the order. |
CUSTKEY | NUMBER | The customer key related to the order. |
ORDERSTATUS | CHAR | The status of the order (e.g., 'F' for 'finished'). |
TOTALPRICE | FLOAT | The total price of the order. |
ORDERDATE | DATE | The date when the order was placed. |
ORDERPRIORITY | STRING | Priority of the order (e.g., 'urgent'). |
CLERK | STRING | The clerk who took the order. |
SHIPPRIORITY | INTEGER | The priority given to the order for shipping. |
COMMENT | STRING | Comments related to the order. |
Note
In addition to capturing the original container fields, the platform includes three metadata columns designed to assist in the analysis and remediation process.
- _QUALYTICS_SOURCE_PARTITION
- ANOMALY_UUID
Understanding Remediation Tables vs. Source Record Tables
When managing data anomalies in containers, it's important to understand the structures of Remediation Tables and Source Record Tables in the Enrichment Datastore.
Remediation Tables
Purpose: Remediation tables are designed to capture detailed snapshots of the affected containers at the time of anomaly detection. They serve as a primary tool for remediation actions.
Creation: These tables are generated based on the remediation strategy selected during the scan operation:
- None: No tables are created.
- Append: Tables are created with new data appended.
- Overwrite: Tables are created and existing data is overwritten.
Structure: The structure includes all columns from the source container, along with additional columns for metadata and remediation purposes. The naming convention for these tables is <enrichment_prefix>_remediation_<container_id>
, where <enrichment_prefix>
is defined during the Enrichment Datastore configuration.
Source Record Tables
Purpose: The Source Record Table is mainly used within the Qualytics App to display anomalies directly to users by showing the source records.
Structure: Unlike remediation tables, the Source Record Table stores each record in a JSON format within a single column named RECORD
, along with other metadata columns like SOURCE_CONTAINER
, SOURCE_PARTITION
, ANOMALY_UUID
, and CONTEXT
.
Key Differences
-
Format: Remediation tables are structured with separate columns for each data field, making them easier to use for consulting and remediation processes.
Source Record Tables store data in a JSON format within a single column, which can be less convenient for direct data operations.
-
Usage: Remediation tables are optimal for performing corrective actions and are designed to integrate easily with data workflows.
Source Record Tables are best suited for reviewing specific anomalies within the Qualytics App due to their format and presentation.
Recommendation
For users intending to perform querying or need detailed snapshots for audit purposes, Remediation Tables are recommended.
For those who need to quickly review anomalies directly within the Qualytics App, Source Record Tables are more suitable due to their straightforward presentation of data.
Metadata Tables
The Qualytics platform enables users to manually export metadata into the enrichment datastore, providing a structured approach to data analysis and management. These metadata tables are structured to reflect the evolving characteristics of data entities, primarily focusing on aspects that are subject to changes.
Currently, the following assets are available for exporting:
- _<enrichment_prefix>_export_anomalies
- _<enrichment_prefix>_export_checks
- _<enrichment_prefix>_export_field_profiles
Note
The strategy used for managing these metadata tables employs a create or replace
approach, meaning that the export process will create a new table if one does not exist, or replace it entirely if it does. This means that any previous data will be overwritten.
For more detailed information on exporting metadata, please refer to the export documentation.
_EXPORT_ANOMALIES Table
Contains metadata from anomalies in a distinct normalized format. This table is specifically designed to capture the mutable states of anomalies, emphasizing their status changes.
Columns
Name | Data Type | Description |
---|---|---|
ID | NUMBER | Unique identifier for the anomaly. |
CREATED | TIMESTAMP | Timestamp of anomaly creation. |
UUID | UUID | Universal Unique Identifier of the anomaly. |
TYPE | STRING | Type of the anomaly (e.g., 'shape'). |
STATUS | STRING | Current status of the anomaly (e.g., 'Active'). |
GLOBAL_TAGS | STRING | Tags associated globally with the anomaly. |
CONTAINER_ID | NUMBER | Identifier for the associated container. |
SOURCE_CONTAINER | STRING | Name of the source container. |
DATASTORE_ID | NUMBER | Identifier for the associated datastore. |
SOURCE_DATASTORE | STRING | Name of the source datastore. |
GENERATED_AT | TIMESTAMP | Timestamp when the export was generated. |
_EXPORT_CHECKS Table
Contains metadata from quality checks.
Columns
Name | Data Type | Description |
---|---|---|
ADDITIONAL_METADATA | STRING | JSON-formatted string containing additional metadata for the check. |
COVERAGE | FLOAT | Represents the expected tolerance of the rule. |
CREATED | STRING | Created timestamp of the check. |
DELETED_AT | STRING | Deleted timestamp of the check. |
DESCRIPTION | STRING | Description of the check. |
FIELDS | STRING | Fields involved in the check separated by comma. |
FILTER | STRING | Criteria used to filter data when asserting the check. |
GENERATED_AT | STRING | Indicates when the export was generated. |
GLOBAL_TAGS | STRING | Represents the global tags of the check separated by comma. |
HAS_PASSED | BOOLEAN | Boolean indicator of whether the check has passed its last assertion . |
ID | NUMBER | Unique identifier for the check. |
INFERRED | BOOLEAN | Indicates whether the check was inferred by the platform. |
IS_NEW | BOOLEAN | Flags if the check is new. |
LAST_ASSERTED | STRING | Timestamp of the last assertion performed on the check. |
LAST_EDITOR | STRING | Represents the last editor of the check. |
LAST_UPDATED | STRING | Represents the last updated timestamp of the check. |
NUM_CONTAINER_SCANS | NUMBER | Number of containers scanned. |
PROPERTIES | STRING | Specific properties for the check in a JSON format. |
RULE_TYPE | STRING | Type of rule applied in the check. |
WEIGHT | FLOAT | Represents the weight of the check. |
DATASTORE_ID | NUMBER | Identifier of the datastore used in the check. |
CONTAINER_ID | NUMBER | Identifier of the container used in the check. |
TEMPLATE_ID | NUMBER | Identifier of the template id associated tothe check. |
IS_TEMPLATE | BOOLEAN | Indicates wheter the check is a template or not. |
SOURCE_CONTAINER | STRING | Name of the container used in the check. |
SOURCE_DATASTORE | STRING | Name of the datastore used in the check. |
_EXPORT_CHECK_TEMPLATES Table
Contains metadata from check templates.
Columns
Name | Data Type | Description |
---|---|---|
ADDITIONAL_METADATA | STRING | JSON-formatted string containing additional metadata for the check. |
COVERAGE | FLOAT | Represents the expected tolerance of the rule. |
CREATED | STRING | Created timestamp of the check. |
DELETED_AT | STRING | Deleted timestamp of the check. |
DESCRIPTION | STRING | Description of the check. |
FIELDS | STRING | Fields involved in the check separated by comma. |
FILTER | STRING | Criteria used to filter data when asserting the check. |
GENERATED_AT | STRING | Indicates when the export was generated. |
GLOBAL_TAGS | STRING | Represents the global tags of the check separated by comma. |
ID | NUMBER | Unique identifier for the check. |
IS_NEW | BOOLEAN | Flags if the check is new. |
IS_TEMPLATE | BOOLEAN | Indicates wheter the check is a template or not. |
LAST_EDITOR | STRING | Represents the last editor of the check. |
LAST_UPDATED | STRING | Represents the last updated timestamp of the check. |
PROPERTIES | STRING | Specific properties for the check in a JSON format. |
RULE_TYPE | STRING | Type of rule applied in the check. |
TEMPLATE_CHECKS_COUNT | NUMBER | The count of associated checks to the template. |
TEMPLATE_LOCKED | BOOLEAN | Indicates wheter the check template is locked or not. |
WEIGHT | FLOAT | Represents the weight of the check. |
_EXPORT_FIELD_PROFILES Table
Contains metadata from field profiles.
Columns
Name | Data Type | Description |
---|---|---|
APPROXIMATE_DISTINCT_VALUES | FLOAT | Estimated number of distinct values in the field. |
COMPLETENESS | FLOAT | Ratio of non-null entries to total entries in the field. |
CONTAINER_ID | NUMBER | Identifier for the container holding the field. |
SOURCE_CONTAINER | STRING | Name of the container holding the field. |
CONTAINER_STORE_TYPE | STRING | Storage type of the container. |
CREATED | STRING | Date when the field profile was created. |
DATASTORE_ID | NUMBER | Identifier for the datastore containing the field. |
SOURCE_DATASTORE | STRING | Name of the datastore containing the field. |
DATASTORE_TYPE | STRING | Type of datastore. |
ENTROPY | FLOAT | Measure of randomness in the information being processed. |
FIELD_GLOBAL_TAGS | STRING | Global tags associated with the field. |
FIELD_ID | NUMBER | Unique identifier for the field. |
FIELD_NAME | STRING | Name of the field being profiled. |
FIELD_PROFILE_ID | NUMBER | Identifier for the field profile record. |
FIELD_QUALITY_SCORE | FLOAT | Score representing the quality of the field. |
FIELD_TYPE | STRING | Data type of the field. |
FIELD_WEIGHT | NUMBER | Weight assigned to the field for quality scoring. |
GENERATED_AT | STRING | Date when the field profile was generated. |
HISTOGRAM_BUCKETS | STRING | Distribution of data within the field represented as buckets. |
IS_NOT_NORMAL | BOOLEAN | Indicator of whether the field data distribution is not normal. |
KLL | STRING | Sketch summary of the field data distribution. |
KURTOSIS | FLOAT | Measure of the tailedness of the probability distribution. |
MAX | FLOAT | Maximum value found in the field. |
MAX_LENGTH | FLOAT | Maximum length of string entries in the field. |
MEAN | FLOAT | Average value of the field's data. |
MEDIAN | FLOAT | Middle value in the field's data distribution. |
MIN | FLOAT | Minimum value found in the field. |
MIN_LENGTH | FLOAT | Minimum length of string entries in the field. |
NAME | STRING | Descriptive name of the field. |
Q1 | FLOAT | First quartile in the field's data distribution. |
Q3 | FLOAT | Third quartile in the field's data distribution. |
SKEWNESS | FLOAT | Measure of the asymmetry of the probability distribution. |
STD_DEV | FLOAT | Standard deviation of the field's data. |
SUM | FLOAT | Sum of all numerical values in the field. |
TYPE_DECLARED | BOOLEAN | Indicator of whether the field type is explicitly declared. |
UNIQUE_DISTINCT_RATIO | FLOAT | Ratio of unique distinct values to the total distinct values. |
Diagram
The diagram below provides a visual representation of the associations between various tables in the Enrichment Datastore. It illustrates how tables can be joined to track and analyze data across different processes.
Handling JSON and string splitting
SELECT
PARSE_JSON(ADDITIONAL_METADATA):metadata_1::string AS Metadata1_Key1,
PARSE_JSON(ADDITIONAL_METADATA):metadata_2::string AS Metadata2_Key1,
PARSE_JSON(ADDITIONAL_METADATA):metadata_3::string AS Metadata3_Key1,
-- Add more lines as needed up to MetadataN
CONTAINER_ID,
COVERAGE,
CREATED,
DATASTORE_ID,
DELETED_AT,
DESCRIPTION,
SPLIT_PART(FIELDS, ',', 1) AS Field1,
SPLIT_PART(FIELDS, ',', 2) AS Field2,
-- Add more lines as needed up to FieldN
FILTER,
GENERATED_AT,
SPLIT_PART(GLOBAL_TAGS, ',', 1) AS Tag1,
SPLIT_PART(GLOBAL_TAGS, ',', 2) AS Tag2,
-- Add more lines as needed up to TagN
HAS_PASSED,
ID,
INFERRED,
IS_NEW,
IS_TEMPLATE,
LAST_ASSERTED,
LAST_EDITOR,
LAST_UPDATED,
NUM_CONTAINER_SCANS,
PARSE_JSON(PROPERTIES):allow_other_fields::string AS Property_AllowOtherFields,
PARSE_JSON(PROPERTIES):assertion::string AS Property_Assertion,
PARSE_JSON(PROPERTIES):comparison::string AS Property_Comparison,
PARSE_JSON(PROPERTIES):datetime_::string AS Property_Datetime,
-- Add more lines as needed up to Property
RULE_TYPE,
SOURCE_CONTAINER,
SOURCE_DATASTORE,
TEMPLATE_ID,
WEIGHT
FROM "_EXPORT_CHECKS";
SELECT
(ADDITIONAL_METADATA::json ->> 'metadata_1') AS Metadata1_Key1,
(ADDITIONAL_METADATA::json ->> 'metadata_2') AS Metadata2_Key1,
(ADDITIONAL_METADATA::json ->> 'metadata_3') AS Metadata3_Key1,
-- Add more lines as needed up to MetadataN
CONTAINER_ID,
COVERAGE,
CREATED,
DATASTORE_ID,
DELETED_AT,
DESCRIPTION,
(string_to_array(FIELDS, ','))[1] AS Field1,
(string_to_array(FIELDS, ','))[2] AS Field2,
-- Add more lines as needed up to FieldN
FILTER,
GENERATED_AT,
(string_to_array(GLOBAL_TAGS, ','))[1] AS Tag1,
(string_to_array(GLOBAL_TAGS, ','))[2] AS Tag2,
-- Add more lines as needed up to TagN
HAS_PASSED,
ID,
INFERRED,
IS_NEW,
IS_TEMPLATE,
LAST_ASSERTED,
LAST_EDITOR,
LAST_UPDATED,
NUM_CONTAINER_SCANS,
(PROPERTIES::json ->> 'allow_other_fields') AS Property_AllowOtherFields,
(PROPERTIES::json ->> 'assertion') AS Property_Assertion,
(PROPERTIES::json ->> 'comparison') AS Property_Comparison,
(PROPERTIES::json ->> 'datetime_') AS Property_Datetime,
-- Add more lines as needed up to PropertyN
RULE_TYPE,
SOURCE_CONTAINER,
SOURCE_DATASTORE,
TEMPLATE_ID,
WEIGHT
FROM "_EXPORT_CHECKS";
SELECT
(ADDITIONAL_METADATA->>'$.metadata_1') AS Metadata1_Key1,
(ADDITIONAL_METADATA->>'$.metadata_2') AS Metadata2_Key1,
(ADDITIONAL_METADATA->>'$.metadata_3') AS Metadata3_Key1,
-- Add more lines as needed up to MetadataN
CONTAINER_ID,
COVERAGE,
CREATED,
DATASTORE_ID,
DELETED_AT,
DESCRIPTION,
SUBSTRING_INDEX(FIELDS, ',', 1) AS Field1,
-- Add more lines as needed up to FieldN
SUBSTRING_INDEX(GLOBAL_TAGS, ',', 1) AS Tag1,
-- Add more lines as needed up to TagN
HAS_PASSED,
ID,
INFERRED,
IS_NEW,
IS_TEMPLATE,
LAST_ASSERTED,
LAST_EDITOR,
LAST_UPDATED,
NUM_CONTAINER_SCANS,
(PROPERTIES->>'$.allow_other_fields') AS Property_AllowOtherFields,
(PROPERTIES->>'$.assertion') AS Property_Assertion,
(PROPERTIES->>'$.comparison') AS Property_Comparison,
(PROPERTIES->>'$.datetime_') AS Property_Datetime,
-- Add more lines as needed up to PropertyN
RULE_TYPE,
SOURCE_CONTAINER,
SOURCE_DATASTORE,
TEMPLATE_ID,
WEIGHT
FROM "_EXPORT_CHECKS";
Usage Notes
- Both metadata tables and remediation tables, are designed to be ephemeral and thus are recommended to be used as temporary datasets. Users are advised to move this data to a more permanent dataset for long-term storage and reporting.
- The anomaly UUID in the remediation tables acts as a link to the detailed data in the _anomaly enrichment table. This connection not only shows the number of failed checks but also provides insight into each one, such as the nature of the issue, the type of rule violated, and associated check tags. Additionally, when available, suggested remediation actions, including suggested field modifications and values, are presented alongside a score indicating the suggested action's potential effectiveness. This information helps users to better understand the specifics of each anomaly related to the remediation tables.
- The Qualytics platform is configured to capture and write a maximum of 10 rows of data per anomaly by default for both the _source_records enrichment table and the remediation tables. To adjust this limit, users can utilize the
enrichment_source_record_limit
parameter within the Scan Operation settings. This parameter accepts a minimum value of 10 but allows the specification of a higher limit, up to an unrestricted number of rows per anomaly. It is important to note that if an anomaly is associated with fewer than 10 records, the platform will only write the actual number of records where the anomaly was detected.
API Payload Examples
Retrieving Enrichment Datastore Tables
Endpoint (Get)
/api/datastores/{enrichment-datastore-id}/listing
(get)
[
{
"name":"_datastore_prefix_scan_operations",
"label":"scan_operations",
"datastore":{
"id":123,
"name":"My Datastore",
"store_type":"jdbc",
"type":"postgresql",
"enrich_only":false,
"enrich_container_prefix":"_datastore_prefix",
"favorite":false
}
},
{
"name":"_datastore_prefix_source_records",
"label":"source_records",
"datastore":{
"id":123,
"name":"My Datastore",
"store_type":"jdbc",
"type":"postgresql",
"enrich_only":false,
"enrich_container_prefix":"_datastore_prefix",
"favorite":false
}
},
{
"name":"_datastore_prefix_failed_checks",
"label":"failed_checks",
"datastore":{
"id":123,
"name":"My Datastore",
"store_type":"jdbc",
"type":"postgresql",
"enrich_only":false,
"enrich_container_prefix":"_datastore_prefix",
"favorite":false
}
},
{
"name": "_datastore_prefix_remediation_container_id",
"label": "table_name",
"datastore": {
"id": 123,
"name": "My Datastore",
"store_type": "jdbc",
"type": "postgresql",
"enrich_only": false,
"enrich_container_prefix": "_datastore_prefix",
"favorite": false
}
}
]
Retrieving Enrichment Datastore Source Records
Endpoint (Get)
/api/datastores/{enrichment-datastore-id}/source-records?path={_source-record-table-prefix}
(get)
Endpoint With Filters (Get)
/api/datastores/{enrichment-datastore-id}/source-records?filter=anomaly_uuid='{uuid}'&path={_source-record-table-prefix}
(get)
{
"source_records": "[{\"source_container\":\"table_name\",\"source_partition\":\"partition_name\",\"anomaly_uuid\":\"f11d4e7c-e757-4bf1-8cd6-d156d5bc4fa5\",\"context\":null,\"record\":\"{\\\"P_NAME\\\":\\\"\\\\\\\"strategize intuitive systems\\\\\\\"\\\",\\\"P_TYPE\\\":\\\"\\\\\\\"Radiographer, therapeutic\\\\\\\"\\\",\\\"P_RETAILPRICE\\\":\\\"-24.69\\\",\\\"LAST_MODIFIED_TIMESTAMP\\\":\\\"2023-09-29 11:17:19.048\\\",\\\"P_MFGR\\\":null,\\\"P_COMMENT\\\":\\\"\\\\\\\"Other take so.\\\\\\\"\\\",\\\"P_PARTKEY\\\":\\\"845004850\\\",\\\"P_SIZE\\\":\\\"4\\\",\\\"P_CONTAINER\\\":\\\"\\\\\\\"MED BOX\\\\\\\"\\\",\\\"P_BRAND\\\":\\\"\\\\\\\"PLC\\\\\\\"\\\"}\"}]"
}
Retrieving Enrichment Datastore Remediation
Endpoint (Get)
/api/datastores/{enrichment-datastore-id}/source-records?path={_remediation-table-prefix}
(get)
Endpoint With Filters (Get)
/api/datastores/{enrichment-datastore-id}/source-records?filter=anomaly_uuid='{uuid}'&path={_remediation-table-prefix}
(get)
{
"source_records": "[{\"source_container\":\"table_name\",\"source_partition\":\"partition_name\",\"anomaly_uuid\":\"f11d4e7c-e757-4bf1-8cd6-d156d5bc4fa5\",\"context\":null,\"record\":\"{\\\"P_NAME\\\":\\\"\\\\\\\"strategize intuitive systems\\\\\\\"\\\",\\\"P_TYPE\\\":\\\"\\\\\\\"Radiographer, therapeutic\\\\\\\"\\\",\\\"P_RETAILPRICE\\\":\\\"-24.69\\\",\\\"LAST_MODIFIED_TIMESTAMP\\\":\\\"2023-09-29 11:17:19.048\\\",\\\"P_MFGR\\\":null,\\\"P_COMMENT\\\":\\\"\\\\\\\"Other take so.\\\\\\\"\\\",\\\"P_PARTKEY\\\":\\\"845004850\\\",\\\"P_SIZE\\\":\\\"4\\\",\\\"P_CONTAINER\\\":\\\"\\\\\\\"MED BOX\\\\\\\"\\\",\\\"P_BRAND\\\":\\\"\\\\\\\"PLC\\\\\\\"\\\"}\"}]"
}
Retrieving Enrichment Datastore Failed Checks
Endpoint (Get)
/api/datastores/{enrichment-datastore-id}/source-records?path={_failed-checks-table-prefix}
(get)
Endpoint With Filters (Get)
/api/datastores/{enrichment-datastore-id}/source-records?filter=anomaly_uuid='{uuid}'&path={_failed-checks-table-prefix}
(get)
{
"source_records": "[{\"quality_check_id\":155481,\"anomaly_uuid\":\"1a937875-6bce-4bfe-8701-075ba66be364\",\"quality_check_message\":\"{\\\"SNPSHT_TIMESTAMP\\\":\\\"2023-09-03 10:26:15.0\\\"}\",\"suggested_remediation_field\":null,\"suggested_remediation_value\":null,\"suggested_remediation_score\":null,\"quality_check_rule_type\":\"greaterThanField\",\"quality_check_tags\":\"Time-Sensitive\",\"quality_check_parameters\":\"{\\\"field_name\\\":\\\"SNPSHT_DT\\\",\\\"inclusive\\\":false}\",\"quality_check_description\":\"Must have a value greater than the value of SNPSHT_DT\",\"operation_id\":28162,\"detected_time\":\"2024-03-29T15:08:07.585Z\",\"source_container\":\"ACTION_TEST_CLIENT_V3\",\"source_partition\":\"ACTION_TEST_CLIENT_V3\",\"source_datastore\":\"DB2 Dataset\"}]"
}
Retrieving Enrichment Datastore Scan Operations
Endpoint (Get)
/api/datastores/{enrichment-datastore-id}/source-records?path={_scan-operations-table-prefix}
(get)
Endpoint With Filters (Get)
/api/datastores/{enrichment-datastore-id}/source-records?filter=operation_id='{operation-id}'&path={_scan-operations-table-prefix}
(get)
{
"source_records": "[{\"operation_id\":22871,\"datastore_id\":850,\"container_id\":7239,\"container_scan_id\":43837,\"partition_name\":\"ACTION_TEST_CLIENT_V3\",\"incremental\":true,\"records_processed\":0,\"enrichment_source_record_limit\":10,\"max_records_analyzed\":-1,\"anomaly_count\":0,\"start_time\":\"2023-12-04T20:35:54.194Z\",\"end_time\":\"2023-12-04T20:35:54.692Z\",\"result\":\"success\",\"message\":null}]"
}
Retrieving Enrichment Datastore Exported Metadata
Endpoint (Get)
/api/datastores/{enrichment-datastore-id}/source-records?path={_export-metadata-table-prefix}
(get)
Endpoint With Filters (Get)
/api/datastores/{enrichment-datastore-id}/source-records?filter=container_id='{container-id}'&path={_export-metadata-table-prefix}
(get)
{
"source_records": "[{\"container_id\":13511,\"created\":\"2024-06-10T17:07:20.751438Z\",\"datastore_id\":1198,\"generated_at\":\"2024-06-11 18:42:31+0000\",\"global_tags\":\"\",\"id\":224818,\"source_container\":\"PARTSUPP-FORMATTED.csv\",\"source_datastore\":\"TPCH GCS\",\"status\":\"Active\",\"type\":\"shape\",\"uuid\":\"f2d4fae3-982b-45a1-b289-5854b7af4b03\"}]"
}
{
"source_records": "[{\"additional_metadata\":null,\"container_id\":13515,\"coverage\":1.0,\"created\":\"2024-06-10T16:27:05.600041Z\",\"datastore_id\":1198,\"deleted_at\":null,\"description\":\"Must have a numeric value above >= 0\",\"fields\":\"L_QUANTITY\",\"filter\":null,\"generated_at\":\"2024-06-11 18:42:38+0000\",\"global_tags\":\"\",\"has_passed\":false,\"id\":196810,\"inferred\":true,\"is_new\":false,\"is_template\":false,\"last_asserted\":\"2024-06-11T18:04:24.480899Z\",\"last_editor\":null,\"last_updated\":\"2024-06-10T17:07:43.248644Z\",\"num_container_scans\":4,\"properties\":null,\"rule_type\":\"notNegative\",\"source_container\":\"LINEITEM-FORMATTED.csv\",\"source_datastore\":\"TPCH GCS\",\"template_id\":null,\"weight\":7.0}]"
}
{
"source_records": "[{\"approximate_distinct_values\":106944.0,\"completeness\":0.7493389459,\"container_container_type\":\"file\",\"container_id\":13509,\"created\":\"2024-06-10T16:23:48.457907Z\",\"datastore_id\":1198,\"datastore_type\":\"gcs\",\"entropy\":null,\"field_global_tags\":\"\",\"field_id\":145476,\"field_name\":\"C_ACCTBAL\",\"field_profile_id\":882170,\"field_quality_score\":\"{\\\"total\\\": 81.70052209952111, \\\"completeness\\\": 74.93389459101233, \\\"coverage\\\": 66.66666666666666, \\\"conformity\\\": null, \\\"consistency\\\": 100.0, \\\"precision\\\": 100.0, \\\"timeliness\\\": null, \\\"volumetrics\\\": null, \\\"accuracy\\\": 100.0}\",\"field_type\":\"Fractional\",\"field_weight\":1,\"generated_at\":\"2024-06-11 18:42:32+0000\",\"histogram_buckets\":null,\"is_not_normal\":true,\"kll\":null,\"kurtosis\":-1.204241522,\"max\":9999.99,\"max_length\":null,\"mean\":4488.8079264033,\"median\":4468.34,\"min\":-999.99,\"min_length\":null,\"name\":\"C_ACCTBAL\",\"q1\":1738.87,\"q3\":7241.17,\"skewness\":0.0051837205,\"source_container\":\"CUSTOMER-FORMATTED.csv\",\"source_datastore\":\"TPCH GCS\",\"std_dev\":3177.3005493585,\"sum\":5.0501333575999904E8,\"type_declared\":false,\"unique_distinct_ratio\":null}]"
}
Data Preview
Data Preview in Qualytics makes it simple to explore data tables and fields within a selected enrichment dataset. It supports filtering, field selection, and record downloads for deeper analysis, ensuring streamlined and efficient data management.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and click the Enrichment Datastores button on the left side panel of the interface.
Step 2: You will see a list of available enrichment datastores. Click on the specific datastore you want to preview its details and data.
For Demonstaration purpose we are selected Netsuite Financials Enrich enrichment datastore.
Step 3: After clicking on your selected enrichment datastore, you will be able to preview its enrichment, metadata, remediation, all data tables, and unlinked data tables.
Data Preview Tab
Data Preview Tab provides a clear visualization of enriched datasets tables and fields like _FAILED_CHECKS
, _SOURCE_RECORDS
, and _SCAN_OPERATIONS
. Users can explore remediation data, metadata, and unlinked objects, refine data with filters, select specific fields, and download records for further analysis. This tab ensures efficient data review and management for enhanced insights.
All
By selecting All, users can access a comprehensive list of data table associated with the selected enrichment datastore. This includes all relevant tables categorized under Enrichment, Remediation, Metadata, and Unlinked sections, enabling users to efficiently explore and manage the data. Click on specific table or dataset within the all section to access its detailed information.
After clicking on a specific table or dataset, a detailed view opens, displaying fields such as _FAILED_CHECKS
, _SOURCE_RECORDS
, _SCAN_OPERATIONS
, remediation tables (e.g., _ENRICHMENT_CONTAINER_PREFIX_REMEDIATION_CONTAINER_ID
), metadata from export containers, and unlinked objects (orphaned data) for review and action.
Enrichment
By selecting Enrichment users can access a comprehensive view of the table or data associated with the selected enrichment datastore. Click on specific table or dataset within the Enrichment section to access its detailed information.
After clicking on a specific table or dataset, a detailed view opens, displaying fields of the selected table or dataset.
Remediation
By selecting Remediation users can access a comprehensive view of the table or data associated with the selected enrichment datastore. Click on specific table or dataset within the Remedation section to access its detailed information.
After clicking on a table or dataset, a detailed view opens, displaying all the fields and data associated with the selected table or dataset.
Metadata
By selecting Metadata users can access a comprehensive view of the table or data associated with the selected enrichment datastore.Click on specific table or dataset within the Metadata section to access its detailed information.
After clicking on a table or dataset, a detailed view opens, displaying all the fields and data associated with the selected table or dataset.
Unlinked
By selecting Unlinked users can access a comprehensive view of the table or data associated with the selected enrichment datastore. Click on specific table or dataset within the Unlinked section to access its detailed information.
After clicking on a table or dataset, a detailed view opens, displaying all the fields and data associated with the selected table or dataset.
Filter Clause and Refresh
Data Preview tab includes a filter functionality that enables users to focus on specific field by applying filter clauses. This refines the displayed rows based on specific criteria, enhancing data analysis and providing more targeted insights and a Refresh button to update the data view with the latest data.
Filter Clause
Use the Filter Clause to narrow down the displayed rows by applying specific filter clauses, allowing for focused and precise data analysis.
Refresh
Click Refresh button to update the data view with the latest information, ensuring accuracy and relevance.
Select Specific Fields
Select specific fields to display, allowing you to focus on the most relevant data for analysis.To focus on relevant data for analysis, click on the Select Fields to Show dropdown. Choose specific fields you want to review by checking or unchecking options.
Download Records
Download Records feature in Qualytics allows users to easily export all source records from the selected enrichment dataset. This functionality is essential for performing deeper analysis outside the platform or for sharing data with external tools and teams.
Add Enrichment Datastore
Adding an enrichment datastore in Qualytics lets you connect and configure data sources for enhanced data management. You can create a new connection or reuse an existing one, with options to securely manage credentials.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and click the Enrichment Datastores button on the left side panel of the interface.
Step 2: Click on the Add Enrichment Datastore button located at the top-right corner of the interface.
Step 3: A modal window- Add Enrichment Datastore will appear, providing you with the options to add enrichment datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name | Specify the name of the enrichment datastore |
2. | Toggle Button | Toggle ON to create a new enrichment datastore from scratch, or toggle OFF to reuse credentials from an existing connection |
3. | Connector | Select connector from the dropdown list. |
Option I: Add Enrichment Datastore with a new Connection
If the toggle for Add New connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using existing connection details.
Step 1: Select the connector from the dropdown list and add connection details such as Secrets Management, temp dataset ID, service account key, project ID, and dataset ID.
For demonstration purposes we have selected the Snowflake connector.
Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.
Note
Once the HashiCorp Vault is set up, use the $ format in Connection form to reference a Vault secret.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Login URL | Enter the URL used to authenticate with HashiCorp Vault. |
2. | Credentials Payload | Input a valid JSON containing credentials for Vault authentication. |
3. | Token JSONPath | Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token). |
4. | Secret URL | Enter the URL where the secret is stored in Vault. |
5. | Token Header Name | Set the header name used for the authentication token (e.g., X-Vault-Token). |
6. | Data JSONPath | Specify the JSONPath to retrieve the secret data (e.g., $.data). |
Step 2: The configuration form, requesting credential details before add the enrichment datastore.
Note
Different connectors have different sets of fields and options appearing when selected.
REF | FIELDS | ACTIONS |
---|---|---|
1. | Account (Required) | Define the account identifier to be used for accessing the Snowflake. |
2. | Role (Required) | Specify the user role that grants appropriate access and permissions. |
3. | Warehouse (Required) | Provide the warehouse name that will be used for computing resources. |
4. | Authentication (Required) | You can choose between Basic authentication or Keypair authentication for validating and securing the connection to your Snowflake instance. Basic Authentication: This method uses a username and password combination for authentication. It is a straightforward method where the user's credentials are directly used to access Snowflake.
|
5. | Database | Specify the database name to be accessed. |
6. | Schema | Define the schema within the database that should be used. |
7. | Teams | Select one or more teams from the dropdown to associate with this source datastore. |
Step 3: After adding the details, click on the Save button.
A modal window apperars display a success message indicating that your enrichment has been successfully updated.
Step 4: Close the success dialog. Here, you can view a list of all the enrichment datastores you have added to the system. For demonstration purposes, we have created an enrichment datastore named Snowflake_demo, which is visible in the list.
Option II: Use an Existing Enrichment Datastore
If the toggle for Add New connection is turned off, then this will prompt you to add and configure the enrichment datastore using existing connection details.
Step 1: Select a connection to reuse existing credentials.
Note
If you are using existing credentials, you can only edit the details such as Database, Schema, and Teams.
Step 2: Click on the Save button.
A modal window apperars display a success message indicating that your enrichment has been successfully updated.
Step 3: Close the success dialog. Here, you can view a list of all the enrichment datastores you have added to the system. For demonstration purposes, we have created an enrichment datastore named Snowflake_demo, which is visible in the list.
Ended: Enrichment Datastores
Containers ↵
Containers Overview
Containers are fundamental entities representing structured data sets. These containers could manifest as tables in JDBC datastores or as files within DFS datastores. They play a pivotal role in data organization, profiling, and quality checks within the Qualytics application.
Let’s get started 🚀
Container Types
There are two main types of containers in Qualytics:
JDBC Container
JDBC containers are virtual representations of database objects, making it easier to work with data stored in relational databases. These containers include tables, which organize data into rows and columns like a spreadsheet, views that provide customized displays of data from one or more tables, and other database objects such as indexes or stored procedures. Acting as a bridge between applications and databases, JDBC enables seamless interaction with these containers, allowing efficient data management and retrieval.
DFS Container
DFS containers are used to represent files stored in distributed file systems, such as Hadoop or cloud storage. These files can include formats like CSV, JSON, or Parquet, which are commonly used for storing and organizing data. DFS containers make it easier for applications to work with these files by providing a structured way to access and process data in large-scale storage systems.
Containers Attributes
Totals
-
Quality Score: This represents the overall health of the data based on various checks. A higher score indicates better data quality and fewer issues detected.
-
Sampling: Displays the percentage of data sampled during profiling. A 100% sampling rate means the entire dataset was analyzed for the quality report.
-
Completeness: Indicates the percentage of records that are fully populated without missing or incomplete data. Lower percentages may suggest that some fields have missing values.
-
Records Profiled: Indicates the percentage of records that are fully populated without missing or incomplete data. Lower percentages may suggest that some fields have missing values.
-
Fields Profiled: This shows the number of fields or attributes within the dataset that have undergone data profiling, which helps identify potential data issues in specific columns.
-
Active Checks: Represents the number of ongoing checks applied to the dataset. These checks monitor data quality, consistency, and correctness.
-
Active Anomalies: Displays the total number of anomalies found during the data profiling process. Anomalies can indicate inconsistencies, outliers, or potential data quality issues that need resolution.
Observability
1.Volumetric Measurement
Volumetric measurement allows users to track the size of data stored within the table over time. This helps in monitoring how the data grows or changes, making it easier to detect sudden spikes that may impact system performance. Users can visualize data volume trends and manage the table's efficiency. This helps in optimizing storage, adjusting resource allocation, and improving query performance based on the size and growth of the computed table.
2.Anomalies Measurement
The Anomalies section helps users track any unusual data patterns or issues within the computed tables. It shows a visual representation of when anomalies occurred over a specific time period, making it easy to spot unusual activity. This allows users to quickly identify when something might have gone wrong and take action to fix it, ensuring the data stays accurate and reliable.
Actions on Container
Users can perform various operations on containers to manage datasets effectively. The actions are divided into three main sections: Settings, Add, and Run. Each section contains specific options to perform different tasks.
Settings
Settings button allows users to configure the container. By clicking on the Settings button, users can access the following options:
No. | Options | Description |
---|---|---|
1. | Settings | Configure incremental strategy, partitioning fields, and exclude specific fields from analysis. |
2. | Edit | Edit allowing you edit computed file/table to modify the file name, select expressions, and apply filter clauses. |
3. | Score | Score allowing you to adjust the decay period and factor weights for metrics like completeness, accuracy, and consistency. |
4. | Observability | Enables or disables volumetrics tracking for daily data volumes. |
5. | Export | Export quality checks, field profiles and Anomalies to an enrichment datastore for further action or analysis. |
6. | Delete | Delete the selected container from the system. |
Add
Add button allows users to add checks or computed fields. By clicking on the Add button, users can access following options:
No. | Options | Description |
---|---|---|
1. | Checks | Checks allow you to Add new checks or validation rules for the container. |
2. | Computed Field | Allows you to add computed field. |
Run
Run button provides options to execute operations on datasets, such as profiling, scanning, and external scans. By clicking on the Run button, users can access following options:
No. | Options | Descriptions |
---|---|---|
1. | Profile | Profile allows you to run a profile operation to analyze the data structure, gather metadata, set thresholds, and define record limits for comprehensive dataset profiling. |
2. | Scan | Scans allows you to perform data quality checks, configure scan strategies, and detect anomalies in the dataset. |
3. | External Scan | External Scan allows you to upload a file and validate its data against predefined checks in the selected table. |
Field Profiles
After profiling a container, individual field profiles offer granular insights:
Totals
1. Quality Score: This provides a comprehensive assessment of the overall health of the data, factoring in multiple checks for accuracy, consistency, and completeness. A higher score, closer to 100, indicates optimal data quality with minimal issues or errors detected. A lower score may highlight areas that require attention and improvement.
2. Sampling: This shows the percentage of data that was evaluated during profiling. A sampling rate of 100% indicates that the entire dataset was analyzed, ensuring a complete and accurate representation of the data’s quality across all records, rather than just a partial sample.
3. Completeness: This metric measures how fully the data is populated without missing or null values. A higher completeness percentage means that most fields contain the necessary information, while a lower percentage indicates data gaps that could negatively impact downstream processes or analysis.
4. Active Checks: This refers to the number of ongoing quality checks being applied to the dataset. These checks monitor aspects such as format consistency, uniqueness, and logical correctness. Active checks help maintain data integrity and provide real-time alerts about potential issues that may arise.
5. Active Anomalies: This tracks the number of anomalies or irregularities detected in the data. These could include outliers, duplicates, or inconsistencies that deviate from expected patterns. A count of zero indicates no anomalies, while a higher count suggests that further investigation is needed to resolve potential data quality issues.
Profile
This provides detailed insights into the characteristics of the field, including its type, distinct values, and length. You can use this information to evaluate the data's uniqueness, length consistency, and complexity.
No | Profile | Description |
---|---|---|
1 | Declared Type | Indicates whether the type is declared by the source or inferred. |
2 | Distinct Values | Count of distinct values observed in the dataset. |
3 | Min Length | Shortest length of the observed string values or lowest value for numerics. |
4 | Max Length | Greatest length of the observed string values or lowest value for numerics. |
5 | Mean | Mathematical average of the observed numeric values. |
6 | Median | The median of the observed numeric values. |
7 | Standard Deviation | Measure of the amount of variation in observed numeric values. |
8 | Kurtosis | Measure of the ‘tailedness’ of the distribution of observed numeric values. |
9 | Skewness | Measure of the asymmetry of the distribution observed numeric values. |
10 | Q1 | The first quartile; the central point between the minimum and the median. |
11 | Q3 | The third quartile; the central point between the median and the maximum. |
12 | Sum | Total sum of all observed numeric values. |
Data Preview
Data Preview in Qualytics makes it easy for users to view and understand their container data. It provides a clear snapshot of the data's structure and contents, showing up to 100 rows from the source. With options to filter specific data, refresh for the latest updates, and download records, it helps users focus on the most relevant information, troubleshoot issues, and analyze data effectively. The simple grid view ensures a smooth and efficient way to explore and work with your data.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and select the source datastore (JDBC or DFS) from the left menu that contains the data you want to preview.
Step 2: Select Tables (if JDBC datastore is connected) or File Patterns (if DFS datastore is connected) from the Navigation tab on the top.
Note
Before accessing the Data Preview tab, ensure the container is profiled. If not, run a profile operation on the container. Profiling collects important information about the table structure, like column types and field names. Without profiling, no data will be shown in the Data Preview section.
Step 3: You will view the full list of tables or files belonging to the selected source datastore. Select the specific table or file whose data you want to preview.
Alternatively, you can access the tables or files by clicking the drop-down arrow on the selected datasource. This will display the full list of tables or files associated with the selected source datastore. From there, select the specific table or file whose data you want to preview.
Step 4: After selecting the specific table or file, click on the Data Preview tab.
You will see a tabular view of the data, displaying the field names (columns) and their corresponding data values, allowing you to review the data's structure, types, and sample records.
UI Caching
Upon initial access, data in the Data Preview section, the data may not be stored (cached) yet, which can cause longer loading times. How long it takes to load depends on the type of data store being used (like DFS or JDBC) and if the data warehouse is serverless. However, the next time you access the same data, it will load faster because it will be cached, meaning the data is stored temporarily for quicker access.
Filter Clause and Refresh
Data Preview tab includes a filter functionality that enables users to focus on a specific field by applying filter clauses. This refines the displayed rows based on specific criteria, enhancing data analysis and providing more targeted insights and a Refresh button to update the data view with the latest data.
Filter Clause
Use the Filter Clause to narrow down the displayed rows by applying specific filter clauses, allowing for focused and precise data analysis.
Refresh
Click Refresh button to update the data view with the latest information, ensuring accuracy and relevance.
Select Specific Fields
Select specific fields to display, allowing you to focus on the most relevant data for analysis.To focus on relevant data for analysis, click on the Select Fields to Show dropdown. Choose specific fields you want to review by checking or unchecking options.
Download Records
Download Records feature in Qualytics allows users to easily export all source records from the selected enrichment dataset. This functionality is essential for performing deeper analysis outside the platform or for sharing data with external tools and teams.
Use Cases
Debugging Checks
One of the primary use cases of the Data Preview tab is for debugging checks. Users can efficiently inspect the first 100 rows of container data to identify any anomalies, inconsistencies, or errors, facilitating the debugging process and improving data quality.
Data Analysis
The Data Preview tab also serves as a valuable tool for data analysis tasks. Users can explore the dataset, apply filters to focus on specific subsets of data, and gain insights into patterns, trends, and correlations within the container data.
Examples
Example 1: Debugging Data Import
Suppose a user encounters issues with importing data into a container. By utilizing the Data Preview tab, the user can quickly examine the first 100 rows of imported data, identify any formatting errors or missing values, and troubleshoot the data import process effectively.
Example 2: Filtering Data by Date Range
In another scenario, a user needs to analyze sales data within a specific date range. The user can leverage the filter support feature of the Data Preview tab to apply date range filters, displaying only the sales records that fall within the specified timeframe. This allows for targeted analysis and informed decision-making.
Computed Tables & Files
Computed Tables and Computed Files are powerful virtual tables within the Qualytics platform, each serving distinct purposes in data manipulation. Computed Tables are created using SQL queries on JDBC source datastores, enabling advanced operations like joins and where clauses. Computed Files, derived from Spark SQL transformations on DFS source datastores, allow for efficient data manipulation and transformation directly within the DFS environment.
This guide explains how to add Computed Tables and Computed Files and discusses the differences between them.
Let's get started 🚀
Computed Tables
Use Computed Tables when you want to perform the following operation on your selected source datastore:
- Data Preparation and Transformation: Clean, shape, and restructure raw data from JDBC source datastores.
- Complex Calculations and Aggregations: Perform calculations not easily supported by standard containers.
- Data Subsetting: Extract specific data subsets based on filters using SQL's WHERE clause.
- Joining Data Across source datastores: Combine data from multiple JDBC source datastores using SQL joins.
Add Computed Tables
Step 1: Sign in to your Qualytics account and select a JDBC-type source datastore from the side menu on which you would like to add computed table.
Step 2: After selecting your preferred source datastore, you will be redirected to the source datastore's store operation page. From this page, click on the Add button and select the Computed Table option from the dropdown menu.
Step 3: A modal window will appear prompting you to enter the name for your computed table and a valid SQL query that supports your selected source datastore.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name | Enter a name for your compute table. The name should be descriptive and meaningful to help you easily identify the table later (e.g., add a meaningful name like Customer_Order_Statistics ). |
2. | Query | Write the valid SQL queries that support your selected source datastore. The query helps to perform joins and aggregations on your selected source datastore. |
Step 4: Click on the Add button to add the computed file table with your selected source datastore.
Computed Files
Use Computed Files when you want to perform the following operation on your selected source datastore:
- Data Preparation and Transformation: Efficiently clean and restructure raw data stored in a DFS.
- Column-Level Transformations: Utilize Spark SQL functions to manipulate and clean individual columns.
- Filtering Data: Extract specific data subsets within a DFS container using Spark SQL's WHERE clause.
Add Computed Files
Step 1: Sign in to your Qualytics account and select a DFS-type source datastore from the side menu on which you would like to add computed file.
Step 2: After clicking on your preferred source datastore, it will navigate you to the source datastore's store operation page. From this page, click on the Add button and select the Computed files option from the dropdown menu.
Step 3: A modal window will appear prompting you to enter the name for your computed file, select a source file pattern, select the expression, and define the filter clause (optional).
REF. | FIELDS | ACTION |
---|---|---|
1. | Name (Required) | Enter a name for your computed table. The name should be descriptive and meaningful to help you easily identify the table later (e.g., add a meaningful name like Customer_Order_Statistics). |
2. | Source File Pattern (Required) | Select a source file pattern from the dropdown menu to match files that have a similar naming convention. |
3. | Select Expression (Required) | Select the expression to define the data you want to include in the computed file. |
4. | Filter Clause (Optional) | Add a WHERE clause to filter the data that meets certain conditions. |
After configuring the computed file details, it will look like this: |
Step 4: Click on the Add button to add the computed file with your selected source datastore.
After clicking on the Add button, a flash message for successful operation will display
Computed Table Vs. Computed Files
Feature | Computed Table (JDBC) | Computed File (DFS) |
---|---|---|
Source Data | JDBC source datastores | DFS source datastores |
Query Language | SQL (database-specific functions) | Spark SQL |
Supported Operations | Joins, where clauses, and database functions | Column transforms, where clauses (no joins), Spark SQL functions |
Note
Computed tables and files function like regular tables. You can profile them, create checks, and detect anomalies.
- Updating a computed table's query will trigger a profiling operation.
- Updating a computed file's select or where clause will trigger a profiling operation.
- When you create a computed table or file, a basic profile of up to 1000 records is automatically generated.
Manage Tables & Files
Managing JDBC “tables” and DFS “files” in a connected source datastore allows you to perform actions such as adding validation checks, running scans, monitoring data changes, exporting, or deleting them. For JDBC tables, you can also handle metadata, configure partitions, and manage incremental data for optimized processing. However, for DFS datastores, the default incremental field is the file’s last modified timestamp, and users cannot configure incremental or partition fields manually.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and select the source datastore (JDBC or DFS) from the left menu that you want to manage.
Step 2: Select Tables (if JDBC datastore is connected) or File Patterns (if DFS datastore is connected) from the Navigation tab on the top.
Step 3: You will view the full list of tables or files belonging to the selected source datastore.
Settings
Settings allow you to edit how data is processed and analyzed for a specific table in your connected source datastore. This includes selecting fields for incremental and partitioning strategies, grouping data, excluding certain fields from scans, and adjusting general behaviors.
Step 1: Click on the vertical ellipse next to the table of your choice and select Settings from the dropdown list.
A modal window will appear for “Table Settings”.
Step 2: Modify the table setting based on:
-
Identifiers
-
Group Criteria
-
Excluding
-
General
Identifiers
An Identifier is a field that can be used to help load the desired data from a Table in support of analysis. For more details about identifiers, you can refer to the documentation on Identifiers.
Incremental Strategy
This is crucial for tracking changes at the row level within tables. This approach is essential for efficient data processing, as it is specifically used to track which records have already been scanned. This allows for scan operations to focus exclusively on new records that have not been previously scanned, thereby optimizing the scanning process and ensuring that only the most recent and relevant data is analyzed.
Note
If you have connected a DFS datastore, no manual setup is needed for the incremental strategy, the system automatically tracks and processes the latest data changes.
For information about incremental strategy, you can refer to the Incremental Strategy section in the Identifiers documentation.
Incremental Field
Incremental Field lets you select a field that tracks changes in your data. This ensures only new or updated records are scanned, improving efficiency and reducing unnecessary processing.
Partition Field
Partition Field is used to divide the data in a table into distinct segments, or dataframes. These partitions allow for parallel analysis, improving efficiency and performance. By splitting the data, each partition can be processed independently. This approach helps optimize large-scale data operations.
For information about Partition Field, you can refer to the Partition Field section in the Identifiers documentation.
Group Criteria
Group Criteria allow you to organize data into specific groups for more precise analysis. By grouping fields, you can gain better insights and enhance the accuracy of your profiling.
For information about Group Criteria, you can refer to the documentation on Grouping.
Excluding
Excluding allows you to choose specific fields from a table that you want to exclude from data checks. This helps focus on the fields that matter most for validation while ignoring others that are not relevant to the current analysis.
For information about Excluding, you can refer to the documentation on Excluding Settings.
General
You can control the default behavior of the specific table by checking or unchecking the option to infer the data type for each field. When checked, the system will automatically determine and cast the data types as needed for accurate data processing.
For information about General, you can refer to the documentation on General Settings.
Step 3: Once you have configured the table settings, click on the Save button.
After clicking on the Save button, your table is successfully updated and a success flash message will appear stating "Table has been successfully updated".
Add Checks
Add Check allows you to create rules to validate the data within a particular table. You can choose the type of rule, link it directly to the selected table, and add descriptions or tags. This ensures that the table's data remains accurate and compliant with the required standards.
Step 1: Click on the vertical ellipse next to the table name and select Add Checks.
A modal window will appear to add checks against the selected table.
To understand how to add checks, you can follow the remaining steps from the documentation Checks Template.
Run
Execute various operations like profiling or scanning your table or file. It helps validate data quality and ensures that the table meets the defined checks and rules, providing insights into any anomalies or data issues that need attention.
Step 1: Click on the vertical ellipse next to the table name and select Run.
Under Run, choose the type of operation you want to perform:
-
Profile: To collect metadata and profile the table's contents.
-
Scan: To validate the data against defined rules and checks.
To understand how a profile operation is performed, you can follow the remaining steps from the documentation Profile Operation..
To understand how a scan operation is performed, you can follow the remaining steps from the documentation Scan Operation.
Observability Settings
Observability helps you track and monitor data performance in your connected source datastore’s tables and files. It provides insights into data volume, detects anomalies, and ensures smooth data processing by identifying potential issues early. This makes it easier to manage and maintain data quality over time.
Step 1: Select the table in your JDBC datastore that you would like to monitor, then click on Observability.
A modal window “Observability Settings” will appear. Here you can view the details of the table and datastore where actions have been applied.
Step 2: Check the "Allow Tracking" to enable daily volumetric measurement tracking for the data asset.
You can enable or disable volumetric tracking by checking or unchecking the "Allow Tracking" option. This feature monitors and records the daily volume of data in the selected table. When enabled, it helps track how much data is being processed and stored over time, providing valuable insights into the data asset's growth and usage.
Step 3: Click on the Save button.
After clicking on the Save button, a success flash message will appear stating "Profile has been successfully updated".
Export
Export feature lets you capture changes in your tables. You can export metadata for Quality Checks, Field Profiles, and Anomalies from selected tables to an enrichment datastore. This helps you analyze data trends, find issues, and make better decisions based on the table data.
Step 1: Select the tables in your JDBC datastore that you would like to export, then click on Export.
A modal window will appear with the Export Metadata setting.
For the next steps, detailed information on the export metadata is available in the Export Metadata section of the documentation.
Delete
Delete allows you to remove a table from the connected source datastore. While the table and its associated data will be deleted, it is not permanent, as the table can be recreated if you run a catalog with the "recreate" option.
Note
Deleting a table is a reversible action if a catalog with the "recreate" option is run later.
Step 1: Select the tables in your connected source datastore that you would like to delete, then click on Delete.
Step 2: A confirmation modal window will appear, click on the Delete button to remove the table from the system.
Step 3: After clicking on the delete button, your table is successfully deleted and a success flash message will appear saying "Profile has been successfully deleted"
Mark Tables & Files as Favorite
Marking a tables and files as a favorite allows you to quickly access important items. This feature helps you prioritize and manage the tables and files you use frequently, making data management more efficient.
Step 1: Locate the table and file you want to mark as a favorite and click on the bookmark icon to mark the table and file as a favorite.
After Clicking on the bookmark icon your table and file is successfully marked as a favorite and a success flash message will appear stating “The Table has been favorited”.
Step 2: To unmark a tables and files, simply click on the bookmark icon of the marked tables and files. This will remove it from your favorites.
Computed Fields
Computed Fields allow you to enhance data analysis by applying dynamic transformations directly to your data. These fields let you create new data points, perform calculations, and customize data views based on your specific needs, ensuring your data is both accurate and actionable.
Let's get started 🚀
Add Computed Fields
Step 1: Log in to Your Qualytics Account, navigate to the side menu, and select the source datastore where you want to create a computed field.
Step 2: Select the Container within the chosen datastore where you want to create the computed field. This container holds the data to which the new computed field will be applied, enabling you to enhance your data analysis within that specific datastore.
For demonstration purposes, we have selected the Bank Dataset-Staging
source datastore and the bank_transactions_.csv
container within it to create a computed field.
Step 3: After selecting the container, click on the Add button and select Computed Field from the dropdown menu to create a new computed field.
A modal window will appear, allowing you to enter the details for your computed field.
Step 4: Enter the Name for the computed field and select Transformation Type from the dropdown menu.
REF. | FIELDS | ACTION |
---|---|---|
1. | Field Name (Required) | Add a unique name for your computed field. |
2. | Transformation Type (Required) | The type of transformation you want to apply from the available options. |
Info
Transformations are changes made to data, like converting formats, doing calculations, or cleaning up fields. In Qualytics, you can use transformations to meet specific needs, such as cleaning entity names, converting formatted numbers, or applying custom expressions. With various transformation types available, Qualytics enables you to customize your data directly within the platform, ensuring it’s accurate and ready for analysis.
Transformation Types | Purpose | Reference |
---|---|---|
Cleaned Entity Name | Removes business signifiers (such as 'Inc.' or 'Corp') from an entity name. | See here |
Convert Formatted Numeric | Removes formatting (such as parentheses for denoting negatives or commas as delimiters) from values that represent numeric data, converting them into a numerically typed field. | See here |
Custom Expression | Allows you to create a new field by applying any valid Spark SQL expression to one or more existing fields. | See here |
Step 5: After selecting the appropriate Transformation Type, click the Save button.
Step 6: After clicking on the Save button, your computed field is created and a success flash message will display saying A computed field has been created successfully.
You can find your computed field by clicking on the dropdown arrow next to the container you selected when creating the computed field.
Computed Fields Details
Totals
1. Quality Score: This provides a comprehensive assessment of the overall health of the data, factoring in multiple checks for accuracy, consistency, and completeness. A higher score, closer to 100, indicates optimal data quality with minimal issues or errors detected. A lower score may highlight areas that require attention and improvement.
2. Sampling: This shows the percentage of data that was evaluated during profiling. A sampling rate of 100% indicates that the entire dataset was analyzed, ensuring a complete and accurate representation of the data’s quality across all records, rather than just a partial sample.
3. Completeness: This metric measures how fully the data is populated without missing or null values. A higher completeness percentage means that most fields contain the necessary information, while a lower percentage indicates data gaps that could negatively impact downstream processes or analysis.
4. Active Checks: This refers to the number of ongoing quality checks being applied to the dataset. These checks monitor aspects such as format consistency, uniqueness, and logical correctness. Active checks help maintain data integrity and provide real-time alerts about potential issues that may arise.
5. Active Anomalies: This tracks the number of anomalies or irregularities detected in the data. These could include outliers, duplicates, or inconsistencies that deviate from expected patterns. A count of zero indicates no anomalies, while a higher count suggests that further investigation is needed to resolve potential data quality issues.
Profile
This provides detailed insights into the characteristics of the field, including its type, distinct values, and length. You can use this information to evaluate the data's uniqueness, length consistency, and complexity.
No | Profile | Description |
---|---|---|
1 | Declared Type | Indicates whether the type is declared by the source or inferred. |
2 | Distinct Values | Count of distinct values observed in the dataset. |
3 | Min Length | Shortest length of the observed string values or lowest value for numerics. |
4 | Max Length | Greatest length of the observed string values or highest value for numerics. |
5 | Mean | Mathematical average of the observed numeric values. |
6 | Median | The median of the observed numeric values. |
7 | Standard Deviation | Measure of the amount of variation in observed numeric values. |
8 | Kurtosis | Measure of the ‘tailedness’ of the distribution of observed numeric values. |
9 | Skewness | Measure of the asymmetry of the distribution of observed numeric values. |
10 | Q1 | The first quartile; the central point between the minimum and the median. |
11 | Q3 | The third quartile; the central point between the median and the maximum. |
12 | Sum | Total sum of all observed numeric values. |
You can hover over the (i) button to view the native field properties, which provide detailed information such as the field's type (numeric), size, decimal digits, and whether it allows null values.
Manage Tags in field details
Tags can now be directly managed in the field profile within the Explore section. Simply access the Field Details panel to create, add, or remove tags, enabling more efficient and organized data management.
Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.
Step 2: Click on the Profiles tab and select fields.
Step 3: Click on the specific field that you want to manage tags.
A Field Details modal window will appear. Click on the + button to assign tags to the selected field.
Step 4: You can also create the new tag by clicking on the ➕ button.
A modal window will appear, providing the options to create the tag. Enter the required values to get started.
For more information on creating tags, refer to the Add Tag section.
Filter and Sort Fields
Filter and Sort options allow you to organize your fields by various criteria, such as Name, Checks, Completeness, Created Date, and Tags. You can also apply filters to refine your list of fields based on Type and Tags
Sort
You can sort your checks by Anomalies, Checks, Completeness, Created Date, Name, Quality Score, and Type to easily organize and prioritize them according to your needs.
No | Sort By | Description |
---|---|---|
1 | Anomalies | Sorts fields based on the number of detected anomalies. |
2 | Checks | Sorts fields by the number of active validation checks applied. |
3 | Completeness | Sorts fields based on their data completeness percentage. |
4 | Created Date | Sorts fields by the date they were created, showing the newest or oldest fields first. |
5 | Name | Sorts fields alphabetically by their names. |
6 | Quality Score | Sorts fields based on their quality score, indicating the reliability of the data in the field. |
7 | Type | Sorts fields based on their data type (e.g., string, boolean, etc.). |
Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.
Filter
You can filter your checks based on values like Type and Tag to easily organize and prioritize them according to your needs.
No | Filter | Description |
---|---|---|
1 | Type | Filters fields based on the data type (e.g., string, boolean, date, etc.). |
2 | Tag | Select this to filter the fields based on specific tags, such as Healthcare, Compliance, or Sensitive. |
Types of Transformations
Cleaned Entity Name
This transformation removes common business signifiers from entity names, making your data cleaner and more uniform.
Options for Cleaned Entity Name
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Drop from Suffix | Add a unique name for your computed field. |
2. | Drop from Prefix | Removes specified terms from the beginning of the entity name. |
3. | Drop from Interior | Removes specified terms from the beginning of the entity name. |
4. | Additional Terms to Drop (Custom) | Allows you to specify additional terms that should be dropped from the entity name. |
5. | Terms to Ignore (Custom) | Designate terms that should be ignored during the cleaning process. |
Example for Cleaned Entity Name
Example | Input | Transformation | Output |
---|---|---|---|
1 | "TechCorp, Inc." | Drop from Suffix: "Inc." | "TechCorp" |
2 | "Global Services Ltd." | Drop from Prefix: "Global" | "Services Ltd." |
3 | "Central LTD & Finance Co." | Drop from Interior: "LTD" | "Central & Finance Co." |
4 | "Eat & Drink LLC" | Additional Terms to Drop: "LLC", "&" | "Eat Drink" |
5 | "ProNet Solutions Ltd." | Terms to Ignore: "Ltd." | "ProNet Solutions" |
Convert Formatted Numeric
This transformation converts formatted numeric values into a plain numeric format, stripping out any characters like commas or parentheses that are not numerically significant.
Example for Convert Formatted Numeric
Example | Input | Transformation | Output |
---|---|---|---|
1 | "$1,234.56" | Remove non-numeric characters: ",", "$" | "1234.56" |
2 | "(2020)" | Remove non-numeric characters: "(", ")" | "-2020" |
3 | "100%" | Remove non-numeric characters: "%" | "100" |
Custom Expression
Enables the creation of a field based on a custom computation using Spark SQL. This is useful for applying complex logic or transformations that are not covered by other types.
Using Custom Expression:
You can combine multiple fields, apply conditional logic, or use any valid Spark SQL functions to derive your new computed field.
Example: To create a field that sums two existing fields, you could use the expression SUM(field1, field2)
.
Example for Custom Expression
Example | Input Fields | Custom Expression | Output |
---|---|---|---|
1 | field1 = 10 , field2 = 20 |
SUM(field1, field2) |
30 |
2 | salary = 50000 , bonus = 5000 |
salary + bonus |
55000 |
3 | hours = 8 , rate = 15.50 |
hours * rate |
124 |
4 | status = 'active' , score = 85 |
CASE WHEN status = 'active' THEN score ELSE 0 END |
85 |
Export Metadata
Qualytics’ metadata export feature lets you capture the changing states of your data. You can export metadata for Quality Checks, Field Profiles, and Anomalies from selected profiles into an enrichment datastore so that you can perform deeper analysis, identify trends, detect issues, and make informed decisions based on your data.
To keep things organized, the exported files use specific naming patterns:
- Anomalies: Saved as
_<enrichment_prefix>_anomalies_export
. - Quality Checks: Saved as
_<enrichment_prefix>_checks_export
. - Field Profiles: Saved as
_<enrichment_prefix>_field_profiles_export
.
Note
Ensure that an enrichment datastore is already set up and properly configured to accommodate the exported data. This setup is essential for exporting anomalies, quality checks, and field profiles successfully.
Let’s get started 🚀
Step 1: Select a source datastore from the side menu from which you would like to export the metadata.
For demonstration purposes, we have selected the “COVID-19 Data” Snowflake source datastores.
Step 2: After selecting your preferred datastore, a bottom-up menu will appear on the right side of the interface. Click on “Export Metadata” alongside the Enrichment Datastore.
Step 3: After clicking “Export Metadata,” a modal window will appear providing you the options to select the assets you want to export to your Enrichment Datastore—whether it's anomalies, quality checks, or field profiles.
For demonstration purposes, we have opted to export all three assets: Anomalies, Quality Checks, and Field Profiles.
Step 4: Once you have selected the assets, click the “Next” button to continue.
Step 5: Select the profiles you wish to export. You can choose specific profiles or export them all at once. After selection, click on the Export button.
For demonstration purposes, we have checked the “All” option.
Step 6: After clicking “Export,” the process starts, and a message will confirm that the metadata will be available in your Enrichment Datastore shortly.
Review Exported Metadata
Step 1: Once the metadata has been exported, navigate to the “Enrichment Datastores” located on the left menu.
Step 8: In the “Enrichment Datastores” section, select the datastore where you exported the metadata. The exported metadata will now be visible in the selected datastore.
Step 9: Click on the exported files to view the metadata. For demonstration purposes, we have selected the “export_field_profiles” file to review the metadata.
The exported metadata is displayed in a table format, showing key details about the field profiles from the datastore. It typically includes columns that indicate the uniqueness of data, the completeness of the fields, and the data structure. You can use this metadata to check data quality, prepare for analysis, ensure compliance, and manage your data.
Settings ↵
Identifiers
An Identifier is a field that can be used to help load the desired data from a Table in support of analysis.There are two types of identifiers can be declared for a Table:
-
Incremental Field: Track records in the table that have already been scanned in order to support Scan operations that only analyze new (not previously scanned) data.
-
Partition Field: Divide the data in the table into distinct dataframes that can be analyzed in parallel.
Managing an Identifier
Step 1: Log in to your Qualytics account and select the source datastore (JDBC or DFS) from the left menu that you want to manage.
Step 2: Select Tables (if JDBC datastore is connected) or File Patterns (if DFS datastore is connected) from the Navigation tab on the top.
Step 3: You will view the full list of tables or files belonging to the selected source datastore.
Step 4: Click on the vertical ellipse next to the table of your choice and select Settings from the dropdown list.
A modal window will appear for “Table Settings”, where you can manage identifiers for the selected table.
Incremental Strategy
The Incremental Strategy configuration in Qualytics is crucial for tracking changes at the row level within tables.
This approach is essential for efficient data processing, as it is specifically used to track which records have already been scanned.
This allows for scan operations to focus exclusively on new records that have not been previously scanned, thereby optimizing the scanning process and ensuring that only the most recent and relevant data is analyzed.
No | Strategy Option | Description |
---|---|---|
1 | None | No incremental strategy, it will run full. |
2 | Last Modified | - Available types are Date or Timestamp was last modified. - Uses a "last modified column" to track changes in the data set. - This column typically contains a timestamp or date value indicating when a record was last modified. - The system compares the "last modified column" to a previous timestamp or date, updating only the records modified since that time. |
3 | Batch Value | - Available types are Integral or Fractional. - Uses a "batch value column" to track changes in the data set. - This column typically contains an incremental value that increases as new data is added. - The system compares the current "batch value" with the previous one, updating only records with a higher "batch value". - Useful when data comes from a system without a modification timestamp. |
4 | Postgres Commit Timestamp Tracking | - Utilizes commit timestamps for change tracking. |
Availability based on technologies:
Option | Availability |
---|---|
Last Modified | All |
Batch Value | All |
Postgres Commit Timestamp Tracking | PostgreSQL |
Info
- All options are useful for incremental strategy, it depends on the availability of the data and how it is modeled.
- The 3 options will allow you to track and process only the data that has changed since the last time the system was run, reducing the amount of data that needs to be read and processed, and increasing the efficiency of your system.
Incremental Strategy with DFS (Distributed File System)
For DFS in Qualytics, the incremental strategy leverages the last modified timestamps from the file metadata.
This automated process means that DFS users do not need to manually configure their incremental strategy, as the system efficiently identifies and processes the most recent changes in the data.
Example
Objective: Identify and process new or modified records in the ORDERS table since the last scan using an Incremental Strategy.
Sample Data
O_ORDERKEY | O_PAYMENT_DETAILS | LAST_MODIFIED |
---|---|---|
1 | {"date": "2023-09-25", "amount": 250.50, "credit_card": "5105105105105100"} | 2023-09-25 10:00:00 |
2 | {"date": "2023-09-25", "amount": 150.75, "credit_card": "4111-1111-1111-1111"} | 2023-09-25 10:30:00 |
3 | {"date": "2023-09-25", "amount": 200.00, "credit_card": "1234-5678-9012-3456"} | 2023-09-25 11:00:00 |
4 | {"date": "2023-09-25", "amount": 175.00, "credit_card": "5555-5555-5555-4444"} | 2023-09-26 09:00:00 |
5 | {"date": "2023-09-25", "amount": 300.00, "credit_card": "2222-2222-2222-2222"} | 2023-09-26 09:30:00 |
Incremental Strategy Explanation
In this example, an Incremental Strategy would focus on processing records that have a LAST_MODIFIED timestamp after a certain cutoff point. For instance, if the last scan was performed on 2023-09-25 at 11:00:00, then only records with O_ORDERKEY 4 and 5 would be considered for the current scan, as they have been modified after the last scan time.
graph TD
A[Start] --> B[Retrieve Orders Since Last Scan]
B --> C{Record Modified After Last Scan?}
C -->|Yes| D[Process Record]
C -->|No| E[Skip Record]
D --> F[Move to Next Record/End]
E --> F
Partition Field
The Partition Field is a fundamental feature for organizing and managing large datasets. It is specifically designed to divide the data within a table into separate, distinct dataframes.
This segmentation is a key strategy for handling and analyzing data more effectively. By creating these individual dataframes, Qualytics allows for parallel processing, which significantly accelerates the analysis.
Each partition can be analyzed independently, enabling simultaneous examination of different segments of the dataset.
This not only increases the efficiency of data processing but also ensures a more streamlined and scalable approach to handling large volumes of data, making it an indispensable tool in data analysis and management.
The ideal Partition Identifier is an Incremental Identifier of type datetime such as a last-modified field, however alternatives are automatically identified and set during a Catalog operation.
Info
- Partition Field Selection: When selecting a partition field for a table during catalog operation, we will attempt to select a field with no nulls where possible.
- User-Specified Partition Fields: Users are permitted to specify partition fields manually. While we ensure that the user selects a field of a supported data type, we do not currently enforce non-nullability or completeness. Care should be given to select partition fields with no or a low percentage of nulls in order to avoid unbalanced partitioning.
Warning
If no appropriate partition identifier can be selected, then repeatable ordering candidates (order by fields) are used for less efficient processing of containers with a very large number of rows.
Example
Objective: Identify the efficient process and analyze the ORDERS table by partitioning the data based on the O'ORDERDATE field, allowing parallel processing of different date segments.
Sample Data
O_ORDERKEY | O_CUSTKEY | O_ORDERSTATUS | O_TOTALPRICE | O_ORDERDATE |
---|---|---|---|---|
1 | 123 | 'O' | 173665.47 | 2023-09-01 |
2 | 456 | 'O' | 46929.18 | 2023-09-01 |
3 | 789 | 'F' | 193846.25 | 2023-09-02 |
4 | 101 | 'O' | 32151.78 | 2023-09-02 |
5 | 202 | 'F' | 144659.20 | 2023-09-03 |
Partition Field Explanation
In this example, the O_ORDERDATE field is used to partition the ORDERS table. Each partition represents a distinct date, allowing for the parallel processing of orders based on their order date. This strategy enhances the efficiency of data analysis by distributing the workload across different partitions.
graph TD
A[Start] --> B[Retrieve Orders Data]
B --> C{Partition by O_ORDERDATE}
C --> D[Distribute Partitions for Parallel Processing]
C --> E[Identify Date Segments]
D --> F[Analyze Each Partition Independently]
E --> F
F --> G[Combine Results/End]
Grouping Overview
The grouping is a fundamental aspect of data analysis, allowing users to organize data into meaningful categories for in-depth examination. With the ability to set grouping on Containers, users can define how data within a container should be grouped, facilitating more focused and efficient analysis.
Usage
The grouping
parameter accepts a list of lists of field names. Each inner list holds the field names in the order that they will be applied as grouping criteria. This flexibility allows users to customize the grouping behavior based on their specific analytical requirements.
Example
Consider the following examples of grouping
configurations:
["store_id"]
: Groups data within the container by thestore_id
field.["store_id", "month"]
: Groups data first bystore_id
, then bymonth
.["store_id", "state"]
: Groups data first bystore_id
, then bystate
.
By specifying different combinations of fields in the grouping
parameter, users can tailor the grouping behavior to suit their analytical needs.
Impact on Data Profiles
The grouping has implications for various aspects of data profiling and analysis within Qualytics.
Field Profiles
Field Profiles are now produced with filters determined by the grouping
specified on the Profile Operation. This means that the profiles generated will reflect the characteristics of data within each group defined by the grouping criteria.
Inferred Quality Checks
The inferred checks produced by the analytics engine will also hold the filter defined by the grouping
. This ensures that data access controls and constraints are applied consistently across different groupings of data within the container.
Inferred Quality Check Filters
Quality Check filters, represented as Spark SQL where clauses, are set based on the grouping
specified on the Profile Operation. This ensures that quality checks are applied appropriately to the data within each group, allowing for comprehensive data validation and quality assurance.
Conclusion
The introduction of Grouping for Containers in Qualytics enhances data organization and analysis capabilities, allowing users to define custom grouping criteria and analyze data at a granular level. By leveraging grouping
, users can gain deeper insights into their data and streamline the analytical process, ultimately driving more informed decision-making and improving overall data quality and reliability.
General Configurations Overview
Excluded Fields
This configuration allows you to selectively exclude specific fields from containers. These excluded fields will be omitted from check creation during profiling operations while also being hidden in data previews, without requiring a profile run.
This can be helpful when dealing with sensitive data, irrelevant information, or large datasets where you want to focus on specific fields.
Benefits of Excluding Fields
Targeted Analysis
Focus your analysis on the fields that matter most by removing distractions from excluded fields.
Data Privacy
Protect sensitive information by excluding fields that contain personal data or confidential information.
Important Considerations
Excluding fields will permanently remove them from profile creation and data preview until you re-include them and re-profile the container.
Infer data type
The "infer data type" option in containers allows the system to automatically determine the appropriate data types (e.g., fractional, integer, date) for columns within your data containers. This setting is configurable for both JDBC and DFS containers.
Behavior in JDBC Datastores
- Default: Disabled
- Reason: JDBC datastores provide inherent schema information from the database tables. Qualytics leverages this existing schema for accurate data typing.
- Override: You can optionally enable this setting if encountering issues with automatic type detection from the source database.
Behavior in DFS Datastores
- Default:
- Enabled for CSV files
- Disabled for other file formats (Parquet, Delta, Avro, ORC, etc.)
- Reason:
- CSV files lack a defined schema. Data type inference helps ensure correct data interpretation.
- File formats like Parquet, Delta, Avro, and ORC have embedded schemas, making inference unnecessary.
- Override: You can adjust the default behavior based on your specific data sources and requirements.
Rule for the "Infer Data Type"
Schema-Based Data Sources
If the data source has a defined schema (JDBC, Delta, Parquet, Avro, ORC), the flag is set to "False".
Schema-less Data Sources
If the data source lacks a defined schema (CSV), the flag is set to "True".
Override file pattern for DFS datastores
Override the file pattern to include files with the same schema but don't match the automatically generated pattern from the initial cataloging.
In some cases, you may have multiple files that share the same schema but don't match the automatically generated file pattern during the initial cataloging process. To address this, Qualytics has the ability to override file patterns in the UI. This allows you to specify a custom pattern that encompasses all files with the shared schema, ensuring they are properly included in profiling and analysis.
Explore Deeper Knowledge
If you want to go deeper into the knowledge or if you are curious and want to learn more about DFS filename globbing, you can explore our comprehensive guide here: How DFS Filename Globbing Works.
Important Considerations
Subsequent catalog operations without pruning (Disabled
) will use the new pattern.
Ended: Settings
Ended: Containers
Weight ↵
Weight Mechanism
Weight Mechanism for checks is designed to evaluate and prioritize checks based on three key factors: Rule Type Weighting, Anomaly Weighting, and Tag Weighting.
Let’s get started 🚀
1. Rule Type Weighting
Each quality check rule type has a specific weight based on its importance. The rule types are divided into three categories:
High Importance (Weight: 3)
These rules are assigned the highest weight of 3 to reflect their crucial role in maintaining data quality.
No. | Rule Type | Weight |
---|---|---|
1 | Entity Resolution | 3 |
2 | Expected Schema | 3 |
3 | Matches Pattern | 3 |
4 | Predicted By | 3 |
5 | Satisfies Expression | 3 |
6 | Contains Social Security Number | 3 |
7 | Time Distribution Size | 3 |
8 | User Defined Function | 3 |
9 | Is Replica Of | 3 |
10 | Metric | 3 |
11 | Aggregation Comparison | 3 |
12 | Is Address | 3 |
Medium Importance (Weight: 2)
These rules are assigned the medium weight of 2 to reflect their role in maintaining data quality.
No. | Rule Type | Weight |
---|---|---|
1 | Any Not Null | 2 |
2 | Between | 2 |
3 | Between Times | 2 |
4 | Contains Credit Card | 2 |
5 | Contains Email | 2 |
6 | Equal To | 2 |
7 | Equal To Field | 2 |
8 | Exists In | 2 |
9 | Not Exists In | 2 |
10 | Expected Values | 2 |
11 | Greater Than Field | 2 |
12 | Less Than Field | 2 |
13 | Not Future | 2 |
14 | Required Values | 2 |
15 | Unique | 2 |
16 | Contains URL | 2 |
17 | Min Partition Size | 2 |
18 | Is Credit Card | 2 |
19 | Volumetric | 2 |
Low Importance (Weight: 1)
These rules are assigned the lowest weight of 1 to reflect their role in maintaining data quality.
No. | Rule Type | Weight |
---|---|---|
1 | After Date Time | 1 |
2 | Before DateTime | 1 |
3 | Distinct Count | 1 |
4 | Field Count | 1 |
5 | Is Type | 1 |
6 | Max Length | 1 |
7 | Max Value | 1 |
8 | Min Length | 1 |
9 | Min Value | 1 |
10 | Not Exists In | 1 |
11 | Not Negative | 1 |
12 | Not Null | 1 |
13 | Positive | 1 |
14 | Sum | 1 |
2. Anomaly Weighting
Anomalies can impact the importance of a check by adjusting its weight. The adjustment is based on whether the check has anomalies and whether it is authored or inferred:
-
Authored Check with Anomalies: - The check's weight increases by 12 points.
-
Authored Check without Anomalies: - The check's weight increases by 9 points.
-
Inferred Check with Anomalies: - The check's weight increases by 6 points.
-
Inferred Check without Anomalies: - The check's weight remains 0 points.
3. Tag Weighting
Tags can further modify the weight of a check. When tags with weight modifiers are applied, their weights are added to the check’s total weight.
- Tag with Weight Modifier: Each tag that has a specific weight modifier will contribute to the overall weight of the check. For example, if Tag B has a weight of 2, it will add 2 points to the total weight of the check.
Example of Weight Calculation
Let's break down an example calculation for a check of type Authored, using the isCreditCard rule (Medium Importance), with no anomalies, and Tag B applied:
Step-by-Step Calculation
- Step 1: Rule Type Weight – The isCreditCard rule has a weight of 2 (Medium Importance).
- Step 2: Anomaly Weight – An Authored Check without anomalies adds 9 points.
- Step 3: Tag Weight – Tag B adds 2 points.
Total Weight = 2 (rule type) + 9 (no anomalies) + 2 (Tag B) = 13 points
Additional Notes
If the table itself has a Tag A with a weight of 10, the check will inherit that tag. In this case, the total weight will include both tag weights.
Total Weight = 2 (rule type) + 9 (no anomalies) + 2 (Tag B) + 10 (Tag A) = 23 points
Quick Calculation Formula
To make the calculation easier, here are the quick formulas for different types of checks:
-
For Authored Checks with Anomalies:
[Rule Type Weight] + 12 (Anomaly) + Check’s Tag Weight + Table’s Tag Weight
-
For Authored Checks without Anomalies:
[Rule Type Weight] + 9 (No Anomaly) + Check’s Tag Weight + Table’s Tag Weight
-
For Inferred Checks with Anomalies:
[Rule Type Weight] + 6 (Anomaly) + Check’s Tag Weight + Table’s Tag Weight
-
For Inferred Checks without Anomalies:
[Rule Type Weight] + 0 (No Anomaly) + Check’s Tag Weight + Table’s Tag Weight
Example Calculation (Extended)
Let's extend the example with the inclusion of both Tag A and Tag B:
-
For Authored Checks with Anomalies:
[Rule Type Weight] + 12 + 10 (Tag A) + 2 (Tag B)
-
For Authored Checks without Anomalies:
[Rule Type Weight] + 9 + 10 (Tag A) + 2 (Tag B)
-
For Inferred Checks with Anomalies:
[Rule Type Weight] + 6 + 10 (Tag A) + 2 (Tag B)
-
For Inferred Checks without Anomalies:
[Rule Type Weight] + 0 + 10 (Tag A) + 2 (Tag B)
Ended: Weight
Data Quality Checks ↵
Checks Overview
Checks in Qualytics are rules applied to data that ensure quality by validating accuracy, consistency, and integrity. Each check includes a data quality rule, along with filters, tags, tolerances, and notifications, allowing efficient management of data across tables and fields.
Let’s get started 🚀
Checks Types
In Qualytics, you will come across two types of checks:
Inferred Checks
Qualytics automatically generates inferred checks during a Profile operation. These checks typically cover 80-90% of the rules needed by users. They are created and maintained through profiling, which involves statistical analysis and machine learning methods.
For more details on Inferred Checks, please refer to the Inferrred Check documentation.
Authored Checks
Authored checks are manually created by users within the Qualytics platform or API. You can author many types of checks, ranging from simple templates for common checks to complex rules using Spark SQL and User-Defined Functions (UDF) in Scala.
For more details on Authored Checks, please refer to the Authored Checks documentation.
View & Manage Checks
Checks tab in Qualytics provides users with an interface to view and manage various checks associated with their data. These checks are accessible through two different methods, as discussed below.
Method 1: Datastore-Specific Checks
Step 1: Log in to your Qualytics account and select the Datastore from the left menu.
Step 2: Click on the "Checks" from the Navigation Tab.
You will see a list of all the checks that have been applied to the selected datastore
You can switch between different types of checks to view them categorically (such as All, Active, Draft, and Archived).
Method 2: Explore Section
Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.
Step 2: Click on the "Checks" from the Navigation Tab.
You'll see a list of all the checks that have been applied to various tables and fields across different source datastores.
Check Templates
Check Templates empower users to efficiently create, manage, and apply standardized checks across various datastores, acting as blueprints that ensure consistency and data integrity across different datasets and processes.
Check templates streamline the validation process by enabling check management independently of specific data assets such as datastores, containers, or fields. These templates reduce manual intervention, minimize errors, and provide a reusable framework that can be applied across multiple datasets, ensuring all relevant data adheres to defined criteria. This not only saves time but also enhances the reliability of data quality checks within an organization.
For more details about check templates, please refer to the Check Templates documentation.
Apply Check Template for Quality Checks
You can export check templates to make quality checks easier and more consistent. Using a set template lets you quickly verify that your data meets specific standards, reducing mistakes and improving data quality. Exporting these templates simplifies the process, making finding and fixing errors more efficient, and ensuring your quality checks are applied across different projects or systems without starting from scratch.
For more details how to apply checks template for quality check, please refer to the Apply Checks Template for Quality Check documentation.
Export Check Templates
You can export check templates to easily share or reuse your quality check settings across different systems or projects. This saves time by eliminating the need to recreate the same checks repeatedly and ensures that your quality standards are consistently applied. Exporting templates helps maintain accuracy and efficiency in managing data quality across various environments.
For more details about export checks template, please refer to the Export Check Templates documentation.
Manage Checks in Datastore
Managing your checks within a datastore is important to maintain data integrity and ensure quality. You can categorize, create, update, archive, restore, delete, and clone checks, making it easier to apply validation rules across the datastores. The system allows for checks to be set as active, draft, or archived based on their current state of use. You can also define reusable templates for quality checks to streamline the creation of multiple checks with similar criteria. With options for important and favorite, users have full flexibility to manage data quality efficiently.
For more details how to manage checks in datastore, please refer to the Manage Checks in Datastore documentation.
Check Rule Types
In Qualytics, a variety of check rule types are provided to maintain data quality and integrity.These rules define specific criteria that data must meet, and checks apply these rules during the validation process.
For more details about check rule types, please refer to the (Rule Types Overview) documentation.
Rule Type | Description |
---|---|
After Date Time | Asserts that the field is a timestamp later than a specific date and time. |
Any Not Null | Asserts that one of the fields must not be null. |
Before DateTime | Asserts that the field is a timestamp earlier than a specific date and time. |
Between | Asserts that values are equal to or between two numbers. |
Between Times | Asserts that values are equal to or between two dates or times. |
Contains Credit Card | Asserts that the values contain a credit card number. |
Contains Email | Asserts that the values contain email addresses. |
Contains Social Security Number | Asserts that the values contain social security numbers. |
Contains Url | Asserts that the values contain valid URLs. |
Distinct Count | Asserts on the approximate count distinct of the given column. |
Entity Resolution | Asserts that every distinct entity is appropriately represented once and only once |
Equal To Field | Asserts that this field is equal to another field. |
Exists in | Asserts if the rows of a compared table/field of a specific Datastore exists in the selected table/field. |
Expected Schema | Asserts that all selected fields are present and that all declared data types match expectations. |
Expected Values | Asserts that values are contained within a list of expected values. |
Field Count | Asserts that there must be exactly a specified number of fields. |
Greater Than | Asserts that the field is a number greater than (or equal to) a value. |
Greater Than Field | Asserts that this field is greater than another field. |
Is Address | Asserts that the values contain the specified required elements of an address. |
Is Credit Card | Asserts that the values are credit card numbers. |
Is Replica Of | Asserts that the dataset created by the targeted field(s) is replicated by the referred field(s). |
Is Type | Asserts that the data is of a specific type. |
Less Than | Asserts that the field is a number less than (or equal to) a value. |
Less Than Field | Asserts that this field is less than another field. |
Matches Pattern | Asserts that a field must match a pattern. |
Max Length | Asserts that a string has a maximum length. |
Max Value | Asserts that a field has a maximum value. |
Metric | Records the value of the selected field during each scan operation and asserts that the value is within a specified range (inclusive). |
Min Length | Asserts that a string has a minimum length. |
Min Partition Size | Asserts the minimum number of records that should be loaded from each file or table partition. |
Min Value | Asserts that a field has a minimum value. |
Not Exists In | Asserts that values assigned to this field do not exist as values in another field. |
Not Future | Asserts that the field's value is not in the future. |
Not Negative | Asserts that this is a non-negative number. |
Not Null | Asserts that the field's value is not explicitly set to nothing. |
Positive | Asserts that this is a positive number. |
Predicted By | Asserts that the actual value of a field falls within an expected predicted range. |
Required Values | Asserts that all of the defined values must be present at least once within a field. |
Satisfies Expression | Evaluates the given expression (any valid Spark SQL ) for each record. |
Sum | Asserts that the sum of a field is a specific amount. |
Time Distribution Size | Asserts that the count of records for each interval of a timestamp is between two numbers. |
Unique | Asserts that the field's value is unique. |
User Defined Function | Asserts that the given user-defined function (as Scala script) evaluates to true over the field's value. |
Volumetrics | Asserts that the data volume (rows or bytes) remains within dynamically inferred thresholds based on historical trends (daily, weekly, monthly). |
Manage Checks in Datastore
Managing your checks within a datastore is important to maintain data integrity and ensure quality. You can categorize, create, update, archive, restore, delete, and clone checks, making it easier to apply validation rules across the datastores. The system allows for checks to be set as active, draft, or archived based on their current state of use. You can also define reusable templates for quality checks to streamline the creation of multiple checks with similar criteria. With options for important and favorite, users have full flexibility to manage data quality efficiently.
Let's get started 🚀
Navigation
Step 1: Log in to your Qualytics account and select the datastore from the left menu on which you want to manage your checks.
Step 2: Click on the "Checks" from the Navigation Tab.
Categories Checks
You can categorize your checks based on their status, such as Active, Draft, Archived (Invalid and Discarded), or All, according to your preference. This categorization offers a clear view of the data quality validation process, helping you manage checks efficiently and maintain data integrity.
All
By selecting All Checks, you can view a comprehensive list of all the checks in the datastore, including both active and draft checks, allowing you to focus on the checks that are currently being managed or are in progress. However, archived checks are not displayed in this.
Active
By selecting Active, you can view checks that are currently applied and being enforced on the data. These operational checks are used to validate data quality in real time, allowing you to monitor all active checks and their performance.
You can also categorize the active checks based on their importance and favorites to streamline your data quality monitoring.
1. Important: Shows only checks that are marked as important. These checks are prioritized based on their significance, typically assigned a weight of 7 or higher.
Note
Important checks are prioritized based on a weight of 7 or higher.
2. Favorite: Displays checks that have been marked as favorites. This allows you to quickly access checks that you use or monitor frequently.
3. All: Displays a comprehensive view of all active checks, including important, favorite and any checks that do not fall under these specific categories.
Draft Checks
By selecting Draft, you can view checks that have been created but have not yet been applied to the data. These checks are in the drafting stage, allowing for adjustments and reviews before activation. Draft checks provide flexibility to experiment with different validation rules without affecting the actual data.
You can also categorize the draft checks based on their importance and favorites to prioritize and organize them effectively during the review and adjustment process.
1. Important: Shows only checks that are marked as important. These checks are prioritized based on their significance, typically assigned a weight of 7 or higher.
2 Favorite: Displays checks that have been marked as favorites. This allows you to quickly access checks that you use or monitor frequently.
3. All: Displays a comprehensive view of all draft checks, including important, favorite and any checks that do not fall under these specific categories.
Archived Checks
By selecting Archived, you can view checks that have been marked as discarded or invalid from use but are still stored for future reference or restoration. Although these checks are no longer active, they can be restored if needed.
You can also categorize the archived checks based on their status as Discarded, Invalid, or view All archived checks to manage and review them effectively.
1. Discarded: Shows checks that have been marked as no longer useful or relevant and have been discarded from use.
2. Invalid: Displays checks that are deemed invalid due to errors or misconfigurations, requiring review or deletion.
3. All: Provides a view of all archive checks within this category including discarded and invalid checks.
Check Details
Check Details provides important information about each check in the system. It shows when a check was last run, how often it has been used, when it was last updated, who made changes to it, and when it was created. This section helps users understand the status and history of the check, making it easier to manage and track its use over time.
Step 1: Locate the check you want to review, then hover over the info icon to view the Check Details.
A popup will appear with additional details about the check.
Last Asserted
Last Asserted At shows the most recent time the check was run, indicating when the last validation occurred. For example, the check was last asserted on Oct 17, 2023, at 2:37 AM (GMT+5:30).
Scans
Scans show how many times the check has been used in different operations. It helps you track how often the check has been applied. For example, the check was used in 30 operations.
Updated At
Updated At shows the most recent time the check was modified or updated. It helps you see when any changes were made to the check’s configuration or settings. For example, the check was last updated on Sep 9, 2024, at 3:18 PM (GMT+5:30).
Last Editor
Last Editor indicates who most recently made changes to the check. It helps track who is responsible for the latest updates or modifications. This is useful for accountability and collaboration within teams.
Created At
Created At shows when the check was first made. It helps you know how long the check has been in use. This is useful for tracking its history. For example, the check was created on Oct 17, 2023, at 2:19 PM (GMT+5:30).
Status Management of Checks
Set Check as Draft
You can move an active check into a draft state, allowing you to work on the check, make adjustments, and refine the validation rules without affecting live data. This is useful when you need to temporarily deactivate a check for review and updates. There are two methods from which you can move your active check to draft: you can either draft specific checks or draft multiple checks in bulk.
Method I: Draft Specific Check
Step 1: Click on the active check that you want to move to the draft state.
For Demonstration purpose, we have selected the "Between" check.
Step 2: A modal window will appear displaying the check details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and select "Draft" from the drop-down menu.
Step 3: After clicking on "Draft", the check will be successfully moved to the draft state, and a success flash message will appear stating, "Selected checks have been successfully updated."
Method II. Draft Checks in Bulk
You can move multiple checks into the draft state in one action, allowing you to pause or make adjustments to several checks without affecting your active validation process.
Step 1: Hover over the active checks and click on the checkbox to select multiple checks.
Step 2: Click on the vertical ellipses (⋮) and select "Draft" from the dropdown menu to move active checks to the draft state.
A confirmation modal window titled Bulk Update Checks to Draft will appear, indicating the number of checks being moved to draft.
Step 3: Click the "Update" button to move the selected active checks to draft.
After clicking the "Update" button, your selected checks will be moved to draft, and a success message will appear stating, "Selected checks have been successfully updated."
Activate Draft Check
You can activate the draft checks after when you have worked on the check, make adjustments, and refine the validation rules. By activating the draft check and making it live, ensures that the defined criteria are enforced on the data. There are two ways to activate draft checks: you can activate specific checks or activate multiple checks in bulk.
Method I. Activate Specific Check
Step 1: Navigate to the Draft check section, and click on the drafted check that you want to activate, whether you have made changes or wish to activate it as is.
For Demonstration purpose, we have selected the "Metric" check.
A modal window will appear with the check details. If you want to make any changes to the check details, you can edit them.
Step 2: Click on the down arrow icon with the Update button. A dropdown menu will appear, click on the Activate button.
Step 3: After clicking on the activate button, your check is now successfully moved to the active checks and a success flash message will appear stating "Check successfully updated"
Method II. Activate Draft Checks in Bulk
Step 1. Hover over the draft checks and click on the checkbox to select multiple checks in bulk.
When multiple checks are selected, an action toolbar appears, displaying the total number of checks chosen along with a vertical ellipsis for additional bulk action options.
Step 2. Click on the vertical ellipsis (⋮) and choose "Activate" from the dropdown menu to activate the selected checks.
Step 3. A confirmation modal window “Bulk Activate Check” will appear, click on the “Activate” button to activate the draft checks.
After clicking on the activate button, your drafts checks will be activated and a success message flash will appear stating “Selected checks have been successfully updated”
Set Check as Archived
You can move an active or draft check into the archive when it is no longer relevant but may still be needed for historical purposes or future use. Archiving helps keep your checks organized without permanently deleting them. There are two ways to archive checks: you can archive individual checks or archive multiple checks in bulk.
Method I: Archive Specific Check
You can archive a specific check using two ways: either by directly clicking the archive button on the check or by opening the check and selecting the archive option from the action menu.
1. Archive Directly
Step 1: Locate the check (whether Active or Draft) which you want to archive and click on the Archive icon (represented by a box with a downward arrow) located on the right side of the check.
For Demonstration purpose, we have selected the "Metric" check.
Step 2: A modal window titled "Archive Check" will appear, providing you with the following archive options:
-
Discarded: Select this option if the check is no longer relevant or suitable for the current business rules or data requirements. This helps in archiving checks that are obsolete but still exist for historical reference.
-
Invalid: Choose this option if the check is not valid and should be retired from future inference. This helps the system learn from invalid checks and improves its ability to infer valid checks in the future.
Step 3: Once you've made your selection, click the Archive button to proceed.
Step 4: After clicking on the Archive button your check is moved to the archive and a flash message will appear saying "Check has been successfully archived"
2. Archive from Action Menu
Step 1: Click on the check from the list of available (whether Active or Draft) checks that you want to archive.
For Demonstration purpose, we have selected the "Metric" check.
Step 2: A modal window will appear displaying the check details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and click on the "Archive" from the drop-down menu.
Step 3: A modal window titled “Archive Check” will appear, providing you with the following archive options:
-
Discarded: Select this option if the check is no longer relevant or suitable for the current business rules or data requirements. This helps in archiving checks that are obsolete but still exist for historical reference.
-
Invalid: Choose this option if the check is not valid and should be retired from future inference. This helps the system learn from invalid checks and improves its ability to infer valid checks in the future.
Step 4: Once you've made your selection, click the Archive button to proceed.
Step 5: After clicking on the Archive button your check is moved to the archive and a flash message will appear saying "Check has been successfully archived"
Method II: Archive Checks in Bulk
You can archive multiple checks in a single step, deactivating and storing them for future reference or restoration while keeping your active checks uncluttered.
Step 1: Hover over the checks (whether Active or Draft) and click on the checkbox to select multiple checks.
When multiple checks are selected, an action toolbar appears, displaying the total number of selected checks along with a vertical ellipsis for additional bulk action options.
Step 2: Click on the vertical ellipsis (⋮) and choose "Archive" from the dropdown menu to archive the selected checks.
A modal window will appear, providing you with the following archive options:
1. Delete all anomalies associated with the checks: Toggle this option "On" if you want to delete any anomalies related to the selected checks when archiving them.
2. Archive Options: You are presented with two options to categorize why the checks are being archived:
-
Discarded: Select this option if the check is no longer relevant or suitable for the current business rules or data requirements. This helps in archiving checks that are obsolete but still exist for historical reference.
-
Invalid: Choose this option if the check is not valid and should be retired from future inference. This helps the system learn from invalid checks and improves its ability to infer valid checks in the future.
Step 3: Once you've made your selections, click the "Archive" button to confirm and archive the checks.
Step 4: After clicking the "Archive" button, your selected checks (whether Active or Draft) will be archived successfully and a success flash message will appear stating, "Checks have been successfully archived."
Restore Archived Checks
If a check has been archived, then you can restore it back to an active state or in a draft state. This allows you to reuse your checks that were previously archived without having to recreate them from scratch.
Step 1: Click on Archived from the navigation bar in the Checks section to view all archived checks.
Step 2: Click on the archived check which you want to restore as an active or draft check.
For Demonstration purpose, we have selected the "Metric" check.
A modal window will appear with the check details.
Step 3: If you want to make any changes to the check, you can edit it. Otherwise, click on the Restore button to restore it as an active check.
To restore the check as a draft, click on the arrow icon next to the Restore button. A dropdown menu will appear—select Restore as Draft from the options.
After clicking the Restore button, the check will be successfully restored as either an active or draft check, depending on your selection. A success message will appear confirming, "Check successfully updated."
Edit Check
You can edit an existing check to modify its properties, such as the rule type, coverage, filter clause, or description. Updating a check ensures that it stays aligned with evolving data requirements and maintains data quality as conditions change. There are two methods for editing checks: you can either edit specific checks or edit multiple checks in bulk.
Note
When editing multiple checks in bulk, only the filter clause and tags can be modified.
Method I. Edit Specific Check
Step 1: Click on the check you want to edit, whether it is an active or draft check.
For Demonstration purpose, we have selected the "Metric" check.
A modal window will appear with the check details.
Step 2: Modify the check details as needed based on your preferences.
Step 3: Once you have edited the check details, then click on the Validate button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct.
If the validation is successful, a green message saying "Validation Successful" will appear.
If the validation fails, a red message saying "Failed Validation" will appear. This typically occurs when the check logic or parameters do not match the data properly.
Step 3: Once you have a successful validation, click the "Update" button. The system will update the changes you've made to the check, including changes to the fields, filter clause, coverage, description, tags, or metadata.
After clicking on the Update button, your check is successfully updated and a success flash message will appear stating "Check successfully updated".
Method II. Edit Checks in Bulk
You can easily apply changes to multiple checks at once, saving time by editing several checks simultaneously without having to modify each one individually.
Step 1: Hover over the checks (whether Active or Draft) and click on the checkbox to select multiple checks.
When multiple checks are selected, an action toolbar appears, displaying the total number of selected checks along with a vertical ellipsis for additional bulk action options.
Step 2: Click on the vertical ellipses (⋮) and select "Edit" from the dropdown menu to make changes to the selected checks.
Step 3: A modal window titled "Bulk Edit Checks" will appear. Here you can only modify the "filter clause" and "tags" of the selected checks.
Step 4: Toggle on the options (Filter Clause or Tags) that you want to modify for the selected checks, and make the necessary changes.
Note
This action will overwrite the existing data for the selected checks.
Step 5: Once you have made the changes, click on the "Save" button.
After clicking the "Save" button, your selected checks will be updated with the new changes. A success message will appear stating, "Selected checks have been successfully updated."
Delete Checks
You can delete a check permanently, removing it from the system, and this is an irreversible action. Once you delete it, the check cannot be restored. By deleting the check, you ensure it will no longer appear in active or archived lists, making the system more streamlined and organized. There are two methods for deleting checks: you can either delete individual checks or delete multiple checks in bulk.
Note
You can only delete archived checks. If you want to delete an active or draft check, you must first move it to the archive, and then you can delete it.
Warning
Deleting a check is a one-time action. It cannot be restored after deletion.
Method I. Delete Specific Check
Step 1: Click on Archived from the navigation bar in the Checks section to view all archived checks.
Step 2: Locate the check, that you want to delete and click on the Delete icon located on the right side of the check.
For Demonstration purpose, we have selected the "Time Distribution Size" check.
Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the check from the system.
Step 4: After clicking on the delete button, your check is successfully deleted and a success flash message will appear saying "Check has been successfully deleted"
Method II. Delete Check in Bulk
You can permanently delete multiple checks from the system in one action. This process is irreversible, so it should be used when you are certain that the checks are no longer needed.
Note
For bulk archiving checks, the only available option is Bulk Delete. There is no option to Bulk Restore to draft or activate archived checks.
Step 1: Hover over the archived checks and click on the checkbox to select checks in bulk.
When multiple checks are selected, an action toolbar appears, displaying the total number of selected checks along with a vertical ellipsis for additional bulk action options.
Step 2: Click on the vertical ellipsis (⋮) and choose "Delete" from the dropdown menu to delete the selected checks.
Step 3: A confirmation modal window will appear, click on the "Delete" button to permanently delete the selected checks.
After clicking on the "Delete" button, your selected checks will be permanently deleted, and a success flash message will appear stating, "Checks have been successfully deleted."
Mark Check as Favorite
Marking a check as a favorite allows you to quickly access and prioritize the checks that are most important to your data validation process. This helps streamline workflows by keeping frequently used or critical checks easily accessible, ensuring you can monitor and manage them efficiently. By marking a check as a favorite, it will appear in the "Favorite" category for faster retrieval and management.
Step 1: Locate the check which you want to mark as a favorite and click on the bookmark icon located on the right side of the check.
After Clicking on the bookmark icon your check is successfully marked as a favorite and a success flash message will appear stating "Check has been favorited"
To unmark a check, simply click on the bookmark icon of the marked check. This will remove it from your favorites.
Clone Check
You can clone both active and draft checks to create a duplicate copy of an existing check. This is useful when you want to create a new check based on the structure of an existing one, allowing you to make adjustments without affecting the original check.
Step 1: Click on the check (whether Active or Draft) that you want to clone.
For Demonstration purpose, we have selected the "Metric" check.
Step 2: A modal window will appear displaying the check details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and select "Clone" from the drop-down menu.
Step 3: After clicking the Clone button, a modal window will appear. This window allows you to adjust the cloned check's details.
1. If you toggle on the "Associate with a Check Template" option, the cloned check will be linked to a specific template.
Choose a Template from the dropdown menu that you want to associate with the cloned check. The check will inherit properties from the selected template.
-
Locked: The check will automatically sync with any future updates made to the template, but you won't be able to modify the check's properties directly.
-
Unlocked: You can modify the check, but future updates to the template will no longer affect this check.
2. If you toggle off the "Associate with a Check Template" option, the cloned check will not be linked to any template, which allows you full control to modify the properties independently.
Select the appropriate Rule type for the check from the dropdown menu.
Step 4: Once you have selected the template or rule type, fill up the remaining check details as required.
Step 5: After completing all the check details, click on the "Validate" button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct. It ensures that the check will work as expected by running it against the data without committing any changes.
If the validation is successful, a green message saying "Validation Successful" will appear.
If the validation fails, a red message saying "Failed Validation" will appear. This typically occurs when the check logic or parameters do not match the data properly.
Step 6: Once you have a successful validation, click the "Save" button. The system will save any modifications you've made to the check, and create a clone of that check on basis of your changes.
After clicking on the "Save" button your check is successfully created and a success flash message will appear stating "Check successfully created".
Create a Quality Check template
You can add checks as a Template, which allows you to create a reusable framework for quality checks. By using templates, you standardize the validation process, enabling the creation of multiple checks with similar rules and criteria across different datastores. This ensures consistency and efficiency in managing data quality checks.
Step 1: Locate the check (whether Active or Draft) which you want to archive and click on that check.
For Demonstration purpose, we have selected the "Not Exists In" check.
Step 2: A modal window will appear displaying the check details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and select "Template" from the drop-down menu.
After clicking the "Template" button, the check will be saved and created as a template in the library, and a success flash message will appear stating, "The quality check template has been created successfully." This allows you to reuse the template for future checks, streamlining the validation process.
Method II: Activate Draft Checks in Bulk
Once you have completed refining the validation rules and made the necessary adjustments, you can activate multiple draft checks in bulk. Activating these checks in bulk ensures that all the defined criteria are enforced across the data simultaneously.
Step 1. Hover over the draft checks and click on the checkbox to select multiple checks in bulk.
When multiple checks are selected, an action toolbar appears, displaying the total number of checks chosen along with a vertical ellipsis for additional bulk action options.
Step 2. Click on the vertical ellipsis (⋮) and choose "Activate" from the dropdown menu to activate the selected checks.
Step 3. A confirmation modal window “Bulk Activate Check” will appear, click on the “Activate” button to activate the draft checks.
After clicking on the activate button, your drafts checks will be activated and a success message flash will appear stating “Selected checks have been successfully updated”
Authored Check
Authored checks are manually created by users within the Qualytics platform or API. You can author many types of checks, ranging from simple templates for common checks to complex rules using Spark SQL and User-Defined Functions (UDF) in Scala.
Let's get started 🚀
Navigation
Step 1: Log in to your Qualytics account and select the datastore from the left menu.
Step 2: Click on the "Checks" from the Navigation Tab.
Step 3: In the top-right corner, click on the "Add" button, then select "Check" from the dropdown menu.
A modal window titled “Authored Check Details” will appear, providing you the options to add the authored check details.
Step 3: Enter the following details to add the authored check:
1. Associate with a Check Template:
-
If you toggle ON the "Associate with a Check Template" option, the check will be linked to a specific template.
-
If you toggle OFF the "Associate with a Check Template" option, the check will not be linked to any template, which allows you full control to modify the properties independently.
2. Rule Type (Required): Select a Rule from the dropdown menu, such as checking for non-null values, matching patterns, comparing numerical values, or verifying date-time constraints. Each rule type defines the specific validation logic to be applied.
For demonstration purposes we have selected the after date time rule type.
For more details about the available rule types, refer to the "Rule Types Overview" documentation.
Note
Different rule types have different sets of fields and options appearing when selected.
3. File (Required): Select a file from the dropdown menu on which the check will be performed.
4. Field (Required): Select a field from the dropdown menu on which the check will be performed.
5. Filter Clause: Specify a valid Spark SQL WHERE expression to filter the data on which the check will be applied.
The filter clause defines the conditions under which the check will be applied. It typically includes a WHERE statement that specifies which rows or data points should be included in the check.
6. Date (Required): Enter the reference date for the rule. For the After Date Time rule, records in the selected field must have a timestamp later than this specified date.
7. Coverage: Adjust the Coverage setting to specify the percentage of records that must comply with the check.
Note
The Coverage setting applies to most rule types and allows you to specify the percentage of records that must meet the validation criteria.
8. Description (Required): Enter a detailed description of the check template, including its purpose, applicable data, and relevant information to ensure clarity for users. If you're unsure of what to include, click on the "💡" lightbulb icon to apply a suggested description based on the rule type.
Example: The Date of Birth must be a timestamp later than
This description specifies that the Date of Birth field must have a timestamp later than the specified
9. Tag: Assign relevant tags to your check to facilitate easier searching and filtering based on categories like "data quality," "financial reports," or "critical checks."
10. Additional Metadata: Add key-value pairs as additional metadata to enrich your check. Click the plus icon (+) next to this section to open the metadata input form, where you can add key-value pairs.
Enter the desired key-value pairs (e.g., DataSourceType: SQL Database and PriorityLevel: High). After entering the necessary metadata, click "Confirm" to save the custom metadata.
Step 4: After completing all the check details, click on the "Validate" button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct. It ensures that the check will work as expected by running it against the data without committing any changes.
If the validation is successful, a green message will appear saying "Validation Successful".
If the validation fails, a red message will appear saying "Failed Validation". This typically occurs when the check logic or parameters do not match the data properly.
Step 5: Once you have a successful validation, click the "Save" button.
After clicking on the “Save” button your check is successfully created and a success flash message will appear saying “Check successfully created”.
Author a Check via API
Users are able to author and interact with Checks through the API by passing JSON Payloads. Please refer to the API documentation on details: acme.qualytics.io/api/docs
Inferred Check
Qualytics automatically generates and maintains inferred
checks by Profiling the Datastore, performing statistical analysis followed by various machine learning methods.
Info
Inferred checks
will be automatically updated with the next Profiling run.
Manually
updating an inferred check will take it out of the automatic update workflow.
Inference Engine
-
After metadata is generated by a
Profile Operation
, Inference Engine is initiated to kick offInductive
andUnsupervised
learning methods -
Available data is partitioned into a training set and a testing set.
-
The engine applies numerous machine learning models & techniques to the training data in an effort to discover well-fitting data quality constraints.
-
Those inferred constraints are then filtered by testing them against the held out testing set & only those that assert true above a certain threshold are converted and exposed to users as
Inferred Checks
.
Rule Types ↵
Rule Types Overview
In Qualytics, a variety of rule types are provided to maintain data quality and integrity.These rules define specific criteria that data must meet, and checks apply these rules during the validation process.
Here’s an overview of the rule types and their purposes:
Check Rule Types
Rule Type | Description |
---|---|
After Date Time | Asserts that the field is a timestamp later than a specific date and time. |
Any Not Null | Asserts that one of the fields must not be null. |
Before DateTime | Asserts that the field is a timestamp earlier than a specific date and time. |
Between | Asserts that values are equal to or between two numbers. |
Between Times | Asserts that values are equal to or between two dates or times. |
Contains Credit Card | Asserts that the values contain a credit card number. |
Contains Email | Asserts that the values contain email addresses. |
Contains Social Security Number | Asserts that the values contain social security numbers. |
Contains Url | Asserts that the values contain valid URLs. |
Distinct Count | Asserts on the approximate count distinct of the given column. |
Entity Resolution | Asserts that every distinct entity is appropriately represented once and only once |
Equal To Field | Asserts that this field is equal to another field. |
Exists in | Asserts if the rows of a compared table/field of a specific Datastore exists in the selected table/field. |
Expected Schema | Asserts that all selected fields are present and that all declared data types match expectations. |
Expected Values | Asserts that values are contained within a list of expected values. |
Field Count | Asserts that there must be exactly a specified number of fields. |
Greater Than | Asserts that the field is a number greater than (or equal to) a value. |
Greater Than Field | Asserts that this field is greater than another field. |
Is Address | Asserts that the values contain the specified required elements of an address. |
Is Credit Card | Asserts that the values are credit card numbers. |
Is Replica Of | Asserts that the dataset created by the targeted field(s) is replicated by the referred field(s). |
Is Type | Asserts that the data is of a specific type. |
Less Than | Asserts that the field is a number less than (or equal to) a value. |
Less Than Field | Asserts that this field is less than another field. |
Matches Pattern | Asserts that a field must match a pattern. |
Max Length | Asserts that a string has a maximum length. |
Max Value | Asserts that a field has a maximum value. |
Metric | Records the value of the selected field during each scan operation and asserts that the value is within a specified range (inclusive). |
Min Length | Asserts that a string has a minimum length. |
Min Partition Size | Asserts the minimum number of records that should be loaded from each file or table partition. |
Min Value | Asserts that a field has a minimum value. |
Not Exists In | Asserts that values assigned to this field do not exist as values in another field. |
Not Future | Asserts that the field's value is not in the future. |
Not Negative | Asserts that this is a non-negative number. |
Not Null | Asserts that the field's value is not explicitly set to nothing. |
Positive | Asserts that this is a positive number. |
Predicted By | Asserts that the actual value of a field falls within an expected predicted range. |
Required Values | Asserts that all of the defined values must be present at least once within a field. |
Satisfies Expression | Evaluates the given expression (any valid Spark SQL ) for each record. |
Sum | Asserts that the sum of a field is a specific amount. |
Time Distribution Size | Asserts that the count of records for each interval of a timestamp is between two numbers. |
Unique | Asserts that the field's value is unique. |
User Defined Function | Asserts that the given user-defined function (as Scala script) evaluates to true over the field's value. |
Volumetric Check | Asserts that the volume of the data asset has not changed by more than an inclusive percentage amount for the prescribed moving daily average. |
After Date Time
Definition
Asserts that the field is a timestamp later than a specific date and time.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Specify a particular date and time to act as the threshold for the rule.
Name | Description |
---|---|
Date |
The timestamp used as the lower boundary. Values in the selected field should be after this timestamp. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all O_ORDERDATE entries in the ORDERS table are later than 10:30 AM on December 31st, 1991.
Sample Data
O_ORDERKEY | O_ORDERDATE |
---|---|
1 | 1991-12-31 10:30:00 |
2 | 1992-01-02 09:15:00 |
3 | 1991-12-14 10:25:00 |
{
"description": "Ensure that all O_ORDERDATE entries in the ORDERS table are later than 10:30 AM on December 31st, 1991.",
"coverage": 1,
"properties": {
"datetime": "1991-12-31 10:30:00"
},
"tags": [],
"fields": ["O_ORDERDATE"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "afterDateTime",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with O_ORDERKEY
1 and 3 do not satisfy the rule because their O_ORDERDATE
values are not after 1991-12-31 10:30:00.
graph TD
A[Start] --> B[Retrieve O_ORDERDATE]
B --> C{Is O_ORDERDATE > '1991-12-31 10:30:00'?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The O_ORDERDATE
value of 1991-12-14 10:30:00
is not later than 1991-12-31 10:30:00
Shape Anomaly
In O_ORDERDATE
, 66.667% of 3 filtered records (2) are not later than 1991-12-31 10:30:00
Aggregation Comparison
Definition
Verifies that the specified comparison operator evaluates true when applied to two aggregation expressions.
In-Depth Overview
The Aggregation Comparison
is a rule that allows for the dynamic analysis of aggregations across different datasets. It empowers users to establish data integrity by ensuring that aggregate values meet expected comparisons, whether they are totals, averages, counts, or any other aggregated metric.
By setting a comparison between aggregates from potentially different tables or even source datastores, this rule confirms that relationships between data points adhere to business logic or historical data patterns. This is particularly useful when trying to validate interrelated financial reports, summary metrics, or when monitoring the consistency of data ingestion over time.
Field Scope
Calculated: The rule automatically identifies the fields involved, without requiring explicit field selection.
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Facilitates the comparison between a target
aggregate metric and a reference
aggregate metric across different datasets.
Name | Description |
---|---|
Target Aggregation |
Specifies the aggregation expression to evaluate |
Comparison |
Select the comparison operator (e.g., greater than, less than, etc.) |
Datastore |
Identifies the source datastore for the reference aggregation |
Table/File |
Specifies the table or file for the reference aggregation |
Reference Aggregation |
Defines the reference aggregation expression to compare against |
Reference Filter |
Applies a filter to the reference aggregation if necessary |
Details
It's important to understand that each aggregation must result in a single row. Also, similar to Spark expressions, the aggregation expressions must be written in a valid format for DataFrames.
Examples
Simple Aggregations
Combining with SparkSQL Functions
Complex Aggregations
Aggregation Expressions
Here are some common aggregate functions used in SparkSQL:
SUM
: Calculates the sum of all values in a column.AVG
: Calculates the average of all values in a column.MAX
: Returns the maximum value in a column.MIN
: Returns the minimum value in a column.COUNT
: Counts the number of rows in a column.
For a detailed list of valid SparkSQL aggregation functions, refer to the Apache Spark SQL documentation.
{
"description": "Assert that O_ORDERDATE is after the defined date time",
"coverage": 1,
"properties": {
"datetime": "1991-12-31 10:30:00"
},
"tags": [],
"fields": fields,
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "afterDateTime",
"container_id": {container_id},
"template_id": {template_id},
}
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the aggregated sum of total_price
from the ORDERS
table matches the aggregated and rounded sum of calculated_price
from the LINEITEM
table.
Info
The calculated_price
in this example is represented by the sum of each product's extended price, adjusted for discount and tax.
Sample Data
Aggregated data from ORDERS (Target)
TOTAL_PRICE |
---|
5000000 |
Aggregated data from LINEITEM (Reference)
CALCULATED_PRICE |
---|
4999800 |
Inputs
- Target Aggregation: ROUND(SUM(O_TOTALPRICE))
- Comparison: eq (Equal To), lt (Less Than), lte (Less Than or Equal to), gte (Greater Than or Equal To), gt (Greater Than)
- Reference Aggregation: ROUND(SUM(L_EXTENDEDPRICE * (1 - L_DISCOUNT) * (1 + L_TAX)))
{
"description": "Ensure that the aggregated sum of total_price from the ORDERS table matches the aggregated and sum of l_totalprice from the LINEITEM table",
"coverage": 1,
"properties": {
"comparison": "eq",
"expression": f"SUM(O_TOTALPRICE)",
"ref_container_id": ref_container_id,
"ref_datastore_id": ref_datastore_id,
"ref_expression": f"SUM(L_TOTALPRICE)",
"ref_filter": "1=1",
},
"tags": [],
"fields": ["O_TOTALPRICE"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "aggregationComparison",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the aggregated TOTAL_PRICE
from the ORDERS
table is 5000000, while the aggregated and rounded CALCULATED_PRICE
from the LINEITEM
table is 4999800. The difference between these totals indicates a potential anomaly, suggesting issues in data calculation or recording methods.
graph TD
A[Start] --> B[Retrieve Aggregated Values]
B --> C{Do Aggregated Totals Match?}
C -->|Yes| D[End]
C -->|No| E[Mark as Anomalous]
E --> D
-- An illustrative SQL query related to the rule using TPC-H tables.
with orders_agg as (
select
round(sum(o_totalprice)) as total_order_price
from
orders
),
lineitem_agg as (
select
round(sum(l_extendedprice * (1 - l_discount) * (1 + l_tax))) as calculated_price
from
lineitem
),
comparison as (
select
o.total_order_price,
l.calculated_price
from
orders_agg o
cross join lineitem_agg l
)
select * from comparison
where comparison.total_order_price != comparison.calculated_price;
Potential Violation Messages
Shape Anomaly
ROUND(SUM(O_TOTALPRICE))
is not equal to ROUND(SUM(L_EXTENDEDPRICE * (1 - L_DISCOUNT) * (1 + L_TAX)))
.
Any Not Null
Definition
Asserts that at least one of the selected fields must hold a value.
Field Scope
Multiple: The rule evaluates multiple specified fields.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
|
String |
|
Boolean |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that for every record in the ORDERS table, at least one of the fields (O_COMMENT, O_ORDERSTATUS) isn't null.
Sample Data
O_ORDERKEY | O_COMMENT | O_ORDERSTATUS |
---|---|---|
1 | NULL | NULL |
2 | Good product | NULL |
3 | NULL | Shipped |
{
"description": "Ensure that for every record in the ORDERS table, at least one of the fields (O_COMMENT, O_ORDERSTATUS) isn't null",
"coverage": 1,
"properties": {},
"tags": [],
"fields": ["O_ORDERSTATUS","O_COMMENT"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "anyNotNull",
"container_id": {container_id},
"template_id": {template_id},
"filter": "_PARITY = 'odd'"
}
Anomaly Explanation
In the sample data above, the entry with O_ORDERKEY
1 does not satisfy the rule because both O_COMMENT
and O_ORDERSTATUS
does not hold a value.
graph TD
A[Start] --> B[Retrieve O_COMMENT and O_ORDERSTATUS]
B --> C{Is Either Field Not Null?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
There is no value set for any of O_COMMENT, O_ORDERSTATUS
Shape Anomaly
In O_COMMENT, O_ORDERSTATUS
, 33.333% of 3 filtered records (1) have no value set for any of O_COMMENT, O_ORDERSTATUS
Before Date Time
Definition
Asserts that the field is a timestamp earlier than a specific date and time.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Specify a particular date and time to act as the threshold for the rule.
Name | Description |
---|---|
Date |
The timestamp used as the upper boundary. Values in the selected field should be before this timestamp. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all L_SHIPDATE entries in the LINEITEM table are earlier than 3:00 PM on December 1st, 1998.
Sample Data
L_ORDERKEY | L_SHIPDATE |
---|---|
1 | 1998-12-01 15:30:00 |
2 | 1998-11-02 12:45:00 |
3 | 1998-08-01 10:20:00 |
{
"description": "Make sure datetime values are earlier than 3:00 PM, Dec 01, 1998",
"coverage": 1,
"properties": {
"datetime":"1998-12-01T15:00:00Z"
},
"tags": [],
"fields": ["L_SHIPDATE"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "beforeDateTime",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with L_ORDERKEY
1 does not satisfy the rule because its L_SHIPDATE
value is not before 1998-12-01 15:00:00.
graph TD
A[Start] --> B[Retrieve L_SHIPDATE]
B --> C{Is L_SHIPDATE < '1998-12-01 15:00:00'?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The L_SHIPDATE
value of 1998-12-01 15:30:00
is not earlier than 1998-12-01 15:00:00.
Shape Anomaly
In L_SHIPDATE
, 33.33% of 3 filtered records (1) are not earlier than 1998-12-01 15:00:00.
Between
Definition
Asserts that values are equal to or between two numbers.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Specify both minimum and maximum boundaries, and determine if these boundaries should be inclusive.
Name | Explanation |
---|---|
Max |
The upper boundary of the range. |
Inclusive (Max) |
If true, the upper boundary is considered a valid value within the range. Otherwise, it's exclusive. |
Min |
The lower boundary of the range. |
Inclusive (Min) |
If true, the lower boundary is considered a valid value within the range. Otherwise, it's exclusive. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all L_QUANTITY entries in the LINEITEM table are between 5 and 20 (inclusive).
Sample Data
L_ORDERKEY | L_QUANTITY |
---|---|
1 | 4 |
2 | 15 |
3 | 21 |
{
"description": "Ensure that all L_QUANTITY entries in the LINEITEM table are between 5 and 20 (inclusive)",
"coverage": 1,
"properties": {
"min":5
"inclusive_min":true,
"max":20,
"inclusive_max":true,
},
"tags": [],
"fields": ["L_QUANTITY"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "between",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with L_ORDERKEY
1 and 3 do not satisfy the rule because their L_QUANTITY
values are not between 5 and 20 inclusive.
graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is 5 <= L_QUANTITY <= 20?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The value for L_QUANTITY
of 4 is not between 5.000 and 20.000.
Shape Anomaly
In L_QUANTITY
, 66.67% of 3 filtered records (2) are not between 5.000 and 20.000.
Between Times
Definition
Asserts that values are equal to or between two dates or times.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Specify the range of dates or times that values in the selected field should fall between.
Name | Description |
---|---|
Min |
The timestamp used as the lower boundary. Values in the selected field should be after this timestamp. |
Max |
The timestamp used as the upper boundary. Values in the selected field should be before this timestamp. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all O_ORDERDATE entries in the ORDERS table are between 10:30 AM on January 1st, 1991 and 3:00 PM on December 31st, 1991.
Sample Data
O_ORDERKEY | O_ORDERDATE |
---|---|
1 | 1990-12-31 10:30:00 |
2 | 1991-06-02 09:15:00 |
3 | 1992-01-01 01:25:00 |
{
"description": "Ensure that all O_ORDERDATE entries in the ORDERS table are between 10:30 AM on January 1st, 1991 and 3:00 PM on December 31st, 1991",
"coverage": 1,
"properties": {
"min_time":"1991-01-01T10:30:00Z",
"max_time":"1991-12-31T15:00:00Z"
},
"tags": [],
"fields": ["O_ORDERDATE"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "betweenTimes",
"container_id": {container_id},
"template_id": {template_id},
"filter": "_PARITY = 'odd'"
}
Anomaly Explanation
In the sample data above, the entries with O_ORDERKEY
1 and 3 do not satisfy the rule because their O_ORDERDATE
values are not between 1991-01-01 10:30:00 and 1991-12-31 15:00:00.
graph TD
A[Start] --> B[Retrieve O_ORDERDATE]
B --> C{Is '1991-01-01 10:30:00' <= O_ORDERDATE <= '1991-12-31 15:00:00'?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The value for O_ORDERDATE
of 1990-12-31 10:30:00
is not between 1991-01-01 10:30:00 and 1991-12-31 15:00:00.
Shape Anomaly
In O_ORDERDATE
, 66.667% of 3 filtered records (2) are not between 1991-01-01 10:30:00 and 1991-12-31 15:00:00.
Contains Credit Card
Definition
Asserts that the values contain a credit card number.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that every O_PAYMENT_DETAILS in the ORDERS table contains a credit card number to confirm the payment method used for each order.
Sample Data
O_ORDERKEY | O_PAYMENT_DETAILS |
---|---|
1 | {"date": "2023-09-25", "amount": 250.50, "credit_card": "5105105105105100"} |
2 | {"date": "2023-09-25", "amount": 150.75, "credit_card": "ABC12345XYZ"} |
3 | {"date": "2023-09-25", "amount": 200.00, "credit_card": "4111-1111-1111-1111"} |
{
"description": "Ensure that every O_PAYMENT_DETAILS in the ORDERS table contains a credit card number to confirm the payment method used for each order",
"coverage": 1,
"properties": {},
"tags": [],
"fields": ["C_CCN_JSON"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "containsCreditCard",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with O_ORDERKEY
2 violates the rule as the O_PAYMENT_DETAILS
does not contain a credit card number, indicating an incomplete order record.
graph TD
A[Start] --> B[Retrieve O_PAYMENT_DETAILS]
B --> C{Contains Credit Card Number?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The O_PAYMENT_DETAILS
value of {"date": "2023-09-25", "amount": 150.75, "credit_card": "ABC12345XYZ"}
does not contains a credit card number.
Shape Anomaly
In O_PAYMENT_DETAILS
, 33.33% of 3 order records (1) do not contain a credit card number.
Contains Email
Definition
Asserts that the values contain email addresses.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: .
Sample Data
C_CUSTKEY | C_CONTACT_DETAILS |
---|---|
1 | {"name": "John Doe", "email": "john.doe@example.com"} |
2 | {"name": "Amy Lu", "email": "amy.lu@"} |
3 | {"name": "Jane Smith", "email": "jane.smith@domain.org"} |
{
"description": "Ensure that all C_CONTACT_DETAILS entries in the CUSTOMER table contain valid email addresses",
"coverage": 1,
"properties": {},
"tags": [],
"fields": ["C_EMAIL_JSON"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "containsEmail",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with C_CUSTKEY
2 does not satisfy the rule because its C_CONTACT_DETAILS
value does not follow a typical email format.
graph TD
A[Start] --> B[Retrieve C_CONTACT_DETAILS]
B --> C{Does C_CONTACT_DETAILS contain an email address?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The C_CONTACT_DETAILS
value of {"name": "Amy Lu", "email": "amy.lu@"}
does not contain an email address.
Shape Anomaly
In C_CONTACT_DETAILS
, 33.333% of 3 filtered records (1) do not contain email addresses.
Contains Social Security Number
Definition
Asserts that the values contain a social security number.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all C_CONTACT_DETAILS entries in the CUSTOMER table contain valid social security numbers.
Sample Data
C_CUSTKEY | C_CONTACT_DETAILS |
---|---|
1 | {"name": "John Doe", "ssn": "234567890"} |
2 | {"name": "Amy Lu", "ssn": "666-12-3456"} |
3 | {"name": "Jane Smith", "ssn": "429-14-2216"} |
{
"description": "Ensure that all C_CONTACT_DETAILS entries in the CUSTOMER table contain valid social security numbers",
"coverage": 1,
"properties": {},
"tags": [],
"fields": ["C_CONTACT_DETAILS"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "containsSocialSecurityNumber",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with C_CUSTKEY
2 does not satisfy the rule because its C_CONTACT_DETAILS
value does not contain the typical social security number format.
graph TD
A[Start] --> B[Retrieve C_CONTACT_DETAILS]
B --> C{Does C_CONTACT_DETAILS contain a valid SSN format?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The C_CONTACT_DETAILS
value of {"name": "Amy Lu", "ssn": "666-12-3456"}
does not contain a social security number.
Shape Anomaly
In C_CONTACT_DETAILS
, 33.333% of 3 filtered records (1) do not contain social security numbers.
Contains URL
Definition
Asserts that the values contain valid URLs.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all S_DETAILS entries in the SUPPLIER table contain valid URLs.
Sample Data
S_SUPPKEY | S_DETAILS |
---|---|
1 | {"name": "Tech Parts", "website": "www.techparts.com"} |
2 | {"name": "Hardwarepro", "website": "https://www.hardwarepro.com"} |
3 | {"name": "Smith's Tools", "website": "ftp:server:8080"} |
{
"description": "Ensure that all S_DETAILS entries in the SUPPLIER table contain valid URLs",
"coverage": 1,
"properties": {},
"tags": [],
"fields": ["S_DETAILS"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "containsUrl",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with S_SUPPKEY
1 and 3 do not satisfy the rule because their S_DETAILS
values do not contain a valid URL pattern.
graph TD
A[Start] --> B[Retrieve S_DETAILS]
B --> C{Does S_DETAILS contain a valid URL?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The S_DETAILS
value of {"name": "Tech Parts", "website": "www.techparts.com"}
does not contain a valid URL.
Shape Anomaly
In S_DETAILS
, 66.667% of 3 filtered records (2) do not contain a valid URL.
Distinct Count
Definition
Asserts on the approximate count distinct of the given column.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
|
String |
|
Boolean |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Specify the distinct count expectation for the values in the field.
Name | Description |
---|---|
Value |
The exact count of distinct values expected in the selected field. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that there are exactly 3 distinct O_ORDERSTATUS in the ORDERS table: 'O' (Open), 'F' (Finished), and 'P' (In Progress).
Sample Data
O_ORDERKEY | O_ORDERSTATUS |
---|---|
1 | O |
2 | F |
... | ... |
20 | X |
21 | O |
{
"description": "Ensure that there are exactly 3 distinct O_ORDERSTATUS in the ORDERS table: 'O' (Open), 'F' (Finished), and 'P' (In Progress)",
"coverage": 1,
"properties": {
"value":3
},
"tags": [],
"fields": ["O_ORDERSTATUS"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "distinctCount",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the rule is violated because the O_ORDERSTATUS
contains 4 distinct values and not 3: 'O' (Open), 'F' (Finished), and 'P' (In Progress).
graph TD
A[Start] --> B[Retrieve all O_ORDERSTATUS entries and count distinct values]
B --> C{Is distinct count of O_ORDERSTATUS != 3?}
C -->|Yes| D[Mark as Anomalous]
C -->|No| E[End]
Potential Violation Messages
Shape Anomaly
In O_ORDERSTATUS
, the distinct count of the records is not 3.
Entity Resolution
Definition
Asserts that every distinct entity is appropriately represented once and only once
In-Depth Overview
This check performs automated entity name clustering to identify entities with similar names that likely represent
the same entity. It then assigns each cluster a unique entity identifier and asserts that every row with the same
entity identifier shares the same value for the designated distinction field
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Name | Description |
---|---|
Distinction Field |
The field that must hold a distinct value for every distinct entity |
Pair Substrings |
Considers entities a match if one entity is part of the other |
Pair Homophones |
Considers entities a match if they sound alike, even if spelled differently |
Spelling Similarity |
The minimum similarity required for clustering two entity names |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: If you have a businesses
table with an id
field and a name
field, this check can be configured to
resolve name
and use id
as the distinction field
. During each scan, similar names will be grouped and assigned the
same entity identifier
and any rows that share the same entity identifier
but have different values for id
will be
identified as anomalies.
Sample Data
BUSINESS_ID | BUSINESS_NAME |
---|---|
1 | ACME Boxing |
2 | Frank's Flowers |
3 | ACME Boxes |
{
"description": "Ensure a `businesses` table with an `BUSINESS_ID` field and a `BUSINESS_NAME` field shares the same `entity identifier`",
"coverage": 1,
"properties": {
"distinct_field_name":"BUSINESS_ID",
"pair_substrings":true,
"pair_homophones":true,
"spelling_similarity_threshold":0.6
},
"tags": [],
"fields": ["BUSINESS_NAME"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "entityResolution",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with BUSINESS_ID
1 and 3 do not satisfy the rule because their BUSINESS_NAME
values will be marked as similar yet they do not share the same BUSINESS_ID
graph TD
A[Start] --> B[Retrieve Original Data]
B --> C{Which entities are similar?}
C --> D[Assign each record an entity identifier]
D --> E[Cluster records by entity identifier]
E --> F{Do records with same<br/>entity identifier share the<br/>same distinction field value?}
F -->|Yes| I[End]
F -->|No| H[Mark as Anomalous]
H --> I
Equal To
Definition
Asserts that all of the selected fields' equal a value.
Field Scope
Multi: The rule evaluates multiple specified fields.
Accepted Types
Type | |
---|---|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Specify the field to compare for equality with the selected field.
Name | Description |
---|---|
Value |
Specifies the value a field should be equal to. |
Comparators |
Specifies how variations are handled, allowing for slight deviations within a defined margin of error. |
Details
Comparators
The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:
Numeric
Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.
Comparison Type
- Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
- Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.
Threshold
The threshold is the value you set to define the margin of error:
- When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
- For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.
Illustration using Absolute Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 50.
Value A | Value B | Difference | Are equal? |
---|---|---|---|
100 | 150 | 50 | True |
100 | 90 | 10 | True |
100 | 155 | 55 | False |
100 | 49 | 51 | False |
Illustration using Percentage Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 10%.
Percentage Change Formula: [ (Value B
- Value A
) / Value A
] * 100
Value A | Value B | Percentage Change | Are equal? |
---|---|---|---|
120 | 132 | 10% | True |
150 | 135 | 10% | True |
200 | 180 | 10% | True |
160 | 150 | 6.25% | True |
180 | 200 | 11.11% | False |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is equal to a value of 10.
Sample Data
L_ORDERKEY | L_LINENUMBER | L_QUANTITY |
---|---|---|
1 | 1 | 10 |
2 | 2 | 5 |
3 | 3 | 10 |
4 | 4 | 8 |
{
"description": "Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is equal to a value of 10",
"coverage": 1,
"properties": {
"value":"10",
"inclusive":true
},
"tags": [],
"fields": ["L_QUANTITY"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "equalTo",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with L_ORDERKEY
2 and 4 do not satisfy the rule because their L_QUANTITY
values are below the specified minimum value of 10.
graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY = 10?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
Not all of the fields equal are equal to the value of 10
Shape Anomaly
In L_QUANTITY
, 2 of 4 filtered records (4) are not equal to the value of 10
Equal To Field
Definition
Asserts that a field is equal to another field.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Specify the field to compare for equality with the selected field.
Name | Description |
---|---|
Field to compare |
The field name whose values should match those of the selected field. |
Comparators |
Specifies how variations are handled, allowing for slight deviations within a defined margin of error. |
Details
Comparators
The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:
String
String comparators facilitate comparisons of textual data by allowing variations in spacing. This capability is essential for ensuring data consistency, particularly where minor text inconsistencies may occur.
Ignore Whitespace
When enabled, this setting allows the comparator to ignore differences in whitespace. This means sequences of whitespace are collapsed into a single space, and any leading or trailing spaces are removed. This can be particularly useful in environments where data entry may vary in formatting but where those differences are not relevant to the data's integrity.
Illustration
In this example, it is being compared Value A
and Value B
according to the defined string comparison to ignore whitespace
as True
.
Value A | Value B | Are equal? | Has whitespace? |
---|---|---|---|
Leonidas |
Leonidas |
True | No |
Beth |
Beth |
True | Yes |
Ana |
Anna |
False |
Yes |
Joe |
Joel |
False |
No |
Duration
Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.
Unit
The unit of time you select determines how granular the comparison is:
- Millis: Measures time in milliseconds, ideal for high-precision needs.
- Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
- Days: Best for longer durations.
Value
Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.
Illustration using Duration Comparator
Unit | Value A | Value B | Difference | Threshold | Are equal? |
---|---|---|---|---|---|
Millis | 500 ms | 520 ms | 20 ms | 25 ms | True |
Seconds | 30 sec | 31 sec | 1 sec | 2 sec | True |
Days | 5 days | 7 days | 2 days | 1 day | False |
Millis | 1000 ms | 1040 ms | 40 ms | 25 ms | False |
Seconds | 45 sec | 48 sec | 3 sec | 2 sec | False |
Numeric
Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.
Comparison Type
- Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
- Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.
Threshold
The threshold is the value you set to define the margin of error:
- When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
- For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.
Illustration using Absolute Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 50.
Value A | Value B | Difference | Are equal? |
---|---|---|---|
100 | 150 | 50 | True |
100 | 90 | 10 | True |
100 | 155 | 55 | False |
100 | 49 | 51 | False |
Illustration using Percentage Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 10%.
Percentage Change Formula: [ (Value B
- Value A
) / Value A
] * 100
Value A | Value B | Percentage Change | Are equal? |
---|---|---|---|
120 | 132 | 10% | True |
150 | 135 | 10% | True |
200 | 180 | 10% | True |
160 | 150 | 6.25% | True |
180 | 200 | 11.11% | False |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Scenario: An e-commerce platform sells digital products. The shipping date (when the digital product link is sent) should always be the same as the delivery date (when the customer acknowledges receiving the product).
Objective: Ensure that the O_SHIPDATE in the ORDERS table matches its delivery date O_DELIVERYDATE.
Sample Data
O_ORDERKEY | O_SHIPDATE | O_DELIVERYDATE |
---|---|---|
1 | 1998-01-04 | 1998-01-04 |
2 | 1998-01-14 | 1998-01-15 |
3 | 1998-01-12 | 1998-01-12 |
{
"description": "Ensure that the O_SHIPDATE in the ORDERS table matches its delivery date O_DELIVERYDATE",
"coverage": 1,
"properties": {"field_name":"O_DELIVERYDATE", "inclusive":false},
"tags": [],
"fields": ["O_SHIPDATE"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "equalToField",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with O_ORDERKEY
2 does not satisfy the rule because its O_SHIPDATE
of 1998-01-14 does not match the O_DELIVERYDATE
of 1998-01-15.
graph TD
A[Start] --> B[Retrieve O_SHIPDATE and O_DELIVERYDATE]
B --> C{Is O_SHIPDATE = O_DELIVERYDATE?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The O_SHIPDATE
value of 1998-01-14 is not equal to the value of O_DELIVERYDATE which is 1998-01-15.
Shape Anomaly
In O_SHIPDATE
, 33.333% of the filtered fields are not equal.
Exists In
Definition
Asserts that values assigned to a field exist as values in another field.
In-Depth Overview
The ExistsIn
rule allows you to cross-validate data between different sources, whether it’s object storage systems or databases.
Traditionally, databases might utilize foreign key constraints (if available) to enforce data integrity between related tables. The ExistsIn
rule extends this concept in two powerful ways:
- Cross-System Integrity: it allows for integrity checks to span across different databases or even entirely separate systems. This is particularly advantageous in scenarios where data sources are fragmented across diverse platforms.
- Flexible Data Formats: Beyond just databases, this rule can validate values against various data formats, such as ensuring values in a file align with those in a table.
These enhancements enable businesses to maintain data integrity even in complex, multi-system environments.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
|
String |
|
Boolean |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Define the datastore, table/file, and field where the rule should look for matching values.
Name | Description |
---|---|
Datastore |
The source datastore where the profile of the reference field is located. |
Table/file |
The profile (e.g. table, view or file) containing the reference field. |
Field |
The field name whose values should match those of the selected field. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all NATION_NAME entries in the NATION table match entries under the COUNTRY_NAME column in an external lookup file listing official country names.
Sample Data
N_NATIONKEY | N_NATIONNAME |
---|---|
1 | Algeria |
2 | Argentina |
3 | Atlantida |
{
"description": "Ensure that all NATION_NAME entries in the NATION table match entries under the COUNTRY_NAME column in an external lookup file listing official country names",
"coverage": 1,
"properties": {
"field_name":"COUNTRY_NAME",
"ref_container_id": {ref_container_id},
"ref_datastore_id": {ref_datastore_id}
},
"tags": [],
"fields": ["NATION_NAME"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "existsIn",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Lookup File Sample
COUNTRY_NAME |
---|
Algeria |
Argentina |
Brazil |
Canada |
... |
Zimbabwe |
Anomaly Explanation
In the sample data above, the entry with N_NATIONKEY
3 does not satisfy the rule because the N_NATIONNAME
"Atlantida" does not match any COUNTRY_NAME
in the official country names lookup file.
graph TD
A[Start] --> B[Retrieve COUNTRY_NAME]
B --> C[Retrieve N_NATIONNAME]
C --> D{Does N_NATIONNAME exists in COUNTRY_NAME?}
D -->|Yes| E[Move to Next Record/End]
D -->|No| F[Mark as Anomalous]
F --> E
Potential Violation Messages
Record Anomaly
The N_NATIONNAME
value of 'Atlantida'
does not exist in COUNTRY_NAME
.
Shape Anomaly
In N_NATIONNAME
, 33.333% of 3 filtered records (1) do not match any COUNTRY_NAME
.
Expected Schema
Definition
Asserts that all of the selected fields must be present in the datastore.
Behavior
The expected schema is the first check to be tested during a scan operation. If it fails, the scan operation will result as Failure
with the following message:
<container-name>
: Aborted because schema check anomalies were identified.
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
Specific Properties
Specify the fields that must be present in the schema, and determine if a schema change caused by additional fields should fail or pass the assertion.
Name | Description |
---|---|
Fields |
List of fields that must be presented in the schema. |
Allow other fields |
If true, then new fields are allowed to be presented in the schema. Otherwise, the assertion will be stricter. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that expected fields such as L_ORDERKEY, L_PARTKEY, and L_SUPPKEY are always present in the LINEITEM table.
Sample Data
Valid
FIELD_NAME | FIELD_TYPE |
---|---|
L_ORDERKEY | NUMBER |
L_PARTKEY | NUMBER |
L_SUPPKEY | NUMBER |
L_LINENUMBER | NUMBER |
L_QUANTITY | NUMBER |
L_EXTENDEDPRICE | NUMBER |
... | ... |
Invalid
L_SUPPKEY is missing from the schema
FIELD_NAME | FIELD_TYPE |
---|---|
L_ORDERKEY | NUMBER |
L_PARTKEY | NUMBER |
L_LINENUMBER | NUMBER |
L_QUANTITY | NUMBER |
L_EXTENDEDPRICE | NUMBER |
... | ... |
{
"description": "Ensure that expected fields such as L_ORDERKEY, L_PARTKEY, and L_SUPPKEY are always present in the LINEITEM table",
"coverage": 1,
"properties": {
"allow_other_fields":false,
"list":["L_ORDERKEY","L_PARTKEY","L_SUPPKEY"]
},
"tags": [],
"fields": null,
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "expectedSchema",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
Among the presented sample schemas, the second one is missing one of the expected schema. Only the first schema has the correct expected schema.
graph TD
A[Start] --> B{Check for Field Presence}
B -.->|Field is missing| C[Mark as Shape Anomaly]
B -.->|All fields present| D[End]
Potential Violation Messages
Shape Anomaly
The required fields (L_SUPPKEY
) are not present.
Expected Values
Definition
Asserts that values are contained within a list of expected values.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
|
String |
|
Boolean |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Specify the list of expected values for the data in the field.
Name | Description |
---|---|
List |
A predefined set of values against which the data is validated. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all O_ORDERSTATUS entries in the ORDERS table only contain expected order statuses: "O", "F", and "P".
Sample Data
O_ORDERKEY | O_ORDERSTATUS |
---|---|
1 | F |
2 | O |
3 | P |
4 | X |
{
"description": "Ensure that all O_ORDERSTATUS entries in the ORDERS table only contain expected order statuses: "O", "F", and "P"",
"coverage": 1,
"properties": {
"list":["O","F","P"]
},
"tags": [],
"fields": ["O_ORDERSTATUS"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "expectedValues",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with O_ORDERKEY
4 does not satisfy the rule because the O_ORDERSTATUS
"X" is not on the list of expected order statuses ("O", "F", "P").
graph TD
A[Start] --> B[Retrieve O_ORDERSTATUS]
B --> C{Is O_ORDERSTATUS in 'O', 'F', 'P'?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The O_ORDERSTATUS
value of 'X'
does not appear in the list of expected values
Shape Anomaly
In O_ORDERSTATUS
, 25.000% of 4 filtered records (1) do not appear in the list of expected values
Field Count
Definition
Asserts that there must be exactly a specified number of fields.
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
Specific Properties
Specify the exact number of fields expected in the profile.
Name | Description |
---|---|
Number of Fields |
The exact number of fields that should be present in the profile. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the ORDERS profile contains exactly 9 fields.
Sample Profile
Valid
FIELD_NAME | FIELD_TYPE |
---|---|
O_ORDERKEY | STRING |
O_CUSTKEY | STRING |
O_ORDERSTATUS | STRING |
O_TOTALPRICE | FLOAT |
O_ORDERDATE | DATE |
O_ORDERPRIORITY | STRING |
O_CLERK | STRING |
O_SHIPPRIORITY | STRING |
O_COMMENT | STRING |
Invalid
count (8) less than expected (9)
FIELD_NAME | FIELD_TYPE |
---|---|
O_ORDERKEY | STRING |
O_CUSTKEY | STRING |
O_ORDERSTATUS | STRING |
O_TOTALPRICE | FLOAT |
O_ORDERDATE | DATE |
O_ORDERPRIORITY | STRING |
O_CLERK | STRING |
O_COMMENT | STRING |
count (10) greater than expected (9)
FIELD_NAME | FIELD_TYPE |
---|---|
O_ORDERKEY | STRING |
O_CUSTKEY | STRING |
O_ORDERSTATUS | STRING |
O_TOTALPRICE | FLOAT |
O_ORDERDATE | DATE |
O_ORDERPRIORITY | STRING |
O_CLERK | STRING |
O_COMMENT | STRING |
EXTRA_FIELD | UNKOWN |
{
"description": "Ensure that the ORDERS profile contains exactly 9 fields",
"coverage": 1,
"properties": {
"value": 9
},
"tags": [],
"fields": null,
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "fieldCount",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
Among the presented sample profiles, the second one is missing a field, while the third one contains an extra field. Only the first profile has the correct number of fields, which is 9.
graph TD
A[Start] --> B[Retrieve Profile Fields]
B --> C{Does the profile have exactly 9 fields?}
C -->|Yes| D[End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Shape Anomaly
In ORDERS
, the field count is not 9
.
Greater Than
Definition
Asserts that the field is a number greater than (or equal to) a value.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Allows specifying a numeric value that acts as the threshold.
Name | Description |
---|---|
Value |
The number to use as the base comparison. |
Inclusive |
If true, the comparison will also allow values equal to the threshold. Otherwise, it's exclusive. |
Comparators |
Specifies how variations are handled, allowing for slight deviations within a defined margin of error. |
Details
Comparators
The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:
Numeric
Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.
Comparison Type
- Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
- Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.
Threshold
The threshold is the value you set to define the margin of error:
- When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
- For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.
Illustration using Absolute Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 50.
Value A | Value B | Difference | Are equal? |
---|---|---|---|
100 | 150 | 50 | True |
100 | 90 | 10 | True |
100 | 155 | 55 | False |
100 | 49 | 51 | False |
Illustration using Percentage Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 10%.
Percentage Change Formula: [ (Value B
- Value A
) / Value A
] * 100
Value A | Value B | Percentage Change | Are equal? |
---|---|---|---|
120 | 132 | 10% | True |
150 | 135 | 10% | True |
200 | 180 | 10% | True |
160 | 150 | 6.25% | True |
180 | 200 | 11.11% | False |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all L_QUANTITY entries in the LINEITEM table are greater than 10.
Sample Data
L_ORDERKEY | L_QUANTITY |
---|---|
1 | 9 |
2 | 15 |
3 | 5 |
{
"description": "Ensure that all L_QUANTITY entries in the LINEITEM table are greater than 10",
"coverage": 1,
"properties": {
"inclusive": true,
"value": 10
},
"tags": [],
"fields": ["L_QUANTITY"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "greaterThan",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with L_ORDERKEY
1 and 3 do not satisfy the rule because their L_QUANTITY
values are not greater than 10.
graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY > 10?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The L_QUANTITY
value of 5
is not greater than the value of 10.
Shape Anomaly
In L_QUANTITY
, 66.667% of 3 filtered records (2) are not greater than 10.
Greater Than Field
Definition
Asserts that the field is greater than another field.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Allows specifying another field against which the value comparison will be performed.
Name | Description |
---|---|
Field to compare |
Specifies the name of the field against which the value will be compared. |
Inclusive |
If true, the comparison will also allow values equal to the value of the other field. Otherwise, it's exclusive. |
Comparators |
Specifies how variations are handled, allowing for slight deviations within a defined margin of error. |
Details
Comparators
The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:
Numeric
Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.
Comparison Type
- Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
- Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.
Threshold
The threshold is the value you set to define the margin of error:
- When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
- For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.
Illustration using Absolute Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 50.
Value A | Value B | Difference | Are equal? |
---|---|---|---|
100 | 150 | 50 | True |
100 | 90 | 10 | True |
100 | 155 | 55 | False |
100 | 49 | 51 | False |
Illustration using Percentage Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 10%.
Percentage Change Formula: [ (Value B
- Value A
) / Value A
] * 100
Value A | Value B | Percentage Change | Are equal? |
---|---|---|---|
120 | 132 | 10% | True |
150 | 135 | 10% | True |
200 | 180 | 10% | True |
160 | 150 | 6.25% | True |
180 | 200 | 11.11% | False |
Duration
Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.
Unit
The unit of time you select determines how granular the comparison is:
- Millis: Measures time in milliseconds, ideal for high-precision needs.
- Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
- Days: Best for longer durations.
Value
Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.
Illustration using Duration Comparator
Unit | Value A | Value B | Difference | Threshold | Are equal? |
---|---|---|---|---|---|
Millis | 500 ms | 520 ms | 20 ms | 25 ms | True |
Seconds | 30 sec | 31 sec | 1 sec | 2 sec | True |
Days | 5 days | 7 days | 2 days | 1 day | False |
Millis | 1000 ms | 1040 ms | 40 ms | 25 ms | False |
Seconds | 45 sec | 48 sec | 3 sec | 2 sec | False |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all O_TOTALPRICE entries in the ORDERS table are greater than their respective O_DISCOUNT.
Sample Data
O_ORDERKEY | O_TOTALPRICE | O_DISCOUNT |
---|---|---|
1 | 100 | 105 |
2 | 500 | 10 |
3 | 120 | 121 |
{
"description": "Ensure that all O_TOTALPRICE entries in the ORDERS table are greater than their respective O_DISCOUNT",
"coverage": 1,
"properties": {
"field_name": "O_DISCOUNT",
"inclusive": true
},
"tags": [],
"fields": ["O_TOTALPRICE"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "greaterThanField",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with O_ORDERKEY
1 and 3 do not satisfy the rule because their O_TOTALPRICE
values are not greater than their respective O_DISCOUNT
values.
graph TD
A[Start] --> B[Retrieve O_TOTALPRICE and O_DISCOUNT]
B --> C{Is O_TOTALPRICE > O_DISCOUNT?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The O_TOTALPRICE
value of 100
is not greater than the value of O_DISCOUNT
.
Shape Anomaly
In O_TOTALPRICE
, 66.667% of 3 filtered records (2) are not greater than O_DISCOUNT
.
Is Address
Definition
Asserts that the values contain the specified required elements of an address.
In-Depth Overview
This check leverages machine learning powered by the libpostal library to support multilingual street address parsing/normalization that can handle addresses all over the world. The underlying statistical NLP model was trained using data from OpenAddress and OpenStreetMap, a total of about 1.2 billion records of data from over 230 countries, in 100+ languages. The international address parser uses Conditional Random Fields, which can infer a globally optimal tag sequence instead of making local decisions at each word, and it achieves 99.45% full-parse accuracy on held-out addresses (i.e. addresses from the training set that were purposefully removed so we could evaluate the parser on addresses it hasn’t seen before).
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Name | Description |
---|---|
Required Labels |
The labels that must be identifiable in the value of each record |
Info
The address parser can technically use any string labels that are defined in the training data, but these are the ones currently supported:
- road: Street name(s)
- city: Any human settlement including cities, towns, villages, hamlets, localities, etc
- state: First-level administrative division. Scotland, Northern Ireland, Wales, and England in the UK are mapped to "state" as well (convention used in OSM, GeoPlanet, etc.)
- country: Sovereign nations and their dependent territories, anything with an ISO-3166 code
- postcode: Postal codes used for mail sorting
This check allows the user to define any combination of these labels as required elements of the value held in each record. Any value thse does not contain every required element will be identified as anomalous.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all values in O_MAILING_ADDRESS include the labels "road", "city", "state", and "postcode"
Sample Data
O_ORDERKEY | O_MAILING_ADDRESS |
---|---|
1 | One-hundred twenty E 96th St, new york NY 14925 |
2 | Quatre vingt douze R. de l'Église, 75196 cedex 04 |
3 | 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA |
{
"description": "Ensure that all values in O_MAILING_ADDRESS include the labels "road", "city", "state", and "postcode"",
"coverage": 1,
"properties": {
"required_labels": ["road","city","state","country","postcode"]
},
"tags": [],
"fields": ["O_MAILING_ADDRESS"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "isAddress",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with O_ORDERKEY
2 does not satisfy the rule because the O_MAILING_ADDRESS
value includes only a road and postcode which violates the business logic that city and state also be present.
graph TD
A[Start] --> B[Retrieve O_MAILING_ADDRESS]
B --> C[Infer address labels using ML]
C --> D{Are all required labels present?}
D -->|Yes| E[Move to Next Record/End]
D -->|No| F[Mark as Anomalous]
F --> E
Potential Violation Messages
Record Anomaly
The O_MAILING_ADDRESS
value of Quatre vingt douze R. de l'Église, 75196 cedex 04
does not adhere to the required format.
Shape Anomaly
In O_MAILING_ADDRESS
, 33.33% of 3 filtered records (1) do not adhere to the required format.
Is Credit Card
Definition
Asserts that the values are credit card numbers.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all C_CREDIT_CARD entries in the CUSTOMER table are valid credit card numbers.
Sample Data
C_CUSTKEY | C_CREDIT_CARD |
---|---|
1 | 5105105105105100 |
2 | ABC12345XYZ |
3 | 4111111111111111 |
{
"description": "Ensure that all C_CREDIT_CARD entries in the CUSTOMER table are valid credit card numbers",
"coverage": 1,
"properties": {},
"tags": [],
"fields": ["C_CREDIT_CARD"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "isCreditCard",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with C_CUSTKEY
2 does not satisfy the rule because its C_CREDIT_CARD
value is not a valid credit card number.
graph TD
A[Start] --> B[Retrieve C_CREDIT_CARD]
B --> C{Is C_CREDIT_CARD valid?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The C_CREDIT_CARD
value of ABC12345XYZ
is not a valid credit card number.
Shape Anomaly
In C_CREDIT_CARD
, 33.33% of 3 filtered records (1) are not valid credit card numbers.
Is Replica Of
Definition
Asserts that the dataset created by the targeted field(s) is replicated by the referred field(s).
In-Depth Overview
The IsReplicaOf
rule ensures that data integrity is maintained when data is replicated from one source to another. This involves checking not only the data values themselves but also ensuring that the structure and relationships are preserved.
In a distributed data ecosystem, replication often occurs to maintain high availability, create backups, or feed data into analytical systems. However, discrepancies might arise due to various reasons such as network glitches, software bugs, or human errors. The IsReplicaOf
rule serves as a safeguard against these issues by:
- Preserving Data Structure: Ensuring that the structure of the replicated data matches the original.
- Checking Data Values: Ensuring that every piece of data in the source exists in the replica.
Field Scope
Multi: The rule evaluates multiple specified fields.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
|
String |
|
Boolean |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Specify the datastore and table/file where the replica of the targeted fields is located for comparison.
Name | Description |
---|---|
Row Identifiers |
The list of fields defining the compound key to identify rows in the comparison analysis. |
Datastore |
The source datastore where the replica of the targeted field(s) is located. |
Table/file |
The table, view or file in the source datastore that should serve as the replica. |
Comparators |
Specifies how variations are handled, allowing for slight deviations within a defined margin of error. |
Details
Row Identifiers
This optional input allows row comparison analysis by defining a list of fields as row identifiers, it enables a more detailed comparison between tables/files, where each row compound key is used to identify its presence or abscence in the reference table/file compared to the target table/file. Qualytics can inform if the row exists or not and distinguish which field values differ in each row present in the reference table/file, helping to determine if it is a replica.
Info
Anomalies produced by a IsReplicaOf
quality check making use of Row Identifiers
have their source records presented in a different visualization.
See more at: Comparison Source Records
Comparators
The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:
Numeric
Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.
Comparison Type
- Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
- Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.
Threshold
The threshold is the value you set to define the margin of error:
- When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
- For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.
Illustration using Absolute Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 50.
Value A | Value B | Difference | Are equal? |
---|---|---|---|
100 | 150 | 50 | True |
100 | 90 | 10 | True |
100 | 155 | 55 | False |
100 | 49 | 51 | False |
Illustration using Percentage Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 10%.
Percentage Change Formula: [ (Value B
- Value A
) / Value A
] * 100
Value A | Value B | Percentage Change | Are equal? |
---|---|---|---|
120 | 132 | 10% | True |
150 | 135 | 10% | True |
200 | 180 | 10% | True |
160 | 150 | 6.25% | True |
180 | 200 | 11.11% | False |
Duration
Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.
Unit
The unit of time you select determines how granular the comparison is:
- Millis: Measures time in milliseconds, ideal for high-precision needs.
- Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
- Days: Best for longer durations.
Value
Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.
Illustration using Duration Comparator
Unit | Value A | Value B | Difference | Threshold | Are equal? |
---|---|---|---|---|---|
Millis | 500 ms | 520 ms | 20 ms | 25 ms | True |
Seconds | 30 sec | 31 sec | 1 sec | 2 sec | True |
Days | 5 days | 7 days | 2 days | 1 day | False |
Millis | 1000 ms | 1040 ms | 40 ms | 25 ms | False |
Seconds | 45 sec | 48 sec | 3 sec | 2 sec | False |
String
String comparators facilitate comparisons of textual data by allowing variations in spacing. This capability is essential for ensuring data consistency, particularly where minor text inconsistencies may occur.
Ignore Whitespace
When enabled, this setting allows the comparator to ignore differences in whitespace. This means sequences of whitespace are collapsed into a single space, and any leading or trailing spaces are removed. This can be particularly useful in environments where data entry may vary in formatting but where those differences are not relevant to the data's integrity.
Illustration
In this example, it is being compared Value A
and Value B
according to the defined string comparison to ignore whitespace
as True
.
Value A | Value B | Are equal? | Has whitespace? |
---|---|---|---|
Leonidas |
Leonidas |
True | No |
Beth |
Beth |
True | Yes |
Ana |
Anna |
False |
Yes |
Joe |
Joel |
False |
No |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Scenario: Consider that the fields N_NATIONKEY and N_NATIONNAME in the NATION table are being replicated to a backup database for disaster recovery purposes. The data engineering team wants to ensure that both fields in the replica in the backup accurately reflect the original.
Objective: Ensure that N_NATIONKEY and N_NATIONNAME from the NATION table are replicas in the NATION_BACKUP table.
Sample Data from NATION
N_NATIONKEY | N_NATIONNAME |
---|---|
1 | Australia |
2 | United States |
3 | Uruguay |
Replica Sample Data from NATION_BACKUP
N_NATIONKEY | N_NATIONNAME |
---|---|
1 | Australia |
2 | USA |
3 | Uruguay |
{
"description": "Ensure that N_NATIONKEY and N_NATIONNAME from the NATION table are replicas in the NATION_BACKUP table",
"coverage": 1,
"properties": {
"ref_container_id": {ref_container_id},
"ref_datastore_id": {ref_datastore_id}
},
"tags": [],
"fields": ["N_NATIONKEY", "N_NATIONNAME"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "isReplicaOf",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
The datasets representing the fields N_NATIONKEY
and N_NATIONNAME
in the original and the replica are not completely identical, indicating a possible discrepancy in the replication process or an unintended change.
graph TD
A[Start] --> B[Retrieve Original Data]
B --> C[Retrieve Replica Data]
C --> D{Do datasets match for both fields?}
D -->|Yes| E[End]
D -->|No| F[Mark as Anomalous]
F --> E
-- An illustrative SQL query comparing original to replica for both fields.
select
orig.n_nationkey as original_key,
orig.n_nationname as original_name,
replica.n_nationkey as replica_key,
replica.n_nationname as replica_name
from nation as orig
left join nation_backup as replica on orig.n_nationkey = replica.n_nationkey
where
orig.n_nationname <> replica.n_nationname
or
orig.n_nationkey <> replica.n_nationkey
Potential Violation Messages
Shape Anomaly
There is 1 record that differ between NATION_BACKUP
(3 records) and NATION
(3 records) in <datastore_name>
Is Type
Definition
Asserts that the data is of a specific type.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Specify the expected type for the data in the field.
Name | Description |
---|---|
Field Type |
The type that values in the selected field should conform to. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all L_QUANTITY entries in the LINEITEM table are of Integral type.
Sample Data
L_ORDERKEY | L_QUANTITY |
---|---|
1 | "10" |
2 | "15.5" |
3 | "Ten" |
{
"description": "Ensure that all L_QUANTITY entries in the LINEITEM table are of Integral type",
"coverage": 1,
"properties": {
"field_type":"Integral"
},
"tags": [],
"fields": ["L_QUANTITY"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "isType",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with L_ORDERKEY
2 and 3 do not satisfy the rule because their L_QUANTITY
values are not of Integral type.
graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY of Integral type?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The L_QUANTITY
value of Ten
is not a valid Integral.
Shape Anomaly
In L_QUANTITY
, 66.667% of 3 filtered records (2) are not a valid Integral.
Less Than
Definition
Asserts that the field is a number less than (or equal to) a value.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Allows specifying a numeric value that acts as the threshold.
Name | Description |
---|---|
Value |
The number to use as the base comparison. |
Inclusive |
If true, the comparison will also allow values equal to the threshold. Otherwise, it's exclusive. |
Comparators |
Specifies how variations are handled, allowing for slight deviations within a defined margin of error. |
Details
Comparators
The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:
Numeric
Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.
Comparison Type
- Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
- Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.
Threshold
The threshold is the value you set to define the margin of error:
- When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
- For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.
Illustration using Absolute Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 50.
Value A | Value B | Difference | Are equal? |
---|---|---|---|
100 | 150 | 50 | True |
100 | 90 | 10 | True |
100 | 155 | 55 | False |
100 | 49 | 51 | False |
Illustration using Percentage Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 10%.
Percentage Change Formula: [ (Value B
- Value A
) / Value A
] * 100
Value A | Value B | Percentage Change | Are equal? |
---|---|---|---|
120 | 132 | 10% | True |
150 | 135 | 10% | True |
200 | 180 | 10% | True |
160 | 150 | 6.25% | True |
180 | 200 | 11.11% | False |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all L_PRICE entries in the LINEITEM table are less than 20.
Sample Data
L_ORDERKEY | L_PRICE |
---|---|
1 | 18 |
2 | 25 |
3 | 23 |
{
"description": "Ensure that all L_PRICE entries in the LINEITEM table are less than 20",
"coverage": 1,
"properties": {
"inclusive": true,
"value": 20
},
"tags": [],
"fields": ["L_QUANTITY"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "lessThan",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with L_ORDERKEY
2 and 3 do not satisfy the rule because their L_PRICE
values are not less than 20.
Potential Violation Messages
Record Anomaly
The L_PRICE
value of 23
is not less than the value of 20.
Shape Anomaly
In L_PRICE
, 66.667% of 3 filtered records (2) are not less than 20.
Less Than Field
Definition
Asserts that the field is less than another field.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Allows specifying another field against which the value comparison will be performed.
Name | Description |
---|---|
Field to compare |
Specifies the name of the field against which the value will be compared. |
Inclusive |
If true, the comparison will also allow values equal to the value of the other field. Otherwise, it's exclusive. |
Comparators |
Specifies how variations are handled, allowing for slight deviations within a defined margin of error. |
Details
Comparators
The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:
Numeric
Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.
Comparison Type
- Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
- Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.
Threshold
The threshold is the value you set to define the margin of error:
- When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
- For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.
Illustration using Absolute Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 50.
Value A | Value B | Difference | Are equal? |
---|---|---|---|
100 | 150 | 50 | True |
100 | 90 | 10 | True |
100 | 155 | 55 | False |
100 | 49 | 51 | False |
Illustration using Percentage Value
In this example, it is being compared Value A
and Value B
according to the defined Threshold of 10%.
Percentage Change Formula: [ (Value B
- Value A
) / Value A
] * 100
Value A | Value B | Percentage Change | Are equal? |
---|---|---|---|
120 | 132 | 10% | True |
150 | 135 | 10% | True |
200 | 180 | 10% | True |
160 | 150 | 6.25% | True |
180 | 200 | 11.11% | False |
Duration
Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.
Unit
The unit of time you select determines how granular the comparison is:
- Millis: Measures time in milliseconds, ideal for high-precision needs.
- Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
- Days: Best for longer durations.
Value
Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.
Illustration using Duration Comparator
Unit | Value A | Value B | Difference | Threshold | Are equal? |
---|---|---|---|---|---|
Millis | 500 ms | 520 ms | 20 ms | 25 ms | True |
Seconds | 30 sec | 31 sec | 1 sec | 2 sec | True |
Days | 5 days | 7 days | 2 days | 1 day | False |
Millis | 1000 ms | 1040 ms | 40 ms | 25 ms | False |
Seconds | 45 sec | 48 sec | 3 sec | 2 sec | False |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all O_DISCOUNT entries in the ORDERS table are less than their respective O_TOTALPRICE.
Sample Data
O_ORDERKEY | O_TOTALPRICE | O_DISCOUNT |
---|---|---|
1 | 105 | 100 |
2 | 500 | 10 |
3 | 121 | 125 |
{
"description": "Ensure that all O_DISCOUNT entries in the ORDERS table are less than their respective O_TOTALPRICE",
"coverage": 1,
"properties": {
"field_name": "O_TOTALPRICE",
"inclusive":true
},
"tags": [],
"fields": ["O_DISCOUNT"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "lessThanField",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with O_ORDERKEY
3 does not satisfy the rule because its O_DISCOUNT
value is not less than its respective O_TOTALPRICE
value.
graph TD
A[Start] --> B[Retrieve O_TOTALPRICE and O_DISCOUNT]
B --> C{Is O_DISCOUNT < O_TOTALPRICE?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The O_DISCOUNT
value of 125
is not less than the value of O_TOTALPRICE
.
Shape Anomaly
In O_DISCOUNT
, 33.333% of 3 filtered records (1) is not less than O_TOTALPRICE
.
Matches Pattern
Definition
Asserts that a field must match a pattern.
In-Depth Overview
Patterns, typically expressed as regular expressions, allow for the enforcement of custom structural norms for data fields. For complex patterns, regular expressions offer a powerful tool to ensure conformity to the expected format.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Allows specifying a pattern against which the field will be checked.
Name | Description |
---|---|
Pattern |
Specifies the regular expression pattern the field must match. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all P_SERIAL entries in the PART table match the pattern for product serial numbers: TPCH-XXXX-####
, where XXXX
are uppercase alphabetic characters and ####
are numbers.
Sample Data
P_PARTKEY | P_SERIAL |
---|---|
1 | TPCH-ABCD-1234 |
2 | TPCH-1234-ABCD |
3 | TPCH-WXYZ-9876 |
{
"description": "Ensure that all P_SERIAL entries in the PART table match the pattern for product serial numbers: `TPCH-XXXX-####`, where `XXXX` are uppercase alphabetic characters and `####` are numbers",
"coverage": 1,
"properties": {
"pattern":"^tpch-[a-z]{4}-[0-9]{4}$"
},
"tags": [],
"fields": ["P_SERIAL"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "matchesPattern",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with P_PARTKEY
2 does not satisfy the rule because its P_SERIAL
does not match the required pattern.
graph TD
A[Start] --> B[Retrieve P_SERIAL]
B --> C{Does P_SERIAL match TPCH-XXXX-#### format?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The P_SERIAL
value of TPCH-1234-ABCD
does not match the pattern TPCH-XXXX-####
.
Shape Anomaly
In P_SERIAL
, 33.333% of 3 filtered records (1) do not match the pattern TPCH-XXXX-####
.
Max Length
Definition
Asserts that a string has a maximum length.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Determines the maximum acceptable length of the string.
Name | Description |
---|---|
Length |
Specifies the maximum number of characters a string in the field should have. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that P_DESCRIPTION in the PART table do not exceed 50 characters in length.
Sample Data
P_PARTKEY | P_DESCRIPTION |
---|---|
1 | Standard industrial widget |
2 | A product description that clearly goes way beyond the specified fifty characters limit. |
3 | Basic office equipment |
{
"description": "Ensure that P_DESCRIPTION in the PART table do not exceed 50 characters in length",
"coverage": 1,
"properties": {
"value": 3
},
"tags": [],
"fields": ["C_BLOOD_GROUP"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "maxLength",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with P_PARTKEY
2 does not satisfy the rule because its P_DESCRIPTION
exceeds 50 characters in length.
graph TD
A[Start] --> B[Retrieve P_DESCRIPTION]
B --> C{Is P_DESCRIPTION length <= 50 characters?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The P_DESCRIPTION
length of A product description that clearly goes way beyond the specified fifty characters limit.
is greater than the max length of 50.
Shape Anomaly
In P_DESCRIPTION
, 33.333% of 3 filtered records (1) have a length greater than 50.
Max Partition Size
Definition
Asserts the maximum number of records that should be loaded from each file or table partition.
In-Depth Overview
Managing the volume of data in each partition is critical when dealing with partitioned datasets. This is especially pertinent when system limitations or data processing capabilities are considered, ensuring that no partition exceeds the system's ability to handle data efficiently.
The Max Partition Size rule is designed to set an upper limit on the number of records each partition can contain.
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
Specific Properties
Specifies the maximum allowable record count for each data partition
Name | Description |
---|---|
Maximum partition size |
The maximum number of records that can be loaded from each partition. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that no partition of the LINEITEM table contains more than 10,000 records to prevent data processing bottlenecks.
Sample Data for Partition P3
Row Number | L_ITEM |
---|---|
1 | Data |
2 | Data |
... | ... |
10,050 | Data |
{
"description": "Ensure that no partition of the LINEITEM table contains more than 10,000 records to prevent data processing bottlenecks",
"coverage": 1,
"properties": {
"value":10000
},
"tags": [],
"fields": null,
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "maxPartitionSize",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
In the sample data above, the rule is violated because partition P3 contains 10,050 records, which exceeds the set maximum of 10,000 records.
graph TD
A[Start] --> B[Retrieve Number of Records for Each Partition]
B --> C{Does Partition have <= 10,000 records?}
C -->|Yes| D[Move to Next Partition/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Shape Anomaly
In LINEITEM
, more than 10,000 records were loaded.
Max Value
Definition
Asserts that a field has a maximum value.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Determines the maximum allowable value for the field.
Name | Description |
---|---|
Value |
Specifies the maximum value a field should have. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table does not exceed a value of 50.
Sample Data
L_ORDERKEY | L_LINENUMBER | L_QUANTITY |
---|---|---|
1 | 1 | 40 |
1 | 2 | 55 |
2 | 1 | 20 |
3 | 1 | 60 |
{
"description": "Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table does not exceed a value of 50",
"coverage": 1,
"properties": {
"value": 50
},
"tags": [],
"fields": ["L_QUANTITY"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "maxValue",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with L_ORDERKEY
1 and 3 do not satisfy the rule because their L_QUANTITY
values exceed the specified maximum value of 50.
graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY <= 50?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The L_QUANTITY
value of 55
is greater than the max value of 50
.
Shape Anomaly
In L_QUANTITY
, 50.000% of 4 filtered records (2) are greater than the max value of 50
.
Metric
Definition
Records the value of the selected field during each scan operation and asserts limits based upon an expected change or absolute range (inclusive).
In-Depth Overview
The Metric
rule is designed to monitor the values of a selected field over time. It is particularly useful in a time-series context where values are expected to evolve within certain bounds or limits. This rule allows for tracking absolute values or changes, ensuring they remain within predefined thresholds.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Determines the evaluation method and allowable limits for field value comparisons over time.
Name | Description |
---|---|
Comparison |
Specifies the type of comparison: Absolute Change, Absolute Value, or Percentage Change. |
Min Value |
Indicates the minimum allowable increase in value. Use a negative value to represent an allowable decrease. |
Max Value |
Indicates the maximum allowable increase in value. |
Details
Comparison Options
Absolute Change
The Absolute Change
comparison works by comparing the change in a numeric field's value to a pre-set limit (Min / Max). If the field's value changes by more than this specified limit since the last relevant scan, an anomaly is identified.
Illustration
Any record with a value change smaller than 30 or greater than 70 compared to the last scan should be flagged as anomalous
Thresholds: Min Change = 30, Max Change = 70
Scan | Previous Value | Current Value | Absolute Change | Anomaly Detected |
---|---|---|---|---|
#1 | - | 100 | - | No |
#2 | 100 | 150 | 50 | No |
#3 | 150 | 220 | 70 | No |
#4 | 220 | 300 |
80 |
Yes |
Absolute Value
The Absolute Value
comparison works by comparing the change in a numeric field's value to a pre-set limit between
Min and Max values. If the field's value changes by more than this specified range since the last relevant scan, an anomaly is identified.
Illustration
The value of the record in each scan should be within 100 and 300 to be considered normal
Thresholds: Min Value = 100, Max Value = 300
Scan | Current Value | Anomaly Detected |
---|---|---|
#1 | 150 | No |
#2 | 90 |
Yes |
#3 | 250 | No |
#4 | 310 |
Yes |
Percentage Change
The Percentage Change
comparison operates by tracking changes in a numeric field's value relative to its previous value. If the change exceeds the predefined percentage (%) limit since the last relevant scan, an anomaly is generated.
Illustration
An anomaly is identified if the record's value decreases by more than 20% or increases by more than 50% compared to the last scan.
Thresholds: Min Percentage Change = -20%, Max Percentage Change = 50%
Percentage Change Formula: ( (current_value - previous_value) / previous_value ) * 100
Scan | Previous Value | Current Value | Percentage Change | Anomaly Detected |
---|---|---|---|---|
1 | - | 100 | - | No |
2 | 100 | 150 | 50% | No |
3 | 150 | 120 | -20% | No |
4 | 120 | 65 | -45.83% |
Yes |
5 | 65 | 110 | 69.23% |
Yes |
Thresholds
At least the Min or Max value must be specified, and including both is optional. These values determine the acceptable range or limit of change in the field's value.
Min Value
- Represents the minimum allowable increase in the field's value.
- A negative Min Value signifies an allowable decrease, determining the minimum value the field can drop to be considered valid.
Max Value
- Indicates the maximum allowable increase in the field’s value, setting an upper limit for the value's acceptable growth or change.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the total price in the ORDERS table does not fluctuate beyond a predefined percentage limit between scans.
Thresholds: Min Percentage Change = -30%, Max Percentage Change = 30%
Sample Scan History
Scan | O_ORDERKEY | Previous O_TOTALPRICE | Current O_TOTALPRICE | Percentage Change | Anomaly Detected |
---|---|---|---|---|---|
#1 | 1 | - | 100 | - | No |
#2 | 1 | 100 | 110 | 10% | No |
#3 | 1 | 110 | 200 | 81.8% | Yes |
#4 | 1 | 200 | 105 | -47.5% | Yes |
{
"description": "Ensure that the total price in the ORDERS table does not fluctuate beyond a predefined percentage limit between scans",
"coverage": 1,
"properties": {
"comparison":"Percentage Change",
"min":-0.3,
"max":0.3
},
"tags": [],
"fields": ["O_TOTALPRICE "],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "metric",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample scan history above, anomalies are identified in scans #3 and #4. The O_TOTALPRICE
values in these scans fall outside the declared percentage change limits of -30% and 30%, indicating that something unusual might be happening and further investigation is needed.
graph TD
A[Start] --> B[Retrieve O_TOTALPRICE]
B --> C{Is Percentage Change in O_TOTALPRICE within -30% and 30%?}
C -->|Yes| D[End]
C -->|No| E[Mark as Anomalous]
E --> D
-- An illustrative SQL query demonstrating the rule applied to example dataset(s)
select
o_orderkey,
o_totalprice,
lag(o_totalprice) over (order by o_orderkey) as previous_o_totalprice
from
orders
having
abs((o_totalprice - previous_o_totalprice) / previous_o_totalprice) * 100 > 30
or
abs((o_totalprice - previous_o_totalprice) / previous_o_totalprice) * 100 < -30;
Potential Violation Messages
Record Anomaly (Percentage Change)
The percentage change of O_TOTALPRICE
from '110' to '200' falls outside the declared limits
Record Anomaly (Absolute Change)
using hypothetical numbers
The absolute change of O_TOTALPRICE
from '150' to '300' falls outside the declared limits
Record Anomaly (Absolute Value)
using hypothetical numbers
The value for O_TOTALPRICE
of '50' is not between the declared limits
Min Length
Definition
Asserts that a string has a minimum length.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Determines the minimum allowable length for the field.
Name | Description |
---|---|
Value |
Specifies the minimum length that the string field should have. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that all C_COMMENT entries in the CUSTOMER table have a minimum length of 5 characters.
Sample Data
C_CUSTKEY | C_COMMENT |
---|---|
1 | Ok |
2 | Excellent customer service, very satisfied! |
3 | Nice staff |
{
"description": "Ensure that all C_COMMENT entries in the CUSTOMER table have a minimum length of 5 characters",
"coverage": 1,
"properties": {
"value": 5
},
"tags": [],
"fields": ["C_COMMENT"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "minLength",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with C_CUSTKEY
1 does not satisfy the rule because the length of its C_COMMENT
values is below the required minimum length of 5 characters.
graph TD
A[Start] --> B[Retrieve C_COMMENT]
B --> C{Is C_COMMENT length >= 5?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The C_COMMENT
length of Ok
is less than the min length of 5.
Shape Anomaly
In C_COMMENT
, 33.333% of 3 filtered records (1) have a length less than 5.
Min Partition Size
Definition
Asserts the minimum number of records that should be loaded from each file or table partition.
In-Depth Overview
When working with large datasets that are often partitioned for better performance and scalability, ensuring a certain minimum number of records from each partition becomes crucial. This could be to ensure that each partition is well-represented in the analysis, to maintain data consistency or even to verify that data ingestion or migration processes are functioning properly.
The Min Partition Size rule allows users to set a threshold ensuring that each partition has loaded at least the specified minimum number of records.
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
Specific Properties
Sets the required minimum record count for each data partition
Name | Description |
---|---|
Minimum partition size |
Specifies the minimum number of records that should be loaded from each partition |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that each partition of the LINEITEM table has at least 1000 records.
Sample Data for Partition P3
Row Number | L_ITEM |
---|---|
1 | Data |
2 | Data |
... | ... |
900 | Data |
{
"description": "Ensure that each partition of the LINEITEM table has at least 1000 records",
"coverage": 1,
"properties": {
"value": 1000
},
"tags": [],
"fields": null,
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "minPartitionSize",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
The sample data above does not satisfy the rule because it contains only 900 records, which is less than the required minimum of 1000 records.
graph TD
A[Start] --> B[Retrieve Number of Records for Each Partition]
B --> C{Does Partition have >= 1000 records?}
C -->|Yes| D[Move to Next Partition/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Shape Anomaly
In LINEITEM
, fewer than 900 records were loaded.
Min Value
Definition
Asserts that a field has a minimum value.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Determines the minimum allowable value for the field.
Name | Description |
---|---|
Value |
Specifies the minimum value a field should have. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is not below a value of 10.
Sample Data
L_ORDERKEY | L_LINENUMBER | L_QUANTITY |
---|---|---|
1 | 1 | 40 |
1 | 2 | 5 |
2 | 1 | 20 |
3 | 1 | 8 |
{
"description": "Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is not below a value of 10",
"coverage": 1,
"properties": {
"value": 10
},
"tags": [],
"fields": ["L_QUANTITY"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "minValue",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with L_ORDERKEY
1 and 3 do not satisfy the rule because their L_QUANTITY
values are below the specified minimum value of 10.
graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY >= 10?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The L_QUANTITY
value of 5
is less than the min value of 10
.
Shape Anomaly
In L_QUANTITY
, 50.000% of 4 filtered records (2) are less than the min value of 10
.
Not Exists In
Definition
Asserts that values assigned to this field do not exist as values in another field.
In-Depth Overview
The Not ExistsIn
rule allows you to ensure data exclusivity between different sources, whether it’s object storage systems or databases.
While databases might utilize unique constraints to maintain data distinctiveness between related tables, the Not ExistsIn
rule extends this capability in two significant ways:
- Cross-System Exclusivity: it enables checks to ensure data does not overlap across different databases or even entirely separate systems. This can be essential in scenarios where data should be partitioned or isolated across platforms.
- Flexible Data Formats: Not just limited to databases, this rule can validate values against various data formats, such as ensuring values in a file do not coincide with those in a table.
These functionalities enable businesses to maintain data exclusivity even in intricate, multi-system settings.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
|
String |
|
Boolean |
Specific Properties
Define the datastore, table/file, and field where the rule should look for non-matching values.
Name | Description |
---|---|
Datastore |
The source datastore where the profile of the reference field is located. |
Table/file |
The profile (e.g. table, view or file) containing the reference field. |
Field |
The field name whose values should not match those of the selected field. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Scenario: A shipping company needs to ensure that all NATION_NAME entries in the NATION table aren't listed in an external unsupported regions file, which lists countries they don't ship to.
Sample Data
N_NATIONKEY | N_NATIONNAME |
---|---|
1 | Antarctica |
2 | Argentina |
3 | Atlantida |
Unsupported Regions File Sample
UNSUPPORTED_REGION |
---|
Antarctica |
Mars |
... |
{
"description": "A shipping company needs to ensure that all NATION_NAME entries in the NATION table aren't listed in an external unsupported regions file, which lists countries they don't ship to",
"coverage": 1,
"properties": {
"field_name":"UNSUPPORTED_REGION",
"ref_container_id": {ref_container_id},
"ref_datastore_id": {ref_datastore_id}
},
"tags": [],
"fields": ["NATION_NAME"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "notExistsIn",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with N_NATIONKEY
1 does not satisfy the rule because the N_NATIONNAME
"Antarctica" is listed as an UNSUPPORTED_REGION
in the unsupported regions file, indicating the company doesn't ship there.
graph TD
A[Start] --> B[Retrieve UNSUPPORTED_REGION]
B --> C[Retrieve N_NATIONNAME]
C --> D{Is N_NATIONNAME listed in UNSUPPORTED_REGION?}
D -->|No| E[Move to Next Record/End]
D -->|Yes| F[Mark as Anomalous]
F --> E
Potential Violation Messages
Record Anomaly
The N_NATIONNAME
value of 'Antarctica
' is an UNSUPPORTED_REGION
.
Shape Anomaly
In N_NATIONNAME
, 33.333% of 3 filtered records (1) do exist in UNSUPPORTED_REGION
.
Not Future
Definition
Asserts that the field's value is not in the future.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Fields
Field | |
---|---|
Date |
|
Timestamp |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the delivery dates (O_DELIVERYDATE) in the ORDERS table are not set in the future.
Sample Data
O_ORDERKEY | O_DELIVERYDATE |
---|---|
1 | 2023-09-20 |
2 | 2023-10-25 (Future Date) |
3 | 2023-10-10 |
{
"description": "Ensure that the delivery dates (O_DELIVERYDATE) in the ORDERS table are not set in the future",
"coverage": 1,
"properties": null,
"tags": [],
"fields": ["O_DELIVERYDATE"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "notFuture",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with O_ORDERKEY
2 does not satisfy the rule because its O_DELIVERYDATE
is set in the future.
graph TD
A[Start] --> B[Retrieve O_DELIVERYDATE]
B --> C{Is O_DELIVERYDATE <= Current Date?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The value for O_DELIVERYDATE
of 2023-10-25
is in the future.
Shape Anomaly
In O_DELIVERYDATE
, 33.333% of 3 filtered records (1) are future times.
Not Negative
Definition
Asserts that this is a non-negative number.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Fields
Type | |
---|---|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is a non-negative number.
Sample Data
L_ORDERKEY | L_LINENUMBER | L_QUANTITY |
---|---|---|
1 | 1 | 40 |
2 | 2 | -5 |
3 | 1 | 20 |
{
"description": "Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is a non-negative number",
"coverage": 1,
"properties": null,
"tags": [],
"fields": ["L_QUANTITY"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "notNegative",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entry with L_ORDERKEY
2 does not satisfy the rule because its L_QUANTITY
value is a negative number.
graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY >= 0?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The value for L_QUANTITY
of -5
is a negative number.
Shape Anomaly
In L_QUANTITY
, 33.333% of 3 filtered records (1) are negative numbers.
Not Null
Definition
Asserts that none of the selected fields' values are explicitly set to nothing.
Field Scope
Multi: The rule evaluates multiple specified fields.
Accepted Fields
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
|
String |
|
Boolean |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that every record in the CUSTOMER table has an assigned value for the C_NAME and C_ADDRESS fields.
Sample Data
C_CUSTKEY | C_NAME | C_ADDRESS |
---|---|---|
1 | Alice | 123 Oak St |
2 | Bob | NULL |
3 | Charlie | 789 Maple Ave |
4 | NULL | 456 Pine Rd |
{
"description": "Ensure that every record in the CUSTOMER table has an assigned value for the C_NAME and C_ADDRESS fields",
"coverage": 1,
"properties": null,
"tags": [],
"fields": ["C_ADDRESS","C_NAME"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "notNull",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with C_CUSTKEY
2 and 4 do not satisfy the rule because they have NULL
values in the C_NAME
or C_ADDRESS
fields.
graph TD
A[Start] --> B[Retrieve C_NAME and C_ADDRESS]
B --> C{Are C_NAME and C_ADDRESS non-null?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
There is no assigned value for C_NAME
.
Shape Anomaly
In C_NAME
and C_ADDRESS
, 50.000% of 4 filtered records (2) are not assigned values.
Positive
Definition
Asserts that this is a positive number.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Fields
Type | |
---|---|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is a positive number.
Sample Data
L_ORDERKEY | L_LINENUMBER | L_QUANTITY |
---|---|---|
1 | 1 | 40 |
2 | 1 | 0 |
3 | 1 | -5 |
4 | 1 | 20 |
{
"description": "Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is a positive number",
"coverage": 1,
"properties": null,
"tags": [],
"fields": ["L_QUANTITY"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "positive",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with L_ORDERKEY
2 and 3 do not satisfy the rule because their L_QUANTITY
values are not positive numbers.
graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY Positive?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The value for L_QUANTITY
of -5
is not a positive number.
Shape Anomaly
In L_QUANTITY
, 50.000% of 4 filtered records (2) are not positive numbers.
Predicted By
Definition
Asserts that the actual value of a field falls within an expected predicted range.
In-Depth Overview
The Predicted By
rule is used to verify whether the actual values of a specific field align with a set of expected values that are derived from a prediction expression. This expression could be a mathematical formula, statistical calculation, or any other valid predictive logic.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Fields
Type | |
---|---|
Integral |
|
Fractional |
|
Date |
|
Timestamp |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Determines if the actual value of a field falls within an expected predicted range.
Name | Description |
---|---|
Expression |
The prediction expression or formula for the field. |
Tolerance |
The allowed deviation from the predicted value. |
Note
The tolerance level must be defined to allow a permissible range of deviation from the predicted values.
Here’s a simple breakdown:
- An expression predicts what the value of a field should be.
- A tolerance value specifies how much deviation from the predicted value is acceptable.
- The actual value is then compared against the range defined by the predicted value ± tolerance.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the discount (L_DISCOUNT) in the LINEITEM table is calculated correctly based on the actual price (L_EXTENDEDPRICE). A correct discount should be approximately 8% less than the actual price, within a tolerance of ±2.
Sample Data
L_ORDERKEY | L_LINENUMBER | L_EXTENDEDPRICE | L_DISCOUNT |
---|---|---|---|
1 | 1 | 100 | 8 |
2 | 1 | 100 | 12 |
3 | 1 | 100 | 9 |
Inputs
- Expression: L_EXTENDEDPRICE × 0.08
- Tolerance: 2
{
"description": "Ensure that the discount (L_DISCOUNT) in the LINEITEM table is calculated correctly based on the actual price (L_EXTENDEDPRICE). A correct discount should be approximately 8% less than the actual price, within a tolerance of ±2",
"coverage": 1,
"properties": {
"expression": "L_EXTENDEDPRICE × 0.08",
"tolerance": 2
},
"tags": [],
"fields": ["L_DISCOUNT"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "predictedBy",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
For the entry with L_ORDERKEY
2, the discount is 12, which is outside of the computed range. Based on an 8% expected discount with a tolerance of ±2, the discount should be between 6 and 10 (calculated from the actual price of 100). Therefore, this record is marked as anomalous.
graph TD
A[Start] --> B[Retrieve L_EXTENDEDPRICE and L_DISCOUNT]
B --> C{Is Discount within Predicted Range?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The L_DISCOUNT
value of '12' is not within the predicted range defined by L_EXTENDEDPRICE * 0.08 +/- 2.0
Shape Anomaly
In L_DISCOUNT
, 33.333% of 3 filtered records (1) are not within the predicted range defined by L_EXTENDEDPRICE * 0.08 +/- 2.0
Required Values
Definition
Asserts that all of the defined values must be present at least once within a field.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
|
String |
|
Boolean |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Ensures that a specific set of values is present within a field.
Name | Description |
---|---|
Values |
Specifies the list of values that must exist in the field. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that orders have priorities labeled as '1-URGENT', '2-HIGH', '3-MEDIUM', '4-LOW', and '5-NOT URGENT'.
Sample Data
O_ORDERKEY | O_ORDERPRIORITY |
---|---|
1 | 1-URGENT |
2 | 2-HIGH |
3 | 3-MEDIUM |
4 | 3-MEDIUM |
{
"description": "Ensure that orders have priorities labeled as '1-URGENT', '2-HIGH', '3-MEDIUM', '4-LOW', and '5-NOT URGENT'",
"coverage": 1,
"properties": {
"list":["1-URGENT","2-HIGH","3-MEDIUM","4-LOW","5-NOT URGENT"]
},
"tags": [],
"fields": ["O_ORDERPRIORITY"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "requiredValues",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the rule is violated because the values '4-LOW' and '5-NOT URGENT' are not present in the O_ORDERPRIORITY
field of the ORDERS table.
graph TD
A[Start] --> B{Check if all specified values exist in the field}
B -->|Yes| C[End: No Anomalies]
B -->|No| D[Mark as Anomalous: Missing Values]
D --> C
Potential Violation Messages
Shape Anomaly
In O_ORDERPRIORITY
, required values are missing in 40.000% of filtered records.
Satisfies Expression
Definition
Evaluates the given expression (any valid Spark SQL) for each record.
In-Depth Overview
The Satisfies Expression
rule allows for a wide range of custom validations on the dataset. By defining a Spark SQL expression, you can create customized conditions that the data should meet.
This rule will evaluate an expression against each record, marking those that do not satisfy the condition as anomalies. It provides the flexibility to create complex validation logic without being restricted to predefined rule structures.
Field Scope
Calculated: The rule automatically identifies the fields involved, without requiring explicit field selection.
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Evaluates each record against a specified Spark SQL expression to ensure it meets custom validation conditions.
Name | Description |
---|---|
Expression |
Defines the Spark SQL expression that each record should meet. |
Info
Refers to the Filter Guide in the General Properties topic for examples of valid Spark SQL expressions.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example 1: Satisfies Expression Using a CASE
Statement
Let's assume you want to ensure that for orders with a priority of '1-URGENT' or '2-HIGH', the orderstatus
must be 'O' (for open), and for orders with a priority of '3-MEDIUM', the orderstatus
must be either 'O' or 'P' (for pending).
CASE
WHEN o_orderpriority IN ('1-URGENT', '2-HIGH') AND o_orderstatus != 'O' THEN FALSE
WHEN o_orderpriority = '3-MEDIUM' AND o_orderstatus NOT IN ('O', 'P') THEN FALSE
ELSE TRUE
END
Example 2: Satisfies Expression Using a Relatively Complex CTE
Statement
Objective:: To ensure that the overall effect of discounts on item prices remains within acceptable limits, we validate whether the average discounted price of all items is greater than the maximum discount applied to any single item.
Background:
In pricing analysis, it’s important to monitor how discounts affect the final prices of products. By comparing the average price after discounts with the maximum discount applied, we can assess whether the discounts are having an overly significant impact or if they are within a reasonable range.
CASE
WHEN (SELECT AVG(l_extendedprice * (1 - l_discount)) FROM lineitem) >
(SELECT MAX(l_discount) FROM {{ _qualytics_self }})
THEN TRUE
ELSE FALSE
END AS is_discount_within_limits
Use Case
Objective: Ensure that the total tax applied to each item in the LINEITEM table is not more than 10% of the extended price.
Sample Data
L_ORDERKEY | L_LINENUMBER | L_EXTENDEDPRICE | L_TAX |
---|---|---|---|
1 | 1 | 10000 | 900 |
2 | 1 | 15000 | 2000 |
3 | 1 | 20000 | 1800 |
4 | 1 | 10000 | 1500 |
Inputs
- Expression: L_TAX <= L_EXTENDEDPRICE * 0.10
{
"description": "Ensure that the total tax applied to each item in the LINEITEM table is not more than 10% of the extended price",
"coverage": 1,
"properties": {
"expression":"L_TAX <= L_EXTENDEDPRICE * 0.10"
},
"tags": [],
"fields": ["L_TAX", "L_EXTENDEDPRICE"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "satisfiesExpression",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with L_ORDERKEY
2 and 4 do not satisfy the rule because the L_TAX
values are more than 10% of their respective L_EXTENDEDPRICE
values.
graph TD
A[Start] --> B[Retrieve L_EXTENDEDPRICE and L_TAX]
B --> C{Is L_TAX <= L_EXTENDEDPRICE * 0.10?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The record does not satisfy the expression: L_TAX <= L_EXTENDEDPRICE * 0.10
Shape Anomaly
50.000% of 4 filtered records (2) do not satisfy the expression: L_TAX <= L_EXTENDEDPRICE * 0.10
Sum
Definition
Asserts that the sum of a field is a specific amount.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Integral |
|
Fractional |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Ensures that the total sum of values in a specified field matches a defined amount.
Name | Description |
---|---|
Sum |
Specifies the expected sum of the values in the field. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the total discount value in the LINEITEM table does not exceed $2000.
Sample Data
L_ORDERKEY | L_LINENUMBER | L_EXTENDEDPRICE | L_DISCOUNT | L_DISCOUNT_VALUE |
---|---|---|---|---|
1 | 1 | 10000 | 0.05 | 500 |
2 | 1 | 8000 | 0.10 | 800 |
3 | 1 | 7000 | 0.05 | 350 |
4 | 1 | 5000 | 0.10 | 500 |
{
"description": "Ensure that the total discount value in the LINEITEM table does not exceed $2000",
"coverage": 1,
"properties": {
"value": "2000"
},
"tags": [],
"fields": ["L_DISCOUNT_VALUE"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "sum",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the total of the L_DISCOUNT_VALUE
column is (500 + 800 + 350 + 500 = 2150), which exceeds the specified maximum total discount value of $2000.
graph TD
A[Start] --> B[Retrieve L_DISCOUNT_VALUE]
B --> C{Sum of L_DISCOUNT_VALUE <= 2000?}
C -->|Yes| D[End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Shape Anomaly
In L_DISCOUNT_VALUE
, the sum of the 4 records is not 2000.000
Time Distribution Size
Definition
Asserts that the count of records for each interval of a timestamp is between two numbers.
In-Depth Overview
The Time Distribution Size
rule helps in identifying irregularities in the distribution of records over time intervals such as hours, days, or months.
For instance, in a retail context, it could ensure that there’s a consistent number of orders each month to meet business targets. A sudden drop in orders might highlight operational issues or shifts in market demand that require immediate attention.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
Timestamp |
|
Date |
Specific Properties
Name | Description |
---|---|
Interval |
Defines the time interval for segmentation. |
Min Count |
Specifies the minimum count of records in each segment. |
Max Count |
Specifies the maximum count of records in each segment. |
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the number of orders for each month is consistently between 5 and 10.
Sample Data
O_ORDERKEY | O_ORDERDATE |
---|---|
1 | 2023-01-01 |
2 | 2023-01-15 |
3 | 2023-01-20 |
4 | 2023-01-25 |
5 | 2023-02-01 |
6 | 2023-02-05 |
7 | 2023-02-10 |
8 | 2023-02-15 |
9 | 2023-02-20 |
10 | 2023-02-25 |
11 | 2023-02-28 |
{
"description": "Ensure that the number of orders for each month is consistently between 5 and 10",
"coverage": 1,
"properties": {
"interval_name": "Monthly",
"min_size": 5,
"max_size": 10
},
"tags": [],
"fields": ["O_ORDERDATE"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "timeDistributionSize",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the January segment fails the rule because there are only 4 orders, which is below the specified minimum count of 5.
graph TD
A[Start] --> B[Retrieve O_ORDERDATE]
B --> C{Segment data by month}
C --> D{Is count of records in each segment between 5 and 10?}
D -->|Yes| E[End]
D -->|No| F[Mark as Anomalous]
F --> E
Potential Violation Messages
Shape Anomaly
50.000% of the monthly segments of O_ORDERDATE
have record counts not between 5 and 10.
Unique
Definition
Asserts that every value held by a field appears only once. If multiple fields are specified, then every combination of values of the fields should appear only once.
Field Scope
Multi: The rule evaluates multiple specified fields.
Accepted Types
Type | |
---|---|
Date |
|
Timestamp |
|
Integral |
|
Fractional |
|
String |
|
Boolean |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that each combination of C_NAME and C_ADDRESS in the CUSTOMER table is unique.
Sample Data
C_CUSTKEY | C_NAME | C_ADDRESS |
---|---|---|
1 | Customer_A | 123 Main St |
2 | Customer_B | 456 Oak Ave |
3 | Customer_A | 123 Main St |
4 | Customer_C | 789 Elm St |
{
"description": "Ensure that each combination of C_NAME and C_ADDRESS in the CUSTOMER table is unique",
"coverage": 1,
"properties": null,
"tags": [],
"fields": ["C_NAME", "C_ADDRESS"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "unique",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with C_CUSTKEY
1 and 3 have the same C_NAME
and C_ADDRESS
, which violates the rule because this combination of keys should be unique.
graph TD
A[Start] --> B[Retrieve C_NAME and C_ADDRESS]
B --> C{Is the combination unique?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Shape Anomaly
In C_NAME
and C_ADDRESS
, 25.000% of 4 filtered records (1) are not unique.
User Defined Function
Definition
Asserts that the given user-defined function (as scala script) evaluates to true over the field's value.
In-Depth Overview
The User Defined Function
rule enables the application of a custom Scala function on a specified field, allowing for highly customizable and flexible validation based on user-defined logic.
Field Scope
Single: The rule evaluates a single specified field.
Accepted Types
Type | |
---|---|
String |
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}
.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on {{ _qualytics_self }}
The {{ _qualytics_self }}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }}
may not encompass all entries from the target container.
Specific Properties
Implements a user-defined scala script.
Name | Description |
---|---|
Scala Script |
The custom scala script to evaluate each record. |
Note
The Scala script must contain a function that should return a boolean value, determining the validity of the record based on the field's value.
Below is a scaffold to guide the creation of the Scala function:
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Validate that each record in the LINEITEM table has a well-structured JSON in the L_ATTRIBUTES column by ensuring the presence of essential keys: "color", "weight", and "dimensions".
Sample Data
L_ORDERKEY | L_LINENUMBER | L_ATTRIBUTES |
---|---|---|
1 | 1 | {"color": "red", "weight": 15, "dimensions": "10x20x15"} |
2 | 2 | {"color": "blue", "weight": 20} |
3 | 1 | {"color": "green", "dimensions": "5x5x5"} |
4 | 3 | {"weight": 10, "dimensions": "20x20x20"} |
Inputs
Scala Script
(lAttributes: String) => {
import play.api.libs.json._
try {
val json = Json.parse(lAttributes)
// Define the keys we expect to find in the JSON
val expectedKeys = List("color", "weight", "dimensions")
// Check if the expected keys are present in the JSON
expectedKeys.forall(key => (json \ key).toOption.isDefined)
} catch {
case e: Exception => false // Return false if parsing fails
}
}
{
"description": "Validate that each record in the LINEITEM table has a well-structured JSON in the L_ATTRIBUTES column by ensuring the presence of essential keys: "color", "weight", and "dimensions"",
"coverage": 1,
"properties": {"assertion":"(lAttributes: String) => {\n import play.api.libs.json._\n\n try {\n val json = Json.parse(lAttributes)\n \n // Define the keys we expect to find in the JSON\n val expectedKeys = List(\"color\", \"weight\", \"dimensions\")\n \n // Check if the expected keys are present in the JSON\n expectedKeys.forall(key => (json \\ key).toOption.isDefined)\n } catch {\n case e: Exception => false // Return false if parsing fails\n }\n }"},
"tags": [],
"fields": ["L_ATTRIBUTES"],
"additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
"rule": "userDefinedFunction",
"container_id": {container_id},
"template_id": {template_id},
"filter": "1=1"
}
Anomaly Explanation
In the sample data above, the entries with L_ORDERKEY
2, 3, and 4 do not satisfy the rule because they lack at least one of the essential keys ("color", "weight", "dimensions") in the L_ATTRIBUTES column.
graph TD
A[Start] --> B[Retrieve L_ATTRIBUTES]
B --> C{Does L_ATTRIBUTES contain all essential keys?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D
Potential Violation Messages
Record Anomaly
The L_ATTRIBUTES
value of {"color": "blue", "weight": 20}
does not evaluate true as a parameter to the given UDF.
Shape Anomaly
In L_ATTRIBUTES
, 75.000% of 4 filtered records (3) do not evaluate true as a parameter to the given UDF.
Volumetrics Check
The Volumetric Check has been introduced to help users monitor and maintain the stability of data volumes across various datasets. This feature ensures that the size or volume of your data (either in terms of rows or bytes) remains within an acceptable range based on historical trends. It is designed to detect significant fluctuations by comparing the current data volume against a moving daily average.
How It Works
The system automatically infers and maintains volumetric checks based upon observed daily, weekly, and monthly averages. These checks enable proactive management of data volume trends, ensuring that any unexpected deviations are identified as anomalies for review.
Automating Adaptive Volumetric Checks
The following Volumetric Checks are automatically inferred for data assets with automated volume measurements enabled:
-
Daily: the expected daily volume expressed as an absolute minimum and maximum threshold. The thresholds are calculated as standard deviations from the previous 7-day moving average.
-
Weekly: the expected weekly volume expressed as an absolute minimum and maximum threshold. The thresholds are calculated as standard deviations from the previous four weeks’ weekly volume moving average.
-
Monthly: the expected 4-week volume expressed as an absolute minimum and maximum threshold. The thresholds are calculated as standard deviations from the previous sixteen weeks’ 4-week volume moving average.
Scan Assertion and Anomaly Creation
Volumetric Checks are asserted during a Scan Operation just like all other check types and enrichment of volumetric check anomalies is fully supported. This enables full support for custom scheduling of volumetric checks and remediation workflows of volumetric anomalies.
Adaptive Thresholds and Human Adjustments
Each time a volume measurement is recorded for a data asset, the system will automatically infer and update any inferred Volumetric Checks for that asset.
By default, thresholds are set to 2 standard deviations from the moving average, but the system will adapt over time using inference weights to fine-tune these thresholds based on historical trends.
This feature is essential for maintaining data integrity and ensuring that any deviations from expected data volumes are quickly identified and addressed.
Ended: Rule Types
Check Templates
Check Templates empower users to efficiently create, manage, and apply standardized checks across various datastores, acting as blueprints that ensure consistency and data integrity across different datasets and processes.
Check templates streamline the validation process by enabling check management independently of specific data assets such as datastores, containers, or fields. These templates reduce manual intervention, minimize errors, and provide a reusable framework that can be applied across multiple datasets, ensuring all relevant data adheres to defined criteria. This not only saves time but also enhances the reliability of data quality checks within an organization.
Let's get started 🚀
Step 1: Log in to your Qualytics account and click the “Library” button on the left side panel of the interface.
Step 2: Click on the “Add Check Template” button located in the top right corner.
A modal window titled “Check Template Details” will appear, providing you the options to add the check template details.
Step 3: Enter the following details to add the check template:
- Rule Type (Required)
- Filter Clause
- Description (Required)
- Tag
- Additional Metadata
- Template Locked
1. Rule Type (Required): Select a Rule type from the dropdown menu for data validation, such as checking for non-null values, matching patterns, comparing numerical values, or verifying date-time constraints. Each rule type defines the specific validation logic to be applied.
For more details about the available rule types, refer to the "Check Rule Types" section.
Note
Different rule types have different sets of fields and options appearing when selected.
2. Filter Clause: Specify a valid Spark SQL WHERE
expression to filter the data on which the check will be applied.
The filter clause defines the conditions under which the check will be applied. It typically includes a WHERE
statement that specifies which rows or data points should be included in the check.
Example: A filter clause might be used to apply the check only to rows where a certain column meets a specific condition, such as WHERE status \= 'active'
.
Adjust the Coverage setting to specify the percentage of records that must comply with the check.
Note
The Coverage setting applies to most rule types and allows you to specify the percentage of records that must meet the validation criteria.
3. Description (Required): Enter a detailed description of the check template, including its purpose, applicable data, and relevant information to ensure clarity for users. If you're unsure of what to include, click on the "💡" lightbulb icon to apply a suggested description based on the rule type.
Example: "The < field > must exist in bank_transactions_*.csv.Total_Transaction_Amount
(Bank Dataset - Staging)".
This description clarifies that the specified field must be present in a particular file (bank_transactions_*.csv
) and column (Total_Transaction_Amount
) within the Bank Dataset.
4. Tag: Assign relevant tags to your check template to facilitate easier searching and filtering based on categories like "data quality," "financial reports," or "critical checks."
5. Additional Metadata: Add key-value pairs as additional metadata to enrich your check. Click the plus icon (+) next to this section to open the metadata input form, where you can add key-value pairs.
Enter the desired key-value pairs (e.g., DataSourceType: SQL Database and PriorityLevel: High). After entering the necessary metadata, click "Confirm" to save the custom metadata.
6. Template Locked: Check or uncheck the "Template Locked" option to determine whether all checks created from this template will have their properties automatically synced to any changes made to the template.
For more information about the template state, jump to the "Template Statesection below.
Step 4: Once you have entered all the required fields, click the “Save” button to finalize the template.
Warning
Once a template is saved, the selected rule type becomes locked and cannot be changed.
After clicking the "Save" button, your check template is created, and a success flash message will appear stating, "Check Template successfully created."
After saving the check template, you can now Apply a Check Template to create Quality Checks, which will enforce the validation rules defined in the template across your datastores. This ensures consistent data quality and compliance with the criteria you’ve established.
Template State
Any changes to a template may or may not impact its related checks, depending on whether the template state is locked or unlocked. Managing the template state allows you to control if updates automatically apply to all related checks or let them function independently.
Unlocked
- Quality Checks can evolve independently of the template. Subsequent updates to an unlocked Check Template do not affect its related quality checks
Locked
- Quality Checks from a locked Check Template will inherit changes made to the template. Subsequent updates to a locked Check Template do affect its related quality checks
Info
Tags will be synced independently of unlocked and locked Check Templates, while Description and Additional Metadata will not be synced. This behavior is general for Check Templates.
graph TD
A[Start] -->|Is `Template Locked` enabled?| B{Yes/No}
B -->|No| E[The quality check can evolve independently]
B -->|Yes| C[They remain synchronized with the template]
C --> D[End]
E --> D[End]
Apply Check Template for Quality Checks
You can export check templates to make quality checks easier and more consistent. Using a set template lets you quickly verify that your data meets specific standards, reducing mistakes and improving data quality. Exporting these templates simplifies the process, making finding and fixing errors more efficient, and ensuring your quality checks are applied across different projects or systems without starting from scratch.
Let’s get started 🚀
Step 1: Log in to your Qualytics account and click the “Library” button on the left side panel of the interface.
Here you can view the list of all the customer data validation templates.
Step 2: Locate the template, click on the vertical ellipsis (three dots) next to it, and select “Add Check” from the dropdown menu to create a Quality Check based on this template
For demonstration purposes, we have selected the “After Date Time” template.
A modal window titled “Authored Check Template” will appear, displaying all the details of the Quality Check Template.
Step 3: Enter the following details:
1. Associate with a Check Template:
-
If you toggle ON the "Associate with a Check Template" option, the check will be linked to a specific template.
-
If you toggle OFF the "Associate with a Check Template" option, the check will not be linked to any template, which allows you full control to modify the properties independently.
Since we are applying a check template to create quality checks, it's important to keep the toggle on to ensure the template is applied as a quality check.
2. Template: Choose a Template from the dropdown menu that you want to associate with the quality check. The check will inherit properties from the selected template.
-
Locked: If the template is locked, it will automatically sync with any future updates made to the template. However, you won't be able to modify the check's properties directly, except for specific fields like Datastore, Table, and Fileds, which can still be updated while maintaining synchronization with the template.
-
Unlocked: If the template is unlocked, you are free to modify the check's properties as needed. However, any future updates to the template will no longer affect this check, as it will no longer be synced with the template.
3. Datastore: Select the Datastore, Table and Field where you want to apply the check template. This ensures that the template is linked to the correct data source, allowing the quality checks to be performed on the specified datastore.
For demonstration purposes, we have selected the “MIMIC II” datastore, with the “ADMISSIONS” table and the “ADMITTIME” field.
Step 4: After completing all the check details, click on the "Validate" button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct. It ensures that the check will work as expected by running it against the data without committing any changes.
If the validation is successful, a green message will appear saying "Validation Successful".
If the validation fails, a red message will appear saying "Failed Validation". This typically occurs when the check logic or parameters do not match the data properly.
Step 5: Once you have a successful validation, click the "Save" button.
Info
You can create as many Quality checks as you want for a specific template.
After clicking on the “Save” button your check is successfully created and a success flash message will appear saying “Check successfully created”.
Export Check Templates
You can export check templates to easily share or reuse your quality check settings across different systems or projects. This saves time by eliminating the need to recreate the same checks repeatedly and ensures that your quality standards are consistently applied. Exporting templates helps maintain accuracy and efficiency in managing data quality across various environments.
Let’s get started 🚀
Step 1: Log in to your Qualytics account and click the “Library” button on the left side panel of the interface.
Step 2: Click on the “Export Check Template” button located in the top right corner.
Step 3: A modal window titled “Export Check Templates” will appear, where you have to select the enrichment store to which the check templates will be exported.
Step 4: Once you have selected the enrichment store, click on the “Export” button
After clicking “Export,” the process starts, and a message will confirm that the metadata will be available in your Enrichment Datastore shortly.
Review Exported Check Templates
Step 1: Once the checks has been exported, navigate to the “Enrichment Datastores” located on the left menu.
Step 2: In the “Enrichment Datastores” section, select the datastore where you exported the checks templates. The exported checks templates will now be visible in the selected datastore.
When you export check templates, you can reuse them for other datastores, share them with teams, or save them as a backup. Once exported, the templates can be imported and customized to fit different datasets, making them versatile and easy to adapt.
You also have the option to download them as a CSV file, allowing you to share or store them for future use.
Ended: Data Quality Checks
Observability ↵
Observability
Observability gives users an easy way to track changes in data volume over time. It introduces two types of checks: Volumetric and Metric. The Volumetric check automatically monitors the number of rows in a table and flags unusual changes, while the Metric check focuses on specific fields, providing more detailed insights from scan operations. Together, these tools help users spot data anomalies quickly and keep their data accurate.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and select the datastore from the left menu that you want to monitor
Step 2: Click on the “Observability” from the Navigation tab.
Observability Categories
Observability, data checks are divided into two categories: Volumetric and Metric. Volumetric checks track overall data volume, while Metric checks focus on specific data attributes. These two categories work together to offer comprehensive insights into data trends and anomalies.
1. Volumetric: Volumetric is a tool that automatically tracks changes in the amount of data within a table over time. It monitors row counts and compares them to expected ranges based on historical data. If the data volume increases or decreases unexpectedly, the check flags it for further review. This feature helps users easily identify unusual data patterns without manual monitoring.
2. Metric: Metric measures data based on predefined fields and thresholds, tracking changes in data values over time. It detects if the value of a specific field, like the average or absolute value, goes beyond the expected range. Using scheduled scans, it automatically records and analyzes these values, helping users quickly spot any anomalies. This check gives deeper insights into data behavior, ensuring data integrity and identifying irregular patterns easily.
Volumetric
Volumetric help monitor data volumes over time to keep data accurate and reliable. They automatically count rows in a table and spot any unusual changes, like problems with data loading. This makes it easier to catch issues early and keep everything running smoothly. Volumetric checks also let you track data over different time periods, like daily or weekly. The system sets limits based on past data, and if the row count goes above or below those limits, an anomaly alert is triggered.
No | Field | Description |
---|---|---|
1 | Search | This feature helps users quickly find specific identifiers or names in the data. |
2 | Report Data | Report Date lets users pick a specific date to view data trends for that day. |
3 | Time Frame | The time frame option lets users choose a period (week, month, quarter, or year.) to view data trends. |
4 | Sort By | Sort By option helps users organize data by criteria like Volumetrics Count, Name, or Last Scanned for quick access. |
5 | Filter | The filter lets users easily refine results by choosing specific tags or tables to view. |
6 | Favorite | Mark this as a favorite for quick access and easy monitoring in the future. |
7 | Table | Displays the table for which the volumetric check is being performed (e.g., customer_view, nation). Each table has its own Volumetric Check. |
8 | Check (# ID) | Each check is assigned a unique identifier, followed by the time period it applies to (e.g., 1 Day for the customer table). This ID helps in tracking the specific check in the system. |
9 | Weight | Weight shows how important a check is for finding anomalies and sending alerts. |
10 | Anomaly Detection | The Volumetric Check detects anomalies when row counts exceed set min or max thresholds, triggering an alert for sudden changes. |
11 | Edit Checks | Edit the check to modify settings, or add tags for better customization and monitoring. |
12 | Group By | Users can also Group By specific intervals, such as day, week, or month, to observe trends over different periods. |
13 | Measurement Period | Defines the time period over which the volumetric check is evaluated. It can be customized to 1 day, week, or other timeframes. |
14 | Min Values | These indicate the minimum thresholds for the row count of the table being checked (e.g., 150,139 Rows) |
15 | Max Values. | These indicate the maximum thresholds for the row count of the table being checked |
16 | Last Asserted | This shows the date the last check was asserted, which is the last time the system evaluated the Volumetric Check (e.g., Oct 02, 2024). |
17 | Edit Threshold | Edit Threshold lets users set custom limits for alerts, helping them control when they’re notified about changes in data. |
18 | Graph Visualization | The graph provides a visual representation of the row count trends. It shows fluctuations in data volume over the selected period. This visual allows users to quickly identify any irregularities or anomalies. |
Observability Heatmap
The heatmap provides a visual overview of data anomalies by day, using color codes for quick understanding:
- Blue square: Blue square represent days with no anomalies, meaning data stayed within the expected range.
- Orange square: Orange square indicate days where data exceeded the minimum or maximum threshold range but didn’t qualify as a critical anomaly.
- Red square: Red square highlight days with anomalies, signaling significant deviations from expected values that need further investigation.
By hovering over each square, you can view additional details for that specific day, including the date, last row count, and anomaly count, allowing you to easily pinpoint and analyze data issues over time.
Edit Check
Editing a Volumetric Check lets users customize settings like measurement period, row count limits, and description. This helps improve data monitoring and anomaly detection, ensuring the check fits specific needs. Users can also add tags for better organization.
Note
When editing checks, only the properties and metadata can be modified. |
Step 1: Click the edit icon to modify the check.
A modal window will appear with the check details.
Step 2: Modify the check details as needed based on your preferences:
No | Fields | Description |
---|---|---|
1 | Measurement Periods Days | Edit the Measurement Period Days to change how often the check runs (e.g., 1 day, 2 days, etc). |
2 | Min Value and Max Value | Edit the Min Value and Max Value to set new row count limits. If the row count exceeds these limits, an alert will be triggered. |
3 | Description | Edit the Description to better explain what the check does. |
4 | Tags | Edit the Tags to organize and easily find the check later. |
5 | Additional Metadata(Optional) | Edit the Additional Metadata section to add any new custom details for more context. |
Step 3: Once you have edited the check details, then click on the Validate button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct.
If the validation is successful, a green message saying "Validation Successful" will appear.
If the validation fails, a red message saying "Failed Validation" will appear. This typically occurs when the check logic or parameters do not match the data properly.
Step 3: Once you have a successful validation, click the "Update" button. The system will update the changes you've made to the check, including changes to the properties, description, tags, or additional metadata.
After clicking on the Update button, your check is successfully updated and a success flash message will appear stating "Check successfully updated".
Edit Threshold
Edit thresholds to set specific row count limits for your data checks. By defining minimum and maximum values, you ensure alerts are triggered when data goes beyond the expected range. This helps you monitor unusual changes in data volume. It gives you better control over tracking your data's behavior.
Note
When editing the threshold, only the min and max values can be modified.
Step 1: Click the Edit Thresholds button on the right side of the graph.
Step 2: After clicking Edit Thresholds, you enter the editing mode where the Min and Max values become editable, allowing you to input new row count limits.
Step 3: Once you've updated the Min and Max values, click Save to apply the changes and update the thresholds.
After clicking on the Save button, your threshold is successfully updated and a success flash message will appear stating "Check successfully updated".
Mark Check as Favorite
Marking a Volumetric Check as a favorite allows you to easily access important checks quickly. This feature helps you prioritize and manage the checks you frequently use, making data monitoring more efficient.
Step 1: Click on the bookmark icon to mark the Volumetric Check as a favorite.
After Clicking on the bookmark icon your check is successfully marked as a favorite and a success flash message will appear stating “Check has been favorited”.
To unmark a check, simply click on the bookmark icon of the marked check. This will remove it from your favorites.
Metric
Metric track changes in data over time to ensure accuracy and reliability. They check specific fields against set limits to identify when values, like averages, go beyond expected ranges. With scheduled scans, Metrics automatically log and analyze these data points, making it easy for users to spot any issues. This functionality enhances users' understanding of data patterns, ensuring high quality and dependability. With Metrics, managing and monitoring data becomes straightforward and efficient.
No | Field | Description |
---|---|---|
1 | Search | The search bar helps users find specific metrics or data by entering an identifier or description. |
2 | Sort By | Sort By allows users to organize data by Weight, Anomalies, or Created Date for easier analysis and prioritization. |
3 | Filter | Filter lets users refine data by Tags or Tables. Use Apply to filter or Clear to reset. |
4 | Metric(ID) | Represents the tracked data metric with a unique ID. |
5 | Description | A brief label or note about the metric, in this case, it's labeled as test |
6 | Weight | Weight shows how important a check is for finding anomalies and sending alerts. |
7 | Anomalies | Anomalies show unexpected changes or issues in the data that need attention. |
8 | Favorite | Mark this as a favorite for quick access and easy monitoring in the future. |
9 | Edit Checks | Edit the check to modify settings, or add tags for better customization and monitoring. |
10 | Field | This refers to the specific field being measured, here the max_value, which tracks the highest value observed for the metric. |
11 | Min | This indicates the minimum value for the metric, which is set to 1. If not defined, no lower limit is applied. |
12 | Max | This field shows the maximum threshold for the metric, set at 8. Exceeding this may indicate an issue or anomaly. |
13 | Created Date | This field shows when the metric was first set up, in this case, June 18, 2024. |
14 | Last Asserted | Last Asserted field shows the last time the metric was checked, in this case July 25, 2024. |
15 | Group By Edit Threshold | Edit Threshold lets users set custom limits for alerts, helping them control when they’re notified about changes in data. |
16 | Group By | This option lets users group data by periods like Day, Week, or Month. In this example, it's set to Day. |
Comparisons
When you add a metric check, you can choose from three comparison options:
- Absolute Change
- Absolute Value
- Percentage Change.
These options help define how the system will evaluate your data during scan operations on the datastore.
Once a scan is run, the system analyzes the data based on the selected comparison type. For example, Absolute Change will look for significant differences between scans, Absolute Value checks if the data falls within a predefined range, and Percentage Change identifies shifts in data as a percentage.
Based on the chosen comparison type, the system flags any deviations from the defined thresholds. These deviations are then visually represented on a chart, displaying how the metric has fluctuated over time between scans. If the data crosses the upper or lower limits during any scan, the system will highlight this in the chart for further analysis.
1. Absolute Change: The Absolute Change comparison checks how much a numeric field's value has changed between scans. If the change exceeds a set limit (Min/Max), it flags this as an anomaly.
2. Absolute Value: The Absolute Value comparison checks whether a numeric field's value falls within a defined range (between Min and Max) during each scan. If the value goes beyond this range, it identifies it as an anomaly.
3. Percentage Change: The Percentage Change comparison monitors how much a numeric field's value has shifted in percentage terms. If the change surpasses the set percentage threshold between scans, it triggers an anomaly.
Edit Check
Edit Check allows users to modify the parameters of an existing metric check. It enables adjusting values, thresholds, or comparison logic to ensure that the metric accurately reflects current monitoring needs.
Note
When editing checks, only the properties and metadata can be modified.
Step 1: Click the edit icon to modify the check.
A modal window will appear with the check details.
Step 2: Modify the check details as needed based on your preferences:
No | Field | Description |
---|---|---|
1 | Min Value and Max Value | Edit the Min Value and Max Value to set new row count limits. If the row count exceeds these limits, an alert will be triggered. |
2 | Description | Edit the Description to better explain what the check does. |
3 | Tags | Edit the Tags to organize and easily find the check later. |
4 | Additional Metadata(Optional) | Edit the Additional Metadata section to add any new custom details for more context. |
Step 3: Once you have edited the check details, then click on the Validate button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct.
If the validation is successful, a green message saying "Validation Successful" will appear.
Step 3: Once you have a successful validation, click the "Update" button. The system will update the changes you've made to the check, including changes to the properties, description, tags, or additional metadata.
After clicking on the Update button, your check is successfully updated and a success flash message will appear stating "Check successfully updated".
Edit Threshold
Edit Threshold allows you to change the upper and lower limits of a metric. This ensures the metric tracks data within the desired range and only triggers alerts when those limits are exceeded.
Note
When editing the threshold, only the min and max values can be modified. |
Step 1: Click the Edit Thresholds button on the right side of the graph.
Step 2: After clicking Edit Thresholds, you enter the editing mode where the Min and Max values become editable, allowing you to input new row count limits.
Step 3: Once you've updated the Min and Max values, click Save to apply the changes and update the thresholds.
After clicking on the Save button, your threshold is successfully updated and a success flash message will appear stating "Check successfully updated".
Mark Check as Favorite
Marking a Metric Check as a favorite allows you to easily access important checks quickly. This feature helps you prioritize and manage the checks you frequently use, making data monitoring more efficient.
Step 1: Click on the bookmark icon to mark the Metric Check as a favorite.
After Clicking on the bookmark icon your check is successfully marked as a favorite and a success flash message will appear stating “Check has been favorited”
To unmark a check, simply click on the bookmark icon of the marked check. This will remove it from your favorites
Ended: Observability
Anomalies ↵
Anomalies
An anomaly in Qualytics is a data set (record or column) that fails to meet specified data quality checks, indicating a deviation from expected standards or norms. These anomalies are detected when the data does not satisfy the applied validation criteria, which could include both inferred and authored checks.
Let’s get started 🚀
Anomaly Detection Process
The anomaly detection process in Qualytics ensures data quality and reliability through a series of systematic steps as discussed below.
1. Create a Datastore and Connection
By setting up a datastore and establishing a connection to your data source (database or file system), you create a robust foundation for effective data management and analysis in Qualytics. This setup enables you to access, manipulate, and utilize your data efficiently, paving the way for advanced data quality checks, profiling, scanning, anomaly surveillance, and other analytics tasks.
Note
For more information, please refer to the documentation on Configuring Datastores
2. Catalog Operation
The Catalog operation involves systematically collecting data structures along with their corresponding metadata. This process also includes a thorough analysis of the existing metadata within the datastore. This ensures a solid foundation for the subsequent Profile and Scan operations.
Note
For more information, please refer to the documentation Catalog Operation
3. Profile Operation
The Profile operation enables training of the collected data structures and their associated metadata values. This is crucial for gathering comprehensive aggregating statistics on the selected data, providing deeper insights, and preparing the data for quality assessment.
Note
For more information, please refer to the documentation Profile Operation
4. Create Authored Checks
Authored Checks are manually created data quality checks in Qualytics, defined by users either through the user interface (UI) or via API. These checks encapsulate specific data quality check, along with additional context such as associated notifications, tags, filters, and tolerances.
Authored checks can range from simple, template-based checks to more complex rules implemented through SQL or user-defined functions (UDFs) in Scala. By allowing users to define precise criteria for data quality, authored checks enable detailed monitoring and validation of data within the datastore, ensuring that it meets the specified standards and requirements.
Note
For more information, please refer to the documentation Checks
5. Scan Operation
The Scan operation asserts rigorous quality checks to identify any anomalies within the data. This step ensures data integrity and reliability by recording the analyzed data in your configured enrichment datastore, facilitating continuous data quality improvement.
Note
For more information, please refer to the documentation Scan Operation
6. Anomaly Analysis
An Anomaly is a data record or column that fails a data quality check during a Scan Operation. These anomalies are identified through both Inferred and Authored Checks and are grouped together to highlight data quality issues. This process ensures that any deviations from expected data quality standards are promptly identified and addressed.
Note
For more information, please refer to the documentation Anomalies.
Types of Anomalies
Anomaly detection is categorized into two main types: Record Anomalies and Shape Anomalies. Both types play a crucial role in maintaining data integrity by identifying deviations at different levels of the dataset.
Record Anomaly
A record anomaly identifies a single record (row) as anomalous and provides specific details regarding why it is considered anomalous. The simplest form of a record anomaly is a row that lacks an expected value for a field.
Example: Consider a data quality check that requires the salary field to be greater than 40,000. Based on this rule, any record that does not meet this condition will be identified and labeled as a record anomaly by Qualytics. In the sample table illustrated below, the row with id 6 will be flagged as a record anomaly because the salary of 30,000 is less than the required 40,000. This precise identification allows for targeted investigation and correction of specific data issues.
ID | Name | Age | Salary |
---|---|---|---|
1 | John Doe | 28 | 50000 |
2 | Jane Smith | 35 | 75000 |
3 | Bob Johnson | 22 | 45000 |
4 | Alice Brown | 30 | 60000 |
5 | John Wick | 29 | 70000 |
6 | David White | 45 | 30000 |
Shape Anomaly
A shape anomaly identifies anomalous structure within the analyzed data. The simplest shape anomaly is a dataset that doesn't match the expected schema because it lacks one or more fields. Some shape anomalies only apply to a subset of the data being analyzed and can therefore produce a count of the number of rows that reflect the anomalous concern. Where that is possible, the shape anomaly's anomalous_record_count
is populated.
Note
Sometimes, shape anomalies only affect a subset of the dataset. This means that only certain rows exhibit the structural issue, rather than the entire dataset.
Note
When a shape anomaly affects only a portion of the dataset, Qualytics can count the number of rows that have the structural problem. This count is stored in the anomalous_record_count field, providing a clear measure of how widespread the issue is within the dataset. Example: Imagine a dataset that is supposed to have columns for id, name, age, and salary. If some rows are missing the salary column, this would be flagged as a shape anomaly. If this issue only affects 50 out of 1,000 rows, the anomalous_record_count would be 50, indicating that 50 rows have a structural issue.
Anomaly Status
Anomaly status is a crucial feature for managing and addressing data quality issues. It provides a structured way to track and resolve anomalies detected in your data, ensuring that data integrity is maintained. Here's a breakdown of the different anomaly statuses:
Active: This status indicates that the data anomaly has been detected and is currently unresolved, requiring attention to address the underlying issue.
Acknowledged: This status indicates that the anomaly has been verified as a legitimate data quality concern but has not yet been resolved.
Resolved: This status indicates that the anomaly was a legitimate data quality concern that has been addressed and fixed.
Invalid: This status indicates that the anomaly is not a legitimate data quality concern, possibly due to it being a false positive or an error in the detection process.
Anomaly Details
The anomaly identified during the scan operation illustrates the following details:
1. Anomaly ID: A numerical identifier (e.g. #75566) used for quick search and easy identification of anomalies within the Qualytics interface.
2. Status: Reflects the status of the Anomaly. It can be active, acknowledged, resolved, or invalid.
3. Share: Copy a shareable link to a specific anomaly. This can be shared with other users or stakeholders for collaboration.
4. Anomaly UID: A longer, standardized, and globally unique identifier, displayed as a string of hexadecimal characters. This can be copied to the clipboard.
5. Type: This reflects the type to which the anomaly belongs (e.g. Record or Shape).
6. Weight: A metric that indicates the severity or importance of the anomaly. Higher weights indicate more critical issues.
7. Detected: Reflects the timestamp when the anomaly was first detected.
8. Scan: Click on this to view the outcome of a data quality scan. It includes the scan status, the time taken, the user who triggered it, the schedule status, and a detailed list of anomalies detected across various tables.
In addition to the above details, the users can also explore the following additional details of the Anomaly:
1. Name: This indicates the name of the source datastore where the anomaly was detected.
Tip
Clicking on the expand icon opens a detailed view and navigates to the dataset’s page, providing more information about the source datastore where the anomaly was found.
2. Table Name: This specifies the particular table within the dataset that contains the anomaly. It helps in pinpointing the exact location of the data quality issue.
Tip
Clicking on the expand icon navigates to the table’s page, providing more in-depth information about the table structure and contents.
3. Location: Full path or location within the data hierarchy where the table resides. It gives a complete reference to the exact position of the data in the database or data warehouse.
Source Records
The Source Records section displays all the data and fields related to the detected anomaly from the dataset. It is an Enrichment Datastore that is used to store the analyzed results, including any anomalies and additional metadata in files, hence it is recommended to add/link an enrichment datastore with your connected source datastore.
If the Anomaly Type is Shape, you will find the highlighted column(s) having anomalies in the source record.
If the Anomaly Type is Record, you will find the highlighted row(s) in the source record indicating failed checks. In the snippet below, it can be observed that 7 checks have been failed in the row.
Note
In anomaly detection, source records are displayed as part of the Anomaly Details. For a Record anomaly, the specific record is highlighted. For a Shape anomaly, 10 samples from the underlying anomalous records are highlighted.
Comparison Source Records
Anomalies identified by the Is Replica Of data quality rule type, configured with Row Identifiers, are displayed with a detailed source record comparison. This visualization highlights differences between rows, making it easier to identify specific discrepancies.
Structure of the Comparison Source Records:
1. Left and Right Fields:
Each field in the table is split into two columns: the left column represents the target table/file, while the right column represents the reference table/file.
If the value in the right column (reference) differs from the value in the left column (target), the cell is highlighted to indicate an anomalous value.
2. _qualytics_diff Column: This column provides the status of each row, which can be one of the following:
Added: The row is missing on the left side (target) but found on the right side (reference).
Removed: The row is present on the left side (target) but missing on the right side (reference).
Changed: The row is present on both sides, but there is at least one field value that differs between the target and reference.
Suggested Remediation
Suggested Remediation is a feature that offers users recommended values to correct data anomalies identified during quality checks. In the snippet below, the FIRST_CAREUNIT field has failed the check, and Qualytics suggests CSRU as the remedial value.
Failed Check Description
This allows users to view detailed explanations of each failed check by hovering over the information icon, helping users better understand the nature of the violation.
Download Source Records
Download and export all source records (up to 250MB in a compressed .csv) for further analysis or external use.
Assign Tags
Assigning tags to an anomaly serves the purpose of labeling and grouping anomalies and driving downstream workflows.
Step 1: Click on the Assign tags to this Anomaly or + button.
Step 2: A dropdown menu will appear with existing tags. Scroll through the list and click on the tag you wish to assign.
Anomaly Actions
You can update the status of anomalies by acknowledging them to confirm they are real issues, or by moving them to archive if they are false positives. This helps refine the system’s detection process and keeps your worklist focused on relevant issues.
Acknowledge Anomalies
You can acknowledge anomalies to confirm that they represent real data issues that need attention. By acknowledging an anomaly, you provide feedback to the system, validating the detection and helping it improve future checks.
For more details on how to acknowledge anomalies, please refer to the documentation on Acknowledge Anomalies.
Archive Anomalies
You can archive the anomalies when they are determined to be false positives or not significant. Archiving helps the system adjust its checks to prevent similar issues from being flagged in the future, ensuring more accurate anomaly detection and keeping your worklist streamlined.
For instance, if an anomaly is marked as invalid, the tolerances of the checks that identified the anomaly will be updated to prevent similar false positives in the future. This continuous feedback loop enhances the accuracy and relevance of data quality checks over time.
For more details on how to archive anomalies, please refer to the documentation on Archive Anomalies
API Payload Examples
Retrieving Anomaly by UUID or Id
Endpoint (Get)
/api/anomalies/{id} (get)
Example Result Response
{
"id": 0,
"created": "2024-06-10T21:29:42.695Z",
"uuid": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"type": "shape",
"is_new": true,
"archived": true,
"status": "Active",
"source_enriched": true,
"datastore": {
"id": 0,
"name": "string",
"store_type": "jdbc",
"type": "athena",
"enrich_only": true,
"enrich_container_prefix": "string",
"favorite": true
},
"container": {
"id": 0,
"name": "string",
"container_type": "table",
"table_type": "table"
},
"partition": {
"name": "string",
"location": "string"
},
"weight": 0,
"global_tags": [
{
"type": "external",
"name": "string",
"color": "string",
"description": "string",
"weight_modifier": 0,
"integration": {
"id": 0,
"created": "2024-06-10T21:29:42.696Z",
"name": "string",
"type": "atlan",
"api_url": "string",
"overwrite": true,
"last_synced": "2024-06-10T21:29:42.696Z",
"status": "syncing"
}
},
{
"type": "external",
"name": "string",
"color": "string",
"description": "string",
"weight_modifier": 0,
"integration": {
"id": 0,
"created": "2024-06-10T21:29:42.696Z",
"name": "string",
"type": "atlan",
"api_url": "string",
"overwrite": true,
"last_synced": "2024-06-10T21:29:42.696Z",
"status": "syncing"
}
}
],
"anomalous_records_count": 0,
"comments": [
{
"id": 0,
"created": "2024-06-10T21:29:42.696Z",
"message": "string",
"user": {
"id": 0,
"created": "2024-06-10T21:29:42.696Z",
"user_id": "string",
"email": "string",
"name": "string",
"picture": "string",
"role": "Member",
"deleted_at": "2024-06-10T21:29:42.696Z",
"teams": [
{
"id": 0,
"name": "string",
"permission": "Read"
}
]
}
}
],
"failed_checks": [
{
"quality_check": {
"id": 0,
"created": "2024-06-10T21:29:42.696Z",
"fields": [
{
"id": 0,
"created": "2024-06-10T21:29:42.696Z",
"name": "string",
"type": "Unknown",
"completeness": 0,
"weight": 0,
"global_tags": [
{
"type": "global",
"name": "string",
"color": "string",
"description": "string",
"weight_modifier": 0
},
{
"type": "external",
"name": "string",
"color": "string",
"description": "string",
"weight_modifier": 0,
"integration": {
"id": 0,
"created": "2024-06-10T21:29:42.697Z",
"name": "string",
"type": "atlan",
"api_url": "string",
"overwrite": true,
"last_synced": "2024-06-10T21:29:42.697Z",
"status": "syncing"
}
}
],
"latest_profile_id": 0,
"quality_score": {
"total": 0,
"completeness": 0,
"coverage": 0,
"conformity": 0,
"consistency": 0,
"precision": 0,
"timeliness": 0,
"volumetrics": 0,
"accuracy": 0
}
}
],
"description": "string",
"rule_type": "afterDateTime",
"coverage": 0,
"inferred": true,
"template_locked": true,
"is_new": true,
"num_container_scans": 0,
"filter": "string",
"properties": {
"allow_other_fields": true,
"assertion": "string",
"comparison": "string",
"datetime": "2024-06-10T21:29:42.697Z",
"expression": "string",
"field_name": "string",
"field_type": "Unknown",
"id_field_names": [
"string"
],
"inclusive": true,
"inclusive_max": true,
"inclusive_min": true,
"interval_name": "Yearly",
"last_value": 0,
"list": [
"string"
],
"max": 0,
"max_size": 0,
"max_time": "2024-06-10T21:29:42.697Z",
"min": 0,
"min_size": 0,
"min_time": "2024-06-10T21:29:42.697Z",
"pattern": "string",
"ref_container_id": 0,
"ref_datastore_id": 0,
"tolerance": 0,
"value": 0,
"ref_expression": "string",
"ref_filter": "string",
"required_labels": [
"road"
],
"numeric_comparator": {
"epsilon": 0,
"as_absolute": true
},
"duration_comparator": {
"millis": 0
},
"string_comparator": {
"ignore_whitespace": false
},
"distinct_field_name": "string",
"pair_substrings": true,
"pair_homophones": true,
"spelling_similarity_threshold": 0
},
"template_checks_count": 0,
"anomaly_count": 0,
"type": "global",
"name": "string",
"color": "string",
"description": "string",
"weight_modifier": 0
},
{
"type": "external",
"name": "string",
"color": "string",
"description": "string",
"weight_modifier": 0,
"integration": {
"id": 0,
"created": "2024-06-10T21:29:42.697Z",
"name": "string",
"type": "atlan",
"api_url": "string",
"overwrite": true,
"last_synced": "2024-06-10T21:29:42.697Z",
"status": "syncing"
}
}
],
"latest_profile_id": 0,
"quality_score": {
"total": 0,
"completeness": 0,
"coverage": 0,
"conformity": 0,
"consistency": 0,
"precision": 0,
"timeliness": 0,
"volumetrics": 0,
"accuracy": 0
}
}
],
"description": "string",
"rule_type": "afterDateTime",
"coverage": 0,
"inferred": true,
"template_locked": true,
"is_new": true,
"num_container_scans": 0,
"filter": "string",
"properties": {
"allow_other_fields": true,
"assertion": "string",
"comparison": "string",
"datetime": "2024-06-10T21:29:42.697Z",
"expression": "string",
"field_name": "string",
"field_type": "Unknown",
"id_field_names": [
"string"
],
"inclusive": true,
"inclusive_max": true,
"inclusive_min": true,
"interval_name": "Yearly",
"last_value": 0,
"list": [
"string"
],
"max": 0,
"max_size": 0,
"max_time": "2024-06-10T21:29:42.697Z",
"min": 0,
"min_size": 0,
"min_time": "2024-06-10T21:29
}
Retrieving Anomaly Source Records
Endpoint (Get)
/api/anomalies/{id}/source-record (get)
Example Result Response
Manage Anomalies
You can manage anomalies to ensure your data remains accurate and any quality issues are addressed efficiently. Anomalies, which occur due to errors or inconsistencies in data, can be categorized as All, Active, Acknowledged, or Archived, helping you track their status and take the appropriate actions. You can acknowledge anomalies that have been reviewed, archive those that no longer need attention, and delete any that are irrelevant or logged by mistake.
Bulk actions further simplify the process, allowing you to manage multiple anomalies at once, saving time and effort. This guide will walk you through the steps of acknowledging, archiving, restoring, editing, and deleting anomalies, whether individually or in bulk.
Let's get started 🚀
Navigation
Step 1: Log in to your Qualytics account and select the datastore from the left menu on which you want to manage your anomalies.
Step 2: Click on the “Anomalies” from the Navigation Tab.
Categories Anomalies
Anomalies can be classified into different categories based on their status and actions taken. These categories include All, Active, Acknowledged, and Archived anomalies. Managing anomalies effectively helps in maintaining data integrity and ensuring quick response to issues.
All
By selecting All Anomalies, you can view the complete list of anomalies, regardless of their status. This option helps you get a comprehensive overview of all issues that have been detected, whether they are currently active, acknowledged, or archived.
Active
By selecting Active Anomalies, you can focus on anomalies that are currently unresolved or require immediate attention. These are the anomalies that are still in play and have not yet been acknowledged, archived, or resolved.
Acknowledge
By selecting Acknowledged Anomalies, you can see all anomalies that have been reviewed and marked as acknowledged. This status indicates that the anomalies have been noted, though they may still require further action.
Archived
By selecting Archived Anomalies, you can view anomalies that have been resolved or moved out of active consideration. Archiving anomalies allows you to keep a record of past issues without cluttering the active list.
You can also categorize the archived anomalies based on their status as Resolved, Duplicate and Invalid, to manage and review them effectively.
1. Resolved: This indicates that the anomaly was a legitimate data quality concern and has been addressed.
2. Duplicate: This indicates that the anomaly is a duplicate of an existing record and has already been addressed.
3. Invalid: This indicates that the anomaly is not a legitimate data quality concern and does not require further action.
4. All: Displays all archived anomalies, including those marked as Resolved, Duplicate, and Invalid, giving a comprehensive view of all past issues.
Acknowledge Anomalies
By acknowledging anomalies, you indicate that they have been reviewed or recognized. This can be done either individually or in bulk, depending on your workflow. Acknowledging anomalies helps you keep track of issues that have been addressed, even if further action is still required.
Warning
Once an anomaly is acknowledged, it remains acknowledged and never reverts to the active state.
Method I. Acknowledge Specific Anomaly
You can acknowledge individual anomalies either directly or through the action menu, giving you precise control over each anomaly's status.
1. Acknowledge Directly
Step 1: Locate the active anomaly you want to acknowledge, and click on the acknowledge icon (represented by an eye icon) located on the right side of the anomaly.
A modal window titled “Acknowledge Anomaly” will appear, confirming that this action acknowledges the anomaly as a legitimate data quality concern.
You also have the option to leave a comment in the provided field to provide additional context or details.
Step 2: Click on the Acknowledge button to move the anomaly to the acknowledged state.
Step 3: After clicking on the Acknowledge button your anomaly is successfully moved to the acknowledge and a flash message will appear saying “The Anomaly has been successfully acknowledged.”
2. Acknowledge via Action Menu
Step 1: Click on the active anomaly from the list of available anomalies that you want to acknowledge.
Step 2: A modal window will appear displaying the anomaly details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and click on “Acknowledge” from the drop-down menu.
A modal window titled “Acknowledge Anomaly” will appear, confirming that this action acknowledges the anomaly as a legitimate data quality concern.
You also have the option to leave a comment in the provided field to provide additional context or details.
Step 3: Click on the Acknowledge button to move the anomaly to the acknowledged state.
Step 4: After clicking on the Acknowledge button your anomaly is successfully moved to the acknowledge and a flash message will appear saying “The Anomaly has been successfully acknowledged.”
Method II. Acknowledge Anomalies in Bulk
By acknowledging anomalies in bulk, you can quickly mark multiple anomalies as reviewed at once, saving time and ensuring that all relevant issues are addressed simultaneously.
Step 1: Hover over the active anomalies and click on the checkbox to select multiple anomalies.
When multiple anomalies are selected, an action toolbar appears, displaying the total number of selected anomalies along with a vertical ellipsis for additional bulk action options.
Step 2: Click on the vertical ellipsis (⋮) and choose "Acknowledge" from the dropdown menu to acknowledge the selected anomalies.
A modal window titled “Acknowledge Anomalies” will appear, confirming that this action acknowledges the anomalies as a legitimate data quality concern.
You also have the option to leave a comment in the provided field to provide additional context or details.
Step 3: Click on the Acknowledge button to acknowledge the anomalies.
Step 4: After clicking on the Acknowledge button your anomalies are successfully moved to the acknowledge state and a flash message will appear saying “The Anomalies have been successfully acknowledged”
Archive Anomalies
By archiving anomalies, you move them to an inactive state, while still keeping them available for future reference or analysis. Archiving helps keep your active anomaly list clean without permanently deleting the records.
Method I. Archive Specific Anomalies
You can archive individual anomalies either directly or through the action menu.
1. Archive Directly
Step 1: Locate the anomaly (whether Active or Acknowledged) you want to archive and click on the Archive icon (represented by a box with a downward arrow) located on the right side of the anomaly.
Step 2: A modal window titled “Archive Anomaly” will appear, providing you with the following archive options:
-
Resolved: Choose this option if the anomaly was a legitimate data quality concern and has been addressed. This helps maintain a record of resolved issues while ensuring that they are no longer active.
-
Duplicate: Choose this option if the anomaly is a duplicate of an existing record and has already been addressed. No further action is required as the issue has been previously resolved.
-
Invalid: Select this option if the anomaly is not a legitimate data quality concern and does not require further action. Archiving anomalies as invalid helps differentiate between real issues and those that can be dismissed, improving overall data quality management.
Step 3: Once you've made your selection, click the Archive button to proceed.
Step 4: After clicking on the Archive button your anomaly is moved to the archive and a flash message will appear saying “Anomaly has been successfully archived”
2. Archive via Action Menu
Step 1: Click on the anomaly from the list of available (whether Active or Acknowledged) anomalies that you want to archive.
Step 2: A modal window will appear displaying the anomaly details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and click on the “Archive” from the drop-down menu.
Step 3: A modal window titled “Archive Anomaly” will appear, providing you with the following archive options:
-
Resolved: Choose this option if the anomaly was a legitimate data quality concern and has been addressed. This helps maintain a record of resolved issues while ensuring that they are no longer active.
-
Duplicate: Choose this option if the anomaly is a duplicate of an existing record and has already been addressed. No further action is required as the issue has been previously resolved.
-
Invalid: Select this option if the anomaly is not a legitimate data quality concern and does not require further action. Archiving anomalies as invalid helps differentiate between real issues and those that can be dismissed, improving overall data quality management.
Step 4: Once you've made your selection, click the Archive button to proceed.
Step 5: After clicking on the Archive button your anomaly is moved to the archive and a flash message will appear saying “Anomaly has been successfully archived”
Method II. Archive Anomalies in Bulk
To handle multiple anomalies efficiently, you can archive them in bulk, allowing you to quickly move large volumes of anomalies into the archived state.
Step 1: Hover over the anomaly (whether Active or Acknowledged) and click on the checkbox to select multiple anomalies.
When multiple anomalies are selected, an action toolbar appears, displaying the total number of selected anomalies along with a vertical ellipsis for additional bulk action options.
Step 2: Click on the vertical ellipsis (⋮) and choose "Archive" from the dropdown menu to archive the selected anomalies.
Step 3: A modal window titled “Archive Anomaly” will appear, providing you with the following archive options:
-
Resolved: Choose this option if the anomaly was a legitimate data quality concern and has been addressed. This helps maintain a record of resolved issues while ensuring that they are no longer active.
-
Duplicate: Choose this option if the anomaly is a duplicate of an existing record and has already been addressed. No further action is required as the issue has been previously resolved.
-
Invalid: Select this option if the anomaly is not a legitimate data quality concern and does not require further action. Archiving anomalies as invalid helps differentiate between real issues and those that can be dismissed, improving overall data quality management.
Step 4: Once you've made your selection, click on the Archive button to proceed.
Step 5: After clicking on the Archive button your anomaly is moved to the archive and a flash message will appear saying “Anomalies has been successfully archived”
Restore Archive Anomalies
By restoring archived anomalies, you can bring them back into the acknowledged state for further investigation or review. These anomalies will not return to the active state once they have been acknowledged.
Step 1: Click on the anomaly that you want to restore from the list of archived anomalies.
Step 2: A modal window will appear displaying the anomaly details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and click on “Restore” from the drop-down menu.
Step 3: After clicking on the “Restore” button, the selected anomaly is now restored as in acknowledged state.
Edit Anomalies
By editing anomalies, you can only update their tags, allowing you to categorize and organize anomalies more effectively without altering their core details.
Note
When editing multiple anomalies in bulk, only the tags can be modified.
Step 1: Hover over the anomaly (whether Active or Acknowledged) and click on the checkbox.
You can edit multiple checks by selecting the checkboxes next to each anomaly to choose multiple anomalies at once.
When multiple anomalies are selected, an action toolbar appears, displaying the total number of selected anomalies along with a vertical ellipsis for additional bulk action options.
Step 2: Click on the vertical ellipsis (⋮) and choose "Edit" from the dropdown menu to edit the selected anomalies.
A modal window titled “Bulk Edit Anomalies” will appear. Here you can only modify the “tags” of the selected anomalies.
Step 3: Turn on the toggle and assign tags to the selected anomalies.
Step 4: Once you have assigned the tags, click on the “Save” button.
After clicking the Save button, the selected anomalies will be updated with the assigned tags.
Delete Anomalies
Deleting anomalies allows you to permanently remove records that are no longer relevant or were logged in error. This can be done individually or for multiple anomalies at once, ensuring that your anomaly records remain clean and up to date.
Note
You can only delete archived anomalies, not active or acknowledged checks. If you want to delete an active or acknowledged anomaly, you must first move it to the archive, and then you can delete it.
Warning
Deleting an anomaly is a one-time action. It cannot be restored after deletion.
Method I. Delete Specific Anomaly
You can delete individual anomalies using one of two methods:
1. Delete Directly
Step 1: Click on Archived from the navigation bar in the Anomalies section to view all archived anomalies.
Step 2: Locate the anomaly, that you want to delete and click on the Delete icon located on the right side of the anomaly.
Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the anomaly from the system.
Step 4: After clicking on the delete button, your anomaly is successfully deleted and a success flash message will appear saying “Anomaly has been successfully deleted”
2. Delete via Action Menu
Step 1: Click on the archive anomaly from the list of archived anomalies that you want to delete.
Step 2: A modal window will appear displaying the anomaly details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and click on “Delete” from the drop-down menu.
Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the anomaly from the system.
Step 4: After clicking on the delete button, your anomaly is successfully deleted and a success flash message will appear saying “Anomaly has been successfully deleted”
Method II. Delete Anomalies in Bulk
For more efficient management, you can delete multiple anomalies at once using the bulk delete option, allowing for faster cleanup of unwanted records.
Step 1: Hover over the archived anomalies and click on the checkbox to select anomalies in bulk.
When multiple checks are selected, an action toolbar appears, displaying the total number of checks chosen along with a vertical ellipsis for additional bulk action options.
Step 2: Click on the vertical ellipsis (⋮) and choose "Delete" from the dropdown menu to delete the selected anomalies.
Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the selected anomalies from the system.
Step 4: After clicking on the delete button, your anomalies are successfully deleted and a success flash message will appear saying “Anomalies has been successfully deleted”
Ended: Anomalies
Explore ↵
Explore
Explore page in Qualytics is where you can easily view and manage all your data. It provides easy access to important features through tabs like Insights, Activity, Profiles, Observability, Checks, and Anomalies. Each tab shows a different part of your data, such as its quality, activities, structure, checks, and issues. You can sort and filter the data by datastore and time frame, making it easier to track performance, spot problems, and take action. The Explore section helps you manage and understand your data all in one place.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.
Step 2: After click on the Explore button, you will see the following tabs: Insights, Activity, Profiles, Observability, Checks, and Anomalies.
Insights
Insights tab provides a quick and clear overview of your data's health and performance. It shows key details like Quality Scores, active checks, profiles, scans, and anomalies in a simple and effective way. This makes it easy to monitor and track data quality, respond to issues, and take action quickly. Additionally, users can monitor specific source datastores and check for a particular report date and time frame.
For more details on Insights, please refer to the Insights documentation.
Activity
Activity tab provides a comprehensive view of all operations, helping users monitor and analyze the performance and workflows across various source datastores. Activity are categorized into Runs and Schedule operations, offering distinct insights into executed and scheduled activities.
For more details on Activity, please refer to the Activity documentation.
Profiles
Profiles tab helps you explore and manage your containers and fields. With features like filtering, sorting, tagging, and detailed profiling, it provides a clear understanding of data quality and structure. This simplifies navigation and enhances data management for quickly, informed decisions.
For more details on Profiles, please refer to the Profiles documentation.
Observability
Observability tab gives users an easy way to track changes in data volume over time. It introduces two types of checks: Volumetric and Metric. The Volumetric check automatically monitors the number of rows in a table and flags unusual changes, while the Metric check focuses on specific fields, providing more detailed insights from scan operations. Together, these tools help users spot data anomalies quickly and keep their data accurate.
For more details on Observability, please refer to the Observability documentation.
Checks
Checks tab provides a quick overview of the various checks applied across different tables and fields in multiple source datastores. In Qualytics, checks act as rules applied to data tables and fields to ensure accuracy and maintain data integrity. You can filter and sort the checks based on your preferences, making it easy to see which checks are active, in draft, or archived. This section is designed to simplify the review of applied checks across datasets without focusing on data quality or performance.
For more details on Checks, please refer to the Checks documentation.
Anomalies
Anomalies tab provides a quick overview of all detected anomalies across your source datastores. In Qualytics, An Anomaly refers to a dataset (record or column) that fails to meet specified data quality checks, indicating a deviation from expected standards or norms. These anomalies are identified when the data does not satisfy the applied validation criteria. You can filter and sort anomalies based on your preferences, making it easy to see which anomalies are active, acknowledged, or archived. This section is designed to help you quickly identify and address any issues.
For more details on Anomalies, please refer to the Anomalies documentation.
Insights
Insights in Qualytics provides a quick and clear overview of your data's health and performance. It shows key details like Quality Scores, active checks, profiles, scans, and anomalies in a simple and effective way. This makes it easy to monitor and track data quality, respond to issues, and take action quickly. Additionally, users can monitor specific source datastores and check for a particular report date and time frame.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.
You will be navigated to the Insights tab to view a presentation of your data, pulled from the connected source datastore.
Filtering Controls
Filtering Controls allow you to refine the data displayed on the Insights page. You can customize the data view based on Source Datastores, Tags, Report Date, and Timeframe, ensuring you focus on the specific information that matters to you.
No | Filter | Description |
---|---|---|
1. | Select Source Datastores | Select specific source datastores to focus on their data. |
2. | Tags | Filter data by specific tags to categorize and refine results. |
3. | Report Date | Set the report date to view data from a particular day. |
4. | Timeframe | Choose a timeframe to view data for a specific (week, month, quarter, and year) |
Understanding Timeframes and Timeslices
When analyzing data on the Insights, two key concepts help you uncover trends: timeframes and timeslices. These work together to give you both a broad view and a detailed breakdown of your data.
Timeframes
Timeframe is the total range of time you select to view your data. For example, you can choose to see data:
-
Weekly: Summarize data for an entire week.
-
Monthly: Group data by months.
-
Quarterly: Cover three months at a time.
-
Yearly: Show data for the entire year.
How Metrics Behave Over a Timeframe
- Quality Score and other similar metrics display an average for the selected timeframe.
Example:If you select weekly, the Quality Score shown will be the average score for the entire week.
- Historical Graphs (like Profiles or Scans) show cumulative totals over time.
Example:If you view a graph for a monthly timeframe, the graph shows how data grows or changes month by month.
Timeslices
Timeslice breaks your selected timeframe into smaller parts. It helps you see more detailed trends within the overall timeframe.
For example:
-
A weekly timeframe shows each day of the week.
-
A monthly timeframe breaks into weekly segments.
-
A quarterly timeframe highlights months within that quarter.
-
A yearly timeframe divides into quarters and months.
How Timeslices Work
-
When you choose a timeframe, the graph automatically breaks it into timeslices.
-
Each bar or point on the graph represents one timeslice.
Example:
- If you choose a Weekly timeframe, each bar in the graph will represent one day of the week.
- If you choose a Monthly timeframe, each bar will represent one week in that month.
Metrics Within a Timeslice
Metrics like Quality Score, Profiles, or Scans are displayed for each timeslice, allowing you to identify trends and patterns over smaller intervals.
Quality Score
Quality Score gives a clear view of your data's overall quality. It shows important measures like Completeness, Conformity, Consistency, Precision, Timeliness, Volumetrics, and Accuracy, each represented by a percentage. This helps you quickly understand the health of your data, making it easier to identify areas that need improvement.
Overview
Overview provides a quick view of your data. It shows the total amount of data being managed, along with the number of Source Datastores and Containers. This helps you easily track the size and growth of your data.
Records and Fields Data
This section shows important information about the records and fields in the connect source datastores:
-
Records Profiled: This represents the total number of records that were included in the profiling process.
-
Records Scanned: This refers to the number of records that were checked during a scan operation. The scan performs data quality checks on collections like tables, views, and files.
-
Fields Profiled: This shows how many field profiles were updated as a result of the profiling operation.
Checks
Checks offer a quick view of active checks, categorizing them based on their results.
1. Passed Check: Displays the real-time number of passed checks that were successfully completed during the scan or profile operation, indicating that the data met the set quality criteria.
2. Failed Checks: This shows the real-time number of checks that did not pass during the scan or profile operation, indicating data that did not meet the quality criteria.
3. Not Asserted Checks: This shows the real-time number of checks that haven't been processed or validated yet, meaning their status is still pending and they have not been confirmed as either passed or failed.
The count for each category can be viewed by hovering over the relevant check, providing real-time ratios of checks. Users can also click on these checks to navigate directly to the corresponding checks’ dedicated page in the Explore section.
Anomalies
Anomalies section provides a clear overview of identified anomalies in the system. The anomalies are categorized for better clarity and management.
Anomalies Identified shows the total issues found, divided into active, acknowledged, and resolved, helping users quickly manage and fix problems.
1. Active Anomalies: Shows the number of unresolved anomalies that require immediate attention. These anomalies are still present and have not been acknowledged, archived, or resolved in the system.
2. Acknowledged Anomalies: These are anomalies that have been reviewed and recognized by users but are not yet resolved. Acknowledging anomalies helps keep track of issues that have been addressed, even if further actions are still needed.
3. Resolved Anomalies: Represent anomalies that were valid data quality issues and have been successfully addressed. These anomalies have been resolved, indicating the data now meets the required quality standards.
The count for each category can be viewed by hovering over the relevant anomalies, providing real-time ratios of anomalies. Users can also click on these anomalies to navigate directly to the corresponding anomalies’ dedicated page in the Explore section.
Rule Distribution Type
Rule Type Distribution highlights the top rule types applied to the source datastore, each represented by a different color. The visualization allows users to quickly see which rules are most commonly applied.
By clicking the caret down 🔽 button, users can choose either the top 5 or top 10 rule types to view in the insights, based on their analysis needs.
Profiles
Profiles section provides a clear view of data profiling activities over time, showing how often profiling is performed and the amount of data (records) analyzed.
Profile Runs shows how many times data profiling has been done over a certain period. Each run processes a specific source datastore or table, helping users see how often profiling happens. The graph gives a clear view of the changes in profile runs over time, making it easier to track profiling activity.
Click on the caret down 🔽 button to choose between viewing Records Profiled or Fields Profiled, depending on your preference.
Record Profile
Record Profiled shows the total number of records processed during the profile runs. It provides insight into the amount of data that has been analyzed during those runs. The bars in the graph show the comparison of the number of records profiled over the selected days.
Field Profiled
Field Profiled shows the number of fields processed during the profile runs. It shows how many individual fields within datasets have been analyzed during those runs. The bars in the graph provide a comparison of the fields profiled over the selected days.
Scans
Scans section provides a clear overview of all scanning activities within a selected period. It helps users keep track of how many scans were performed and how many anomalies were detected during those scans. This section makes it easier to understand the scanning process and manage data by offering insight into how often scans occur.
Scan Runs show how often data scans are performed over a certain period. These scans check the quality of data across tables, views, and files, helping users monitor their data regularly and identify any issues. The process can be customized to scan tables or limit the number of records checked, ensuring that data stays accurate and up to standard.
Click on the caret down 🔽 button to choose between viewing Anomalies Identified or Records Scanned, depending on your preference.
Anomalies Identified
Anomalies Identified shows the total number of anomalies detected during the scan runs. The bars in the graph allow users to compare the number of anomalies found across different days, helping them spot trends or irregularities in the data.
Records Scanned
Records Scanned shows the total number of records that were scanned during the scan runs. It gives users insight into how much data has been processed and allows them to compare the scanned records over the selected period.
Data Volume
Data Volume allows users to track the size of data stored within all source datastores present in the Qualytics platform over time. This helps in monitoring how the source datastore grows or changes, making it easier to detect irregularities or unexpected increases that could affect system performance. Users can visualize data size trends and manage the source datastore's efficiency, optimizing storage, adjusting resources, and enhancing data processing based on its size and growth.
Export
Export button allows you to quickly download the data from the Insights page. You can export data according to the selected Source Datastores, Tags, Report Date, and Timeframe. This makes it easy to save the data for offline use or share it with others.
After exporting, the data appears in a structured format, making it easy to save for offline use or to share with others.
Activity
Activity in Qualytics provides a comprehensive view of all operations, helping users monitor and analyze the performance and workflows across various source datastores. Activity are categorized into Runs and Schedule operations, offering distinct insights into executed and scheduled activities.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.
Step 2: Click on the "Activity" from the Navigation Tab.
You will be navigated to the Activity tab and here you'll see a list of operations catalog, profile, scan and external scan across different source datastores.
Activity Categories
Activity are divided into two categories Runs and Schedule Operation. Runs operation provides insights into the operations that have been performed, while Schedule operation provides insights into the schedule operations.
Runs
Runs provide a complete record of all executed operations across various source datastores. This section enables users to monitor and review activities such as catalog, profile, scan, and external scan. Each run displays key details like the operation type, status, execution time, duration, and triggering method, offering a clear overview of system performance and data processing workflows.
No. | Field | Description |
---|---|---|
1. | Select Source Datastore | Select specific source datastores to focus on their operations. |
2. | Search | This feature helps users quickly find specific identifiers. |
3. | Sort By | Sort By option helps users organize list of performed operations by criteria like Duration and Created Date for quick access. |
4. | Filter | The filter lets users easily refine list of performed operations by choosing specific Type Scan, Catalog, Profile, and External Scan or Status(Success, Failure, Running, and aborted) to view. |
5. | Activity Heatmap | Activity Heatmap shows daily activity levels, with color intensity indicating operation counts. Hovering over a square reveals details for that day. |
6. | Operation List | Shows a list of performed operations catalog, profile, scan, and external scan performed across various source datastores. |
Activity Heatmap
Activity heatmap represents activity levels over a period, with each square indicating a day and the color intensity representing the number of operations or activities on that day. It is useful in tracking the number of operations performed on each day within a specific timeframe.
Note
You can click on any of the squares from the Activity Heatmap to filter operations.
By hovering over each square, you can view additional details for that specific day, such as the exact date, and the total number of operations executed.
Operation Details
Step 1: Click on any successfully performed operation from the list to view its details.
For demonstration purposes, we have selected the profile operation.
Step 2: After clicking, a dropdown will appear, displaying the details of the selected operation.
Step 3: Users can hover over abbreviated metrics to see the full value for better clarity. For demonstration purposes, we are hovering over the Records Profiled field to display the full value.
Users also view both profiled and non-profiled File Patterns:
Step 4: Click on Result Button.
The Profile Results modal displays a list of both profiled and non-profiled containers. You can filter the view to show only non-profiled containers by toggling on button, which will display the complete list of unprofiled containers.
Schedule
Schedule provide a complete record of all scheduled operations across various source datastores. This section enables users to monitor and review scheduled operations such as catalog, profile, and scan operation. Each scheduled operation includes key details like operation type, scheduled time, and triggering method, giving users a clear overview of system performance and data workflows.
No. | Field | Description |
---|---|---|
1 | Selected Source Datastores | Select specific source datastores to focus on their operations. |
2 | Search | This feature helps users quickly find specific identifiers. |
3 | Sort By | Sort By option helps users organize list of scheduled operations by criteria like Created Date and Operations for quick access. |
4 | Filter | The filter lets users easily refine list of scheduled operation by choosing specific Scan, Catalog, and Profile to view. |
5. | Operation List | Shows a list of scheduled operations catalog, profile, and scan performed across various source datastores. |
Profiles
Profiles in Qualytics helps you explore and manage your containers and fields. With features like filtering, sorting, tagging, and detailed profiling, it provides a clear understanding of data quality and structure. This simplifies navigation and enhances data management for quickly, informed decisions.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.
Step 2: Click on the "Profiles" from the Navigation Tab.
You will be navigated to the Profiles section. Here, you will see the data organized into two sections: Containers and Fields, allowing you to explore and analyze the datasets efficiently.
Containers
By selecting the Containers section, you can explore structured datasets that are organized as either JDBC or DFS containers. JDBC containers represent tables or views within relational databases, while DFS containers include files such as CSV, JSON, or Parquet, typically stored in distributed systems like Hadoop or cloud storage.
Container Details
Containers section provides key details about each container, including the last profiled and last scanned dates. Hovering over the info icon for a specific container reveals these details instantly.
Step 1: Locate the container you want to review, then hover over the info icon to view the container Details.
Step 2: A pop-up will appear with additional details about the container, such as the last profiled and last scanned dates.
Explore Tables and Fields
By clicking on a specific container, users can view its associated fields, including detailed profiling information. Additionally, clicking the arrow icon on the right side of a specific container allows users to navigate directly to its corresponding table for a more in-depth exploration.
Explore Fields
To explore the data within a container, you can view all its fields. This allows you to gain insights into the structure and quality of the data stored in the container.
Step 1: Click on the specific container whose fields you want to preview.
For demonstration purposes, we have selected the Netsuite Financials container.
Step 2: You will be directed to the fields of the selected container, where all the fields of the container will be displayed.
Explore Tables
To explore the data in more detail, you can view the corresponding table of a selected container. This provides a comprehensive look at the data stored within, allowing for deeper analysis and exploration.
Step 1: Click on the arrow icon on the right side of the container you want to preview.
Step 2: You will be directed to the corresponding table, providing a comprehensive view of the data stored in the container.
Filter and Sort
Filter and Sort options allow you to organize your containers by various criteria, such as Name, Last Profiled, Last Scanned, Quality Score, Records and Type. You can also apply filters to refine your list of containers based on Type
Sort
You can sort your containers by various criteria, such as Name, Last Profiled, Last Scanned, Quality Score, Records and Type to easily organize and prioritize them according to your needs.
No | Sort By | Description |
---|---|---|
1. | Anomalies | Sorts containers based on the number of detected anomalies. |
2. | Checks | Sorts containers by the number of active validation checks applied. |
3. | Completeness | Sorts containers based on their data completeness percentage. |
4. | Created Date | Sorts containers by the date they were created, showing the newest or oldest fields first. |
5. | Fields | Sorts containers by the number of fields profiled. |
6. | Last Profiled | Sorts by the most recent profiling container. |
7. | Last Scanned | Sorts by the most recent scanned container. |
8. | Name | Sorts containers alphabetically by their names. |
9. | Quality Score | Sorts containers based on their quality score, indicating the reliability of the data in the field. |
10. | Records | Sorts containers by the number of records profiled. |
11. | Type | Sorts containers based on their data type (e.g., string, boolean, etc.). |
Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.
Filter
You can filter your containers based on values Type (Table, View, File, Computed Table and Computed File) to easily organize and prioritize them according to your needs.
Mark as Favorite
Marking a container as a favorite allows you to quickly access and prioritize the containers that are most important to your work, ensuring faster navigation and improved efficiency.
Step 1: Locate the container which you want to mark as a favorite and click on the bookmark icon located on the left side of the container.
After Clicking on the bookmark icon your container is successfully marked as a favorite and a success flash message will appear stating "The Table has been favorited"
To unmark, simply click on the bookmark icon of the marked container. This will remove it from your favorites.
Fields
By selecting the Fields section in the Qualytics platform, you can view all the fields across your data sources, including their quality scores, completeness, and metadata, for streamlined data management.
Fields Details
Field Details view in the Qualytics platform provides in-depth insights into a selected field. It displays key information, including the field’s declared type, number of distinct values, minimum and maximum length of observed values, entropy, and unique/distinct ratio. This detailed profiling allows you to understand the field's data structure, quality, and variability, enabling better data governance and decision-making.
Step 1: Click on the specific field whose fields details you want to preview.
A modal window will appear, providing detailed information about the selected field, such as its declared type, distinct values, length range, and more.
Manage Tags in Field Details
Tags can now be directly managed in the field profile within the Explore section. Simply access the Field Details panel to create, add, or remove tags, enabling more efficient and organized data management.
Step 1: Click on the specific field that you want to manage tags.
A Field Details modal window will appear. Click on the + button to assign tags to the selected field.
Step 2: You can also create the new tag by clicking on the ➕ button.
A modal window will appear, providing the options to create the tag. Enter the required values to get started.
For more information on creating tags, refer to the Add Tag section.
Filter and Sort
Filter and Sort options allow you to organize your fileds by various criteria, such as Anomalies, Checks, Completeness, Created Date, Name, Quality Score and Type. You can also apply filters to refine your list of fields based on Profile and Type.
Sort
You can sort your containers by various criteria, such as Anomalies, Checks, Completeness, Created Date, Name, Quality Score and Type to easily organize and prioritize them according to your needs.
No | Sort By | Description |
---|---|---|
1. | Anomalies | Sorts fields based on the number of detected anomalies. |
2. | Checks | Sorts fields by the number of active validation checks applied. |
3. | Completeness | Sorts fields based on their data completeness percentage. |
4. | Created Date | Sorts fields by the date they were created, showing the newest or oldest fields first. |
5. | Name | Sorts fields alphabetically by their names. |
6. | Quality Score | Sorts fields based on their quality score, indicating the reliability of the data in the field. |
7. | Type | Sorts fields based on their data type (e.g., string, boolean, etc.). |
Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.
Filter
You can filter your fields based on Profiles and Type to easily organize and prioritize them according to your needs.
No. | Filter | Description |
---|---|---|
1. | Profile | Filters fields based on the Profiles (e.g., accounts, accounts.csv, etc.). |
2. | Type | Filters fields based on the data type (e.g., string, boolean, date, etc.). |
Observability
Observability in the Explore section offers a clear view of platform data, enabling users to monitor and analyze data behavior effectively. It tracks changes in data volumes and metrics, highlighting trends and anomalies. By organizing checks into Volumetric and Metric categories, it simplifies monitoring overall data volumes and specific field-based values. Visual tools like heatmaps and customizable checks make it easy to identify issues, set thresholds, and adjust monitoring parameters for efficient data management.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.
Step 2: Click on the "Observability" from the Navigation Tab.
Observability Categorized
Observability, data checks are divided into two categories: Volumetric and Metric. Volumetric checks track overall data volume, while Metric checks measures data based on predefined fields and thresholds, tracking changes in data values over time. These two categories work together to offer comprehensive insights into data trends and anomalies.
1. Volumetric: Volumetric automatically tracks changes in the amount of data within a table over time. It monitors row counts and compares them to expected ranges based on historical data. If the data volume increases or decreases unexpectedly, the check flags it for further review. This feature helps users easily identify unusual data patterns without manual monitoring.
2. Metric: Metric measures data based on predefined fields and thresholds, tracking changes in data values over time. It detects if the value of a specific field, like the average or absolute value, goes beyond the expected range. Using scheduled scans, it automatically records and analyzes these values, helping users quickly spot any anomalies. This check gives deeper insights into data behavior, ensuring data integrity and identifying irregular patterns easily.
Volumetric
Volumetric help monitor data volumes over time to keep data accurate and reliable. They automatically count rows in a table and spot any unusual changes, like problems with data loading. This makes it easier to catch issues early and keep everything running smoothly. Volumetric checks also let you track data over different time periods, like daily or weekly. The system sets limits based on past data, and if the row count goes above or below those limits, an anomaly alert is triggered.
No | Field | Description |
---|---|---|
1. | Select Source Datastore | Select specific source datastores to focus on their data. |
2. | Select Tag | Filter data by specific tags to categorize and refine results. |
3. | Search | This feature helps users quickly find specific identifiers or names in the data. |
4. | Report Data | Report Date lets users pick a specific date to view data trends for that day. |
5. | Time Frame | The time frame option lets users choose a period (week, month, quarter, or year.) to view data trends. |
6. | Sort By | Sort By option helps users organize data by criteria like Anoamlies, Created Date, Checks, Name, Type or Last Scanned for quick access. |
7 | Favorite | Mark this as a favorite for quick access and easy monitoring in the future. |
8. | Datastore | Displays the Datastore name for which the volumetric check is being performed . |
9. | Table | Displays the table for which the volumetric check is being performed (e.g., customer, nation). Each table has its own Volumetric Check. |
10. | Weight | Weight shows how important a check is for finding anomalies and sending alerts. |
11. | Anomaly Detection | The Volumetric Check detects anomalies when row counts exceed set min or max thresholds, triggering an alert for sudden changes. |
12. | Edit Checks | Edit the check to modify settings, or add tags for better customization and monitoring. |
13. | Volumetric Check (# ID) | Each check is assigned a unique identifier, followed by the time period it applies to (e.g., 1 Day for the customer table). This ID helps in tracking the specific check in the system. |
14. | Group By | Users can also Group By specific intervals, such as day, week, or month, to observe trends over different periods. |
15. | Measurement Period | Defines the time period over which the volumetric check is evaluated. It can be customized to 1 day, week, or other timeframes. |
16. | Min Values | These indicate the minimum thresholds for the row count of the table being checked (e.g., 150,139 Rows) |
17. | Max Values. | These indicate the maximum thresholds for the row count of the table being checked |
18. | Last Asserted | This shows the date the last check was asserted, which is the last time the system evaluated the Volumetric Check (e.g., Oct 02, 2024). |
19. | Edit Threshold | Edit Threshold lets users set custom limits for alerts, helping them control when they’re notified about changes in data. |
20. | Graph Visualization | The graph provides a visual representation of the row count trends. It shows fluctuations in data volume over the selected period. This visual allows users to quickly identify any irregularities or anomalies. |
Observability Heatmap
The heatmap provides a visual overview of data anomalies by day, using color codes for quick understanding:
-
Blue square: Blue square represent days with no anomalies, meaning data stayed within the expected range.
-
Orange square: Orange square indicate days where data exceeded the minimum or maximum threshold range but didn’t qualify as a critical anomaly.
-
Red square: Red square highlight days with anomalies, signaling significant deviations from expected values that need further investigation.
By hovering over each square, you can view additional details for that specific day, including the date, last row count, and anomaly count, allowing you to easily pinpoint and analyze data issues over time.
Edit Check
Editing a Volumetric Check lets users customize settings like measurement period, row count limits, and description. This helps improve data monitoring and anomaly detection, ensuring the check fits specific needs. Users can also add tags for better organization.
Step 1: Click the edit icon to modify the check.
A modal window will appear with the check details.
Step 2: Modify the check details as needed based on your preferences:
No | Fields | Description |
---|---|---|
1. | Measurement Periods Days | Edit the Measurement Period Days to change how often the check runs (e.g., 1 day, 2 days, etc). |
2. | Min Value and Max Value | Edit the Min Value and Max Value to set new row count limits. If the row count exceeds these limits, an alert will be triggered. |
3. | Description | Edit the Description to better explain what the check does. |
4. | Tags | Edit the Tags to organize and easily find the check later. |
5. | Additional Metadata(Optional) | Edit the Additional Metadata section to add any new custom details for more context. |
Step 3: Once you have edited the check details, then click on the Validate button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct.
If the validation is successful, a green message saying "Validation Successful" will appear.
If the validation fails, a red message saying "Failed Validation" will appear. This typically occurs when the check logic or parameters do not match the data properly.
Step 3: Once you have a successful validation, click the "Update" button. The system will update the changes you've made to the check, including changes to the properties, description, tags, or additional metadata.
After clicking on the Update button, your check is successfully updated and a success flash message will appear stating "Check successfully updated".
Edit Threshold
Edit thresholds to set specific row count limits for your data checks. By defining minimum and maximum values, you ensure alerts are triggered when data goes beyond the expected range. This helps you monitor unusual changes in data volume. It gives you better control over tracking your data's behavior.
Note
When Editing the threshold, only the min and max values can be modified.
Step 1: Click the Edit Thresholds button on the right side of the graph.
Step 2: After clicking Edit Thresholds, you enter the editing mode where the Min and Max values become editable, allowing you to input new row count limits.
Step 3: Once you've updated the Min and Max values, click Save to apply the changes and update the thresholds.
After clicking on the Save button, your threshold is successfully updated and a success flash message will appear stating "Check successfully updated".
Metric
Metric track changes in data over time to ensure accuracy and reliability. They check specific fields against set limits to identify when values, like averages, go beyond expected ranges. With scheduled scans, Metrics automatically log and analyze these data points, making it easy for users to spot any issues. This functionality enhances users' understanding of data patterns, ensuring high quality and dependability. With Metrics, managing and monitoring data becomes straightforward and efficient.
No | Field | Description |
---|---|---|
1 | Select Source Datastore | Select specific source datastores to focus on their data. |
2 | Tag | Filter data by specific tags to categorize and refine results. |
3 | Search | The search bar helps users find specific metrics or data by entering an identifier or description. |
4 | Sort By | Sort By allows users to organize data by Weight, Anomalies, or Created Date for easier analysis and prioritization. |
5 | Metric(ID) | Represents the tracked data metric with a unique ID. |
6 | Datastore | Shows the Datastore name. |
7 | Table | Shows the table name. |
8 | Description | A brief label or note about the metric, in this case, it's labeled as test |
9 | Weight | Weight shows how important a check is for finding anomalies and sending alerts. |
10 | Anomalies | Anomalies show unexpected changes or issues in the data that need attention. |
11 | Favorite | Mark this as a favorite for quick access and easy monitoring in the future. |
12 | Edit Checks | Edit the check to modify settings, or add tags for better customization and monitoring. |
13 | Field | This refers to the specific field being measured, here the max_value, which tracks the highest value observed for the metric. |
14 | Min | This indicates the minimum value for the metric, which is set to 1. If not defined, no lower limit is applied. |
15 | Max | This field shows the maximum threshold for the metric, set at 8. Exceeding this may indicate an issue or anomaly. |
16 | Created Date | This field shows when the metric was first set up, in this case, June 18, 2024. |
17 | Last Asserted | Last Asserted field shows the last time the metric was checked, in this case July 25, 2024. |
18 | Edit Threshold | Edit Threshold lets users set custom limits for alerts, helping them control when they’re notified about changes in data. |
19 | Group By | This option lets users group data by periods like Day, Week, or Month. In this example, it's set to Day. |
Comparisons
When you add a metric check, you can choose from three comparison options:
- Absolute Change
- Absolute Value
- Percentage Change.
These options help define how the system will evaluate your data during scan operations on the datastore.
Once a scan is run, the system analyzes the data based on the selected comparison type. For example, Absolute Change will look for significant differences between scans, Absolute Value checks if the data falls within a predefined range, and Percentage Change identifies shifts in data as a percentage.
Based on the chosen comparison type, the system flags any deviations from the defined thresholds. These deviations are then visually represented on a chart, displaying how the metric has fluctuated over time between scans. If the data crosses the upper or lower limits during any scan, the system will highlight this in the chart for further analysis.
1. Absolute Change: The Absolute Change comparison checks how much a numeric field's value has changed between scans. If the change exceeds a set limit (Min/Max), it flags this as an anomaly.
2. Absolute Value: The Absolute Value comparison checks whether a numeric field's value falls within a defined range (between Min and Max) during each scan. If the value goes beyond this range, it identifies it as an anomaly.
3. Percentage Change: The Percentage Change comparison monitors how much a numeric field's value has shifted in percentage terms. If the change surpasses the set percentage threshold between scans, it triggers an anomaly.
Edit Check
Edit Check allows users to modify the parameters of an existing metric check. It enables adjusting values, thresholds, or comparison logic to ensure that the metric accurately reflects current monitoring needs.
Step 1: Click the edit icon to modify the check.
A modal window will appear with the check details.
Step 2: Modify the check details as needed based on your preferences:
No | Field | Description |
---|---|---|
1. | Min Value and Max Value | Edit the Min Value and Max Value to set new row count limits. If the row count exceeds these limits, an alert will be triggered. |
2. | Description | Edit the Description to better explain what the check does. |
3. | Tags | Edit the Tags to organize and easily find the check later. |
4. | Additional Metadata(Optional) | Edit the Additional Metadata section to add any new custom details for more context. |
Step 3: Once you have edited the check details, then click on the Validate button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct.
If the validation is successful, a green message saying "Validation Successful" will appear.
Step 4: Once you have a successful validation, click the "Update" button. The system will update the changes you've made to the check, including changes to the properties, description, tags, or additional metadata.
After clicking on the Update button, your check is successfully updated and a success flash message will appear stating "Check successfully updated".
Edit Threshold
Edit Threshold allows you to change the upper and lower limits of a metric. This ensures the metric tracks data within the desired range and only triggers alerts when those limits are exceeded.
Note
When editing the threshold, only the min and max values can be modified.
Step 1: Click the Edit Thresholds button on the right side of the graph.
Step 2: After clicking Edit Thresholds, you enter the editing mode where the Min and Max values become editable, allowing you to input new row count limits.
Step 3: Once you've updated the Min and Max values, click Save to apply the changes and update the thresholds.
After clicking on the Save button, your threshold is successfully updated and a success flash message will appear stating "Check successfully updated".
Checks
Checks tab provides a quick overview of the various checks applied across different tables and fields in multiple source datastores. In Qualytics, checks act as rules applied to data tables and fields to ensure accuracy and maintain data integrity. You can filter and sort the checks based on your preferences, making it easy to see which checks are active, in draft, or archived. This section is designed to simplify the review of applied checks across datasets without focusing on data quality or performance.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.
Step 2: Click on the "Checks" from the Navigation Tab.
You will be navigated to the Checks tabs here and you'll see a list of all the checks that have been applied to various tables and fields across different source datastores.
Categories Check
You can categorize your checks based on their status, such as Active, Draft, Archived (Invalid and Discarded), or All, according to your preference. This categorization offers a clear view of the data quality validation process, helping you manage checks efficiently and maintain data integrity.
All
By selecting All Checks, you can view a comprehensive list of all the checks in the datastores, including both active and draft checks, allowing you to focus on the checks that are currently being managed or are in progress. However, archived checks are not displayed in this.
Active
By selecting Active, you can view checks that are currently applied and being enforced on the data. These operational checks are used to validate data quality in real-time, allowing you to monitor all active checks and their performance.
You can also categorize the active checks based on their importance, favorites, or specific metrics to streamline your data quality monitoring.
For more details on Active Checks, please refer to the Active Checks section in the documentation.
Draft Checks
By selecting Draft, you can view checks that have been created but have not yet been applied to the data. These checks are in the drafting stage, allowing for adjustments and reviews before activation. Draft checks provide flexibility to experiment with different validation rules without affecting the actual data.
You can also categorize the draft checks based on their importance, favorites, or specific metrics to prioritize and organize them effectively during the review and adjustment process.
For more details on Draft Checks, please refer to the Draft Checks section in the documentation.
Archived Checks
By selecting Archived, you can view checks that have been marked as discarded or invalid from use but are still stored for future reference or restoration. Although these checks are no longer active, they can be restored if needed.
You can also categorize the archived checks based on their status as Discarded, Invalid, or view All archived checks to manage and review them effectively.
For more details on Archived Checks, please refer to the Archived Checks section in the documentation.
Check Details
Check Details provides important information about each check in the system. It shows when a check was last run, how often it has been used, when it was last updated, who made changes to it, and when it was created. This section helps users understand the status and history of the check, making it easier to manage and track its use over time.
Step 1: Locate the check you want to review, then hover over the info icon to view the Check Details.
For more steps and further information, please refer to the Check Details section in the documentation.
Status Management of Checks
Set Check as Draft
You can move an active check into a draft state, allowing you to work on the check, make adjustments, and refine the validation rules without affecting live data. This is useful when you need to temporarily deactivate a check for review and updates.
Step 1: Click on the active check that you want to move to the draft state.
To understand how to draft checks, you can follow the remaining steps from the documentation Draft Specific Check.
Activate Draft Check
You can activate the draft checks after when you have worked on the check, make adjustments, and refine the validation rules. By activating the draft check and making it live, ensures that the defined criteria are enforced on the data.
Step 1: Navigate to the Draft check section, and click on the drafted check that you want to activate, whether you have made changes or wish to activate it as is.
To understand how to activate draft checks, you can follow the remaining steps from the documentation Activate Draft Checks.
Set Check as Archived
You can move an active or draft check into the archive when it is no longer relevant but may still be needed for historical purposes or future use. Archiving helps keep your checks organized without permanently deleting them.
Step 1: Click on the check from the list of available (whether Active or Draft) checks that you want to archive.
For Demonstration purposes, we have selected the "Metric" check.
To understand how to set check as archived, you can follow the remaining steps from the documentation Set Check As Archived.
Restore Archived Checks
If a check has been archived, then you can restore it back to an active state or in a draft state. This allows you to reuse your checks that were previously archived without having to recreate them from scratch.
Step 1: Click on Archived from the navigation bar in the Checks section to view all archived checks.
To understand how to restore archived checks, you can follow the remaining steps from the documentation Restore Archived Checks.
Edit Check
You can edit an existing check to modify its properties, such as the rule type, coverage, filter clause, or description. Updating a check ensures that it stays aligned with evolving data requirements and maintains data quality as conditions change.
Step 1: Click on the check you want to edit, whether it is an active or draft check.
For Demonstration purposes, we have selected the "Metric" check.
To understand how to edit checks, you can follow the remaining steps from the Edit Checks section in the documentation.
Mark Check as Favorite
Marking a check as a favorite allows you to quickly access and prioritize the checks that are most important to your data validation process. This helps streamline workflows by keeping frequently used or critical checks easily accessible, ensuring you can monitor and manage them efficiently. By marking a check as a favorite, it will appear in the "Favorite" category for faster retrieval and management.
Step 1: Locate the check which you want to mark as a favorite and click on the bookmark icon located on the right side of the check.
To understand how to mark check as favorite, you can follow the remaining steps from the Mark Check as Favorite section in the documentation.
Clone Check
You can clone both active and draft checks to create a duplicate copy of an existing check. This is useful when you want to create a new check based on the structure of an existing one, allowing you to make adjustments without affecting the original check.
Step 1: Click on the check (whether Active or Draft) that you want to clone.
For Demonstration purposes, we have selected the "Metric" check.
To understand how to clone check, you can follow the remaining steps from the Clone Check section in the documentation.
Create a Quality Check template
You can add checks as a Template, which allows you to create a reusable framework for quality checks. By using templates, you standardize the validation process, enabling the creation of multiple checks with similar rules and criteria across different datastores. This ensures consistency and efficiency in managing data quality checks.
Step 1: Locate the check (whether Active or Draft) you want to archive and click on that check.
To understand how to create a quality check template, you can follow the remaining steps from the Quality Check Template section in the documentation.
Filter and Sort
Filter and Sort options allow you to organize your checks by various criteria, such as Weight, Anomalies, Coverage, Created Date, and Rules. You can also apply filters to refine your list of checks based on Selected Source Datastores, Check Type, Asserted State (Passed, Failed, Not Asserted), Tags, Files, and Fields.
Sort
You can sort your checks by Anomalies, Coverage, Created Date, Rules, and Weight to easily organize and prioritize them according to your needs.
No | Sort By Option | Description |
---|---|---|
1 | Anomalies | Sort checks based on the number of active anomalies. |
2 | Coverage | Sort checks by data coverage percentage. |
3 | Created Date | Sort checks according to the date they were created. |
4 | Rules | Sort checks based on specific rules applied to the checks. |
5 | Weight | Sort checks by their assigned weight or importance level. |
Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.
Filter
You can filter your checks based on values like Source Datastores Check Type, Asserted State, Rule, Tags, File, Field, and Template.
No | Filter | Filter Value | Description |
---|---|---|---|
1 | Selected Source Datastores | N/A | Select specific source datastores to focus on their checks. |
2 | Select Tags | N/A | Filter checks by specific tags to categorize and refine results. |
No | Filter | Filter Value | Description |
---|---|---|---|
3 | Check Type | All | Displays all types of checks, both inferred and authored. |
Inferred | Shows system-generated checks that automatically validate data based on detected patterns or logic. | ||
Authored | Displays user-created checks, allowing the user to focus on custom validations tailored to specific requirements. | ||
4 | Asserted State | All | Displays all checks, regardless of their asserted status. This provides a full overview of both passed, failed, and not asserted checks. |
Passed | Shows checks that have been asserted successfully, meaning no active anomalies were found during the validation process. | ||
Failed | Displays checks that have failed assertion, indicating active anomalies or issues that need attention. | ||
Not Asserted | Filters out checks that have not yet been asserted, either because they haven’t been processed or validated yet. | ||
5 | Rule | N/A | Select this to filter the checks based on specific rule type for data validation, such as checking non-null values, matching patterns, comparing numerical ranges, or verifying date-time constraints. |
6 | Template | N/A | This filter allows users to view and apply predefined check templates. |
Anomalies
Anomalies tab provides a quick overview of all detected anomalies across your source datastores. In Qualytics, An Anomaly refers to a dataset (record or column) that fails to meet specified data quality checks, indicating a deviation from expected standards or norms. These anomalies are identified when the data does not satisfy the applied validation criteria. You can filter and sort anomalies based on your preferences, making it easy to see which anomalies are active, acknowledged, or archived. This section is designed to help you quickly identify and address any issues.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.
Step 2: Click on the "Anomalies" from the Navigation Tab.
You will be navigated to the Anomalies tab, where you'll see a list of all the detected anomalies across various tables and fields from different source datastores, based on the applied data quality checks.
Categories Anomalies
Anomalies can be classified into different categories based on their status and actions taken. These categories include All, Active, Acknowledged, and Archived anomalies. Managing anomalies effectively helps in maintaining data integrity and ensuring quick response to issues.
All
By selecting All Anomalies, you can view the complete list of anomalies, regardless of their status. This option helps you get a comprehensive overview of all issues that have been detected, whether they are currently active, acknowledged, or archived.
Active
By selecting Active Anomalies, you can focus on anomalies that are currently unresolved or require immediate attention. These are the anomalies that are still in play and have not yet been acknowledged, archived, or resolved.
Acknowledge
By selecting Acknowledged Anomalies, you can see all anomalies that have been reviewed and marked as acknowledged. This status indicates that the anomalies have been noted, though they may still require further action.
Archived
By selecting Archived Anomalies, you can view anomalies that have been resolved or moved out of active consideration. Archiving anomalies allows you to keep a record of past issues without cluttering the active list.
You can also categorize the archived anomalies based on their status as Resolved, Duplicate and Invalid, to review them effectively.
1. Resolved: This indicates that the anomaly was a legitimate data quality concern and has been addressed.
2. Duplicate: This indicates that the anomaly is a duplicate of an existing record and has already been addressed.
3. Invalid: This indicates that the anomaly is not a legitimate data quality concern and does not require further action.
4. All: Displays all archived anomalies, including those marked as Resolved, Duplicate, and Invalid, giving a comprehensive view of all past issues.
Anomaly Details
Anomaly Details window provides information about anomalies identified during scan operations. It displays details such as the anomaly ID, status, type, detection time, and where it is in the data, such as the datastore and table. Additionally, it offers options to explore datasets, share details, and collaborate, making it easier to resolve data issues.
Step 1: Click on the anomaly from the list of available (whether Active, Acknowledged or Archived) anomalies to view its details.
A modal window titled “Anomaly Details” will appear, displaying all the details of the selected anomaly.
For more details on Anomaly Details, please refer to the Anomaly Details section in the documentation.
Acknowledged Anomalies
By acknowledging anomalies, you indicate that they have been reviewed or recognized. Acknowledging anomalies helps you keep track of issues that have been addressed, even if further action is still required.
Warning
Once an anomaly is acknowledged, it remains acknowledged and never reverts to the active state.
Step 1: Click on the active anomaly from the list of available anomalies that you want to acknowledge.
Step 2: A modal window will appear displaying the anomaly details. Click on the acknowledge (👁) icon located in the upper-right corner of the modal window.
Step 3: After clicking on the Acknowledge icon your anomaly is successfully moved to the acknowledge and a flash message will appear saying “The Anomaly has been successfully acknowledged”.
Archive Anomalies
By archiving anomalies, you move them to an inactive state, while still keeping them available for future reference or analysis. Archiving helps keep your active anomaly list clean without permanently deleting the records.
Step 1: Click on the anomaly from the list of available (whether Active or Acknowledged) anomalies that you want to archive.
Step 2: A modal window will appear displaying the anomaly details. Click on the archive (🗑) icon located in the upper-right corner of the modal window.
Step 3: A modal window titled “Archive Anomaly” will appear, providing you with the following archive options:
-
Resolved: Choose this option if the anomaly was a legitimate data quality concern and has been addressed. This helps maintain a record of resolved issues while ensuring that they are no longer active.
-
Duplicate: Choose this option if the anomaly is a duplicate of an existing record and has already been addressed. No further action is required as the issue has been previously resolved.
-
Invalid: Select this option if the anomaly is not a legitimate data quality concern and does not require further action. Archiving anomalies as invalid helps differentiate between real issues and those that can be dismissed, improving overall data quality management.
Step 4: Once you've made your selection, click the Archive button to proceed.
Step 5: After clicking on the Archive button your anomaly is moved to the archive and a flash message will appear saying “Anomaly has been successfully archived”.
Restore Archived Anomalies
By restoring archived anomalies, you can bring them back into the acknowledged state for further investigation or review. These anomalies will not return to the active state once they have been acknowledged.
Step 1: Click on the anomaly that you want to restore from the list of archived anomalies.
Step 2: A modal window will appear displaying the anomaly details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and click on “Restore” from the drop-down menu.
Step 3: After clicking on the “Restore” button, the selected anomaly is now restored as in acknowledged state.
Assign Tags
Assigning tags to an anomaly serves the purpose of labeling and grouping anomalies and driving downstream workflows.
Step 1: Click on the Assign tags to this Anomaly or + button.
Step 2: A dropdown menu will appear with existing tags. Scroll through the list and click on the tag you wish to assign.
Delete Anomalies
Deleting an anomaly allows you to permanently remove a record that is no longer relevant or was logged in error. This action is done individually, ensuring that your anomaly records remain clean and up to date.
Note
You can only delete archived anomalies, not active or acknowledged checks. If you want to delete an active or acknowledged anomaly, you must first move it to the archive, and then you can delete it.
You can delete individual anomalies using one of two methods:
1. Delete Directly
Step 1: Click on Archived from the navigation bar in the Anomalies section to view all archived anomalies.
Step 2: Locate the anomaly, that you want to delete and click on the Delete icon located on the right side of the anomaly.
Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the anomaly from the system.
Step 4: After clicking on the delete button, your anomaly is successfully deleted and a success flash message will appear saying “Anomaly has been successfully deleted”.
2. Delete via Action Menu
Step 1: Click on the archive anomaly from the list of archived anomalies that you want to delete.
Step 2: A modal window will appear displaying the anomaly details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and click on “Delete” from the drop-down menu.
Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the anomaly from the system.
Step 4: After clicking on the delete button, your anomaly is successfully deleted and a success flash message will appear saying “Anomaly has been successfully deleted”.
Filter and Sort
Filter and Sort options allow you to organize your anomaly by various criteria, such as Weight, Anomalous Record, Created Date. You can also apply filters to refine your list of anomaly based on Selected Source Datastores, Selected Tag, Timeframe, Type and Rule .
Sort
You can sort your anomalies by Anomalous Record, Created Date, and Weight to easily organize and prioritize them according to your needs.
No | Sort By Option | Description |
---|---|---|
1 | Anomalous Record | Sorts anomalies based on the number of anomalous records identified. |
2 | Created Date | Sorts anomalies according to the date they were detected. |
3 | Weight | Sort anomalies by their assigned weight or importance level. |
Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.
Filter
You can filter your anomalies based on values like Source Datastores, Timeframe, Type, Rule, and Tags.
No. | Filter | Description |
---|---|---|
1 | Selected Source Datastore | Select specific source datastores to focus on their anomalies. |
2 | Select Tags | Filter anomalies by specific tags to categorize and prioritize issues effectively. |
3 | Timeframe | Filtering anomalies detected within specific time ranges (e.g., anomalies detected in the last week or year). |
4 | Type | Filter anomalies based on anomaly type (Record or Shape). |
5 | Rule | Filter anomalies based on specific rules applied to the anomaly. |
Ended: Explore
Notifications ↵
Notifications: Overview
Notifications in Qualytics offer a powerful system for delivering crucial alerts and updates across various communication channels. By setting up notification rules with specific triggers and channels, users can ensure timely awareness of critical events. This functionality enhances productivity, optimizes incident response, and promotes effective data management within the Qualytics platform.
Let’s get started 🚀
Multiple Notification Channels
Qualytics emphasizes the configuration of multiple notification channels, which are crucial for ensuring that important alerts and updates reach users effectively through various platforms.
In-App Notifications
In-app notifications in Qualytics are real-time alerts that keep users informed about various events related to their data operations and quality checks. These notifications are displayed within the Qualytics interface and cover a range of activities, including operation completions, anomaly detections.
External Platforms
Qualytics allows users to choose external platforms where they want to receive notifications, enhancing integration with their existing workflows. You can:
- Add Email Notification
- Add HTTP Notification
- Add Microsoft Teams Notification
- Add Pager Duty Notification
- Add Slack Notification
- Add Webhook Notification
This versatility ensures that notifications are delivered through the most convenient and effective channels for each team, allowing them to stay informed and respond to real-time data quality issues.
Note
If you do not select any notification channel, by default, you will receive notifications via in-app notifications. However, if you choose any notification channel, such as Email, you will receive notifications through both your selected channel and also in-app notifications.
Tip
Qualytics provides you with multiple options for receiving notifications. You can select one or more notification channels to receive your notifications.
Reducing Notification Fatigue
Users can customize their notification settings to define which alerts they wish to receive and how they want to be notified. This flexibility ensures that users are only informed about relevant events, reducing notification fatigue.
For example, “assigning tags” while adding the notifications generates alerts regarding only those source datastores having the tags as defined while adding the notification. If the tag “critical” is defined while adding the notification rule, then the notification will be generated only for source datastores having the “critical” tag.
User Feedback Integration
The system allows for feedback on detected anomalies, which helps refine the anomaly detection process and reduce alert fatigue. This ensures that users are only notified about the most relevant issues. Notifications are enhanced by incorporating user feedback on actions taken.
Navigation to Notifications
Log in to your Qualytics account and click on the "Notification Rules button on the left side panel of the interface.
Add Notification Rule
In Qualytics, notification rules send in-app messages by default and can also notify you via external applications like Email, Slack, Microsoft Teams, and many more. This helps you stay updated on important events, like when an operation completes, even if you're not using the app. You can customize these alerts by adding datastore tags and choosing your preferred notification channels, ensuring you get timely updates.
Step 1: Click on the “Add Notifications” button located in the top right corner.
A modal window “Add Notification Rule” will appear providing you with fields to set notification rules.
Step 2: Enter the following details to add the notification rule.
1. Name: Enter a specific and descriptive title to your notification rule to easily identify its purpose.
2. Description: Provide a brief description of what the notification rule does or when it should trigger.
3. Trigger When: Select the event or condition from the dropdown menu that will trigger the notification. Below is the list of available events you can choose from:
-
Operation Completion: This type of notification is triggered whenever an operation, such as a catalog, profile, or scan, is completed on a source datastore. Upon completion, teams are promptly notified through in-app messages and, if configured, via external notification channels such as email, Slack, Microsoft Teams, and others. For example, the team is notified whenever the catalog operation is completed, helping them proceed with the profile operation on the datastore.
-
An Anomaly is Identified: This type of notification is triggered when any single anomaly is identified in the data. The notification message typically includes the type of anomaly detected and the datastore where it was found. It provides specific information about the anomaly type, which helps quickly understand the issue's nature.
Tip
Users can specify a minimum anomaly weight for this trigger condition. This threshold ensures that only anomalies with a weight equal to or greater than the specified value will trigger a notification. If no value is set, all detected anomalies, regardless of their weight, will generate notifications. This feature helps prioritize alerts based on the importance of the anomalies, allowing users to focus on more critical issues.
Tip
Users can specify check rule types for this trigger condition. This selection ensures that only anomalies identified by the chosen rule types will trigger a notification. If no check rule types are selected, this filter will be ignored, resulting in all anomalies generating notifications. This feature enables users to prioritize alerts based on specific criteria, allowing them to focus on the most relevant issues.
- Anomalies are Detected in a Table or File: This notification is triggered when multiple anomalies are detected within a specific table, file and check rule types. It includes information about the number of anomalies found and the specific scan target within the datastore. This is useful for assessing the overall health of a particular datastore. No concept of weights.
Factors | An Anomaly is Identified | Anomalies are Detected in a Table or File |
---|---|---|
Trigger Event | Notifies for individual anomaly detection | Notifies for multiple anomalies within a specific table or file |
Notification Content | Focuses on the type of anomaly and the affected datastore. | Provide a count of anomalies and specifies the scan target within the datastore. |
Notification Targeting | Tags, Weight and Check Rule Types | Tags, Check Rule Types or both |
4. Message: Enter your custom message using variables in the Message field, where you can specify the content of the notification that will be sent out.
Tip
You can write your custom notification message by utilizing the autocomplete feature. This feature allows you to easily insert internal variables such as {{ rule_name }}, {{ container_name }}, and {{ datastore_name }}. As you start typing, the autocomplete will suggest and recommend relevant variables in the dropdown.
5. Datastore Tags: Use the drop-down menu to select the datastore tags. Notifications will be generated for only those source datastores that have the datastore tags you select in this step. For example, if you select “critical” datastore tag from the dropdown menu, notifications will be generated only for source datastores having the "critical" tag applied to them.
Note
If you choose "An Anomaly is Detected" as the trigger condition, you must define the Anomaly Tag, set a minimum anomaly weight, and select the check rule types. This ensures that only anomalies with a weight equal to or greater than the specified value and matching the selected check rule types will trigger a notification. If no weight or check rule types are specified, these filters will be ignored.
6. Notification channel: Select the notification channel where you want your alerts to be sent. This ensures you get notified in the way you prefer.
Channels | Description | References |
---|---|---|
Emails | Send notifications directly to your specified email addresses. | See more. |
HTTP Action | Triggers an HTTP action to notify a specific endpoint or service. | See more. |
Microsoft Teams | Sends notifications to a specified Microsoft Teams channel. | See more. |
PagerDuty | Integrates with Pager Duty to alert you through your PagerDuty setup. | See more. |
Slack | Sends notifications to a specific Slack channel. | See more. |
Webhook | Sends notifications via webhooks to custom endpoints you configure. | See more. |
Note
If you do not select any notification channel, you will receive notifications by default via in-app notifications. However, if you choose any notification channel, such as Email, you will receive notifications through both your selected channel and also in-app notifications.
Tip
Qualytics provides you with multiple options for receiving notifications. You can select one or more notification channels to get notified.
Step 3: Once you have selected your preferred notification channels, then click on the “Save” button.
After clicking on the “Save” button then a confirmation message will display saying “Notification successfully created”
Available Variables
Qualytics provides a set of internal variables that you can use to customize your notification messages. These variables dynamically insert specific information related to the triggered event, ensuring that your notifications are both relevant and informative. Below is a list of variables categorized by the type of operation:
Event | Variable | Description |
---|---|---|
When an Operation Completes | {{ rule_name }} | The name of the rule associated with the operation. |
{{ target_link }} | A link related to the operation. | |
{{ datastore_name }} | The name of the datastore involved. | |
{{ operation_message }} | A custom message related to the operation. | |
{{ operation_type }} | The type of operation performed. | |
{{ operation_result }} | The result of the operation. | |
When an Anomaly is Detected | {{ rule_name }} | The name of the rule associated with the detected anomaly. |
{{ target_link }} | A link to the relevant target or source. | |
{{ datastore_name }} | The name of the datastore where the anomaly was detected. | |
{{ anomaly_message }} | A custom message related to the anomaly. | |
{{ anomaly_type }} | The type of anomaly detected. | |
{{ check_description }} | A description of the check that detected the anomaly. | |
When Anomalies Are Detected in a Table or File | {{ rule_name }} | The name of the rule associated with the anomaly detection. |
{{ target_link }} | A link to the relevant target or source. | |
{{ datastore_name }} | The name of the datastore where the anomaly was detected. | |
{{ anomaly_count }} | The number of anomalies detected. | |
{{ scan_target_name }} | The name of the scan target (table or file). | |
{{ anomaly_message }} | A custom message related to the detected anomalies. | |
{{ check_description }} | A description of the check that detected the anomaly. |
Notification Examples
I. When an Operation Completes
Message Template:
Operation {{ operation_type }}
completed for the rule {{ rule_name }}
on datastore {{ datastore_name }}
. The result of the operation is {{ operation_result }}
. You can review the details here: {{ target_link }}
. Additional information: {{ operation_message }}
.
Custom Message Example:
Operation Scan Operation
completed for the rule Max Partition Size
on datastore CustomerDataStore
. The result of the operation is Success
.
You can review the details here: https://<your-instance>.qualytics.io/datastores/<datastore-id>/activity?operation_id=<operation-id>
.
Additional information: The scan confirmed that all partitions are within the allowed maximum number of records, ensuring data consistency and performance efficiency.
II. When an Anomaly is Detected
Message Template:
Anomaly detected by rule {{ rule_name }}
in the datastore {{ datastore_name }}
. The anomaly type is {{ anomaly_type }}
. For more details, see here- {{ target_link }}
. Additional info: {{ anomaly_message }}
and check description: {{ check_description }}
.
Custom Message Example:
Anomaly detected by rule Required Values
in the datastore SalesDataStore
. The anomaly type is record
. For more details, see here - https://<your-instance>.qualytics.io/anomaly-details
.
Additional info: A record was found in the CUSTOMER table that lacks a required value for the field 'Customer_Status
and check description: This rule asserts that all defined values, such as 'Customer_Status', must be present at least once within a field to ensure data completeness and integrity.
III. When Anomalies Are Detected in a Table or File
Message Template:
Alert! {{ anomaly_count }}
anomalies were detected in {{ scan_target_name }}
within the datastore {{ datastore_name }}
. The rule {{ rule_name }}
triggered this detection. Anomaly details: {{ anomaly_message }}
. You can review the details here: {{ target_link }}
. Description of the check: {{ check_description }}
.
Custom Message Example:
Alert! 1
anomaly was detected in the LINEITEM
table within the datastore SalesDataStore
. The rule MaxValue_Rule_L_QUANTITY
triggered this detection.
Anomaly details: The quantity of items (L_QUANTITY) exceeded the maximum allowed value of 50
. You can review the details here: https://<your-instance>.qualytics.io/anomaly-details
Description of the check: This check asserts that the quantity of items (L_QUANTITY) in the LINEITEM table does not exceed a value of 50, ensuring data accuracy and preventing potential overflows.
Manage Notifications
In Qualytics, managing notifications gives you full control over how and when you receive alerts. You can customize notifications to fit your needs, mute or unmute them to manage distractions and delete notifications to keep your system organized. By effectively managing your notifications, you ensure that critical updates reach you at the right time, while unnecessary alerts are minimized, allowing you to stay focused on what matters most.
Let's get started 🚀
Navigation
Qualytics gives you full control over your notifications, allowing you to edit notification rules, mute notifications, and delete unwanted notification rules. You can navigate to the notifications management settings via two different methods as discussed below.
Method I: User Interface
Step1: Log in to your Qualytics account and click the bell "🔔" icon located in the top right navigation bar.
Here, you can view all the notifications that have been triggered based on your notification rules. You can also edit or mute the notification rule directly from this interface.
You can switch between the different types of notifications to have a categorical view(such as All, Operations, Anomalies).
Method II: Global Settings
Step 1: Log in to your Qualytics account and click the "Notification Rules button on the left side panel of the interface.
Here, you can view a list of all the notification rules you’ve added to the system. From this interface, you can easily manage (edit, delete, or mute) your notification rules as needed.
Edit Notifications Rule
You can edit the notification rules by adjusting trigger conditions, customizing the notification message, adding datastore tags, and adding or removing notification channels. These management options make sure your notifications fit your needs and preferences. There are two methods from which you can edit your notification rules.
Method I: From User Interface
You can directly navigate to the notification generated and appearing on the user interface and edit the rule that triggered such notification.
Note
If you edit the notification rule directly from the User Interface, the changes will apply to the entire rule, ensuring that all future notifications triggered from such an updated rule follow the latest modification.
Step 1: Log in to your Qualytics account and click the bell; "🔔" icon located in the top right navigation bar.
Here, you can view all notifications that have been triggered based on the notification rules.
Step 2: Hover over any of the notifications from the list and an action menu will appear. Click on the vertical ellipsis within the menu to view the notification management options.
Step 3: Click on the Edit button to start editing and managing the trigger settings that produced the existing notification.
Step 4: A modal window will appear, allowing you to edit the notification rule. Here, you can:
- Adjust trigger conditions
- Modify notification messages
- Add or remove datastore tags
- Update Notification Channels
Step 5: Once you have edited the notification rule or made necessary changes, click on the Save button.
After clicking on the Save button, a success flash message will appear saying the Notification successfully updated.
Method II: From Global Settings
You can also edit notification rules from the global settings. After making changes, such as adjusting the trigger conditions or adding tags, the notifications will now trigger based on the updated rule instead of the previous settings.
Step 1: Log in to your Qualytics account and click the "Notification Rules button on the left side panel of the interface.
Here you can view a list of all the notifications you have added to the Qualytics system.
Step 2: Click on the vertical ellipsis next to the notification rule you want to update and select the "Edit" option from the drop-down menu.
Step 3: A modal window will appear, allowing you to edit the notification rule. Here, you can:
- Adjust trigger conditions
- Modify notification messages
- Add or remove datastore tags
- Update notification channels
Step 4: Once you have edited the notification rule or made necessary changes, click on the Save button.
After clicking on the Save button, a success flash message will display saying Notification successfully updated.
Mute Notifications
You can temporarily turn off alerts for specific events without deleting them by muting notifications. This is helpful when you want to avoid distractions but still have the option to reactivate the notifications later. By muting, you can control which alerts you receive, focusing only on what’s most important.
You can mute your notifications through two different methods, as discussed below
Method I: From User Interface
You can mute notifications directly from the user interface, which allows you to silence specific alerts while keeping them available for later. This helps reduce distractions without losing track of important information. If needed, you can unmute them at any time.
Note
Muting a notification from the User Interface will mute the entire notification rule, stopping all future notifications from that rule.
Step 1: Log in to your Qualytics account and click the bell “🔔” icon located in the top right navigation bar.
Here, you can view all notifications that have been triggered based on the notification rule.
Step 2: Hover over to any of the notifications from the list and an action menu will appear. Click on the vertical ellipsis within the menu to view the notification management options
Step 3: Click on the Mute button from the dropdown list to mute the notification.
After clicking the Mute button, your notification will be muted, and a success flash message will appear saying, Notification successfully muted.
Method II: From Global Settings
You can also mute notifications from the global settings. This helps you focus by reducing distractions and ensuring that only the most important alerts come through on your notification channels when you need them.
Step 1: Log in to your Qualytics account and click the "Notification Rules button on the left side panel of the interface.
Here you can view a list of all the notifications you have added to the Qualytics system.
Step 2: Click on the vertical ellipsis next to the notification rule that you want to mute, then select Mute from the drop-down menu.
Step 3: After clicking the Mute button, all the notifications related to that rule will be muted, and a success flash message will be displayed: Notification successfully muted.
Unmute Notifications
Step 1: To unmute a notification, click on the vertical ellipsis next to the muted notification with the crossed-out bell icon, then select Unmute from the drop-down menu.
Step 2: After clicking on the Unmute button, your notification will be unmuted and a success flash message will be displayed saying Notification successfully unmuted.
Delete Notifications Rule
If you need to tidy up your notifications, you can delete a notification rule to remove it from the system permanently. Once deleted, the rule and its associated notifications will no longer appear in the user interface.
Step 1: Log in to your Qualytics account and click the "Notification Rules button on the left side panel of the interface.
Here you can view a list of all the notifications you have added to the Qualytics system.
Step 2: Click on the vertical ellipsis next to the notification you want to delete, then select "Delete" from the drop-down menu.
Step 3: A modal window Delete Notification will appear. Click the Delete button to confirm and remove the notification.
After clicking the Delete button, your notification will be removed, and a success flash message will appear stating, Notification successfully deleted.
Channels ↵
Email Notification
Adding email notifications allows users to receive timely updates or alerts directly in their inbox. By setting up notifications with specific triggers and channels, you can ensure that you are promptly informed about critical events, such as operation completions or detected anomalies. This proactive approach allows you to take immediate action when necessary, helping to address issues quickly and maintain the smooth and efficient operation of your processes.
Let’s get started 🚀
Navigation to Notifications
Log in to your Qualytics account and click the "Notification Rules button on the left side panel of the interface.
Add Email Notification
Step 1: Click on the Add Notifications button located in the top right corner.
A modal window Add Notification Rule will appear providing you with fields to set notification rules.
Step 2: Enter the following details to add the notification rule.
1. Name: Enter a specific and descriptive title to your notification rule to easily identify its purpose.
2. Description: Provide a brief description of what the notification rule does or when it should trigger.
3. Trigger When: Select the event or condition from the dropdown menu that will trigger the notification. Below is the list of available events you can choose from:
-
Operation Completion: This type of notification is triggered whenever an operation, such as a catalog, profile, or scan, is completed on a source datastore. Upon completion, teams are promptly notified through in-app messages and, if configured, via external notification channels such as email, Slack, Microsoft Teams, and others. For example, the team is notified whenever the catalog operation is completed, helping them proceed with the profile operation on the datastore.
-
An Anomaly is Identified: This type of notification is triggered when any single anomaly is identified in the data. The notification message typically includes the type of anomaly detected and the datastore where it was found. It provides specific information about the anomaly type, which helps quickly understand the issue's nature.
Tip
Users can specify a minimum anomaly weight for this trigger condition. This threshold ensures that only anomalies with a weight equal to or greater than the specified value will trigger a notification. If no value is set, all detected anomalies, regardless of their weight, will generate notifications. This feature helps prioritize alerts based on the importance of the anomalies, allowing users to focus on more critical issues.
Tip
Users can specify check rule types for this trigger condition. This selection ensures that only anomalies identified by the chosen rule types will trigger a notification. If no check rule types are selected, this filter will be ignored, resulting in all anomalies generating notifications. This feature enables users to prioritize alerts based on specific criteria, allowing them to focus on the most relevant issues.
- Anomalies are Detected in a Table or File: This notification is triggered when multiple anomalies are detected within a specific table, file and check rule types. It includes information about the number of anomalies found and the specific scan target within the datastore. This is useful for assessing the overall health of a particular datastore. No concept of weights.
Factors | An Anomaly is Identified | Anomalies are Detected in a Table or File |
---|---|---|
Trigger Event | Notifies for individual anomaly detection | Notifies for multiple anomalies within a specific table or file |
Notification Content | Focuses on the type of anomaly and the affected datastore. | Provide a count of anomalies and specifies the scan target within the datastore. |
Notification Targeting | Tags, Weight and Check Rule Types | Tags, Check Rule Types or both |
4. Message: Enter your custom message using variables in the Message field, where you can specify the content of the notification that will be sent out.
Tip
You can write your custom notification message by utilizing the autocomplete feature. This feature allows you to easily insert internal variables such as {{ rule_name }}, {{ container_name }}, and {{ datastore_name }}. As you start typing, the autocomplete will suggest and recommend relevant variables in the dropdown.
5. Datastore Tags: Use the drop-down menu to select the datastore tags. Notifications will be generated for only those source datastores that have the datastore tags you select in this step. For example, if you select critical datastore tag from the dropdown menu, notifications will be generated only for source datastores having the "critical" tag applied to them.
Note
If you choose "An Anomaly is Detected" as the trigger condition, you must define the Anomaly Tag, set a minimum anomaly weight, and select the check rule types. This ensures that only anomalies with a weight equal to or greater than the specified value and matching the selected check rule types will trigger a notification. If no weight or check rule types are specified, these filters will be ignored.
6. Notification Channel: Select Email as your notification channel and enter the Email Address where you want the notification to be sent.
Test Email Notification
Step 1. Click the Test Notification button to send a test email to the provided address. If the email is successfully sent, you will receive a confirmation message indicating Notification successfully sent
Step 2: The test email will be sent to the address you have provided. This verifies that the address provided is correct.
Save Email Notification
Step 1: Once you have entered all the values and selected Email as notification channels, then click on the Save button.
After clicking the Save button, a success message will be displayed saying Notification Successfully Created.
Post Results
Once you’ve saved your notification rules settings, all real-time notifications will be sent to the email address you specified, ensuring you receive them directly in your inbox.
For example when an operation is completed, or if any anomalies are detected in a table or file, a notification will be sent to the email address you provided.This ensures you are promptly alerted to critical events or irregularities, enabling immediate action when necessary. By providing real-time updates, it helps maintain the integrity and smooth operation of your processes.
HTTP Action
Integrating HTTP Action notifications allows users to receive timely updates or alerts directly to a specified server endpoint. By setting up HTTP Action notifications with specific trigger conditions, you can ensure that you are instantly informed about critical events, such as operation completions or anomalies detected. This approach enables you to take immediate action when necessary, helping to address issues quickly and maintain the smooth and efficient operation of your processes.
Navigation to Notifications
Step 1: Log in to your Qualytics account and click the “Notification Rules” button on the left side panel of the interface.
Add HTTP Action Notification
Step 1: Click on the “Add Notifications” button located in the top right corner.
A modal window, “Add Notification Rule” will appear, providing options to set notification rules.
Step 2: Enter the following details to add the notification rule.
1. Name: Enter a specific and descriptive title to your notification rule to easily identify its purpose.
2. Description: Provide a brief description of what the notification rule does or when it should trigger.
3. Trigger When: Select the event or condition from the dropdown menu that will trigger the notification. Below is the list of available events you can choose from:
-
Operation Completion: This type of notification is triggered whenever an operation, such as a catalog, profile, or scan, is completed on a source datastore. Upon completion, teams are promptly notified through in-app notifications and, the HTTP Action. For example, the team is notified whenever the catalog operation is completed, helping them proceed with the profile operation on the datastore.
-
An Anomaly is Identified: This type of notification is triggered when any single anomaly is identified in the data. The notification message typically includes the type of anomaly detected and the datastore where it was found. It provides specific information about the anomaly type, which helps quickly understand the issue's nature.
Tip
Users can specify a minimum anomaly weight for this trigger condition. This threshold ensures that only anomalies with a weight equal to or greater than the specified value will trigger a notification. If no value is set, all detected anomalies, regardless of their weight, will generate notifications. This feature helps prioritize alerts based on the importance of the anomalies, allowing users to focus on more critical issues. |
Tip
Users can specify check rule types for this trigger condition. This selection ensures that only anomalies identified by the chosen rule types will trigger a notification. If no check rule types are selected, this filter will be ignored, resulting in all anomalies generating notifications. This feature enables users to prioritize alerts based on specific criteria, allowing them to focus on the most relevant issues.
- Anomalies are Detected in a Table or File: This notification is triggered when multiple anomalies are detected within a specific table, file and check rule types. It includes information about the number of anomalies found and the specific scan target within the datastore. This is useful for assessing the overall health of a particular datastore. No concept of weights.
Factors | An Anomaly is Identified | Anomalies are Detected in a Table or File |
---|---|---|
Trigger Event | Notifies for individual anomaly detection | Notifies for multiple anomalies within a specific table or file |
Notification Content | Focuses on the type of anomaly and the affected datastore. | Provide a count of anomalies and specifies the scan target within the datastore. |
Notification Targeting | Tags, Weight and Check Rule Types | Tags, Check Rule Types or both |
4. Message: Enter your custom message using variables in the Message field, where you can specify the content of the notification that will be sent out.
Tip
You can write your custom notification message by utilizing the autocomplete feature. This feature allows you to easily insert internal variables such as {{ rule_name }}, {{ container_name }}, and {{ datastore_name }}. As you start typing, the autocomplete will suggest and recommend relevant variables in the dropdown.
5. Datastore Tags: Use the drop-down menu to select the datastore tags. Notifications will be generated for only those source datastores that have the datastore tags you select in this step. For example, if you select “critical” datastore tag from the dropdown menu, notifications will be generated only for source datastores having the "critical" tag applied to them.
Note
If you choose "An Anomaly is Detected" as the trigger condition, you must define the Anomaly Tag, set a minimum anomaly weight, and select the check rule types. This ensures that only anomalies with a weight equal to or greater than the specified value and matching the selected check rule types will trigger a notification. If no weight or check rule types are specified, these filters will be ignored.
6. Notification Channel: Select “HTTP Action” as your notification channel and enter the details where you want the notification to be sent.
Step 3: Enter the following detail where you want the notification to be sent.
1. Action URL: Enter the “Action URL” in this field, which specifies the server endpoint for the HTTP request, defining where data will be sent or retrieved. It must be correctly formatted and accessible, including the protocol (http or https), domain, and path.
2. HTTP Verbs: HTTP verbs specify the actions performed on server resources. Common verbs include:
-
POST: Use POST to send data to the server to create something new. For example, it's used for submitting forms or uploading files. The server processes this data and creates a new resource.
-
PUT: Updates or creates a resource, replacing it entirely if it already exists. For example, updating a user’s profile information or creating a new record with specific details.
-
GET: Retrieves data from the server without making any modifications. For example, requesting a webpage or fetching user details from a database.
3. Username: Enter the username needed for authentication.
4. Auth Type: This field specifies how to authenticate requests. Choose the method that fits your needs:
-
Basic: Uses a username and password sent with each request. Example: “Authorization: Basic
”. -
Bearer: Uses a token included in the request header to access resources. Example: “Authorization: Bearer
”. -
Digest: Provides a more secure authentication method by using a hashed combination of the username, password, and request details. Example: Authorization: Digest username="
", realm=" ", nonce=" ", uri=" ", response=" ".
5. Secret: Enter the password or token used for authentication. This is paired with the Username and Auth Type to securely access the server. Keep the secret confidential to ensure security.
Test HTTP Action Notification
Step 1: Click the "Test Notification" button to verify the correctness of the Action URL. If the URL is correct, a confirmation message saying "Notification successfully sent" will appear, confirming that the HTTP action is set up and functioning properly.
If you enter an incorrect Action URL, you will receive a failure message. For example, if you enter an incorrect URL endpoint like “test-message”, you will see a failure message indicating "failure: HTTP action returned 404: {"error": "Error: Endpoint not found with that path and method"}." This message shows that the specified endpoint could not be found.
Save HTTP Action Notification
Once you have provided all the necessary values, set the trigger conditions for the notification, and verify the correctness of the Action URL, click the "Save" button.
After clicking the “Save” button, a success message will be displayed saying "Notification Successfully Created".
Microsoft Teams Notification
Integrating Microsoft Teams notifications allows users to receive timely updates or alerts directly in their Teams channel. By setting up Microsoft Teams notifications with specific trigger conditions, you can ensure that you are instantly informed about critical events, such as operation completions or anomalies detected. This approach allows you to take immediate action when necessary, helping to address issues quickly and maintain the smooth and efficient operation of your processes.
Navigation to Notifications
Step 1: Log in to your Qualytics account and click the “Notification Rules” button on the left side panel of the interface.
Add Microsoft Teams Notification
Step 1: Click on the “Add Notifications” button located in the top right corner.
A modal window “Add Notification Rule” will appear providing you with options to set notification rules.
Step 2: Enter the following details to add the notification rule.
1. Name: Enter a specific and descriptive title to your notification rule to easily identify its purpose.
2. Description: Provide a brief description of what the notification rule does or when it should trigger.
3. Trigger When: Select the event or condition from the dropdown menu that will trigger the notification. Below is the list of available events you can choose from:
-
Operation Completion: This type of notification is triggered whenever an operation, such as a catalog, profile, or scan, is completed on a source datastore. Upon completion, teams are promptly notified through in-app notifications and, the Microsoft Teams channel. For example, the team is notified whenever the catalog operation is completed, helping them proceed with the profile operation on the datastore.
-
An Anomaly is Identified: This type of notification is triggered when any single anomaly is identified in the data. The notification message typically includes the type of anomaly detected and the datastore where it was found. It provides specific information about the anomaly type, which helps quickly understand the issue's nature.
Tip
Users can specify a minimum anomaly weight for this trigger condition. This threshold ensures that only anomalies with a weight equal to or greater than the specified value will trigger a notification. If no value is set, all detected anomalies, regardless of their weight, will generate notifications. This feature helps prioritize alerts based on the importance of the anomalies, allowing users to focus on more critical issues.
Tip
Users can specify check rule types for this trigger condition. This selection ensures that only anomalies identified by the chosen rule types will trigger a notification. If no check rule types are selected, this filter will be ignored, resulting in all anomalies generating notifications. This feature enables users to prioritize alerts based on specific criteria, allowing them to focus on the most relevant issues.
- Anomalies are Detected in a Table or File: This notification is triggered when multiple anomalies are detected within a specific table, file and check rule types. It includes information about the number of anomalies found and the specific scan target within the datastore. This is useful for assessing the overall health of a particular datastore. No concept of weights.
Factors | An Anomaly is Identified | Anomalies are Detected in a Table or File |
---|---|---|
Trigger Event | Notifies for individual anomaly detection | Notifies for multiple anomalies within a specific table or file |
Notification Content | Focuses on the type of anomaly and the affected datastore. | Provide a count of anomalies and specifies the scan target within the datastore. |
Notification Targeting | Tags, Weight and Check Rule Types | Tags,Check Rule Types or both |
4. Message: Enter your custom message using variables in the Message field, where you can specify the content of the notification that will be sent out.
Tip
You can write your custom notification message by utilizing the autocomplete feature. This feature allows you to easily insert internal variables such as {{ rule_name }}, {{ container_name }}, and {{ datastore_name }}. As you start typing, the autocomplete will suggest and recommend relevant variables in the dropdown.
5. Datastore Tags: Use the drop-down menu to select the datastore tags. Notifications will be generated for only those source datastores that have the datastore tags you select in this step. For example, if you select “critical” datastore tag from the dropdown menu, notifications will be generated only for source datastores having the "critical" tag applied to them.
Note
If you choose "An Anomaly is Detected" as the trigger condition, you must define the Anomaly Tag, set a minimum anomaly weight, and select the check rule types. This ensures that only anomalies with a weight equal to or greater than the specified value and matching the selected check rule types will trigger a notification. If no weight or check rule types are specified, these filters will be ignored.
6. Notification Channel: Select “Microsoft Teams” as your notification channel and enter the “Webhook URL” where you want the notification to be sent.
Test Microsoft Teams Notification
Step 1. Click the "Test Notification" button to send a test message to the provided “Webhook URL”. If the message is successfully sent, you will receive a confirmation notification indicating "Notification successfully sent".
Step 2: The test notification will be sent to the Webhook URL address you have provided. This verifies that the address provided is correct.
Save Microsoft Teams Notification
Step 1: Once you have entered all the values and selected “Microsoft Teams” as notification channels, then click on the “Save” button.
After clicking the Save button, a message will appear on the screen saying "Notification Successfully Created".
Post Results
Once you’ve saved your notification rules settings, all real-time notifications will be sent to the Webhook URL address you specified, ensuring you receive them directly in your Microsoft Teams channel.
For example when an operation is completed, or if any anomalies are detected in a table or file, a notification will be sent to the Microsoft Teams channel as configured. This ensures you are promptly alerted to critical events or irregularities, enabling immediate action when necessary. By providing real-time updates, it helps maintain the integrity and smooth operation of your processes.
PagerDuty Notification
Integrating PagerDuty with Qualytics ensures that your team gets instant alerts for critical data events and system issues. With this connection, you can automatically receive real-time notifications about anomalies, operation completions and other important events directly in your PagerDuty account. By categorizing alerts based on severity, it ensures the right people are notified at the right time, speeding up decision-making and resolving incidents efficiently. This helps your team respond quickly to issues, reducing downtime and keeping data operations on track.
Navigation to Notifications
Step 1: Log in to your Qualytics account and click the Notification Rules button on the left side panel of the interface.
Add PagerDuty Notification
Step 1: Click on the Add Notifications button located in the top right corner.
A modal window Add Notification Rule will appear providing you with fields to set notification rules.
Step 2: Enter the following details to add the notification rule.
1. Name: Enter a specific and descriptive title to your notification rule to easily identify its purpose.
2. Description: Provide a brief description of what the notification rule does or when it should trigger.
3. Trigger When: Select the event or condition from the dropdown menu that will trigger the notification. Below is the list of available events you can choose from:
-
Operation Completion: This type of notification is triggered whenever an operation, such as a catalog, profile, or scan, is completed on a source datastore. Upon completion, teams are promptly notified through in-app notifications and, the Microsoft Teams channel. For example, the team is notified whenever the catalog operation is completed, helping them proceed with the profile operation on the datastore.
-
An Anomaly is Identified: This type of notification is triggered when any single anomaly is identified in the data. The notification message typically includes the type of anomaly detected and the datastore where it was found. It provides specific information about the anomaly type, which helps quickly understand the issue's nature.
Tip
Users can specify a minimum anomaly weight for this trigger condition. This threshold ensures that only anomalies with a weight equal to or greater than the specified value will trigger a notification. If no value is set, all detected anomalies, regardless of their weight, will generate notifications. This feature helps prioritize alerts based on the importance of the anomalies, allowing users to focus on more critical issues.
Tip
Users can specify check rule types for this trigger condition. This selection ensures that only anomalies identified by the chosen rule types will trigger a notification. If no check rule types are selected, this filter will be ignored, resulting in all anomalies generating notifications. This feature enables users to prioritize alerts based on specific criteria, allowing them to focus on the most relevant issues.
- Anomalies are Detected in a Table or File: This notification is triggered when multiple anomalies are detected within a specific table, file and check rule types. It includes information about the number of anomalies found and the specific scan target within the datastore. This is useful for assessing the overall health of a particular datastore. No concept of weights.
Factors | An Anomaly is Identified | Anomalies are Detected in a Table or File |
---|---|---|
Trigger Event | Notifies for individual anomaly detection | Notifies for multiple anomalies within a specific table or file |
Notification Content | Focuses on the type of anomaly and the affected datastore. | Provide a count of anomalies and specifies the scan target within the datastore. |
Notification Targeting | Tags, Weight and Check Rule Types | Tags,Check Rule Types or both |
4. Message: Enter your custom message using variables in the Message field, where you can specify the content of the notification that will be sent out.
Tip
You can write your custom notification message by utilizing the autocomplete feature. This feature allows you to easily insert internal variables such as {{ rule_name }}, {{ container_name }}, and {{ datastore_name }}. As you start typing, the autocomplete will suggest and recommend relevant variables in the dropdown.
5. Datastore Tags: Use the drop-down menu to select the datastore tags. Notifications will be generated for only those source datastores that have the datastore tags you select in this step. For example, if you select the “critical” datastore tag from the dropdown menu, notifications will be generated only for source datastores having the "critical" tag applied to them.
Note
If you choose "An Anomaly is Detected" as the trigger condition, you must define the Anomaly Tag, set a minimum anomaly weight, and select the check rule types. This ensures that only anomalies with a weight equal to or greater than the specified value and matching the selected check rule types will trigger a notification. If no weight or check rule types are specified, these filters will be ignored.
6. Notification Channel: Select PagerDuty as your notification channel and enter the Integration Key where you want the notification to be sent.
Info
For detailed instructions on creating or configuring the PagerDuty integration, please refer to the PagerDuty documentation available here.
7. Severity: Select the appropriate PagerDuty severity level to categorize incidents based on their urgency and impact. The available severity levels are:
-
Info: For informational messages that don't require immediate action but provide helpful context.
-
Warning: For potential issues that may need attention but aren't immediately critical.
-
Error: For significant problems that require prompt resolution to prevent disruption.
-
Critical: For urgent issues that demand immediate attention due to their severe impact on system operations.
Test PagerDuty Notification
Click on the Test notification button to check if the integration key is functioning correctly. Once the test notification is sent, you will see a success message, "Notification successfully sent."
This confirms that the integration is properly configured and that the PagerDuty account will receive notifications as expected.
Save PagerDuty Notification
Step 1: Once you have entered all the values and selected PagerDuty as notification channels, then click on the Save button.
After clicking the Save button, a success flash message will be displayed saying "Notification Successfully Created".
Post Results
Once you’ve saved the PagerDuty notification rule, all real-time notifications will be sent to the integration key you specified,
For example, when an operation is completed or an anomaly is detected in a table or file, a notification will be sent to the integration key you provided. This keeps you informed of critical events in real-time, allowing you to take quick action to maintain smooth operations.
Slack Notification
To set up Slack notifications, start by naming your notification and selecting the triggers, such as operation completion or anomaly detection. Next, add relevant tags and configure the Slack Webhook URL to connect directly to your Slack channel.
Navigation to Notifications
Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.
Add Slack Notification
Step 1: Click on the Add Notifications button located in the top right corner.
A modal window Add Notification Rule will appear providing you with fields to set notification rules.
Step 2: Enter the following details to add the notification rule.
1. Name: Enter a specific and descriptive title to your notification rule to easily identify its purpose.
2. Description: Provide a brief description of what the notification rule does or when it should trigger.
3. Trigger When: Select the event or condition from the dropdown menu that will trigger the notification. Below is the list of available events you can choose from:
-
Operation Completion: This type of notification is triggered whenever an operation, such as a catalog, profile, or scan, is completed on a source datastore. Upon completion, teams are promptly notified through in-app messages, and a slack is triggered to send notifications to external systems or applications. For example, when a catalog operation is completed, a slack notification is sent, allowing the team to proceed with the profile operation on the datastore efficiently.
-
An Anomaly is Identified: This type of notification is triggered when any single anomaly is identified in the data. The notification message typically includes the type of anomaly detected and the datastore where it was found. It provides specific information about the anomaly type, which helps quickly understand the issue's nature.
Tip
Users can specify a minimum anomaly weight for this trigger condition. This threshold ensures that only anomalies with a weight equal to or greater than the specified value will trigger a notification. If no value is set, all detected anomalies, regardless of their weight, will generate notifications. This feature helps prioritize alerts based on the importance of the anomalies, allowing users to focus on more critical issues.
Tip
Users can specify check rule types for this trigger condition. This selection ensures that only anomalies identified by the chosen rule types will trigger a notification. If no check rule types are selected, this filter will be ignored, resulting in all anomalies generating notifications. This feature enables users to prioritize alerts based on specific criteria, allowing them to focus on the most relevant issues.
- Anomalies are Detected in a Table or File: This notification is triggered when multiple anomalies are detected within a specific table, file and check rule types. It includes information about the number of anomalies found and the specific scan target within the datastore. This is useful for assessing the overall health of a particular datastore. No concept of weights.
Factors | An Anomaly is Identified | Anomalies are Detected in a Table or File |
---|---|---|
Trigger Event | Notifies for individual anomaly detection | Notifies for multiple anomalies within a specific table or file |
Notification Content | Focuses on the type of anomaly and the affected datastore. | Provide a count of anomalies and specifies the scan target within the datastore. |
Notification Targeting | Tags, Weight and Check Rule Types | Tags,Check Rule Types or both |
4. Message: Enter your custom message using variables in the Message field, where you can specify the content of the notification that will be sent out.
Tip
You can write your custom notification message by utilizing the autocomplete feature. This feature allows you to easily insert internal variables such as {{ rule_name }}, {{ container_name }}, and {{ datastore_name }}. As you start typing, the autocomplete will suggest and recommend relevant variables in the dropdown.
5. Datastore Tags: Use the drop-down menu to select the datastore tags. Notifications will be generated for only those source datastores that have the datastore tags you select in this step. For example, if you select critical datastore tag from the dropdown menu, notifications will be generated only for source datastores having the "critical" tag applied to them.
Note
If you choose "An Anomaly is Detected" as the trigger condition, you must define the Anomaly Tag, set a minimum anomaly weight, and select the check rule types. This ensures that only anomalies with a weight equal to or greater than the specified value and matching the selected check rule types will trigger a notification. If no weight or check rule types are specified, these filters will be ignored.
6. Notification Channel: Select Slack as your notification channel and enter the Webhook URL where you want the notification to be sent.
Info
Check here for the official Slack documentation how to create or configure the Slack webhook URL.
Test Slack Notification
Step 1: Click the "Test Notification" button to send a test message to the provided Webhook URL. If the message is successfully sent, you will receive a confirmation notification indicating "Notification successfully sent".
Step 2: The test email will be sent to the Webhook URL address you have provided.This verifies that the address provided is correct .
Save Slack Notification
Step 1: Once you have entered all the values and selected Slack as notification channels, then click on the Save button.
After clicking the Save button, a message will appear on the screen saying "Notification Successfully Created".
Post Results
Once you’ve saved your notification rules settings, all real-time notifications will be sent to the Webhook URL address you specified, ensuring you receive them directly in your Slack Webhook.
For example when an operation is completed, or if any anomalies are detected in a table or file, a notification will be sent to the Slack Webhook you provided.This ensures you are promptly alerted to critical events or irregularities, enabling immediate action when necessary. By providing real-time updates, it helps maintain the integrity and smooth operation of your processes.
Webhook Notifications
Qualytics allows you to connect external apps for notifications using webhooks, making it easy to stay updated in real time. When you set up a webhook, it sends an instant alert to the connected app whenever a specific event or condition occurs. This means you can quickly notify about important events as they happen and respond right away. By using webhook notifications, you can keep your system running smoothly, keep everyone informed, and manage your operations more efficiently.
Let’s get started 🚀
Navigation to Notifications
Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.
Add Webhook Notification
By adding a webhook of any external application where you want to send your notifications, you'll receive real-time alerts whenever specific conditions or events occur like an operation is complete or an anomaly is detected within a table or file.
Step 1: Click on the Add Notifications button located in the top right corner.
A modal window Add Notification Rule will appear providing you with fields to set notification rules.
Step 2: Enter the following details to add the notification rule.
1. Name: Enter a specific and descriptive title to your notification rule to easily identify its purpose.
2. Description: Provide a brief description of what the rule does or when it should trigger.
3. Trigger When: Select the event or condition from the dropdown menu that will trigger the notification. Below is the list of available events you can choose from:
-
Operation Completion: This type of notification is triggered whenever an operation, such as a catalog, profile, or scan, is completed on a source datastore. Upon completion, teams are promptly notified through in-app messages, and a webhook is triggered to send notifications to external systems or applications. For example, when a catalog operation is completed, a webhook notification is sent, allowing the team to proceed with the profile operation on the datastore efficiently.
-
An Anomaly is Identified: This type of notification is triggered when any single anomaly is identified in the data. The notification message typically includes the type of anomaly detected and the datastore where it was found. It provides specific information about the anomaly type, which helps quickly understand the issue's nature.
Tip
Users can specify a minimum anomaly weight for this trigger condition. This threshold ensures that only anomalies with a weight equal to or greater than the specified value will trigger a notification. If no value is set, all detected anomalies, regardless of their weight, will generate notifications. This feature helps prioritize alerts based on the importance of the anomalies, allowing users to focus on more critical issues.
Tip
Users can specify check rule types for this trigger condition. This selection ensures that only anomalies identified by the chosen rule types will trigger a notification. If no check rule types are selected, this filter will be ignored, resulting in all anomalies generating notifications. This feature enables users to prioritize alerts based on specific criteria, allowing them to focus on the most relevant issues.
- Anomalies are Detected in a Table or File: This notification is triggered when multiple anomalies are detected within a specific table, file and check rule types. It includes information about the number of anomalies found and the specific scan target within the datastore. This is useful for assessing the overall health of a particular datastore. No concept of weights.
Factors | An Anomaly is Identified | Anomalies are Detected in a Table or File |
---|---|---|
Trigger Event | Notifies for individual anomaly detection | Notifies for multiple anomalies within a specific table or file |
Notification Content | Focuses on the type of anomaly and the affected datastore. | Provide a count of anomalies and specifies the scan target within the datastore. |
Notification Targeting | Tags, Weight and Check Rule Types | Tags, Check Rule Types or both |
4. Message: Enter your custom message using variables in the Message field, where you can specify the content of the notification that will be sent out.
Tip
You can write your custom notification message by utilizing the autocomplete feature. This feature allows you to easily insert internal variables such as {{ rule_name }}, {{ container_name }}, and {{ datastore_name }}. As you start typing, the autocomplete will suggest and recommend relevant variables in the dropdown.
5. Datastore Tags: Use the drop-down menu to select the datastore tags. Notifications will be generated for only those source datastores that have the datastore tags you select in this step. For example, if you select the "critical" datastore tag from the dropdown menu, notifications will be generated only for source datastores having the "critical" tag applied to them.
Note
If you choose "An Anomaly is Detected" as the trigger condition, you must define the Anomaly Tag, set a minimum anomaly weight, and select the check rule types. This ensures that only anomalies with a weight equal to or greater than the specified value and matching the selected check rule types will trigger a notification. If no weight or check rule types are specified, these filters will be ignored.
Step 6: Select "Webhook" as the notification channel and enter the desired "Webhook URL" of the target system where you want to receive notifications.
Info
Please refer to the official documentation of the target system for detailed instructions on how to create or configure the webhook URL.
Test Webhook Notification
This feature lets you send a test message to your configured webhook URL to verify that it is correctly set up and receiving notifications. It helps ensure that your integration is working before real events trigger alerts.
Step 1: Click on the "Test Notification" button to send a test notification to the webhook URL you provided. If the webhook URL is correct, you will receive a confirmation message saying "Notification successfully sent." This indicates that the webhook is functioning correctly.
The test also triggers a payload and sends an HTTP POST request to the configured URL endpoint. It confirms that the webhook is correctly configured and that notifications will be sent to the intended endpoint when real events occur.
Save Webhook Notification
Once you have provided all the necessary values, set the trigger conditions for the notification, and test the notification, click the "Save" button.
After clicking the “Save” button, a success message will be displayed saying "Notification Successfully Created".
Post Results
After setting up a notification in Qualytics to be sent via webhook, the system watches for the specific events or conditions you've defined. When one of these events happens, the webhook is triggered, and Qualytics sends an HTTP POST request to the external application's URL. This request contains detailed information about the event, enabling the external application to take action.
As a result, the external application can update dashboards, trigger alerts, or integrate the data into workflows in real-time. This ensures you can quickly respond to important events or data changes in Qualytics.
Ended: Channels
Ended: Notifications
Tags ↵
Tags
Tags allow users to categorize and organize data assets effectively and provide the ability to assign weights for prioritization. They drive notifications and downstream workflows, enabling users to stay informed and take appropriate actions. Tags can be configured and associated with specific properties, allowing for targeted actions and efficient management of entities across multiple datastores.
Tags can be applied to Datastores, Profiles, Fields, Checks, and Anomalies, streamlining data management and improving workflow efficiency. Overall, tags enhance organization, prioritization, and decision-making.
Let’s get started 🚀
Navigation to Tags
Step 1: Log in to your Qualytics account and click on the Tags on the left side panel of the interface.
You will be navigated to the Tags section, where you can view all the tags available in the system.
Add Tag
Step 1: Click on the Add Tag button from the top right corner.
Step 2: A modal window will appear, providing the options to create the tag. Enter the required values to get started.
REF. | FIELD | ACTION | EXAMPLE |
---|---|---|---|
1. | Preview | This shows how the tag will appear to users. | Preview |
2. | Name | Assign a name to your tag. | Sensitive |
3. | Color | A color picker feature is provided, allowing you to select a color using its hex code. | #E74C3C |
4. | Description | Explain the nature of your tag. | Maintain data that is highly confidential and requires strict access controls. |
5. | Category | Choose an existing category or create a new one to group related tags for easier organization. | Demo2 |
6. | Weight Modifier | Adjust the tag's weight for prioritization, where a higher value represents greater significance. The range is between -10 and 10. | 10 |
Step 3: Click on the Save button to save your tag.
View Created Tags
Once you have created a tag, you can view it in the tags list.
Filter and Sort
Qualytics allows you to sort and filter your tags so that you can easily organize and find the most relevant tags according to your criteria, improving data management and workflow efficiency.
Sort
You can sort your tags by Color, Created Date, Name, and Weight to easily organize and prioritize them according to your needs.
Filter
You can filter your tags by type and category, which allows you to categorize and manage them more effectively.
Filter by Type
Filter by Type allows you to view and manage tags based on their origin. You can filter between Global tags created within the platform and External tags imported from integrated systems like Atlan or Alation.
-
External Tags: External tags are metadata labels imported from an integrated data catalog system, such as Atlan or Alation, into Qualytics. These tags are synchronized automatically via API integrations and cannot be created or edited manually within Qualytics. They help ensure consistency in data tagging across different platforms by using the same tags already established in the data catalog. Example: If Atlan has a tag named Customer, once integrated, this tag will automatically be synchronized and added to Qualytics as an external tag.
-
Global Tags: Global tags are metadata labels that are created and managed directly within Qualytics. These tags are not influenced by external integrations and are used internally within the Qualytics platform to organize and categorize data according to the users' requirements. Example: A tag created within Qualytics to mark datasets that need internal review. This tag is fully managed within the Qualytics platform and remains unaffected by external data catalog systems unless the Overwrite Tags option is enabled in the Integration configuration.
Filter by Category
Filter by Category allows you to organize and manage tags based on predefined groups or categories. By applying this filter, you can quickly locate tags that belong to a specific category, improving searchability and making it easier to manage large volumes of data.
Manage Tags
You can easily manage your tags by keeping them updated with current information and removing outdated or unnecessary tags. This ensures that your data remains organized and relevant, enhancing overall efficiency and workflow. By efficiently managing tags, you improve data handling and ensure high data standards across the platform.
Edit Tags
This allows you to keep your tags updated with current information and relevance.
Step 1: Click the vertical ellipsis (⋮) next to the tag that you want to edit, then click on Edit from the dropdown menu.
Step 2: Edit the tag's name, color, description, category and weight as needed.
Step 3: Click the Save button to apply your changes.
Delete Tags
This allows you to remove outdated or unnecessary tags to maintain a clean and efficient tag system.
Step 1: Click the vertical ellipsis (⋮) next to the tag that you want to delete, then click on Delete from the dropdown menu.
Step 2: After clicking the Delete button, your tag will be removed from the system, and a success message saying Tag successfully deleted.
Applying a Tag
Once a Tag is created, it's ready to be associated with a Datastore
, Profile
, Check
, Notification
and ultimately an Anomaly
.
Tag Inheritance
-
When a
Tag
is applied to a data asset, all the descendents of that data asset also receive theTag
.- For example, if a
Tag
named Critical is applied to a Datastore then all the Tables, Fields, and Checks under that Datastore also receive theTag
.
Note
Anomalies will inherit the tags if a scan has been run.
- For example, if a
-
Likewise, if the Critical
Tag
is subsequently removed from one of the Tables in that Datastore, then all the Fields and Checks belonging to that Table will have the CriticalTag
removed as well. -
When a new data asset is created, it inherits the
Tags
from the owning data asset. For example, if a user creates a new Computed Table, it inherits all theTags
that are applied to the Datastore in which it is created.
Tagging Anomales
-
Anomalies also inherit
Tags
at the time they are created. They inherit all theTags
of all the associated failed checks. -
Thus Anomalies do not inherit subsequent tag changes from those checks. They only inherit checks one time - at creation time.
-
Tags
can be directly applied to or removed from Anomalies at any time after creation.
Ended: Tags
Settings ↵
Connections
The Connections Management section allows you to manage global configurations for various connections to different data sources. This provides you with a centralized interface for managing all the data connections, ensuring efficient data integration and enrichment processes. You can easily navigate and manage your connections by utilizing the search, sort, edit, and delete features.
Let's get started 🚀
Navigation to Connection
Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.
Step 2: By default, you will be navigated to the Tags section. Click on the Connection tab.
Manage Connection
You can effectively manage your connections by editing, deleting, and adding datastores to maintain accuracy and efficiency.
Warning
Before deleting a connection, ensure that all associated datastores and enrichment datastores have been removed.
Edit Connection
You can edit connections to update details like name, account, role, warehouse, and authentication to improve performance. This keeps connection settings up-to-date and suited to your data needs.
Note
You can only edit the connection name and connection details, but you are not able to edit the connector itself.
Step 1: Click the vertical ellipsis (⋮) next to the connection that you want to edit, then click on Edit from the dropdown menu.
Step 2: Edit the connection details as needed.
Note
Connection details vary from connection to connection, which means that each connection may have its unique configuration settings.
Step 3: Once you have updated the values, click on the Save button to apply your changes.
Step 4: After clicking the Save button, your connection will be updated, and a success message will display saying Connection successfully updated.
Delete Connection
This allows you to remove outdated or unnecessary connections to maintain a clean and efficient network configuration.
Step 1: Click the vertical ellipsis (⋮) next to the connection that you want to delete, then click on Delete from the dropdown menu.
Step 2: A modal window Delete Connection will appear.
Warning
Source Datastores and Enrichment Datastores associated must be removed before deleting the connection
Step 3: Enter the Name of the Connection in the given field (confirmation check) and then click on the I’M SURE, DELETE THIS CONNECTION button to delete the connection.
Add Datastore
You can add new or existing datastores and enrichment datastores directly from the connection, making it easy to manage and access your data while ensuring all sources are connected and available.
Step 1: Click the vertical ellipsis (⋮) next to the connection where you want to add a datastore, then click on Add Datastore from the dropdown menu.
A modal window labeled Add Datastore will appear, giving you options to connect a datastore. For more information on adding a datastore, please refer to the Configuring Datastores section.
Once you have successfully added a datastore to the connection, a success message will appear saying, Your datastore has been successfully added.
View Connection
Once you have added a new datastore and enrichment datastore, you can view them in the connections list.
Sort Connection
You can sort your connections by Name and Created Date to easily find and manage them.
Filter Connection
You can filter connections by selecting specific data source types from the dropdown menu, making it easier to locate and manage the desired connections.
Integrations ↵
Overview
With Qualytics integrations, data analysts can rely on the data quality insights produced by our platform data profiling and scanning. These insights can be visualized in your preferred data catalog tool.
Key features include:
- Leveraging data catalog tags
- Pushing alerts based on anomaly identification
- Sharing valuable data quality metrics
Supported data catalog integrations:
Once an integration is set up, the synchronization process can occur in two ways:
-
Manual Sync: Manual sync occurs when the user clicks the Sync button located on the Settings page in the Integrations tab. Clicking it triggers a full sync of all matching assets.
Info
Tags are only synchronized through a manual sync.
-
Event Driven: Once the Event Driven option is enabled, triggering occurs automatically based on the listed action events executed by users. If any of the following actions occur, a sync is triggered.
Event Description Run an Operation (Profile or Scan) Sync all target containers for the operation Archive an Anomaly (including bulk) Sync the container in which the anomaly was identified Archive a Check (including bulk) Sync the container to which the check belongs
Atlan
Integrating Atlan with Qualytics allows for easy push and pull of metadata between the two platforms. Specifically, Qualytics "pushes" its metadata to the data catalog and "pulls" metadata from the data catalog. Once connected, Qualytics automatically updates when key events happen in Atlan, such as metadata changes, anomaly updates, or archiving checks. This helps maintain data quality and consistency. During the sync process, Qualytics can either replace existing tags in Atlan or skip assets that have duplicate tags to avoid conflicts. Setting it up is simple—you just need to provide an API token to allow smooth communication between the systems.
Let’s get started 🚀
Atlan Setup
Create an Atlan persona and policy
Before starting the integration process, it is recommended that you set up an Atlan persona. It allows access to the necessary data and metadata. While you can create this persona simultaneously as your API token, it's easier if you create it first. That way, you can link the persona directly to the token later.
Before using Atlan with your data source, authorize the API token with access to the needed data and metadata. You do this by setting up policies within the persona for the Atlan connection that matches your Qualytics data source. Remember, you will need to do this for each data source you want to integrate.
Step 1. Navigate to Governance, then select “Personas”.
Step 2: Click on “+ New Persona Button”.
Step 3: Enter a Name and Description for a new persona, then click the “Create” button.
Step 4: Here your new Atlan persona has been created.
Step 5: After creating a new Atlan persona you have to create policies to authorize the personal access token. Click on "Add Policies" to create a new policy or to add one if there isn't any available.
Step 6: Click on "New Policy" and select "Metadata Policy" from the dropdown menu.
Step 7: Enter a "name", and choose the "connection".
Step 8: Customize the permissions and assets that Qualytics will access.
Step 9: Once the policy is created, you’ll see it listed in the Policies section.
Create Atlan Personal Access Token
After you’ve created the persona, the next step is to create a personal access token.
Step 1: Navigate to the API Tokens section in the Admin Center.
Step 2: Click on "Generate API Token" button.
Step 3: Enter a name and description, and select the persona you created earlier.
Step 4: Click the "Save" button and make sure to store the token in a secure location.
Add Atlan Integration
Integrating Atlan with Qualytics enhances your data management capabilities, allowing seamless synchronization between the two platforms. This guide will walk you through the steps to add the Atlan integration efficiently. By following these steps, you can configure essential settings, provide necessary credentials, and customize synchronization options to meet your organization’s needs.
Step 1: Log in to your Qualytics account and click the "Settings" button on the left side panel of the interface.
Step 2: You will be directed to the Settings page, then click on the "Integration" tab.
Step 3: Click on the “Add Integration” button.
Step 4: Fill out the configuration form selecting the "Atlan" integration type.
REF. | FIELDS | ACTIONS |
---|---|---|
1️. | Name (Required) | Provide a detailed description of the integration. |
2. | Type (Required) | Choose the type of integration from the dropdown menu. Currently, 'Atlan' is selected |
3. | URL (Required) | The complete address for the Atlan instance, for example: https://your-company.atlan.com. |
4. | Token (Required) | Provide the authentication token needed to connect to Atlan. |
5. | Domains | Select specific domains to filter assets for synchronization. - Acts as a filtering mechanism to sync specific assets - Uses domain information from the data catalog (e.g. Sales ). Only assets under the selected domains will synchronize. |
6. | Event Driven | If enabled, the integration sync will be activated by operations, archiving anomalies, and checks. |
7. | Overwrite Tags | If enabled, Atlan tags will have precedence over Qualytics tags in cases of conflicts (when tags with the same name exist on both platforms). |
Step 5: Click on the Save button to set up the Atlan integration.
Step 6: Once the Atlan integration is set up with Qualytics, it will appear in Qualytics as a new integration.
Synchronization
The Atlan synchronization supports both push and pull operations. This includes pulling metadata from Atlan to Qualytics and pushing Qualytics metadata to Atlan. During the syncing process, the integration pulls tags assigned to data assets in Atlan and assigns them to Qualytics assets as an external tag.
Note
Tag synchronization requires manual triggering.
Step 1: To sync tags, simply click on the "Sync" button next to the relevant integration card.
Step 2: After clicking the "Sync" button, you will have the following options:
- Pull Atlan Metadata
- Push Qualytics Metadata
Specify whether the synchronization will pull metadata, push metadata, or do both.
Step 3: After selecting the desired options, click on the "Start" button.
Step 4: After clicking the Start button, the synchronization process between Qualytics and Atlan begins. This process pulls metadata from Atlan and pushes Qualytics metadata, including tags, quality scores, anomaly counts, asset links, and many more.
Step 5: Review the logs to verify which assets were successfully mapped from Atlan to Qualytics.
Step 6: Once synchronization is complete, the mapped assets from "Atlan" will display an external tag.
Step 7: When Qualytics detects anomalies, alerts are sent to the assets in Atlan, displaying the number of active anomalies and including a link to view the corresponding details
Metadata
The Quality Score Total, along with the Qualytics 8 metrics (completeness, coverage, conformity, consistency, precision, timeliness, volume, and accuracy), and the count of checks and anomalies per asset identified by Qualytics, are pushed.
Alation
Integrating Alation with Qualytics, allows you to pull metadata from Alation to Qualytics and push Qualytics metadata to Alation. Once integrated, Qualytics can stay updated with key changes in Alation, like metadata updates and anomaly alerts which helps to ensure data quality and consistency. Qualytics updates only active checks, and metadata updates in Qualytics occur if the Event-Driven option is enabled or can be triggered manually using the "Sync" button. During sync, Qualytics can replace existing tags in Alation or skip duplicate tags to avoid conflicts. The setup is simple—just provide a refresh token for communication between the systems.
Let’s get started 🚀
Alation Setup
Create Refresh Token
Before setting up Alation Integration in Qualytics, you have to generate a Refresh token. This allows Qualytics to access Alation's API and keep data in sync between the two platforms.
Step 1: Navigate to the "Profile Settings".
Step 2: Select the "Authentication" tab.
Step 3: Click on the "Create Refresh Token" button.
Step 4: Enter a name for the token.
Step 5: After entering the name for the token, click on "Create Refresh Token".
Step 6: Your "refresh" token has been generated successfully. Please Copy and save it securely.
Step 7: Here you can view the token that is successfully added to the access tokens list.
Add Alation Integration
Step 1: Log in to your Qualytics account and click the "Settings" button on the left side panel of the interface.
Step 2: You will be directed to the Settings page, then click on the "Integration" tab.
Step 3: Click on the "Add Integration" button.
Step 4: Complete the configuration form by choosing the Alation integration type.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name (Required) | Provide a name for the integration. |
2. | Type (Required) | Choose the type of integration from the dropdown menu. Currently, 'Atlan' is selected |
3. | URL (Required) | Enter the full address of the Alation instance, for example, https://instance.alationcloud.com. |
4. | Refresh Token (Required) | Enter the refresh token required to access the Alation API. |
5. | User ID (Required) | Provide the user ID associated with the generated token. |
6. | Domains | Select specific domains to filter assets for synchronization. - Acts as a filtering mechanism to sync specific assets - Uses domain information from the data catalog (e.g. Sales ). Only assets under the selected domains will synchronize. |
7. | Event Driven | If enabled, operations, archiving anomalies, and checks will activate the integration sync. |
8. | Overwrite Tags | If enabled, Alation tags will override Qualytics tags in cases of conflicts (when tags with the same name exist on both platforms). |
Step 5: Click on the Save button to integrate Alation with Qualytics.
Step 6: Here you can view the new integration appearing in Qualytics.
Synchronization
The Alation synchronization supports both push and pull operations. This includes pulling metadata from Alation to Qualytics and pushing Qualytics metadata to Alation. During the syncing process, the integration pulls tags assigned to data assets in Alation and assigns them to Qualytics assets as an external tag.
Note
Tag synchronization requires manual triggering.
Step 1: To sync tags, simply click the "Sync" button next to the relevant integration card.
Step 2: After clicking the Sync button, you will have the following options:
- Pull Alation Metadata
- Push Qualytics Metadata
Specify whether the synchronization will pull metadata, push metadata, or do both.
Step 3: After selecting the desired options, click on the "Start" button.
Step 4: After clicking the Start button, the synchronization process between Qualytics and Alation begins. This process pulls metadata from Alation and pushes Qualytics metadata, including tags, quality scores, anomaly counts, asset links, and many more.
Step 5: Once synchronization is complete, the mapped assets from Alation will display an external tag.
Alerts
When Qualytics detects anomalies, alerts are sent to the assets in Alation, showing the number of active anomalies and providing a link to view them.
Metadata
The Quality Score Total, the "Qualytics 8" metrics (completeness, coverage, conformity, consistency, precision, timeliness, volume, and accuracy), and counts of checks and anomalies per asset identified by Qualytics are pushed to Alation. This enables users to analyze assets based on data profiling and scanning metrics. A link to the asset in Qualytics is also provided.
Data Health
On the Alation tables page, there's a tab called “Data Health” where Qualytics displays insights from data quality checks in a table format, showing the current status based on the number of anomalies per check.
Column | Description |
---|---|
Rule | The type of data quality check rule |
Object Name | The Table Name |
Status | The check status can be either "Alert" if there are active anomalies or "No Issues" if no active anomalies exist for the check. |
Value | The current amount of active anomalies |
Description | The data quality check description |
Last Updated | The last synced timestamp |
External Tag Propagation
External tags propagation in Qualytics serve as metadata labels that are automatically synchronized from an integrated data catalog, such as Atlan or Alation. This process helps maintain consistent data tagging across various platforms by using pre-existing tags from the data catalog.
Let’s get started 🚀
Navigation
Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.
!
Step 2: You will be directed to the Settings page, then click on the Integration tab.
Step 3: Click on the Add Integration button.
A modal window Add Integration will appear, providing you with the options to add integration.
REF. | FIELDS | ACTIONS |
---|---|---|
1. | Name | Provide a detailed description of the integration. |
2. | Type | Choose the type of integration from the dropdown menu. Currently, 'Atlan' is selected |
3. | URL | The complete address for the Atlan instance, for example: https://your-company.atlan.com. |
4. | Token | Provide the authentication token needed to connect to Atlan. |
5. | Event Driven | If enabled, the integration sync will be activated by operations, archiving anomalies, and checks. |
6. | Overwrite Tags | If enabled, Atlan tags will have precedence over Qualytics tags in cases of conflicts (when tags with the same name exist on both platforms). |
For demonstration purposes we have selected Atlan integration type.
Step 5: Click on the Save button to set up the Atlan integration.
Step 6: Once the Atlan integration is set up with Qualytics, it will appear in Qualytics as a new integration.
Synchronization
Synchronization supports both push and pull operations. This includes pulling metadata from one platform to Qualytics and pushing Qualytics metadata to the other platform. During the syncing process, the integration pulls tags assigned to data assets in the source platform and assigns them to Qualytics assets as an external tag.
For demonstration purposes we have selected Atlan synchronization.
Note
Tag synchronization requires manual triggering.
Step 1: To sync tags, simply click on the Sync button next to the relevant integration card.
Step 2: After clicking the Sync button, you will have the following options:
- Pull Atlan Metadata
- Push Qualytics Metadata
Specify whether the synchronization will pull metadata, push metadata, or do both.
Step 3: After selecting the desired options, click on the Start button.
Step 4: After clicking the Start button, the synchronization process between Qualytics and Atlan begins. This process pulls metadata from Atlan and pushes Qualytics metadata, including tags, quality scores, anomaly counts, asset links, and many more.
Step 5: Review the logs to verify which assets were successfully mapped from Atlan to Qualytics.
Step 6: Once synchronization is complete, the mapped assets from Atlan will display an external tag.
Ended: Integrations
Security
You can easily manage user and team access by assigning roles and permissions within the system. This includes setting up specific access levels and roles for different users and teams. By doing so, you ensure that data and resources are accessed securely and appropriately, with only authorized individuals and groups having the necessary permissions to view or modify them. This helps maintain the integrity and security of your system.
Note
Only users with the Admin role have the authority to manage global platform settings, such as user permissions and team access controls.
Let’s get started 🚀
Navigation to Security
Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.
Step 2: By default, you will be navigated to the Tags section. Click on the Security tab.
Add Team
You can create a new team for efficient and secure data management. Teams make it easier to control who has access to what, help people work together better, keep things secure with consistent rules, and simplify managing and expanding user groups. You can assign permissions to the team, such as read and write access, by selecting the datastore and enrichment datastore to which you want them to have access. This makes data management easier.
In Qualytics, every user is assigned one of two roles: Admin
or Member
.
-
Admin: Admin users have full access to the system and can manage datastores, teams, and users. This means they can access everything in the application, as well as manage user accounts and team permissions.
-
Member: Members are normal users with access explicitly granted to them, usually inherited from the teams they are assigned to.
Step 1: Click on the Add Team button located in the top right corner.
Step 2: A modal window will appear, providing the options for creating the team. Enter the required values to get started.
REF. | FIELD | ACTION | EXAMPLE |
---|---|---|---|
1. | Name | Enter the name of the team | Data Insights Team |
2. | Description | Provide a brief description of the team. | Analyzes data to provide actionable insights, supporting data-driven decisions |
3. | Permission | Select the permission level for the team: Write (manage and edit data), Read (view and report) None (no access) | Read/Write |
4. | Users | Add users to the team | John, Michael |
5. | Source Datastores | Grant access to specific source datastores (single or multiple) for the team | Athena |
6. | Enrichment Datastores | Add and grant access to additional enrichment datastores (single or multiple) for the team | Bank Enrichment |
Step 3: Click on the Save button to save your team.
After clicking on the Save button, your team is created, and a success message will appear saying, Team successfully created.
Directory Sync
Directory Sync, also known as User and Group Provisioning, automates the synchronization of users and groups between your identity provider (IDP) and the Qualytics platform. This ensures that your user data is consistent across all systems, improving security and reducing the need for manual updates.
Directory Sync Overview
Directory Sync automates the management of users and groups by synchronizing information between an identity provider (IDP) and your application. This ensures that access permissions, user attributes, and group memberships are consistently managed across platforms, eliminating the need for manual updates.
How Directory Sync Works with SCIM
SCIM is an open standard protocol designed to simplify the exchange of user identity information. When integrated with Directory Sync, SCIM automates the creation, updating, and de-provisioning of users and groups. SCIM communicates securely between the IDP and your platform’s API using OAuth tokens to ensure only authorized actions are performed.
General Setup Requirements
To set up Directory Sync, the following are required:
- Administrative access to both the identity provider and Qualytics platform
- A SCIM-enabled identity provider or custom integration
- The OAuth client set up in your IDP
- SCIM URL and OAuth Bearer Token generated from the Qualytics platform
Getting Started
Prerequisites for Setting Up Directory Sync
Before setting up Directory Sync, ensure you have the following:
- A SCIM-supported identity provider
- Administrative privileges for both your IDP and Qualytics
- A SCIM URL and OAuth Bearer Token, which will be generated from your Qualytics instance
Quick Start Guide
- Set up an OAuth client in your IDP.
- Configure the SCIM endpoints with the SCIM URL and OAuth Bearer Token.
- Assign users and groups to provision in the IDP.
- Monitor the synchronization to ensure proper operation.
What is SCIM?
SCIM is a standardized protocol used to automate the exchange of user identity information between IDPs and service providers. Its goal is to simplify the process of user provisioning and management.
SCIM improves efficiency by automating user lifecycle management (creation, updating, and de-provisioning) and ensures that data remains consistent across platforms. It also enhances security by minimizing manual errors and ensuring proper access control.
SCIM includes endpoints that are configured within your IDP and your platform. It uses OAuth tokens for secure communication between the IDP and the Qualytics API, ensuring that only authorized users can manage identity data.
Benefits of Using SCIM for User and Group Provisioning
By leveraging SCIM (System for Cross-domain Identity Management), Directory Sync simplifies user management with:
- Automated user provisioning and de-provisioning
- Reduced manual intervention, improving efficiency and security
- Real-time updates of user data, ensuring accuracy and compliance
- Support for scaling user management across organizations of any size
Supported Providers
Our API supports SCIM 2.0 (System for Cross-domain Identity Management) as defined in RFC 7643 and RFC 7644. It is designed to ensure seamless integration with any SCIM-compliant identity management system, supporting standardized user provisioning, de-provisioning, and lifecycle management. Additionally, we have verified support with the following providers:
- Microsoft Entra (Azure Active Directory)
- Okta
- OneLogin
- JumpCloud
Unsupported Providers
We do not support Google Workspace, as it does not offer SCIM support. Organizations using Google Workspace must use alternate methods for user provisioning.
Providers
1. Microsoft Entra
Creating an App Registration
Step 1: Log in to the Microsoft Azure Portal, and select “Microsoft Entra ID” from the main menu.
Step 2: Click on “Enterprise Applications” from the left navigation menu.
Step 3: If your application is already created, choose it from the list and move to the section Configuring SCIM Endpoints. If you haven't created your application yet, click on the New Application button.
Step 4: Click on the “Create your own application” button to create your application.
Step 5: Give your application a name (e.g., "Qualytics OAuth Client" or "Qualytics SCIM Client").
Step 6: After entering the name for your application, click the Create button to finalize the creation of your app.
Configuring SCIM Endpoints
Step 1: Click on Provisioning from the left-hand menu.
Step 2: A new window will appear, click on the Get Started button.
Step 3: In the Provisioning Mode dropdown, select “Automatic” and enter the following details in the Admin Credentials section:
-
Provisioning Mode: Select Automatic.
-
Tenant URL:
https://fhlbny.qualytics.io/api/scim/v2
-
Secret Token: Generate this token from the Qualytics UI when logged in as an admin user. For more information on how to generate tokens in Qualytics, refer to the documentation on Tokens.
Step 4: Click on the Test Connection button to test the connection to see if the credentials are correct.
Step 5: Expand the Mappings section and enable your app to enable group and user attribute mappings. The default mapping should work.
Step 6: Expand the Settings section and make the following changes:
- Select Sync only assigned users and groups from the Scope dropdown.
- Confirm the Provisioning Status is set to On.
Step 7: Click on the Save to save the credentials. Now you've successfully configured the Microsoft Entra ID SCIM API integration.
Assigning Users and Groups for Provisioning
Step 1: Click on the Users and groups from the left navigation menu and then click Add user/group.
Step 2: Click on the None Selected under the Users and Groups.
Step 3: From the right side of the screen, select the users and groups you want to assign to the app.
Step 4: Once you selected the group and users for your app, click the “Select” button.
Step 5: Click on the Assign button to assign the users and groups to the application.
Warning
When you assign a group to an application, only users directly in the group will have access. The assignment does not cascade to nested groups.
2. Okta
Setting up the OAuth Client in Okta
Step 1: Log in to your Okta account using your administrator credentials. From the left-hand navigation menu, click Applications, then select Browse App Catalog.
Step 2: In the search bar, type SCIM 2.0 Test App (OAuth Bearer Token), and select the app called SCIM 2.0 Test App (OAuth Bearer Token) from the search results.
Step 3: On the app’s details page, click Add Integration.
Step 4: Enter a name for your application (e.g., "Qualytics SCIM Client").
Step 5: Click on the Next button.
Configuring SCIM Endpoints
Step 1: In the newly created app, go to the Provisioning tab and click Configure API Integration.
Step 2: Check the box labeled Enable API Integration, and enter the following details:
-
SCIM 2.0 Base URL:
https://fhlbny.qualytics.io/api/scim/v2
-
OAuth Bearer Token: Generate this token from the Qualytics UI when logged in as an admin user. For more information on how to generate tokens in Qualytics, refer to the documentation on Tokens.
Step 3: Click Test API Credentials to verify the connection. Once the credentials are validated, click Save.
Step 4: A new settings page will appear. Under the To App section, enable the following settings:
- Create Users
- Update User Attributes
- Deactivate Users
After enabling these settings, your Okta SCIM API integration is successfully configured.
Assigning users for provisioning
Step 1: Click the Assignments tab and select Assign to People from the dropdown Assign.
Step 2: Select the users you want to assign to the app and click the Assign button.
Step 3: After you click the Assign button, you'll see a new popup window with various fields. Confirm the field values and click the Save and Go Back buttons.
Assigning groups for provisioning
Step 1: Navigate to the tab Push Groups and select Find group by name from the dropdown Push Groups.
Step 2: Search for the group you want to assign to the app.
Step 3: After assigning the group name, then click on the Save button.
3. OneLogin
Setting up the OAuth Client in OneLogin
Step 1: Log in to your OneLogin account using your administrator credentials. From the top navigation menu, click Applications, then select Add App.
Step 2: In the search bar, type SCIM and select the app called SCIM Provisioner with SAML (SCIM V2 Enterprise) from the list of apps.
Step 3: Enter a name for your app, then click Save. You have successfully created the SCIM app in OneLogin.
Configuring SCIM Endpoints
Step 1: In your created application, navigate to the Configuration tab on the left and enter the following information:
-
API Status: Enable the API status for the integration to work properly.
-
SCIM Base URL:
https://fhlbny.qualytics.io/api/scim/v2
-
SCIM Bearer Token: Generate this token from the Qualytics UI when logged in as an admin user. For more information on how to generate tokens in Qualytics, refer to the documentation on Tokens.
Step 2: Click on the Save button to store the credentials.
Step 3: Navigate to the Provisioning tab, and check the box labeled Enable Provisioning.
Step 4: Click on Save to apply the changes.
Step 5: Navigate to the Parameters tab and select the row for Groups.
Step 5: A popup window will appear, check the box Include in User Provisioning, then click the Save button.
Assigning Users for Provisioning
Step 1: To assign users to your app, go to Users from the top navigation menu, and select the user you want to assign to the app.
From the User page, click the Applications tab on the left, and click the + (plus) sign.
Step 3: A popup window will show a list of apps. Select the app you created earlier and click Continue.
Step 4: A new modal window will appear, click on the Save to confirm the assignment.
Step 5: If you see the status Pending in the table, click that text. A modal window will appear, where you can click Approve to confirm the assignment.
Assigning Groups for Provisioning
Step 1: To push groups to your app, go to the top navigation menu, click Users, select Roles from the dropdown, and click New Role to create the role.
Step 2: Enter a name for the role, select the app you created earlier
Step 3: Click on the “Save” button.
Step 4: Click the Users tab for the role and search for the user you want to assign to the role.
Step 5: Click the Add To Role button to assign the user, then click Save to confirm the assignment.
Step 6: A modal window will appear, click on the “Save” button to confirm the assignment.
Step 7: Go back to your app and click the Rule tab on the left and click the Add Rule button.
Give the rule a name. Under the Actions, select the Set Groups in your-app-name from the dropdown, then select each role with values that match your-app-name.
Step 8: Click on the Save button.
Step 9: Click on the Users tab on the left, you may see Pending under the provisions state. Click on it to approve the assignment.
Step 10: A modal window will appear, click on the Approve to finalize the assignment.
4. JumpCloud
Configuring SCIM Endpoints
JumpCloud supports SCIM provisioning within an existing SAML application. Follow these steps to configure SCIM provisioning:
Step 1: Log in to JumpCloud and either choose an existing SAML application or create a new one. From the left navigation menu, click SSO and select your Custom SAML App.
Step 2: Click on the tab Identity Management within your SAML application.
Under the SCIM Version, choose SCIM 2.0 and enter the following information:
-
Base URL:
https://fhlbny.qualytics.io/api/scim/v2
-
Token Key: Generate this token from the Qualytics UI when logged in as an admin user. For more information on how to generate tokens in Qualytics, refer to the documentation on Tokens.
-
Test User Email
Step 4: Click Test Connection to ensure the credentials are correct, then click Activate to enable SCIM provisioning.
Step 5: Click Save to store your settings. Once saved, SCIM provisioning is successfully configured for your JumpCloud SAML application.
Assigning Users for Provisioning
Step 1: Click the tab User Groups within your SAML application. You can see all the available groups, select the groups you want to sync, and click Save.
If no existing groups are available, click User Groups from the left navigation menu and click on the plus (+) icon to create a new group.
Step 2: Select the Users tab and choose the users you want to assign to the group.
Step 3: Select the Applications tab and choose the app you want to assign the group to.
Manage Users
You can easily manage users by assigning roles, teams, and deactivating users who are not active. This ensures that access control is streamlined, security is maintained, and only active users have access to resources.
The Security section, visible only to Admins, allows for granting and revoking permissions for Member users.
Access controls in Qualytics are assigned at the datastore level. A non-administrator user (Member) can have one of three levels of access to any datastore connected to Qualytics:
-
Write: Allows the user to perform operations on and manage the datastore’s metadata.
-
Read: Allows the user to view and report on the datastore.
-
None: The datastore is not visible or accessible to the user.
Note
Permissions are assigned to Teams rather than directly to users. Users inherit the permissions of the teams to which they are assigned.
All users are part of the default Public team, which provides access to all Public Datastores. Admins can create and manage additional teams, assigning both users and datastores to them. When a datastore is assigned to a team, the team is granted either Read or Write access, and all team members inherit this permission.
View Users
Whenever new users are added to the system, they will appear in the Users list. Click the Users tab to view the list of users.
Edit Users
You can edit user details to update their role, and team assignments, ensuring their access and team information are current and accurate.
Step 1: Click the vertical ellipsis (⋮) next to the user name that you want to edit, then click on Edit from the dropdown menu.
Step 2: Edit the user details as needed, including:
- Updating their role
- Assigning them additional teams
Note
All users are inside the Public team by default and that can't be changed. If users have no default access to any datastore, then no datastores should be assigned to the Public team.
Step 3: Once you have made the necessary changes, then click on the Save button.
After clicking the Save button, your changes will be updated, and a success message will display saying, **User successfully updated.
Deactivate Users
You can deactivate users to revoke their access to the system while retaining their account information for future reactivation if needed.
Step 1: Click the vertical ellipsis (⋮) next to the user name that you want to deactivate, then click on Deactivate from the dropdown menu.
Step 2: A modal window Deactivate User will appear.
Step 3: Enter deactivate in the given field (confirmation check) and then click on the I’M SURE, DEACTIVATE THIS USER button to deactivate the user.
Sort Users
You can sort users by various criteria, such as Created date, Name, Role, and Teams, to easily manage and organize user information.
Filter Users
You can filter the users by their roles and team, to quickly find and manage particular groups of users.
Manage Teams
You can manage teams by editing their permissions, adding or removing users, and adjusting access to source and enrichment datastores. If a team is no longer needed, you can delete it from the system. This ensures that team configurations are always up-to-date and relevant, enhancing overall data management and security.
View Team
Whenever new teams are added to the system, they will appear in the Teams list. Click the Teams tab to view the list of teams.
Edit Team
You can edit a team to update its permissions, name, manage users within the team, and adjust access to source and enrichment datastores, ensuring the team's configuration is current and effective.
Note
The name and users of a public team cannot be edited.
Step 1: Click on the vertical ellipsis (⋮) next to the team name that you want to edit, then click on Edit from the dropdown menu.
Step 2: Edit the team details as needed, including updating their permissions, users, source, and enrichment datastores.
Step 3: Once you have made the necessary changes, then click on the Save button.
After clicking on the Save button, your team is successfully updated, and a success message will be displayed saying, Team successfully updated
Delete Team
You can delete a team from the system when it is no longer needed, removing its access and permissions to streamline management and maintain security.
Step 1: Click the vertical ellipsis (⋮) next to the team name that you want to delete, then click on Edit from the dropdown menu.
A modal window Delete Team will appear.
Step 2: Click on the Delete button to delete the team from the system.
Sort Team
You can sort teams by various criteria, such as name or creation date, to easily organize and manage team information.
Tokens
A token is a secure way to access the Qualytics API instead of using a password. Each user gets a unique Personal API Token (PAT) for authentication. These tokens are created only once, so you need to copy and store them safely because you'll use them to log in and interact with the platform in the future.
Let’s get Started 🚀
Navigation to Tokens
Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.
Step 2: By default, you will be navigated to the Tags section. Click on the Tokens tab.
Generate Token
Generating a token provides a secure method for authenticating and interacting with your platform, ensuring that only authorized users and applications can access your resources. Personal Access Tokens (PATs) are particularly useful for automated tools and scripts, allowing them to perform tasks without needing manual intervention. By using PATs, you can leverage our Qualytics CLI to streamline data management and operations, making your workflows more efficient and secure.
Step 1: Click on the Generate Token button located in the top right corner.
A modal window will appear providing the options for generating the token.
Step 2: Enter the following values:
- Name: Enter the name for the Token ( e.g., DataAccessToken)
- Expiration: Set the expiration period for the token (e.g., 30 days)
Step 3: Once you have entered the values, then click on the Generate button.
Step 4: After clicking on the Generate button, your token is successfully generated.
Warning
Make sure to copy your secret key as you won't be able to see it again. Keep your secret keys confidential and avoid sharing them with anyone. Use a password manager or an encrypted vault to store your secret keys.
Revoke Token
You can revoke your token to prevent unauthorized access or actions, especially if the token has been compromised, is no longer needed, or to enhance security by limiting the duration of access.
Step 1: Click the vertical ellipsis (⋮) next to the user token, that you want to revoke, then click on Revoke from the dropdown menu.
Step 2: After clicking the Revoke button, your user token will be successfully revoked. A success message will display saying User token successfully revoked. Following revocation, the token's status color will change from green to orange.
Restore Token
You can restore a token to reactivate its access, allowing authorized use again. This is useful if the token was mistakenly revoked or if access needs to be temporarily re-enabled without generating a new token.
Step 1: Click the vertical ellipsis (⋮) next to the revoked tokens, that you want to restore, then click on the Restore button from the dropdown menu.
Step 2: After clicking on the “Restore” button, your secret token will be restored and a confirmation message will display saying User token successfully restored
Delete Token
You can delete a token to permanently remove its access, ensuring it cannot be used again. This is important for maintaining security when a token is no longer needed, has been compromised, or to clean up unused tokens in your system.
Note
You can only delete revoked tokens, not active tokens. If you want to delete an active token, you must first revoke it before you can delete it.
Step 1: Click the vertical ellipsis (⋮) next to the revoked tokens, that you want to delete, then click on the Delete button from the dropdown menu.
After clicking the delete button, a confirmation modal window Delete Token will appear.
Step 2: Click on the Delete button to delete the token.
After clicking on the Delete button, your token will be deleted and a confirmation message will display saying User token successfully deleted.
Health
System Health provides a real-time overview of your system's resources, essential for monitoring performance and diagnosing potential issues. It provides key indicators and status updates to help you maintain system health and quickly address potential issues.
Navigation to Health
Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.
Step 2: You will be directed to the Settings page; then click on the Health section.
Summary Section
The Summary section displays the current platform version, along with the database status and RabbitMQ state.
REF. | FIELD | ACTION | EXAMPLE |
---|---|---|---|
1 | Current Platform Version | Shows the current version of your platform's core software. | 20240808-3019c60 |
2 | Database | Verifies your database connection. An "OK" status means it’s connected. | Status:OK |
3 | RabbitMQ | Confirms RabbitMQ (a message broker software) is running correctly with an "OK" state. | State:OK |
Health Indicator
The health status indicator reflects the overall system resources health.For example, in the image below, a green checkmark indicates that our system resources are healthy.
Note
Status indicators are simple: a green checkmark indicates "Healthy," and a red exclamation mark means "Critical."
Analytics Engine
The Analytics Engine section provides advanced information about the analytics engine's configuration and current state for technical users and developers.
REF | FIELD | ACTION | EXAMPLE |
---|---|---|---|
1 | Build Date | This shows the date and time when the Analytics Engine was built. | Aug 8 2024,7:39 AM (GMT+5:30) |
2 | Implementation Version | The version of the analytics engine implementation being used. | 2.0.0 |
3 | Max Executors | Maximum number of executors allocated for processing tasks. | 10 |
4 | Max Memory Per Executor | This shows the maximum amount of memory allocated to each executor. | 25000 MB |
5 | Driver Free Memory | The amount of free memory available for the driver, which manages the Spark application. | 968 MB |
6 | Spark Version | The version of Apache Spark that the Analytics Engine uses for processing. | 3.5.1 |
7 | Core Per Executor | This shows the number of CPU cores assigned to each executor. | 3 |
8 | Max Dataframe Size | The maximum size of dataframes that can be processed. | 50000 MB |
9 | Thread Pool State | Indicates the current state of the thread pool used for executing tasks. | [Running, parallelism \= 3, size \= 0, active \= 0, running \= 0, steals \= 0, tasks \= 0, submissions \= 0] supporting 0 running operation with 0 queued requests |
Manage Health Summary
You can perform essential tasks such as copying the health summary, refreshing it, and restarting the analytics engine. These functionalities help maintain an up-to-date overview of system performance and ensure accurate analytics.
Copy Health Summary
The Copy Health Summary feature lets you duplicate all data from the Health Section for easy sharing or saving.
Step 1: Click the vertical ellipsis from the right side of the summary section and choose Copy Health Summary from the drop-down menu.
Step 2: After clicking on Copy Health Summary, a success message saying Copied.
Refresh Health Summary
The Refresh Health Summary option updates the Health Section with the latest data. This ensures that you see the most current performance metrics and system status.
Step 1: Click the vertical ellipsis from the right side of the summary section and choose Refresh Health Summary to update the latest data.
Restart Analytics Engine
The Restart Analytics Engine option restarts the analytics processing system. This helps resolve issues and ensures that analytics data is accurately processed.
Step 1: Click the vertical ellipsis from the right side of the summary section and choose Restart Analytics Engine from the drop-down menu.
Step 2: A modal window will pop up. Click the Restart button in this window to restart the analytics engine. Restarting the engine helps resolve any issues and ensures that your analytics data is up-to-date and accurately processed.
Step 3: After clicking on Restartbutton a success message saying Successfully triggered Analytics Engine restart.
Ended: Settings
Qualytics CLI ↵
Qualytics CLI
Qualytics CLI is a command-line tool designed to interact with the Qualytics API. With this tool, users can manage configurations, export and import checks, run operations and more.
You can check more the latest version in Qualytics CLI
Installation and Upgrading
You can install Qualytics CLI
via pip:
You can upgrade the Qualytics CLI
via pip:
Usage
Help
To view available commands and their usage:
Initializing Configuration
To set up your Qualytics URL and token:
Options:
Option | Type | Description | Default | Required |
---|---|---|---|---|
--url |
TEXT | The URL of your Qualytics instance | None | Yes |
--token |
TEXT | The personal access token for accessing Qualytics | None | Yes |
Display Configuration
To view the currently saved configuration:
Export Checks
To export checks to a file:
Options:
Option | Type | Description | Default | Required |
---|---|---|---|---|
--datastore |
INTEGER | Datastore ID | None | Yes |
--containers |
List of INTEGER | Containers IDs | None | No |
--tags |
List of TEXT | Tag names | None | No |
--output |
TEXT | Output file path | $HOME/.qualytics/data_checks.json |
No |
Export Check Templates
To export check templates:
qualytics checks export-templates
--enrichment_datastore_id 123
[--check_templates "1, 2, 3" or "[1,2,3]"]
[--status `true` or `false`]
[--rules "afterDateTime, aggregationComparison" or "[afterDateTime, aggregationComparison]"]
[--tags "tag1, tag2, tag3" or "[tag1, tag2, tag3]"]
[--output "/home/user/.qualytics/data_checks_template.json"]
Options:
Option | Type | Description | Default | Required |
---|---|---|---|---|
--enrichment_datastore_id |
INTEGER | The ID of the enrichment datastore where check templates will be exported. | Yes | |
--check_templates |
TEXT | IDs of specific check templates to export (comma-separated or array-like). | No | |
--status |
BOOL | Check Template status send true if it's locked or false to unlocked. |
No | No |
--rules |
TEXT | Comma-separated list of check templates rule types or array-like format. Example: "afterDateTime, aggregationComparison" or "[afterDateTime, aggregationComparison]". | No | No |
--tags |
TEXT | Comma-separated list of Tag names or array-like format. Example: "tag1, tag2, tag3" or "[tag1, tag2, tag3]". | No | No |
--output |
TEXT | Output file path [example: /home/user/.qualytics/data_checks_template.json ]. |
No | No |
Import Checks
To import checks from a file:
import qualytics.qualytics as qualytics
TARGET_DATASTORE_ID = 1172
qualytics.checks_import(
datastore=TARGET_DATASTORE_ID,
input_file="/home/user/.qualytics/data_checks.json"
)
Quality check id: 195646 for container: CUSTOMER created successfully
Quality check id: 195647 for container: CUSTOMER created successfully
Quality check id: 195648 for container: CUSTOMER created successfully
Quality check id: 195649 for container: CUSTOMER created successfully
Quality check id: 195650 for container: CUSTOMER created successfully
Quality check id: 195651 for container: CUSTOMER created successfully
Quality check id: 195652 for container: CUSTOMER created successfully
Quality check id: 195653 for container: CUSTOMER created successfully
Quality check id: 195654 for container: CUSTOMER created successfully
Options:
Option | Type | Description | Default | Required |
---|---|---|---|---|
--datastore |
TEXT | Datastore IDs to import checks into (comma-separated or array-like). | None | Yes |
--input |
TEXT | Input file path | HOME/.qualytics/data_checks.json | No |
Note: Errors during import will be logged in $HOME/.qualytics/errors.log
.
Run a Catalog Operation on a Datastore
Allows you to trigger a catalog operation on any current datastore (datastore permission required by admin)
Options:
Option | Type | Description | Required |
---|---|---|---|
--datastore |
TEXT | Comma-separated list of Datastore IDs or array-like format. Example: 1,2,3,4,5 or "[1,2,3,4,5]" | Yes |
--include |
TEXT | Comma-separated list of include types or array-like format. Example: "table,view" or "[table,view]" | No |
--prune |
BOOL | Prune the operation. Do not include if you want prune == false | No |
--recreate |
BOOL | Recreate the operation. Do not include if you want recreate == false | No |
--background |
BOOL | Starts the catalog but does not wait for the operation to finish | No |
Run a Profile Operation on a Datastore
Allows you to trigger a profile operation on any current datastore (datastore permission required by admin)
qualytics run profile
--datastore "DATSTORE_ID_LIST"
--container_names "CONTAINER_NAMES_LIST"
--container_tags "CONTAINER_TAGS_LIST"
--infer_constraints
--max_records_analyzed_per_partition "MAX_RECORDS_ANALYZED_PER_PARTITION"
--max_count_testing_sample "MAX_COUNT_TESTING_SAMPLE"
--percent_testing_threshold "PERCENT_TESTING_THRESHOLD"
--high_correlation_threshold "HIGH_CORRELATION_THRESHOLD"
--greater_then_date "GREATER_THAN_TIME"
--greater_than_batch "GREATER_THAN_BATCH"
--histogram_max_distinct_values "HISTOGRAM_MAX_DISTINCT_VALUES"
--background
import qualytics.qualytics as qualytics
DATASTORE_ID = "844"
CONTAINER_NAMES = "CUSTOMER, NATION"
qualytics.profile_operation(
datastores=DATASTORE_ID,
container_names=CONTAINER_NAMES,
container_tags=None,
infer_constraints=True,
max_records_analyzed_per_partition=None,
max_count_testing_sample=None,
percent_testing_threshold=None,
high_correlation_threshold=None,
greater_than_time=None,
greater_than_batch=None,
histogram_max_distinct_values=None,
background=False
)
Successfully Started Profile 29466 for datastore: 844
Waiting for operation to finish
Waiting for operation to finish
Waiting for operation to finish
Waiting for operation to finish
Waiting for operation to finish
Waiting for operation to finish
Successfully Finished Profile operation 29466 for datastore: 844
Processing... ---------------------------------------- 100% 0:00:46
Options:
Option | Type | Description | Required |
---|---|---|---|
--datastore |
TEXT | Comma-separated list of Datastore IDs or array-like format. Example: 1,2,3,4,5 or "[1,2,3,4,5]" | Yes |
--container_names |
TEXT | Comma-separated list of include types or array-like format. Example: "container1,container2" or "[container1,container2]" | No |
--container_tags |
TEXT | Comma-separated list of include types or array-like format. Example: "tag1,tag2" or "[tag1,tag2]" | No |
--infer_constraints |
BOOL | Infer quality checks in profile. Do not include if you want infer_constraints == false | No |
--max_records_analyzed_per_partition |
INT | Number of max records analyzed per partition | No |
--max_count_testing_sample |
INT | The number of records accumulated during profiling for validation of inferred checks. Capped at 100,000 | No |
--percent_testing_threshold |
FLOAT | Percent of testing threshold | No |
--high_correlation_threshold |
FLOAT | Number of Correlation Threshold | No |
--greater_than_time |
DATETIME | Only include rows where the incremental field's value is greater than this time. Use one of these formats %Y-%m-%dT%H:%M:%S or %Y-%m-%d %H:%M:%S | No |
--greater_than_batch |
FLOAT | Only include rows where the incremental field's value is greater than this number | No |
--histogram_max_distinct_values |
INT | Number of max distinct values of the histogram | No |
--background |
BOOL | Starts the catalog but does not wait for the operation to finish | No |
Run a Scan Operation on a Datastore
Allows you to trigger a scan operation on a datastore (datastore permission required by admin)
qualytics run scan
--datastore "DATSTORE_ID_LIST"
--container_names "CONTAINER_NAMES_LIST"
--container_tags "CONTAINER_TAGS_LIST"
--incremental
--remediation
--max_records_analyzed_per_partition "MAX_RECORDS_ANALYZED_PER_PARTITION"
--enrichment_source_records_limit
--greater_then_date "GREATER_THAN_TIME"
--greater_than_batch "GREATER_THAN_BATCH"
--background
import qualytics.qualytics as qualytics
DATASTORE_ID = 1172
CONTAINER_NAMES = "CUSTOMER, NATION"
qualytics.scan_operation(
datastores=str(DATASTORE_ID),
container_names=None,
container_tags=None,
incremental=False,
remediation="none",
enrichment_source_record_limit=10,
greater_than_batch=None,
greater_than_time=None,
max_records_analyzed_per_partition=10000,
background=False
)
Successfully Started Scan 29467 for datastore: 1172
Waiting for operation to finish
Waiting for operation to finish
Waiting for operation to finish
Waiting for operation to finish
Waiting for operation to finish
Successfully Finished Scan operation 29467 for datastore: 1172
Processing... ---------------------------------------- 100% 0:03:04
Options:
Option | Type | Description | Required |
---|---|---|---|
--datastore |
TEXT | Comma-separated list of Datastore IDs or array-like format. Example: 1,2,3,4,5 or "[1,2,3,4,5]" | Yes |
--container_names |
TEXT | Comma-separated list of include types or array-like format. Example: "container1,container2" or "[container1,container2]" | No |
--container_tags |
TEXT | Comma-separated list of include types or array-like format. Example: "tag1,tag2" or "[tag1,tag2]" | No |
--incremental |
BOOL | Process only new or records updated since the last incremental scan | No |
--remediation |
TEXT | Replication strategy for source tables in the enrichment datastore. Either 'append', 'overwrite', or 'none' | No |
--max_records_analyzed_per_partition |
INT | Number of max records analyzed per partition. Value must be Greater than or equal to 0 | No |
--enrichment_source_record_limit |
INT | Limit of enrichment source records per . Value must be Greater than or equal to -1 | No |
--greater_than_date |
DATETIME | Only include rows where the incremental field's value is greater than this time. Use one of these formats %Y-%m-%dT%H:%M:%S or %Y-%m-%d %H:%M:%S | No |
--greater_than_batch |
FLOAT | Only include rows where the incremental field's value is greater than this number | No |
--background |
BOOL | Starts the catalog but does not wait for the operation to finish | No |
Note: Errors during any of the three operations will be logged in $HOME/.qualytics/operation-error.log
.
Check Operation Status
To check the status of operations:
Options:
Option | Type | Description | Required |
---|---|---|---|
--ids |
TEXT | Comma-separated list of Operation IDs or array-like format. Example: 1,2,3,4,5 or "[1,2,3,4,5]" | Yes |
Ended: Qualytics CLI
FAQ ↵
Quality Scores
Quality Scores are quantified measures of data quality calculated at the field and container levels recorded as time-series to enable tracking of changes over time. Scores range from 0-100 with higher values indicating superior quality. These scores integrate eight distinct factors providing a granular analysis of the attributes that impact the overall data quality.
Quality Scoring a Field
Each field receives a total quality score based on eight key factors each evaluated on a 0-100 scale. The overall score is a composite reflecting the relative importance and configured weights of these factors:
- Completeness: Measures the average completeness of a field across all profiles.
- Coverage: Assesses the adequacy of data quality checks for the field.
- Conformity: Checks alignment with standards defined by quality checks.
- Consistency: Ensures uniformity in type and scale across all data representations.
- Precision: Evaluates the resolution of field values against defined quality checks.
- Timeliness: Gauges data availability according to schedule inheriting the container's timeliness.
- Volumetrics: Analyzes consistency in data size and shape over time inheriting the container's volumetrics.
- Accuracy: Determines the fidelity of field values to their real-world counterparts.
Quality Scoring a Container
A container is any structured data entity such as a table or dataframe that comprises multiple fields. Containers are scored using the same eight factors with each factor's score derived from a weighted average across its fields. Additional container-specific metrics also influence the total quality score:
- Shape anomaly adjustments
- Volumetric checks
- Scanning frequency
- Profiling frequency
- Timeliness assessments through Freshness SLA
- Impact of Freshness SLA violations
Customizing Quality Score Weights and Decay Time
You can tailor the impact of each quality factor on the total score by adjusting their weights allowing the scoring system to align with your organization’s data governance priorities. Additionally the decay period for considering past data events defaults to 180 days but can be customized to fit your operational needs ensuring the scores reflect the most relevant data quality insights.
Factor Impacting Rule Types
Specific check rule types are considered for factor score calculations at the field level for the following factors.
Conformity Rule Types
RuleType.matchesPattern
RuleType.minLength
RuleType.maxLength
RuleType.isReplicaOf
RuleType.isType
RuleType.entityResolution
RuleType.expectedSchema
RuleType.fieldCount
RuleType.isCreditCard
RuleType.isAddress
RuleType.containsCreditCard
RuleType.containsUrl
RuleType.containsEmail
RuleType.containsSocialSecurityNumber
RuleType.minPartitionSize
RuleType.maxPartitionSize
Precision Rule Types
RuleType.afterDateTime
RuleType.beforeDateTime
RuleType.between
RuleType.betweenTimes
RuleType.equalTo
RuleType.equalToField
RuleType.greaterThan
RuleType.greaterThanField
RuleType.lessThan
RuleType.lessThanField
RuleType.maxValue
RuleType.minValue
RuleType.notFuture
RuleType.notNegative
RuleType.positive
RuleType.predictedBy
RuleType.sum
Volumetric Rule Types
Printing This Guide
This guide changes often, so we can't recommend printing it or saving it offline. However, we recognize that there are circumstances beyond some users' control that make doing so more convenient. Thus, we are providing this link to our userguide as a single page appropriate for saving as a PDF. If you find this useful, please let us know.
Ended: FAQ
Misc ↵
SSO (Single Sign-On)
SSO for PaaS Deployments
Qualytics platform harnesses the power of Auth0's Single Sign-On (SSO) technology to create a frictionless authentication journey for our PaaS users. Once users have successfully logged in to Qualytics, they can conveniently access all linked external applications and services without the need for additional sign-ins. Depending on the application and its compatibility with federated SSO protocols such as SAML, OIDC, or any proprietary authentication methods, Qualytics, with the help of Auth0, establishes a secure connection for user authentication. In essence, SSO allows one central domain to authenticate and then share the session across various other domains. The method of sharing may vary between SSO protocols, but the principle remains constant.
Through Auth0's Integration Network (OIN), Qualytics extends SSO access to an extensive range of supported cloud-based applications. These integrations can utilize OpenID Connect (OIDC), SAML, SWA, or proprietary APIs for SSO. Maintenance of SSO protocols and provisioning APIs is reliably managed by Auth0.
In addition to this, Qualytics also leverages Auth0's capabilities to provide SSO integrations for on-premises web-based applications. You have the option to integrate these applications via SWA or SAML toolkits. In addition, Auth0 supports user provisioning and deprovisioning with applications that publicly offer their provisioning APIs.
Further enhancing our SSO integrations, Qualytics provides seamless access to mobile applications. Whether they are web applications optimized for mobile devices, native iOS apps, or Android apps, users can access web app integrations in the OIN using SSO from any mobile device. These mobile web apps can employ industry-standard OIDC, SAML, or Auth0 SWA technologies. To illustrate, Qualytics, in conjunction with Auth0, can integrate with native applications such as Box Mobile using SAML for registration and OAuth for continuous use.
Auth0 supports the following enterprise providers out of the box: - OAuth2 - Active Directory/LDAP
SSO for On-Premise Deployments
In addition to the option of leveraging our robust Auth0 support for federated authentication, customer-managed deployments can choose to directly integrated with their IdP (Identity Provider such as Active Directory, ForgeRock, etc) using OpenID Connect (OIDC). Once configured for direct federated authentication using OIDC, the customer's own user login requirements fully govern the authentication process in support of a fully air-gapped deployment of Qualytics with no egress required for operations.
Deployment options
Overview
The following two operations are supported for the Qualytics platform:
- Platform as a Service Deployment: to a single-tenant virtual private cloud (VPC) provisioned by Qualytics on infrastructure that Qualytics manages
- On-Premises Deployment: to a CNCF compliant kubernetes control plane on Customer managed infrastructure
Platform as a Service (PaaS) Deployment:
Depending on Customer’s cloud infrastructure, this option uses one of the following:
- EKS (Elastic Kubernetes Service)
- AKE (Azure Kubernetes Engine)
- GKE (Google Kubernetes Engine)
- Oracle OKE (Oracles Container Engine for Kubernetes)
The Qualytics platform is deployed to a single-tenant virtual private cloud provisioned by Qualytics and with the provider and in the region of Customer’s choosing. This VPC is not shared (single-tenant) and contains a single Customer Qualytics deployment. This model requires that the provisioned VPC have the ability to access Customer’s datastore(s). In the case of publicly routable datastores such as Snowflake or S3, no extra configuration is required. In the case of private datastore(s) with no public IP address or route, the hosted VPC will require private routing using: PrivateLink, Transit Gateway peering, point to point VPN, or similar support to enable network access to that private datastore.
Considerations This is Qualytics’ preferred model of deployment. In this model, Qualytics is fully responsible for the provisioning and operation of the Qualytics platform. Customer is only responsible for granting the Qualytics platform necessary access.
On-Premises Deployment:
This option supports deployments to any Kubernetes control plane that meets the following system requirements:
- Kubernetes version that is officially supported for patches running any CNCF compliant control plane
- A minimum 16 cores and 80 gigabytes of memory available for workload allocation
- Assigned a Customer resolvable fully-qualified domain name for the https ingress to the Qualytics UI
- (optional) Grant Qualytics an admin-level ServiceAccount to the cluster for pushing automated updates
This option requires that the kubernetes nodes supporting Qualytics’ analytics engine have the ability to access Customer’s datastore(s). Because Customer hosts the Qualytics deployment, Customer is solely responsible for ensuring the necessary network configuration and support.
Considerations This option supports organizations that due to regulatory or other restrictions cannot permit READ access to their datastore(s) from a third-party hosted product. This model requires Customer to manage and operate the appropriate infrastructure and ensure it is granted all necessary access to the targeted datastore(s). For deployments to supported commercial kubernetes control planes (EKS, AKE, GKE, OKE) and at the Customer’s discretion, Qualytics will provision the deployment and transfer ownership of the applicable infrastructure to the Customer. Otherwise, the Customer shall be responsible for both the provisioning of a cluster meeting the requisite system requirements and the deployment of the Qualytics platform via Qualytics provided Helm chart.
Installing helm for Qualytics single-tenant instance
Welcome to the Installation Guide for setting up Helm for your Qualytics Single-Tenant Instance.
Qualytics is a closed source container-native platform for assessing, monitoring, and ameliorating data quality for the Enterprise.
Learn more about our product and capabilities here.
Important Note for Deployment Type
Before proceeding with the installation of Helm for Qualytics Single-Tenant Instance, please note the following:
-
This installation guide is specifically designed for on-premises customers who manage their own infrastructure.
-
If you are a Qualytics Software as a Service (SaaS) customer, you do not need to perform this installation. The Helm setup is managed by Qualytics for SaaS deployments.
If you are unsure about your deployment type or have any questions, please reach out to your Qualytics account manager for clarification.
What is in this chart?
This chart will deploy a single-tenant instance of the qualytics platform to a CNCF compliant kubernetes control plane.
How should I use this chart?
Work with your account manager at Qualytics to securely obtain the appropriate values for your licensed deployment. If you don't yet have an account manager, please write us here to say hello!
At minimum, you will need credentials for our Docker Private Registry and a set of Auth0 secrets that will be used in the following steps.
1. Create a CNCF compliant cluster
Qualytics fully supports kubernetes clusters hosted in AWS, GCP, and Azure as well as any CNCF compliant control plane.
Node Requirements
Node(s) with the following labels must be made available:
appNodes=true
sparkNodes=true
Nodes with the sparkNodes=true
label will be used for Spark jobs and nodes with the appNodes=true
label will be used for all other needs.
It is possible to provide a single node with both labels if that node provides sufficient resources to operate the entire cluster according to the specified chart values.
However, it is highly recommended to setup autoscaling for Apache Spark operations by providing a group of nodes with the sparkNodes=true
label that will grow on demand.
Application Nodes | Spark Nodes | |
---|---|---|
Label | appNodes=true | sparkNodes=true |
Scaling | Fixed (1 node on-demand pricing) | Autoscaling (1 - 21 nodes spot pricing) |
EKS | t3.2xlarge | r5d.2xlarge |
GKE | n2-standard-8 | c2d-highmem-8 |
AKS | Standard_D8_v5 | Standard_E8s_v5 |
Docker Registry Secrets
Execute the command below using the credentials supplied by your account manager as replacements for "your-name" and "your-pword". The secret created will provide access to Qualytics private registry and the required images that are available there.
kubectl create secret docker-registry regcred --docker-server=artifactory.qualytics.io/docker --docker-username=<your-name> --docker-password=<your-pword>
Important
If you are unable to directly connect your cluster to our image repository for technical or compliance reasons, then you can instead import our images into your preferred registry using these same credentials. You'll need to update the image URLs in the values.yaml file in the next step to point to your repository instead of ours.
2. Update values.yaml with appropriate values
Update values.yaml
according to your requirements. At minimum, the "secrets" section at the top should be updated with the Auth0 settings supplied by your Qualytics account manager.
auth0_audience: changeme-api
auth0_organization: org_changeme
auth0_spa_client_id: spa_client_id
auth0_client_id: m2m_client_id
auth0_client_secret: m2m_client_secret
auth0_user_client_id: m2m_user_client_id
auth0_user_client_secret: m2m_user_client_secret
Contact your Qualytics account manager for assistance.
3. Deploy Qualytics to your cluster
The following command will first ensure that all chart dependencies are availble and then proceed with an installation of the Qualytics platform.
helm repo add qualytics https://qualytics.github.io/qualytics-helm-public
helm upgrade --install qualytics qualytics/qualytics --namespace qualytics --create-namespace -f values.yaml
As part of the install process, an nginx ingress will be configured with an inbound IP address. Make note of this IP address as it is needed for the fourth and final step!
4. Register your deployment's web application
Send your account manager the IP address for your cluster ingress gathered from step 3. Qualytics will assign a DNS record to it under *.qualytics.io
so that your end users can securely access the deployed web application from a URL such as https://acme.qualytics.io
Upgrade Qualytics Helm chart
Do you have the Qualytics Helm chart repository locally?
Make sure you have the Qualytics Helm chart repository in your local Helm repositories. Run the following command to add them:
Update Qualytics Helm Chart:
Target Helm chart version?
The target Helm chart version must be higher than the current Helm chart version.
To see all available Helm chart versions of the specific product run this command:
Upgrade Qualytics Helm Chart:
helm upgrade --install qualytics qualytics/qualytics --namespace qualytics --create-namespace -f values.yaml
Monitor Update Progress:
Monitor the progress of the update by running the following command:
Watch the status of the pods in real-time. Ensure that the pods are successfully updated without any issues.
Verify Update
Once the update is complete, verify the deployment by checking the pods' status:
Ensure that all pods are running, indicating a successful update.
Can I run a fully "air-gapped" deployment?
Yes. The only egress requirement for a standard self-hosted Qualytics' deployment is to https://auth.qualytics.io
which provides Auth0 powered federated authentication.
This is recommended for ease of installation and support, but not a strict requirement. If you have need of a fully private deployment with no access to the public internet, you can instead configure an OpenID Connect (OIDC) integration with your enterprise identity provider (IdP).
Simply contact your Qualytics account manager for more details.
Qualytics Scheduled Operations
Users may want to create their own scheduled operations in Qualytics for various reasons, such as automating routine tasks, data exports, or running specific operations at regular intervals. This guide will walk you through the process of creating a scheduled task.
On Linux machine
Prerequisites
Before proceeding, ensure that you have the following:
- Access to the terminal on your machine.
- The
curl
command-line tool installed. - The desired Qualytics instance details, including the instance URL and authentication token.
Steps to Create a Scheduled Operation
1. Open the Crontab Editor
Run the following command in your terminal to open the crontab editor:
2. Add the Cron Job Entry
In the crontab editor, add the following line to execute the curl command at your specified schedule:
<cronjob-expression> /usr/bin/curl --request POST --url 'https://<your-instance>.qualytics.io/api/export/<operation>?datastore=<datastore-id>&containers=<container-id-one>&containers=<container-id-two>' --header 'Authorization: Bearer <your-token>' >> <path-to-show-logs> 2>&1
3. Example:
For example, to run the command every 5 minutes:
*/5 * * * * /usr/bin/curl --request POST --url 'https://your-instance.qualytics.io/api/export/anomalies?datastore=123&containers=14&containers=16' --header 'Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...' >> /path/to/show/logs.txt 2>&1
4. Verify or List Cron Jobs:
Customize the placeholders based on your specific details and requirements. Save the crontab file to activate the scheduled operation.
On Windows machine
Prerequisites
Before proceeding, ensure that you have the following:
- Access to the PowerShell on your machine.
- The desired Qualytics instance details, including the instance URL and authentication token.
Steps to Create a Scheduled Operation
1. Open your text editor of your preference and add the script entry
In the text editor, add the following line to execute the Invoke-RestMethod
command:
Invoke-RestMethod -Method 'Post' -Uri https://<your-instance>/api/export/anomalies?datastore=<datastore-id>&containers=<container-id-one>&containers=<container-id-two> -Headers @{'Authorization' = 'Bearer <your-token>'; 'Content-Type' = 'application/json'}
2. Example:
For example, to run the command every 5 minutes:
Invoke-RestMethod -Method 'Post' -Uri https://your-instance.qualytics.io/api/export/anomalies?datastore=123&containers=44&containers=22 -Headers @{'Authorization' = 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'; 'Content-Type' = 'application/json'}
Customize the placeholders based on your specific details and requirements. Save the script with the desired name with the extension .ps1
.
3. Add the script to the Task Scheduler:
-
Open Task Scheduler:
- Press
Win + S
to open the Windows search bar. - Type "Task Scheduler" and select it from the search results.
- Press
-
Create a Basic Task:
- In the Task Scheduler window, click on
Create Basic Task...
on the right-hand side.
- In the Task Scheduler window, click on
-
Provide a Name and Description:
- Enter a name and description for your task. Click
Next
to proceed.
- Enter a name and description for your task. Click
-
Choose Trigger:
- Select when you want the task to start. Options include
Daily
,Weekly
, orAt log on
. - Choose the one that fits your schedule. Click
Next
.
- Select when you want the task to start. Options include
-
Set the Start Date and Time:
- If you selected a trigger that requires a specific start date and time, set it accordingly. Click
Next
.
- If you selected a trigger that requires a specific start date and time, set it accordingly. Click
-
Choose Action:
- Select
Start a program
as the action and clickNext
.
- Select
-
Specify the Program/Script:
- In the
Program/script
field, provide the path to PowerShell executable (powershell.exe
), typically located atC:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
. Alternatively, you can just typepowershell.exe
. - In the
Add arguments (optional)
field, provide the path to your PowerShell script. For example:-File "C:\Path\To\Your\GeneratedScript.ps1"
. - Click
Next
.
- In the
-
Review Settings:
- Review your task settings. If everything looks correct, click
Finish
.
- Review your task settings. If everything looks correct, click
-
Finish:
- You should now see your task listed in the Task Scheduler Library.
Installing Qualytics CLI:
Prerequisites
Before installing the Qualytics CLI, ensure you have the following prerequisites:
Python
: Make sure Python is installed on your machine. You can download Python from python.org.
pip
(Python Package Installer): Verify that pip is installed. It usually comes with Python installations.
Qualytics Account: Obtain your Qualytics API access token.
Installation
Open a Terminal and Install Qualytics CLI:
Verify Installation:
Initialization:
To use the Qualytics CLI, initialize it with your Qualytics instance details and API token:
Replace placeholders with your Qualytics instance URL and API token.
Automated Setup Using Qualytics CLI:
For Linux and Windows Users
Use the Qualytics CLI to schedule a task automatically.
qualytics schedule export-metadata --crontab "<cronjob-expression>" --datastore <datastore-id> --containers <container-ids> --options <metadata-options>
Replace placeholders as needed.
Behaviour on Linux:
It will create the files inside your home/user/.qualytics
folder.
The schedule operations commands are going to be located in home/user/.qualytics/schedule-operation.txt
.
You can see some files with the option
you selected with the logs of the cronjob run.
It will already create for you a cronjob expression, you can run crontab -l
to list all cronjobs.
Behaviour on Windows:
It will create the files inside your home/user/.qualytics
folder.
The script files will be located in home/user/.qualytics
with a pattern task_scheduler_script_<option-you-selected>_<datastore-number>.ps1
and it's just a matter for you to follow the step above to create the Task Scheduler.
Explanation of Placeholders:
-
<cronjob-expression>
: Replace this with your desired cron expression. For example,*/5 * * * *
means "every 5 minutes." You can checkcrontab.guru
for more examples. -
<your-instance>
: Replace with the actual Qualytics instance URL. -
<operation>
: Replace with the specific operation (e.g., "anomalies", "checks" or "field-profiles"). -
<datastore-id>
: Replace with the ID of the target datastore. -
<container-id-one>
and<container-id-two>
: Replace with the IDs of the containers. You can add more containers as needed. -
<container-ids>
: Comma-separated list of containers IDs or array-like format. Example: "1, 2, 3" or "[1,2,3]". -
<options>
: Comma-separated list of op to export or all for everything. Example: anomalies, checks, field-profiles or all. -
<your-token>
: Replace with the access token obtained from Qualytics (Settings
->Security
->API Keys
). -
<path-to-show-logs>
: Replace with the file path where you want to store the logs.
DFS multi-token filename globbing
Overview
Our data quality product offers a sophisticated feature that facilitates the organization and categorization of files on a distributed filesystem. This feature, known as Multi-Token Filename Globbing, enables the system to recursively scan files and intelligently group them based on shared filename conventions. It achieves this through a combination of filename pattern analysis and globbing techniques.
Process
Delimiter Identification: The first step involves identifying a common delimiter in filenames, such as an underscore (_) or dash (-). This delimiter is used to split the filenames into tokens. Tokenization and Grouping: Once the filenames are tokenized, the system groups them based on shared tokens. This is achieved through a method called applyMultiTokenGlobbing. Glob Pattern Formation: The core of this feature lies in forming glob patterns that represent groups of files sharing a schema. These patterns are created using the tokens derived from the filenames.
Methodology
- Initial Token Grouping: The method begins by grouping filenames based on each token. It considers the number of tokens and processes each token index separately.
-
Left or Right Side Grouping Decision: The system decides whether to group tokens starting from the left side or the right side of the filename, based on the distribution of tokens.
-
Pattern Creation Logic:
-
For filenames with a single token, the system avoids globbing and keeps the filenames as they are.
- For multi-token filenames, the method constructs a container name (glob pattern) by iterating through each token.
-
At each token, the method decides whether to include the token as-is or replace it with a wildcard (*). This decision is based on several factors, such as:
- The uniqueness of the token in the context of other filenames.
- The nature of the token (e.g., all letters).
- The comparison of token counts in adjacent indexes.
-
Special Cases Handling: The method includes logic to handle special cases, such as all-letter tokens, tokens at the beginning or end of a filename, and unique tokens.
- Glob Pattern Optimization: Finally, the system optimizes the glob patterns, ensuring that each pattern uniquely represents a group of files with a shared schema. This is done by comparing new patterns with existing ones and updating them based on the latest file modifications.
Detailed Methodology: Multi-Token Filename Globbing
Step-by-Step Process
Delimiter Identification and Tokenization
The system identifies a common delimiter in the filenames, typically an underscore (_) or dash (-), and splits the filenames into tokens.
Token Grouping and Indexing
- Each token in a filename is indexed (0, 1, 2, ...).
- Filenames are grouped based on the value of tokens at each index.
Determining Grouping Strategy
- The system decides whether to group tokens from the left (start of filename) or right (end of filename) based on the distribution and variation of tokens at each index.
Pattern Creation Logic
- Single-Token Filenames: No globbing is applied to filenames with only one token.
- Multi-Token Filenames: The method constructs glob patterns by analyzing each token. It considers factors like token uniqueness, commonality, and special cases like all-letter tokens.
Uniqueness vs. Commonality:
- Unique tokens (unique in their position across all filenames) are replaced with a wildcard "*".
- Common tokens across many files are kept as they are in the pattern.
Special Considerations for All-Letter Tokens:
- Tokens comprising entirely of letters are often grouped together, unless they are unique identifiers.
- Tokens at the start or end of a filename are treated with contextual logic, considering their potential roles (like identifiers or file types).
Adjacent Token Group Sizes:
The method compares the group sizes of adjacent tokens to determine if a token leads to a tighter grouping, influencing whether it's kept as literal or replaced with a wildcard.
Constructing Container Names (Glob Patterns)
-
For each token index, the method constructs a container name, deciding whether to include the token as-is or replace it with "*".
-
This decision is influenced by factors like the uniqueness of the token, the nature of the token (all letters or not), and the comparison of token counts in adjacent indexes.
Optimization and Finalization
- The system optimizes the glob patterns to ensure each pattern uniquely represents a group of files with a shared schema.
- It compares new patterns with existing ones and updates them based on the latest file modifications.
Example Scenarios
- Filename:
"project_data_2023_v1.csv"
- Potential Pattern:
"project_data_*_*.csv"
(if "2023" and "v1" vary across files).
- Potential Pattern:
- Filename:
"user_123_profile_2023-06-01.json"
- Potential Pattern:
"user_*_profile_*.json"
(if "123" and dates vary, and "user" and "profile" are consistent).
- Potential Pattern:
- Filename:
"log2023-06_error.txt"
- Potential Pattern:
"*_error.txt"
(if dates vary but "error" is a constant token).
- Potential Pattern:
Limitations
Context
While the Multi-Token Filename Globbing feature is a powerful tool for organizing files in distributed filesystems, including object storage systems like AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage, it's important to understand the limitations of using glob patterns with wildcards in these environments.
Wildcard Mechanics in Directory Listings
Wildcard Character (*):
In glob patterns, the asterisk (*) is used as a wildcard that matches any character, any number of times. This flexibility is powerful for grouping a wide range of file patterns but has limitations in precision.
Behavior in Object Storage Systems:
- Systems like AWS S3, GCS, and Azure Blob interpret the wildcard in a glob pattern to match any sequence of characters in a filename.
- This means a pattern with a wildcard can encompass a broad range of filenames, potentially grouping files that were not intended to be grouped together.
Specific Limitation Example
Consider the following scenario to illustrate this limitation:
Intended File Grouping Patterns:
- Pattern A:
project_data_*.txt
- Pattern B:
project_data_*_*.txt
Example Filenames:
project_data_1234.txt
project_data_1234_suffix.txt
Limitation in Practice:
-
In this case, Pattern A (
project_data_*.txt
) is intended to match files like project_data_1234.txt. However, due to the nature of the wildcard, this pattern will also inadvertently matchproject_data_1234_suffix.txt
. -
The wildcard in Pattern A extends to any length of characters following project_data_, making it impossible to exclusively group files that strictly follow the
project_data_1234.txt
format without including those with additional suffixes likeproject_data_1234_suffix.txt
.
Addressing the Limitations:
Understanding the inherent limitations of glob patterns, particularly when dealing with wildcards in object storage systems, is crucial for effective file management.
When users encounter scenarios where filenames within a folder are incompatible due to these limitations, several practical options are available.
Ensure appropriate file grouping:
Separation into Distinct Folders:
One effective strategy is to organize files with conflicting name formats into separate folders.
By doing so, the resultant glob patterns within each folder will be distinct and won’t overlap, ensuring precise file grouping.
Leveraging Folder-Globbing Feature:
For added flexibility, users can also utilize our folder-globbing feature.
This feature simplifies the grouping process by aggregating all files in the same folder, regardless of their filename patterns. This approach is particularly useful in scenarios where filename-based grouping is less critical or when dealing with a wide variety of filename formats within the same directory.
Customized Filename Conventions:
Users are encouraged to adopt filename conventions that align better with the capabilities and limitations of glob patterns. By designing filenames with clear, distinct segments and predictable structures, users can more effectively leverage the globbing feature for accurate file categorization.
Conclusion
The Multi-Token Filename Globbing feature stands out as a powerful and efficient tool for organizing and categorizing files within a distributed filesystem.
By astutely analyzing filename patterns and forming optimized glob patterns, this feature significantly streamlines the process of managing files that share common schemas, thereby elevating the overall data quality and accessibility within the system.
Ended: Misc
Ended: Getting Started
Release Notes
2024.12.11
Feature Enhancements
- Add
Max Parallelization
Field on Datastore Connection- Users can now configure the maximum parallelization level for certain datastores, providing greater control over operation performance.
General Fixes
- General Fixes and Improvements.
2024.11.29
Feature Enhancements
- Activity List
- Removed the
Warning
status for a cleaner and more concise status display. - Added an alert icon to indicate if an operation completed with warnings, improving visibility into operation outcomes.
- Removed the
General Fixes
- Better handling of Oracle Date and Numeric columns during Catalog operations for improved partition field selection.
- General Fixes and Improvements.
2024.11.21
Feature Enhancements
- Improved Operations Container Dialogs
- Added container status details based on profile and scan results, providing better visibility of container-level operations.
- Introduced a loading tracker component for containers, enhancing feedback during operation processing.
- Made the entire modal reactive to operation updates, enabling real-time tracking of operation progress within the modal.
- Removed "containers requested" and "containers analyzed" dialogs for a cleaner interface.
General Fixes
-
Resolved an issue where the table name was not rendering correctly in notifications when using the
{{ customer_name }}
variable. -
General Fixes and Improvements.
2024.11.12
Feature Enhancements
-
Enhance Data Catalog Integration
- Introduced a new domain input field that allows users to select specific domains, enabling more granular control over assets synchronization.
-
Scan Results Enhancements
- Added partition label to the scan results modal for improved partition identification.
- Removed unnecessary metadata partitions created solely for volumetric checks, reducing clutter in scan results.
-
Activity Tab
- Display of Unprocessed Containers in the Operation List
- Unprocessed containers are now visible in the operation list within the operation summary.
- A total count label was added to indicate if the number of analyzed containers exceeds the total requested.
- The search icon now highlights in a different color if not all containers were analyzed, making it easier to identify incomplete operations.
- Reorder the Datastore Column in the Activity Tab
- Users can now reorder columns in the Activity tab for easier navigation and data organization.
- Profile Operations
- Users can now view added, updated, and total inferred checks within Profile operations.
- Triggered by Column
- Updated the term "Triggered by API" to "Triggered by System" for clarity.
- Display of Unprocessed Containers in the Operation List
General Fixes
- General Fixes and Improvements.
2024.11.01
Feature Enhancements
-
Observability Enhancements
- An observability heatmap was added to the volumetric card in the Observability tab.
- The heatmap allows users to monitor volumetric status and check for new anomalies.
- Improved observability chart for clearer insights.
- Users can now view the count of volumetric anomalies produced over time, along with the last recorded measurements for each period.
- Introduced new color indicators to help distinguish volumetric measures outside thresholds that didn’t produce anomalies from those that did.
- An observability heatmap was added to the volumetric card in the Observability tab.
-
Editable Tags in Field Details
- Users with write permissions can now manage tags directly in the Field Details within the Explore context.
-
Distinct Count Rule Update
- The Distinct Count rule now excludes the Coverage field for more accurate assessments.
-
Support for Pasting into Expected Values
- Users can now paste values from spreadsheets directly into Expected Values, saving time on data entry.
General Fixes
- General Fixes and Improvements.
2024.10.23
Feature Enhancements
-
Dremio Connector
- We’ve expanded our connectivity options by supporting a new connection with Dremio.
-
Full View of Abbreviated Metrics in Operation Summary
- Users can now hover over abbreviated metrics to see the full value for better clarity.
-
Redirect to Conflicting Check
- Added a redirect link to the conflicting check from the error message, improving navigation when addressing errors.
-
Enhanced Visibility and Engagement for Tags and Notifications Setup
- Introduced a Call to Action to encourage users to manage Tags and Notifications for better engagement.
-
Favorite Containers
- Users can now favorite individual containers.
- The option to favorite datastores and containers is now available in both card and list views.
General Fixes
- General Fixes and Improvements.
2024.10.16
Feature Enhancements
-
Improved Anomaly Modal
- Introduced an information icon in each failed check to display the check's description.
- Anomaly links now persist filters for sort order and displayed fields.
- Added integration details to fields in a source record.
-
Secrets Management
- Added support for Secrets Manager in connection properties, enabling integration with Vault and other secrets management systems.
-
Alation Data Dictionary
- Enhanced the dictionary to display friendly names in anomaly screens for improved usability.
- Added integration information to the datastore, container, and fields in the tree view footer.
-
Tag Category
- Introduced support for tag categories to improve tag management, with sorting and filtering options based on the category field.
-
Call to Action for Volumetric Measurements
- A call to action was added in the overview tab within the container context, and the observability page per container was added to enable volumetric measurements.
-
Error Display for Check Operations
- Bulk operations like Edit, Activate, Update, and Template Edit now display error messages clearly when validation fails.
-
Check Validation
- Improved check validation logic to enhance bulk check validation speed and prevent timeouts.
-
Tag Filtering for Fields
- Users can now filter fields by tags in the field list under the datastore context.
-
Field Remarks in Native Field Properties
- Added support for displaying field remarks alongside other native field properties.
-
Customer Support Link
- Users can now access the Qualytics Helpdesk via the Discover menu in the main header.
General Fixes
- General Fixes and Improvements.
2024.10.04
Feature Enhancements
-
Insights Page Redesign
- Introduced a new Overview card displaying key metrics such as
Data Under Management
,Source Datastores
, andContainers
. - Added a doughnut chart visualization for checks and anomalies, providing a clearer view of data health.
- Expanded available metrics to include profile runs and scan runs.
- Users can now easily navigate to Checks and Anomalies based on their current states and statuses.
- Implemented data volume visualizations to give users better insight into data trends.
- Introduced a legend option that allows users to compare specific metrics against the primary one.
- Enhanced the check distribution visualization across the platform within the overview tabs.
- Introduced a new Overview card displaying key metrics such as
-
Check Filter
- Now users can filter
Not Asserted
checks.
- Now users can filter
-
Team Management
- Now admin users can modify the
Read
andWrite
permissions of thePublic
Team.
- Now admin users can modify the
-
Reapplying Clone Field
- Check cloning functionality by attempting to reapply the field from the original (source) check when a new container is selected. If the selected container matches the field and type from the original check, the cloned field will be reapplied automatically.
General Fixes
-
Allow saving checks with attached templates as drafts
- Adjusted the behavior to allow checks attached to a template to be saved as drafts. The
Save as draft
feature now remains functional when a template is attached.
- Adjusted the behavior to allow checks attached to a template to be saved as drafts. The
-
Incremental identifier strange behavior
- When a user tries to modify a query in a computed table, the Incremental Modifier is set to null.
-
General Fixes and Improvements
2024.09.25
Feature Enhancements
-
Observability
- Time-series charts are presented to monitor data volume and related anomalies for each data asset.
- Custom thresholds were added to adjust minimum and maximum volume expectations.
- The Metrics tab has been moved to the Observability tab.
- The Observability tab has replaced the Freshness page.
- Time-series charts are presented to monitor data volume and related anomalies for each data asset.
-
Check Category Options for Scan Operations
- Users can select one or multiple check categories when running a scan operation.
-
Anomaly Trigger Rule Type Filter
- Added a filter by check rule types to anomaly triggers. A help component was added to the tags selector to improve clarity.
-
Auto-Archive Anomalies
- A new Duplicate status has been introduced for anomalies.
- Users can now use Incremental Identifier ranges to auto-archive anomalies with the new Duplicate status.
- An option has been added to scan operations to automatically archive anomalies identified as duplicates if the containers analyzed have incremental identifiers configured.
-
A dedicated tab for filtering duplicate anomalies has been added for better visibility.
-
Tree View and Breadcrumb Context Menu
- A context menu has been added, allowing users to copy essential information and open links in new tabs.
- Users can access the context menu by right-clicking on the assets.
-
Incremental Identifier Support
- Users can manage incremental identifiers for computed tables and computed files.
-
Native Field Properties
- Users can now see native field properties in the field profile, displayed through an info icon next to the Type Inferred section.
-
Qualytics CLI Update
- Users can now import check templates.
- A status filter has been added to check exports. Users can filter by
Active
,Draft
, orArchived
(which will includeInvalid
andDiscarded
statuses).
General Fixes
- The Oracle connector now handles invalid schemas when creating connections.
- Anomalies identified in scan operations were not counting archived statuses.
- Improved error message when a user creates a schedule name longer than 50 characters.
- General Fixes and Improvements.
Breaking Changes
- Freshness and SLA references have been removed from user notifications and notification rules, users should migrate to Observability using volumetric checks.
2024.09.14
Feature Enhancements
-
Volumetric Measurement
- We are excited to introduce support for volumetric measurements of views, computed tables and computed files.
-
Enhanced Source Record CSV Download
- Users can now download all source records as CSV that have been written to the enrichment datastores.
-
Tags and Notifications Moved to Left-Side Navigation
- Users can now quickly switch between Tags, Notifications, and Data Assets through the left-side navigation.
- Access to the Settings page is restricted to admin users.
-
Last Asserted Information in Checks
- The
Created Date
information has been replaced withLast Asserted
to improve visibility. - Users can hover over an info icon to view the
Created Date
.
- The
-
Auto-Generated Description in Check Template Dialog
-
Descriptions are now automatically generated in the Template Dialog based on the rule type, ensuring consistency with the check form.
-
Exposed Properties in Profile and Scan Operations
- Profile and scan operations now expose properties when listed:
- Record Limit
- Infer As Draft
- Starting Threshold
- Enrichment Record Limit
- Profile and scan operations now expose properties when listed:
General Fixes
- Fixed a bug where the container list would not update when a user created a computed container.
- Fixed an issue where deactivated users were not filtered on the Settings page under the Security tab.
- Improved error messages when operations fail.
- Fixed a bug where the
Last Editor
field was empty after a user was deactivated by an admin. - General Fixes and Improvements.
2024.09.10
Feature Enhancements
-
Add Source Datastore Modal
- Enhanced text messages and labels for better clarity and user experience.
-
Add Datastore
- Users can now add a datastore directly from the Settings page under the Connections tab, simplifying connection management.
General Fixes
- General Fixes and Improvements
2024.09.06
Feature Enhancements
- Introducing Bulk Activation on Draft Checks
- Users can now activate and validate multiple draft checks at once, streamlining the workflow and reducing manual effort.
General Fixes
-
Improved error message for BigQuery temporary dataset configuration exceptions.
-
Added a retry operation for Snowflake when no active warehouse is selected in the current session.
-
General Fixes and Improvements
Breaking Changes
- API fields (
type
andcontainer_type
) are now mandatory in request payloads where they were previously optional.- POST /global-tags:
type
is now required. - PUT /global-tags/{name}:
type
is now required. - POST /containers:
container_type
is now required. - PUT /containers/{id}:
container_type
is now required. - POST /operations/schedule:
type
is now required. - PUT /operations/schedule/{id}:
type
is now required. - POST /operations/run:
type
is now required.
- POST /global-tags:
2024.09.03
Feature Enhancements
- Introducing Catalog Scheduling
- Users can now schedule a Catalog operation like Profile and Scan Operations, allowing automated metadata extraction.
General Fixes
- General Fixes and Improvements
2024.08.31
Feature Enhancements
-
New Draft Status for Checks
- Introduced a new 'draft' status for checks to enhance lifecycle management, allowing checks to be prepared and reviewed without impacting scan operations.
- Validation is only applied to active checks, ensuring draft checks remain flexible for adjustments without triggering automatic validations.
-
Introduce Draft Check Inference in Profile Operations
- Added a new option to infer checks as drafts, offering more flexibility during data profiling.
-
Improve Archive Capabilities for Checks and Anomalies
- Enhanced the archive capabilities for both checks and anomalies, allowing recovery of archived items.
- Introduced a hard delete option that allows permanent removal of archived items, providing greater control over their management.
- The Anomaly statuses 'Resolved' and 'Invalid' are now treated as archived states, aligning with the consistent approach used for checks.
-
Introduce a new Volumetric Check
- Introduced the Volumetric Check to monitor and maintain data volume stability within a specified range. This check ensures that the volume of data assets does not fluctuate beyond acceptable limits based on a moving daily average.
- Automatically inferred and maintained by the system for daily, weekly, and monthly averages, enabling proactive management of data volume trends.
-
Incremental Identifier Warning in Scan Dialog
- Enhanced the dialog to notify users when they attempt an incremental scan on containers lacking an incremental identifier, ensuring transparency and preventing unexpected full scans.
General Fixes
-
Improve enrichment writes with queuing all writes (up to a queue threshold) for the entire scan operation. This will dramatically reduce the number of write operations performed.
-
Explicit casting to avoid weak CSV parser support for typing.
-
General Fixes and Improvements
2024.08.19
Feature Enhancements
-
Enhance Auto-Refresh Mechanism on Tree View
- The datastore and container tree footers are now automatically refreshed after specific actions, eliminating the need for manual page refreshes.
-
Support Oracle Client-Side Encryption
- Connections with Oracle now feature end-to-end encryption. Database connection encryption adds an extra layer of protection, especially for transmissions over long-distance, insecure channels.
General Fixes
-
UI Label on Explore Page
- Fixed an issue where the labels on the Explore page did not change based on the selected time frame.
-
Inferred Field Type Enhancements
- Behavior updated to infer field types at data load time rather than implicitly cast them to latest profiled type. This change supports more consistent expected schema verification for delimited file types and resolves issues when comparing inferred fields to non-inferred fields in some rule types.
-
Boolean Type Inference
- Behavior updated to align boolean inference with Spark Catalyst so that profiled types are more robustly handled during Spark based comparisons
-
General Fixes and Improvements
2024.08.10
Feature Enhancements
- Introducing Profile Inference Threshold
- This feature allows users to adjust which check types will be automatically created and updated during data profiling, enabling them to manage data quality expectations based on the complexity of inferred data quality rules.
- Anomaly Source Records Retrieval Retry Option
- Enabled users to manually retry fetching anomaly source records when the initial request fails.
General Fixes
- General Fixes and Improvements
2024.07.31
Feature Enhancements
-
Introducing Field Count to the Datastore Overview
- This enhancement allows users to easily view the total number of fields present in a datastore across all containers.
-
Search Template
- Added a check filter to the templates page.
- Added a template filter to the checks page in the datastore context and explore.
-
Driver Free Memory
- Added driver free memory information on the Health Page.
-
Anomalous Record Count to the Anomaly Sidebar Card
- Added the anomalous record count information to the anomaly sidebar card located under the Scan Results dialog.
General Fixes
-
Enhanced write performance on scan operations with enrichment and relaxed hard timeouts.
-
Updated Azure Blob Storage connector to use TLS encrypted access by default.
-
Overview Tab is not refreshing asset details automatically.
-
General Fixes and Improvements
2024.07.26
Feature Enhancements
-
Introducing Event Bus for Extended Auto-Sync with Data Catalog Integrations
- We are excited to expand our auto-sync capabilities with data catalog integrations by implementing an event bus pattern.
- Added functionality to delete any DQ values that do not meet important checks.
- Included support for a WARNING status in the Alation Data Health tab for checks that have not been asserted yet.
-
Add Autocomplete to the Notification Form
- Improved the notification message form by implementing autocomplete. Users can now easily include internal variables when crafting custom messages, streamlining the message creation process.
-
Redesign the Analytics Engine Functions
- The functions are now accessible through a menu, which displays the icon and full functionality.
- Added a modal to alert users before proceeding with the restart. The modal informs users that the system will be unavailable for a period during the restart process.
-
Improve Qualytics metadata presentation in Alation
- Previously, multiple custom fields were used to persist data quality metrics measured by Qualytics. This process has been simplified by consolidating the metrics into a single rich text custom field formatted in HTML, making it easier for users to analyze the data.
General Fixes
-
Normalize Enrichment Internal Containers
- To improve user recognition and differentiate between our internal tables and those in source systems, we now preserve the original case of table names.
-
Validation Error on Field Search Result
- Resolved the logic for cascade deletion of dependencies on containers that have been soft deleted, ensuring proper handling of related data.
-
Members Cannot Add Datastore on the Onboarding Screen
- Updated permissions so that members can no longer add Datastores during the onboarding process. Only Admins now have this capability.
-
General Fixes and Improvements
2024.07.19
Feature Enhancements
- Global Search
- We are thrilled to introduce the “Global Search” feature into Qualytics! This enhancement is designed to streamline the search across the most crucial assets: Datastores, Containers, and Fields. It provides quick and precise search results, significantly improving navigation and user interaction.
- Navigation Update: To integrate the new global search bar seamlessly, we have relocated the main menu icons to the left side of the interface. This adjustment ensures a smoother user experience.
- Teradata Connector
- We’ve expanded our connectivity options by supporting a new connection with Teradata. This enhancement allows users to connect and interact with Teradata databases directly from Qualytics, facilitating more diverse data management capabilities.
- Snowflake Key-pair Authentication
- In our ongoing efforts to enhance security, we have implemented support for Snowflake Key-pair authentication. This new feature provides an additional layer of security for our users accessing Snowflake, ensuring that data transactions are safe and reliable.
General Fixes
- General Fixes and Improvements
2024.07.15
Feature Enhancements
-
Alation Data Catalog Integration
- We're excited to introduce integration with Alation, enabling users to synchronize and manage assets across both Qualytics and Alation.
- Metadata Customization:
- Trust Check Flags: We now support warning flags at both the container and field levels, ensuring users are aware of deprecated items.
- Data Health: Qualytics now pushes important checks to Alation's Data Health tab, providing a comprehensive view of data health at the container level.
- Custom Fields: Quality scores and related metadata are pushed under a new section in the Overview page of Alation. This includes quality scores, quality score factors, URLs, anomaly counts, and check counts.
-
Support for Never Expiration Option for Tokens
- Users now have the option to create tokens that never expire, providing more flexibility and control over token management.
General Fixes
- General Fixes and Improvements
2024.07.05
Feature Enhancements
- Enhanced Operations Listing Performance
- Optimized the performance of operations listings and streamlined the display of container-related information dialogs. These enhancements include improved handling of operations responses and the addition of pagination for enhanced usability
General Fixes
-
Fix Computed Field Icon Visibility
- Resolved an issue where the computed field icon was not being displayed in the table header.
-
General Fixes and Improvements
2024.06.29
Feature Enhancements
-
Computed Field Support
- Introduced computed fields allowing users to dynamically create new virtual fields within a container by applying transformations to existing data.
- Computed fields offer three transformation options to cater to various data manipulation needs. Each transformation type is designed to address specific data characteristics:
- Cleaned Entity Name: Automates the removal of business signifiers such as 'Inc.' or 'Corp.' from entity names, simplifying entity recognition.
- Convert Formatted Numeric: Strip formatting like parentheses (for negatives) and commas (as thousand separators) from numeric data, converting them into a clean, numerically-typed format.
- Custom Expression: Allows users to apply any valid Spark SQL expression to combine or transform fields, enabling highly customized data manipulations.
- Users can define specific checks on computed fields to automatically detect anomalies during scan operations.
- Computed fields are also visible in the data preview tab, providing immediate insight into the results of the defined transformations.
-
Autogenerated Descriptions for Authored Checks
- Implemented an auto-generation feature for check descriptions to streamline the check authoring process. This feature automatically suggests descriptions based on the selected rule type, reducing manual input and simplifying the setup of checks.
-
Event-Driven Catalog Integrations and Sync Enhancements
- Enhanced the Atlan integration and synchronization functionalities to include event-driven support, automatically syncing assets during Profile and Scan operations. This update also refines the Sync and Integration dialogs, offering clearer control options and flexibility.
-
Sorting by Anomalous Record Count
- Added a new sorting filter in the Anomalies tabs that allow users to sort anomalies by record count, improving the manageability and analysis of detected anomalies.
-
Refined Tag Sorting Hierarchy:
- Updated the tag sorting logic to consistently apply a secondary alphabetical sort by name. This ensures that tags will additionally be organized by name within any primary sorting category.
General Fixes
-
Profile Operation Support for Empty Containers
- Resolved an issue where profiling operations failed to record fields in empty containers. Now, fields are generated even if no data rows are present.
-
Persistent Filters on the Explore Page
- Fixed a bug that caused Explore to disable when switching tabs on the Explore page. Filters now remain active and consistent, enhancing user navigation and interaction.
-
Visibility of Scan Results Button
- Corrected the visibility issue of the 'results' button in the scan operation list at the container level. The button now correctly appears whenever at least one anomaly is detected, ensuring users have immediate access to detailed anomaly results.
-
General Fixes and Improvements
2024.06.18
Feature Enhancements
-
Improvement to Anomaly Dialog
- Enhanced the anomaly dialog to include a direct link to the operation that generated the anomaly. Users can now easily navigate from an anomaly to view other anomalies generated by the same operation directly from the Activity tab.
-
Sorting by Duration in Activity Tab
- Introduced the ability to sort by the duration of operations in the Activity tab by ascending or descending order.
-
Last Editor Information for Scheduled Operations
- Added visibility of which users have created or last updated scheduled operations, enhancing traceability in scheduling management.
-
Display Total Anomalous Records for Anomalies
- Added the total count of anomalous records in the anomalies listing view.
General Fixes
-
Performance Fixes on Computed Table Creation and Check Validation
- Optimized the processes for creating computed tables and validating checks. Users previously experiencing slow performance or timeouts during these operations will now find the processes significantly faster and more reliable.
-
General Fixes and Improvements
2024.06.14
Feature Enhancements
-
Improvements to Atlan Integration
-
When syncing Qualytics with Atlan, badges now display the "Quality Score Total," increasing visibility and emphasizing key data quality indicators on Atlan assets.
-
Improved performance of the synchronization operation.
-
Implemented the propagation of external tags to checks, now automatically aligned with the container synchronization process, enabling better accuracy and relevance of data tagging.
-
-
Refactor Metric Check Creation
- Enhanced the encapsulated Metric Check creation flow to improve user experience and efficiency. Users can now seamlessly create computed tables and schedule operations simultaneously with the metric check creation.
-
Support Update of Weight Modifier for External Tags
-
Add Validation on Updated Connections
- Added support for testing the connection if there's at least one datastore attached to the connection, ensuring more reliable and accurate connection updates.
-
Standardize Inner Tabs under the Settings Page
-
Tags and Notifications Improvements: The layout has been revamped for better consistency and clarity. General headers have been removed, and now each item features specific headers to enhance readability.
-
Security Tab Improvements: The redesign features chip tabs for improved navigation and consistency. Filters have been updated to ensure they meet application standards.
-
Tokens Tab Accessibility: Moved the action button to the top of the page to make it more accessible.
-
Refine Connector Icons Display: Improved the display of connector icons for Datastores and Enrichments in the Connections Tab.
-
-
Streamlined Container Profiling and Scanning
- In the container context, the profile and scan modals have been updated to automatically display the datastore and container, eliminating the need for a selection step and streamlining the process.
-
Swap Order During Check Creation
-
Rule Type Positioning: The Rule Type now appears before the container selection, making the form more intuitive.
-
Edit Mode Header: In edit mode, the Rule Type is prominently displayed in the modal header, immediately under the check ID.
-
General Fixes
-
Address Minor Issues in the Datastore Activity Page
-
Operation ID Auto-Search: Restored the auto-search feature by operation ID for URL access, enhancing navigation, especially for Catalog Operations.
-
Tree View Auto-Refresh: Implemented an auto-refresh feature for the tree view, which activates after any operation in the CTA flow (Catalog, Profile, Scan).
-
-
Fix "Greater Than Field" Quality Check
- Corrected the inclusive property of the greater than field quality check.
-
Fix Exporting Field Profiles for Non-Admin User with Write Permission
- Resolved issues for non-admin users with write permissions to allow proper exporting of field profile metadata to enrichment.
-
Fix "Is Replica Of" Quality Check validation on Field Names with Special Characters
- Improved validation logic to handle field names with special characters
-
General Fixes and Improvements
2024.06.07
Feature Enhancements
-
Atlan Integration Improvements
- Enhanced the Atlan assets fetch and external tags syncing.
- Added support for external tag propagation to checks and anomalies.
- Merged Global and External tags section for streamlined tag management.
-
Restart Button for Analytics Engine
- Introduced a new "Restart" button under the Settings - Health section, allowing admins to manually restart the Analytics Engine if it is offline or unresponsive.
-
Interactive Tooltip Component
- Added a new interactive tooltip component that remains visible upon hovering, enhancing user interaction across various modules of the application.
- Refactored existing tooltip usage to integrate this new component for a more consistent user experience.
-
Defaulting to Last-Used Enrichment Datastore for Check Template Exports
- Improved user experience by persisting the last selected enrichment datastore as the default option when exporting a check template.
General Fixes
-
Shared Links Fixes
- Fixed issues with shared operation result links, ensuring that dialogs for scan/profile results and anomalies now open correctly.
- Addressed display inaccuracies in the "Field Profiles Updated" metrics.
-
General Fixes and Improvements
2024.06.04
Feature Enhancements
-
Atlan Data Catalog Integration
- We're excited to introduce integration with Atlan, enabling users to synchronize and manage assets across both Qualytics and Atlan:
- Tag Sync: Sync tags assigned to data assets in Atlan with the corresponding assets in Qualytics, enabling tag-based quality score reporting, notifications, and bulk data quality operations using Atlan-managed tags.
- Metadata Sync: Automatically synchronize Atlan with Qualytics metadata, including asset URL, total score, and factor scores such as completeness, coverage, conformity, consistency, precision, timeliness, volume, and accuracy.
- We're excited to introduce integration with Atlan, enabling users to synchronize and manage assets across both Qualytics and Atlan:
-
Entity Resolution Check
- We've removed the previous limitation on the maximum number of distinct entity names that could be resolved with the Entity Resolution rule type. This release includes various performance enhancements that support an unlimited number of entity names.
-
Enhancements to Catalog Operation Results
- We've improved the catalog operation results by now including detailed information on whether tables, views, or both were involved in each catalog operation.
-
Enhancements to 'Equal to Field' Rule Type
- The 'Equal to Field' rule now supports string values, allowing for direct comparisons between text-based data fields.
-
Enhancements to Enrichment
- Qualytics now includes a property for anomalousRecordCount on shape anomaly, which previously was neither populated nor persisted. This aims to accurately capture and record the total number of anomalous records identified in ShapeAnomaly, regardless of the max_source_records threshold.
-
Dynamic Meta Titles
- Pages such as Datastore Details, Container Details, and Field Details now feature dynamic meta titles that accurately describe the page content and are visible in browser tabs providing better searchability.
General Fixes
-
Fix Trends of Quality Scores on the Insights Page
- Addressed issues with displaying trends on the Insights page. Trends now accurately reflect changes and comparisons to the previous report period, providing more reliable and insightful analytics.
-
Resolved a bug in Entity Resolution where the distinction constraint was only applied to entity names that differed.
-
General Fixes and Improvements
2024.05.22
Feature Enhancements
-
Datastore Connection Updates:
- Users can now update the connection on a datastore if the new one has the same type as the current one.
-
Enrichment Datastore Redirection:
- Enhanced the user interface to facilitate easier redirection to enrichment datastores, streamlining the process and improving user experience.
-
Label Enhancements for Data Completeness:
- Updated labels to better distinguish between completeness percentages and Factor Scores. The label for completeness percentage has been changed to provide clear context when viewed alongside.
General Fixes
-
Rule Type Anomaly Corrections:
- Fixed an issue where the violation messages for record anomalies incorrectly included "None" for some rule types. This update ensures accurate messaging across all scenarios.
-
Shape Anomaly Logic Adjustment:
- Revised the logic for Shape Anomalies to prevent the combination of failed checks for high-count record checks on the same field. This change ensures that displayed sample rows have definitively failed the specific checks shown, enhancing the accuracy of anomaly reporting.
-
Entity Resolution Anomalies:
- Addressed an inconsistency where some Entity Resolution Checks did not return source records. Ongoing investigations and fixes have improved the reliability of finding source records for entity resolution checks across DFS and JDBC datastores.
-
General Fixes and Improvements
2024.05.16
Feature Enhancements
-
Entity Resolution Check
- Introduced rule "Entity Resolution" to determine if multiple records reference the same real-world entity. This feature uses customizable fields and similarity settings to ensure accurate and tailored comparisons.
-
Support for Rerunning Operations
- Added an option to rerun operations from the operations listing, allowing users to reuse the configuration from previously executed operations.
General Fixes
-
Export Operations
- Fixed metadata export operations silently failing on writing to the enrichment datastores.
-
Computed File/Table Creation
- Resolved an issue that prevented the creation of computed files/tables with the same name as previously deleted ones, even though it is a valid action.
-
General Fixes and Improvements
2024.05.13
General Fixes
-
Enhanced Quality Score Factors Computation
- Addressed issues in quality score calculation and its associated factors ensuring accuracy
-
General Fixes and Improvements
2024.05.11
Feature Enhancements
-
Introducing Quality Score Factors
- This new feature allows users to control the quality score factor weights at the datastore and container levels.
- Quality Score Detail Expansion: Users can now click on the quality score number to expand its details, revealing the contribution of each factor to the overall score. This enhancement aids in understanding what drives the quality score.
- Insights Page Overhaul: The Insights page has been restructured to better showcase the quality score breakdown. This redesign aims to make the page more informative and focused on quality score metrics.
- Customization of Factor Weights: Users can now customize the weights of different factors at the Datastore and Container levels. This feature is essential for adapting the quality score to meet specific user needs, such as disregarding the Timeliness factor for dimensional tables where it might be irrelevant.
- Enhanced Inferred Checks: Introduced a new property in the Check Listing schema and a feature in the Check modal that displays validity metrics, which help quantify the accuracy of inferred checks. A timezone handling issue in the last_updated property of the Check model has also been addressed.
- This new feature allows users to control the quality score factor weights at the datastore and container levels.
-
Quality Score UI Enhancements
- Enhancements have been made to the user interface to provide a clearer and more detailed view of the quality score metrics, including Completeness, Coverage, Conformity, Consistency, Precision, Timeliness, Volumetrics, and Accuracy. These changes aim to provide deeper insight into the components that contribute to the overall quality score.
General Fixes
-
Fixes to JDBC Incremental Support
- Updated the conditional logic in the catalog operation for update tables to ensure the incremental identifier is preserved if already established.
-
General Fixes and Improvements
2024.05.02
Feature Enhancements
-
Datastore Connections:
- Users can now create connections that can be shared across different datastores. This introduces a more flexible approach to managing connections, allowing users to streamline their workflow and reduce duplication of effort. With shared connections, users can easily reuse common elements such as hostname and credentials across various datastores, enhancing efficiency and simplifying management.
-
File Container Header Configuration:
- Adds support for setting the hasHeader boolean property on File Containers, enabling users to specify whether their flat file data sources include a header row. This enhances compatibility and flexibility when working with different file formats.
-
Improved Error Handling in Delete Dialogs:
- Error handling within delete dialogs has been revamped across the application. Error messages will now be displayed directly within the dialog itself, providing clearer feedback and preventing misleading success messages in case of deletion issues.
General Fixes
-
Locked Template Field Editing:
- Resolves an issue where selecting a new container in the check form would reset check properties, causing problems for locked templates. The fix ensures that checks derived from templates retain their properties, allowing users to modify the field_to_compare field as needed.
-
General Fixes and Improvements
2024.04.25
Feature Enhancements
-
Profile Results Modal:
- Introducing a detailed Results Modal for each profile operation. Users can now view comprehensive statistics about the produced container profiles and their partitions, enhancing their ability to analyze data effectively.
-
Checks Synchronized Count:
- The operations list now includes the count of synchronized checks for datastore and explore operations. This addition streamlines the identification of operations, improving user experience.
General Fixes
- General Fixes and Improvements
2024.04.23
Feature Enhancements
-
Introduction of Comparators for Quality Checks:
- Launched new Comparator properties across several rule types, enhancing the flexibility in defining quality checks. Comparators allow users to set margins of error, accommodating slight variations in data validation:
- Numeric Comparators: Enables numeric comparisons with a specified margin, which can be set as either a fixed absolute value or a percentage, accommodating datasets where minor numerical differences are acceptable.
- Duration Comparators: Supports time-based comparisons with flexibility in duration differences, essential for handling time-based data with variable precision.
- String Comparators: Facilitates string comparisons by allowing for variations in spacing, ideal for textual data where minor inconsistencies may occur.
- Applicable to rule types such as Equal To, Equal To Field, Greater Than, Greater Than Field, Less Than, Less Than Field, and Is Replica Of.
- Launched new Comparator properties across several rule types, enhancing the flexibility in defining quality checks. Comparators allow users to set margins of error, accommodating slight variations in data validation:
-
Introduced Row Comparison in the isReplicaOf Rule:
- Improved the rule to support row comparison by id, enabling more precise anomaly detection by allowing users to specify row identifiers for unique row comparison. Key updates include:
- Revamp of the source record presentation to highlight differences between the left and right containers at the cell level, enhancing visibility into anomalies.
- New input for specifying unique row identifiers, transitioning from symmetric difference to row comparison when set.
- The original behavior of symmetric comparison remains unchanged if no row identifiers are provided.
- Improved the rule to support row comparison by id, enabling more precise anomaly detection by allowing users to specify row identifiers for unique row comparison. Key updates include:
-
New equalTo Rule Type for Direct Value Comparisons
- Introduced the equalTo rule type, enabling precise assertions that selected fields match a specified value. This new rule not only simplifies the creation of checks for constant values across datasets but also supports the use of comparators, allowing for more flexible and nuanced data validation.
-
Redirect Links for Requested Containers in Operation Details:
- Introduced redirect links in the "Containers Requested" section of operation results. This enhancement provides direct links to the requested containers (such as tables or files), facilitating quicker navigation and streamlined access to relevant operational data.
-
Enhanced Description Input with Expandable Option:
- Implemented an expandable option for the Description input in the Check Form & Template Form. This enhancement allows users to more comfortably manage lengthy text entries, improving the usability of the form by accommodating extensive descriptions without compromising the interface's usability.
General Fixes
-
Addressed Data Preview Timeout Issues:
- Tackled the timeout problems in the data preview feature, ensuring that data retrieval processes complete successfully within the new extended timeout limits.
-
General Fixes and Improvements
2024.04.12
Feature Enhancements
- File Pattern Overrides:
- We have added support in the UI to override a file pattern. Now, a file pattern overwritten by a user will replace the one that the system generated during the first catalog. To have a new file pattern in the UI, users need to perform a new catalog operation without prune.
- Batch Edit in the Check Templates Library::
- We are now supporting batch edits for check templates in the Library. This enhancement will allow filters and tags.
- Improved Presentation of Incremental, Remediation, and Infer Constraints:
- We have improved the presentation of Incremental, Remediation, and Infer Constraints in the operation listing for catalog, profile, and scan operations. The Incremental, Remediation, and Infer Constraints icons have been added to the list of items, and the visualization of these items has been enhanced.
- Default Placeholders for Computed File in UI:
- We are now automatically populating the form dialog with fields from the selected container. This improvement simplifies the process for users, especially in scenarios where they wish to select or cast specific fields directly from the source container.
General Fixes
-
Tree View Default Ordering:
- We have updated the tree view default ordering. Datastore names are now grouped and presented in alphabetical order.
-
General Fixes and Improvements
2024.04.06
Breaking Changes
-
Remediation Naming Convention Update:
- Updated the naming convention for remediation to
{enrich_container_prefix}_remediation_{container_id}
, standardizing remediation identifiers.
- Updated the naming convention for remediation to
-
Add file extension for DFS Enrichment:
- Introduced
.delta
extension to files in the enrichment process on DFS, aligning with data handling standards.
- Introduced
Feature Enhancements
-
Revamp Enrichment Datastore Main Page:
- Tree View & Data Navigation: Enhanced the enrichment page with an updated tree view that now lists source datastores linked to enrichment datastores, improving navigability. A newly introduced page for enrichment datastore enables:
- Data preview across enrichment, remediation, and metadata tables with the ability to apply "WHERE" filters for targeted insights.
- Direct downloading of preview data as CSV.
- UI Performance Optimization: Implemented UI caching to boost performance, reducing unnecessary network requests and smoothly preserving user-inputted filters and recent data views.
- Tree View & Data Navigation: Enhanced the enrichment page with an updated tree view that now lists source datastores linked to enrichment datastores, improving navigability. A newly introduced page for enrichment datastore enables:
-
User Sorting by Role:
- Introduced a sorting feature in the Settings > Users tab, allowing users to be sorted by their roles in ascending or descending order, facilitating easier user management.
-
Expanded Entity Interaction Options:
- Enhanced entity lists and breadcrumbs with new direct action capabilities. Users can now right-click on an item to access useful functions: copy the entity's ID or name, open the entity's link in a new tab, and copy the entity's link. This enhancement simplifies data management by making essential actions more accessible.
General Fixes
-
Record Quality Scores Overlap Correction:
- Resolved a problem where multiple violations could be open for the same container simultaneously, contrary to logic. This fix ensures violations for containers are uniquely recorded, eliminating parallel open violations.
-
Anomaly Details Text Overflow:
- Corrected text overflow issues in the anomaly details' violation box, ensuring all content is properly contained and readable.
-
Enhanced "Not Found" Warnings with Quick Filters:
- Improved user guidance for Checks and Anomalies list filters by adding hints for "not found" items, suggesting users check the "all" group for unfiltered search results, clarifying navigation and search results.
-
General Fixes and Improvements
2024.03.29
Feature Enhancements
-
Data Preview
- Introducing the "Data Preview" tab, providing users with a streamlined preview of container data within the platform. This feature aims to enhance the user experience for tasks such as debugging checks, offering a grid view showcasing up to 100 rows from the container's source.
- Data Preview Tab: Implemented a new tab for viewing container data, limited to displaying a maximum of 100 rows for improved performance.
- Filter Support: Added functionality to apply filter clauses to the data preview, enabling users to refine displayed rows based on specific criteria.
- UI Caching: Implemented a caching layer within the UI to enhance performance and reduce unnecessary network requests, storing the latest refreshed data along with applied filters.
- Introducing the "Data Preview" tab, providing users with a streamlined preview of container data within the platform. This feature aims to enhance the user experience for tasks such as debugging checks, offering a grid view showcasing up to 100 rows from the container's source.
-
Enhanced Syntax Highlight Inputs
- Improved the syntax highlight inputs for seamless inline editing, minimizing the friction of entering expressions. This feature includes a dual-mode capability, allowing users to type directly within the input field or utilize an expanded dialog for more complex entries, significantly improving user experience.
-
Volumetric Measurements
- Periodically measure container volumetrics for a more robust approach. This update focuses on measuring only containers without a volume measure in the last 24 hours and scheduling multiple runs of the job daily.
-
Sort Tags by Color
- Users can now sort tags by color, visually grouping similar colors for easier navigation and management.
-
Download Source Records
- Added a "Download Source Records" feature to the Anomaly view in the UI, allowing users to export data held in the enrichment store for that anomaly in CSV format.
-
Check Templates Navigation
- Implemented a breadcrumb trail for the Check Template page to improve user navigation.
General Fixes
-
Fix Scheduling Issues
- Resolved scheduling issues affecting specific sets of containers, particularly impacting scheduled profile and scan operations. Users must manually add new profiles after catalog operations or computed file/table creation for inclusion in existing scheduled operations.
-
Fix Notifications Loading Issue on Large Screens
- Fixed an issue where the infinity loading feature for the user notification list was not functioning properly on large screens. The fix ensures correct triggering of infinity loading regardless of screen size, allowing all notifications to be accessed properly.
-
General Fixes and Improvements
2024.03.15
Feature Enhancements
- Enhanced Observability
- Automated daily volumetric measurements for all tables and file patterns
- Time-series capture and visualizations for volume, freshness, and identified anomalies
-
Overview Tab:
- Introduced a new "Overview" tab with information related to monitoring at the datastore and container level.
- This dashboard interface is designed for monitoring and managing data related to qualytics for datastore and containers.
- Users can see:
- Totals: Quality Score, Tables, Records, Checks and Anomalies
- Total of Quality Checks grouped by Rule type
- Data Volume Over Time: A line graph that shows the total amount of data associated with the project over time.
- Anomalies Over Time: A line graph that shows the number of anomalies detected in the project over time.
-
Datastore Field List Update:
- The datastore field profiles list has been updated to match the existing list views design.
- All card-listed pages now display information in a column format, conditionally using scrolling for smaller and larger screens.
- Now the field details will show on a modal with Profiling and Histogram
-
Heatmap Simplification:
- Simplified the heatmap to consider only operations counted.
-
Datastore Metrics:
- Improved distinction between 0 and null values in the datastore metrics (total records, total fields, etc).
-
Explore Page Update:
- Added new metrics to the Explore page.
- We are now adding data volume over time (records and size).
- Improved distinction between 0 and null values in metrics (total records, total fields, etc).
General Fixes
-
UI Wording and Display for Cataloged vs Profiled Fields:
- Addressed user confusion surrounding the display and wording used to differentiate between fields that have been cataloged versus those that have been profiled.
- Updated the messaging within the tree view and other relevant UI components to accurately reflect the state of fields post-catalog operation.
- Implemented a clear distinction between non-profiled and profiled fields in the field count indicators.
- Conducted a thorough review of the CTAs and descriptive text surrounding the Catalog, Profile, and Scan operations to improve clarity and user understanding.
-
General Fixes and Improvements
2024.03.07
General Fixes
-
Corrected MatchesPattern Checks Inference:
- Fixed an issue where the inference engine generated MatchesPattern checks that erroneously asserted false on more than 10% of training data. This resolution ensures all inferred checks now meet the 99% coverage criterion, aligning accurately with their training datasets.
-
Fixed Multi-Field Check Parsing Error in DFS:
- Addressed a bug in DFS environments that caused parsing errors for checks asserting against multiple fields, such as AnyNotNull and NotNull, when selected fields contained white spaces. This resolution ensures that checks involving multiple fields with spaces are now accurately parsed and executed.
-
Volumetric Measurements Tracking Fix:
- Addressed a bug that prevented the recording of volumetric measurements for containers without a last modified time. This fix corrects the problem by treating last_modification_time as nullable, ensuring that containers are now accurately tracked for volumetric measurements regardless of their modification date status.
-
General Fixes and Improvements
2024.03.05
Feature Enhancements
- Check Validation Improvement:
- Enhanced the validation process for the "Is Replica Of" check. Previously, the system did not validate the field name and type, potentially leading to undetected issues until a Scan Operation was executed. Now, the validation process includes checking the field name and type, providing users with immediate feedback on any issues.
General Fixes
-
Matches Pattern Data Quality Check Handling White Space:
- Resolved a bug in the Matches Pattern data quality check that caused white space to be ignored during training. With this fix, the system now accounts for white space during training, ensuring accurate pattern inference even with data containing significant white space. If 1% or more of the training data contains blanks, the system will derive a pattern that includes blanks as a valid value, improving data quality assessment.
-
General Fixes and Improvements
2024.02.28
Feature Enhancements
- User Token Management:
- Transitioned from Generic Tokens to a more robust User Token system accessible under Settings for all users. This enhancement includes features to list, create, revoke, and delete tokens, offering granular control of API access. User activities through the API are now attributable, aligning actions with user accounts for improved accountability and traceability.
General Fixes
-
Datetime Validation in API Requests:
- Strict validation of datetime entries in API requests has been implemented to require the Zulu datetime format. This update addresses and resolves issues where incomplete datetime entries could disrupt Scan operations, enhancing API reliability.
-
Context-Aware Redirection Post-Operation:
- Enhanced the operation modal redirect functionality to be context-sensitive, ensuring that users are directed to the appropriate activity tab after an operation, whether at the container or datastore level. This enhancement ensures a logical and intuitive post-operation navigation experience.
-
Template Details Page Responsiveness:
- Addressed layout issues on the Template Details page caused by long descriptions. Adjustments ensure that the description section now accommodates larger text volumes without disrupting the page layout, maintaining a clean and accessible interface.
-
General Fixes and Improvements
2024.02.23
Feature Enhancements
-
Introduction of Operations Management at the Table/File Level:
- The Activity tab has been added at the table/file level, extending its previous implementation at the source datastore level. This update provides users with the ability to view detailed information on operations for individual tables/files, including scan metrics, and histories of operation runs and schedules. It enhances the user's ability to monitor and analyze operations at a granular level.
-
Enhanced Breadcrumb Navigation UX:
- Breadcrumb navigation has been improved for better user interaction. Users can now click on the breadcrumb representing their current context, enabling more intuitive navigation. In addition, selecting the Source Datastore breadcrumb takes users directly to the Activity tab, streamlining the flow of user interactions.
General Fixes
-
Improved Accuracy in Profile and Scan Metrics:
- Enhanced the accuracy of metrics for profiled and scanned operations by excluding failed containers from the count. Now, metrics accurately reflect only those containers that have been successfully processed.
-
Streamlined input display for Aggregation Comparison rule in Check/Template forms:
- Removed the "Coverage" input for the "Aggregation Comparison" rule in Check/Template Forms, as the rule does not support coverage customization. This simplification helps avoid confusion during rule configuration.
-
Increased Backend Process Timeouts:
- In response to frequent timeout issues, the backend process timeouts have been adjusted. This change aims to reduce interruptions and improve service reliability by ensuring that processes have sufficient time to complete.
-
General Fixes and Improvements
2024.02.19
Feature Enhancements
-
Support for exporting Check Templates to the Enrichment Datastore:
- Added the ability to export Check Library metadata to the enrichment datastore. This feature helps users export their Check Library, making it easier to share and analyze check templates.
-
File Upload Size Limit Handling:
- Implemented a user-friendly error message for file uploads that exceed the 20MB limit. This enhancement aims to improve user experience by providing clear feedback when the file size limit is breached, replacing the generic error message previously displayed.
General Fixes
-
Resolved Parsing Errors in Expected Values Rule:
- Fixed an issue where single quotes in the list of expected values caused parsing errors in the Analytics Engine, preventing the Expected Values rule from asserting correctly. This correction ensures values, including those with quotes or special characters, are now accurately parsed and asserted.
-
General Fixes and Improvements
2024.02.17
General Fixes
-
Corrected Typing for Expected Values Check:
- Resolved an issue with the expectedValues rule, where numeric comparisons were inaccurately processed due to a misalignment between the API and the analytics engine. This fix ensures numeric values are correctly typed and compared, enhancing the reliability of validations.
-
Fixed Anomaly Filtering in Scan Results dialog:
- Addressed a flaw where scan results did not consistently filter anomalies based on the operation ID. The fix guarantees that anomalies are only displayed once the operation ID parameter is accurately defined in the URL, ensuring more precise and relevant scan outcome presentations.
-
Check Validation Sampling Behavior Adjustment:
- Fixed intermittent validation issues encountered in specific source datastore types (DB2, Microsoft SQL Server). The problem, where validation could unpredictably fail or succeed based on container size, was corrected by fine-tuning the sampling method for these technologies, leading to consistent validation performance.
-
General Fixes and Improvements
2024.02.15
Feature Enhancements
-
UX Improvements for Profile and Scan Operation Dialogs:
- Implemented significant UX enhancements to Profile & Scan Operation Dialogs for improved clarity and user flow. Key improvements include:
- Visibility of incremental fields and their current starting positions in Scan Operation dialogs.
- Logical reordering of Profile and Scan Operation steps to align with user workflows, including prioritizing container selection and clarifying the distinction between "Starting Threshold" and "Limit" settings.
- Simplified operation initiation, allowing users to start operations directly before the final scheduling step, streamlining the process for immediate execution.
- Implemented significant UX enhancements to Profile & Scan Operation Dialogs for improved clarity and user flow. Key improvements include:
-
Naming for Scheduled Operations:
- Added a name field to scheduled operations, enabling users to assign descriptive names or aliases. This feature aids in distinguishing and managing multiple scheduled operations more effectively.
-
Container Name Filters for Operations:
- Provided filtering options for operations and scheduled operations by container name, improving the ability to quickly locate and manage specific operations.
-
Improved Design for Field Identifiers in Tooltips:
- The design of field identifiers within tooltips has been refined for greater clarity. Enhancements focus on displaying Grouping Fields, Excluded Fields, Incremental Fields, and Partition Fields, aiming to offer users a more intuitive experience.
General Fixes
-
External Scan Rollup Threshold Correction:
- Fixed an issue in external scans where the rollup threshold was not applied as intended. This correction ensures that anomalies exceeding the threshold are now accurately consolidated into a single shape anomaly, rather than being reported as multiple individual record anomalies.
-
Repetitive Release Notification and Live Update Fixes:
- Resolved a recurring issue with release notifications continually prompting users to refresh despite acknowledgment. Additionally, it restored the live update notifications' functionality, ensuring users are correctly alerted to new features while actively using the system, with suggestions for a hard refresh to access the latest version.
-
Corrected Field Input Logic in Check & Template Forms:
- Addressed a logic error that incorrectly disabled field inputs for certain rules in check and template forms. This correction re-enables the necessary field input, removing a significant barrier that previously prevented users from creating checks affected by this issue.
-
Addressed Absence of Feedback for No-Match Field Filters on Explore Page:
- Rectified the absence of feedback when field filters on the Explore Page yield no results, ensuring users receive a clear message indicating no items match the specified filter criteria.
-
General Fixes and Improvements
2024.02.10
Feature Enhancements
-
Immediate Execution Option for Scheduled Operations:
- Introduced a "Run Now" feature for scheduled operations, enabling users to execute operations immediately without waiting for the scheduled time. This addition provides flexibility in operation management, ensuring immediate execution as needed without altering the original schedule.
-
Simplified Customization of Notification Messages:
- Removed the "use custom message" toggle from the notification form, making the message input field always editable. This change simplifies the user interface and improves usability by allowing direct editing of notification messages.
- Enhanced default messages for each notification trigger type have also been implemented to improve clarity.
-
Performance Improvement in User Notifications Management:
- Implemented infinite scrolling pagination for the user notifications side panel. This update addresses performance issues with loading large numbers of notifications, ensuring a smoother and more responsive experience for users with extensive notification histories.
-
Enhanced Archive Template Confirmation:
- Updated the archive dialog for templates to include information on the number of checks associated with archiving the template. This enhancement ensures users are aware of the impact of checks linked to the template, promoting informed decision-making.
-
Improved Interaction with Computed Tables:
- Refined the Containers list UX to allow navigation to container details immediately after the creation of a computed table, addressing delays caused by background profiling. This improvement ensures users can access computed table details without waiting for the profile operation to complete, drawing inspiration from Tree View functionality for a more seamless experience.
General Fixes
- General Fixes and Improvements
2024.02.02
Feature Enhancements
- Excluded Fields Inclusion in Drop-downs:
- Refined container settings to incorporate previously excluded fields in the dropdown list, enhancing user flexibility. In addition, a warning message has been added to notify users if a profile operation is required when deselecting excluded fields that were previously selected.
General Fixes
-
Linkable Scan Results for Direct Access:
- Made Scan Results dialogs accessible via direct URL links, addressing previous issues with broken anomaly notification links. This enhancement provides users with a straightforward path to detailed scan outcomes.
-
Property Display Refinement for Various Field Types:
- Corrected illogical property displays for specific field types like Date/Timestamp. The system now intelligently displays only properties relevant to the selected data type, eliminating inappropriate options. This update also includes renaming 'Declared Type' to 'Inferred Type' and adjusting the logic for accurate representation.
-
Timezone Consistency in Insights and Activity Pages:
- Implemented improvements in timezone handling across Insights and Activity pages. These changes ensure that date aggregations are accurately aligned with the user's local time, eliminating previous inconsistencies compared to the Operations list results.
-
Fixed breadcrumb display in the datastore for members with restricted permissions
- Enhanced the datastore interface to address issues faced by members with limited permissions. This update also fixes misleading breadcrumb displays and ensures that correct datastore enhancement information is visible.
-
Resolved State Issue in Bulk Check Archive:
- Addressed a bug in the bulk selection process for archiving checks. The fix corrects an issue where the system recognized individual selections instead of the intended group selection due to an overlooked edge case.
-
Improved Operation Modal State Management:
- Tackled state management inconsistencies in Operation Modals. Fixes include resetting the remediation strategy to its default and ensuring 'include' options do not carry over previous states erroneously.
-
Eliminating Infinite Load for Non-Admin Enrichment Editing:
- Solved a persistent loading issue in the Enrichment form for non-admin users. Updates ensure a smoother, error-free interaction for these users, improving accessibility and functionality.
-
General Fixes and Improvements
2024.01.30
Feature Enhancements
-
Enhanced External Scan Operations:
- Improved data handling in External Scans by applying type casting to uploaded data using Spark. This update is particularly significant for date-time fields, which now expect and conform to ISO 8601 standards.
-
Optimized DFS File Reading:
- Streamlined file reading in DFS by storing and utilizing the 'file_format' identified during the Catalog operation. This change eliminates the need for repeated format inspection on each read, significantly reducing overhead, especially for partitioned file types.
General Fixes
-
Resolved DFS Reading Issues with Special Character Headers:
- Fixed a DFS reading issue where columns with headers containing special characters (like pipes |) adversely affected field profiling, including inaccuracies in histogram generation.
-
General Fixes and Improvements
2024.01.26
Feature Enhancements
-
Incremental Scan Starting Threshold:
- Introduced a "Starting Threshold" option for incremental Scans. This feature allows users to manually set a starting value for the incremental field in large tables, bypassing the need to scan the entire dataset initially. It's handy for first-time scans of massive databases, facilitating more efficient and targeted data scanning.
-
Add Support for Archiving Anomalies:
- Implemented the capability of archiving anomalies. Users can now remove anomalies from view without permanently deleting them, providing greater control and flexibility in anomaly management.
-
External Scan Operation for Ad hoc Processes:
- Introduced 'External Scan Operation' as a new feature enabling ad hoc data validation for all containers. This operation allows users to validate ad hoc data, such as Excel or CSV files, against a container's existing checks and enrichment configuration. The provided file's structure must align with the container's schema, ensuring a seamless validation process.
General Fixes
-
Preventing Unrelated Entity Selection in Check Form:
- Fixed an issue in the Check Form where users could inadvertently select unrelated entities. Selecting datastores, containers, and fields is restricted during any ongoing data loading, preventing mismatched entity selections.
-
Performance enhancements for BigQuery and Snowflake removing the need for count operations during full table analysis
-
General Fixes and Improvements
2024.01.23
Feature Enhancements
-
Introduction of 'Expected Schema' Rule for Advanced Schema Validation:
- Introduced the 'Expected Schema' rule, replacing the 'Required Fields' rule. This new rule asserts that all selected fields are present and their data types match predefined expectations, offering more comprehensive schema validation. It also includes an option to validate additional fields added to the schema, allowing users to specify whether the presence of new fields should cause the check to fail.
-
Refined Tree Navigation Experience:
- Updated the tree navigation to prevent automatic expansion of nodes upon selection and eliminated the auto-reset behavior when re-selecting an active node. These changes provide a smoother and more user-friendly navigation experience, especially in tables/files with numerous fields.
-
Locked/Unlocked Status Filter in Library Page:
- Added a new filter feature to the Library page, enabling users to categorize and view check templates based on their Locked or Unlocked status. This enhancement simplifies the management and selection of templates.
-
Improved Messaging for Locked Template Properties in Check Form:
- Enhanced the Check Form UX by adding informative messages explaining why certain inputs are disabled when a check is associated with a locked template. This update enhances user understanding and interaction with the form.
General Fixes
-
Corrected Insights Metrics for Check Templates:
- Fixed an issue where check templates were incorrectly counted as checks in related metrics and counts on the Insights page. Templates are now appropriately filtered out, ensuring accurate representation of check-related data.
-
Enabled Template Creation with Calculated Rules:
- Resolved a limitation that prevented the creation of templates using calculated rules like 'Satisfies Expression' and 'Aggregation Comparison'. This fix expands the capabilities and flexibility of template creation.
-
General Fixes and Improvements
2024-01-11
Feature Enhancements
-
Introduction of Check Templates:
- Implemented Check Templates to offer a balance between flexibility and consistency in quality check management. Checks can now be associated with templates in either a 'locked' or 'unlocked' state, allowing for synchronized properties or independent customization, respectively. This feature streamlines check management and enables efficient tracking and review of anomalies across all checks associated with a template.
-
isType Rule Implementation:
- Replaced the previous dataType rule with the new isType rule for improved accuracy and understanding. The isType rule is now specifically tailored to assert only against string fields, enhancing its applicability and effectiveness.
-
Enhanced Container Details Page with Identifier Icons:
- Updated the Container Details page to display icons for key container identifiers, including Partition Field, Grouping Fields, and Exclude Fields. This enhancement provides a more intuitive and informative user interface, facilitating easier identification and understanding of container characteristics.
General Fixes
-
Notification System Reliability Improvement:
- Fixed intermittent failures in the notifications system. Users will now receive reliable notifications for identified anomalies, ensuring timely awareness and response to data irregularities.
-
Safeguard Against Overlapping Scheduled Operations:
- Implemented a mechanism to prevent the overloading of deployments due to overlapping scheduled operations. If a scheduled operation doesn’t complete before its next scheduled run, the subsequent run will be skipped, thereby avoiding potential strain on system resources.
-
Correction of Group-by Field Display in Containers:
- Resolved an issue where selected grouping fields were not appearing in the list fields of a container. This fix ensures that user-specified fields for group-by operations are correctly displayed, maintaining the integrity of data organization and analysis.
-
General Fixes and Improvements
2024.01.04
Feature Enhancements
- Enhanced Warnings for Schema Inconsistencies in Files Profiled
- Improved the warning message for cases where the user profiles files with different schemas under a single glob pattern. This update ensures users receive clear, helpful information when files within a glob have inconsistent structures.
General Fixes
-
Containers with 'Group By' settings Leading to Erroneous Profile Operation
- Fixed an issue affecting profile operations which included containers with 'Group By' settings. Previously, running a profile without inferring checks resulted in all fields being erroneously removed from the field list.
-
General Fixes and Improvements
2023.12.20
General Fixes
-
Resolved Datastore Creation Issue with Databricks:
- Fixed an issue encountered when creating source datastores using Databricks with catalog names other than the default
hive_metastore
. This fix ensures a smoother and more flexible datastore creation process in Databricks environments.
- Fixed an issue encountered when creating source datastores using Databricks with catalog names other than the default
-
Conflict Resolution for 'anomaly_uuid' Field in Source Container:
- Corrected a problem where source containers with a field named
anomaly_uuid
were unable to run scan operations. This fix eliminates the conflict with internal system columns, allowing for uninterrupted operation of these containers.
- Corrected a problem where source containers with a field named
-
General Fixes and Improvements
2023.12.14
Feature Enhancements
-
Auto-Detection of Partitioned Files:
- Improved file handling to automatically detect partitioned files like
*.delta
without the need for an explicit extension. This update resolves the issue of previously unrecognized delta tables.
- Improved file handling to automatically detect partitioned files like
-
Anomaly Weight Threshold for Notifications:
- Enhanced the notification system to support a minimum anomaly weight threshold for the trigger type "An anomaly is detected". Notifications will now be triggered only for anomalies that meet or exceed the defined weight threshold.
-
Team Assignment in Datastore Forms:
- Updated the Datastore Forms to enable users to manage teams. This enhancement provides Admins with the flexibility to assign or adjust teams right at the point of datastore setup, moving away from the default assignment to the Public team.
General Fixes
-
Corrected Health Page Duplication:
- Addressed an issue on the Health Page where "Max Executors" information was being displayed twice. This duplication has been removed for clearer and more accurate reporting.
-
General Fixes and Improvements
2023.12.12
Feature Enhancements
- Incremental Catalog Results Posting:
- Enhanced the catalog operation to post results incrementally for each container catalogued. Previously, results were only available after the entire operation was completed. With this enhancement, results from successfully catalogued containers are now preserved and posted incrementally, ensuring containers identified are not lost even if the operation does not complete successfully.
General Fixes
-
Aggregation Comparison Rule Filter:
- Resolved an issue where filters were not being applied to the Aggregation Comparison Check, affecting both the reference and target filters.
-
Case Sensitivity File Extension Support
- Addressed a limitation in handling file extensions, ensuring that uppercase formats like .TXT and .CSV are now correctly recognized and processed. This update enhances the system's ability to handle files consistently, irrespective of extension case.
-
SLA Violation Notification Adjustment:
- Modified the SLA violation notifications to trigger only once per violation, preventing a flood of repetitive alerts and improving the overall user experience.
-
Source record not Available for Max Length Rule
- Addressed a bug where the Max Length Rule was not producing source records in cases involving null values. The rule has been updated to correctly handle null values, ensuring accurate anomaly marking and data enrichment.
-
General Fixes and Improvements
2023.12.08
Breaking Changes
-
Renaming of Enrichment Datastore Tables
Due to lack of consistency and to avoid conflicts between different categories of Enrichments tables, changes were performed to the table name patterns:
- The Enrichment table previously named
<enrichment_prefix>_anomalies
has been renamed to<enrichment_prefix>_failed_checks
due to its content and granularity. - The terms
remediation
andexport
were added to distinguish Enrichment Remediation and Export tables from others, resulting in:<enrichment_prefix>_remediation_<container_name>
for Remediation tables.<enrichment_prefix>_export_<asset>
for Export tables.
- The Enrichment table previously named
Feature Enhancements
- Refactor Notifications Panel:
- Introduced a new side panel for Notifications, categorizing alerts by type (Operations, Anomalies, SLA) for improved organization.
- Added notification tags, receivers, and an action menu enabling users to mute or edit notifications directly from the panel
- Enhanced UI for better readability and interaction, providing an overall improved user experience.
- Add Enrichment Export Anomalies available asset:
- Anomalies are now supported as a type of asset for export to an enrichment datastore, enhancing data export capabilities.
- Add files count metric to profile operation summary
- Displayed file count (number of partitions) in addition to existing file patterns count metric in profile operations for DFS datastores.
- Improve Globing Logic:
- Optimized support for multiple subgroups when globing files from DFS datastores during profile operations, enhancing efficiency.
General Fixes
- General Fixes and Improvements
2023.12.05
Feature Enhancements
- Navigation Improvements in Explore Profiles Page:
- Upgraded the Explore Profiles Page by adding direct link icons for more precise navigation. Users can now use these links on container and field cards/lists for a direct redirection to detailed views.
General Fixes
- General Fixes and Improvements
2023.12.01
Feature Enhancements
-
List View Layout Support:
- Introduced list view layouts for Datastores, Profiles, Checks, and Anomalies, providing users with an alternative way to display and navigate through their data.
-
Bulk Acknowledgement Performance:
- Improved the performance of bulk acknowledging in-app notifications, streamlining the user experience and enhancing the application's responsiveness.
General Fixes
-
Checks and Anomalies Dialog Navigation:
- Resolved an issue with arrow key navigation in Checks and Anomalies dialogs where unintended slider movement occurred when using keyboard navigation. This fix ensures that arrow keys will only trigger slider navigation when the dialog is the main focus.
-
Profiled Container Count Inconsistency
- Ensured that containers that fail to load data during profiling are not mistakenly counted as successfully profiled, improving the accuracy of the profiling process.
-
Histogram Field Selection Update:
- Fixed a bug where histograms were not updating correctly when navigating to a new field. Histograms now properly reflect the data of the newly selected field.
-
General Fixes and Improvements
2023.11.28
Feature Enhancements
-
Operations with Tag Selectors:
- Users can now configure operations (including schedules) with multiple tags, enabling dynamic profile evaluation based on tags at the operation's trigger time.
-
Asserted State Filter for Checks:
- Introduced a new check list filter, allowing users to filter checks by those that have passed or identified active anomalies.
-
Bulk Delete for Profiles:
- Enhanced the system to allow bulk deletion of multiple profiles, streamlining the management process where previously only individual deletions were possible.
-
Resizable Columns in Source Records Table:
- Columns in the anomaly dialog source records can now be manually resized, improving visibility and preventing content truncation.
-
Automated Partition Field Setting for BigQuery:
- For BigQuery tables constrained by a required partition filter, the profile partition field setting is now automatically populated during the Catalog operation.
General Fixes
-
Sharable Link Authentication Flow:
- Fixed an issue where direct links did not work if the user was not signed in. Now, users are redirected to the intended page post-authentication.
-
Clarified Violation Messages for 'isUnique' Check:
- Updated the violation message for the 'isUnique' check to describe the anomaly, reducing misinterpretation clearly.
-
Access Restriction and Loading Fix for Health Page:
- Corrected the health page visibility so only admin users can view it, and improved loading behavior for Qualytics services.
-
Availability of Requested Tables During Operations:
- The dialog displaying requested tables/files is now accessible immediately after an operation starts, enhancing transparency for both Profile and Scan operations.
-
General Fixes and Improvements
2023.11.14
Feature Enhancements
- Qualytics App Color Palette and Design Update:
- Implemented a comprehensive design update across the Qualytics App, introducing a new color palette for a refreshed and modern look. This update includes a significant change to the anomalies color, transitioning from red to orange for a more distinct visual cue. Additionally, the font-family has been updated to enhance readability and provide a more cohesive aesthetic experience across the application.
- System Health Readout:
- A new
Health
tab has been added to the Admin menu, offering a comprehensive view of each deployment's operational status. This feature encompasses critical details such as the status of app services, current app version, and analytics engine information, enabling better control over system health.
- A new
- Enhanced Check with Metadata Input:
- The Check form now includes a new input field for custom metadata. This enhancement allows users to add key-value pairs for tailored metadata, significantly increasing the flexibility and customization of the Check definition.
- Responsiveness Improvement in Cards Layout:
- The Cards layout has been refined to improve responsiveness and compactness. This adjustment addresses previous UI inconsistencies and ensures a consistent visual experience across different devices, enhancing overall usability and aesthetic appeal.
- Source Record Enrichment for 'isUnique' Checks:
- The
isUnique
check has been enhanced to support source record enrichment. This significant update allows users to view specific records that fail to meet the 'isUnique' condition. This feature adds a layer of transparency and detail to data validation processes, enabling users to easily identify and address data uniqueness issues.
- The
- New Enrichment Data:
- Scan operations now record operation metadata in a new enrichment table with the suffix
scan_operations
including an entry for each table/file scanned with the number of records processed and anomalies identified as well as start/stop time and other relevant details.
- Scan operations now record operation metadata in a new enrichment table with the suffix
- Insights Enhancement with Check Pass/Fail Metrics:
- Insights now features the checks section with new metrics indicating the total number of checks passed and failed. This enhancement also offers a visual representation through a chart, detailing the passed and failed checks over a specified reporting period.
General Fixes
isAddress
now supports defining multiple checks against the same field with different required label permutations- General Fixes and Improvements
2023.11.08
Feature Enhancements
-
Is Address Check:
- Introduced a new check for address conformity that ensures the presence of required components such as road, city, and state, enhancing data quality controls for address fields. This check leverages machine learning to support multilingual street address parsing/normalization trained on over 1.2 billion records of data from over 230 countries, in 100+ languages. It achieves 99.45% full-parse accuracy on held-out addresses (i.e. addresses from the training set that were purposefully removed so we could evaluate the parser on addresses it hasn’t seen before).
-
Revamped Heatmap Flow in Activity Tab:
- Improved the user interaction with the heatmap by filtering the operation list upon selecting a date. A new feature has been added to operation details allowing users to view comprehensive information about the profiles scanned, with the ability to drill down to partitions and anomalies.
-
Link to Schedule in Operation List:
- Enhanced the operation list with a new "Schedule" column, providing direct links to the schedules triggering the operations, thus improving traceability and scheduling visibility.
-
Insights Tag Filtering Improvement:
- Enhanced the tag filtering capability on the Insights page to now include table/file-level analysis. This ensures a more granular and accurate reflection of data when using tags to filter insights.
-
Support for Incremental Scanning of Partitioned Files:
- Optimized the incremental scanning process by tracking changes at the record level rather than the last modified timestamp of the folder. This enhancement prevents the unnecessary scanning of all records and focuses on newly added data.
General Fixes
- General Fixes and Improvements
2023.11.02
Feature Enhancements
-
Auto Selection of All Fields in Check Form:
- Improved the user experience in the Check Form by introducing a "select all" option for fields. Users can now auto-select all fields when applying rules that expects a multi select input, streamlining the process especially for profiles with a large number of fields.
-
Enhanced Profile Operations with User-Defined Starting Points for Profiling:
- Users can now specify a value for the incremental identifier, to determine the comprehensive set that will be analyzed.
- Two new options have been added:
- Greater Than Time: Targets profiles with incremental timestamp strategies, allowing the inclusion of rows where the incremental field's value surpasses a specified time threshold.
- Greater Than Batch: Tailored for profiles employing an incremental batch strategy, focusing the analysis on rows where the incremental field’s value is beyond a certain numeric threshold.
-
Configurable Enrichment Source Record Limit in Scan Operations:
- Users can now configure the
enrichment_source_record_limit
to dictate the number of anomalous records retained for analysis, adapting to various use case necessities beyond the default sample limit of 10 per anomaly. This improvement allows for a more tailored and comprehensive analysis based on user requirements.
- Users can now configure the
-
Introduction of Passed Status in Check Card:
- A new indicative icon has been added to the Check Card to assure users of a "passed" status based on the last scan. This icon will be displayed only when there are no active anomalies.
-
Inclusion of Last Asserted Time in Check Card:
- Enhanced the Check Card by including the last asserted time, offering users more detailed and up-to-date information regarding the checks.
-
Enhanced Anomaly Search with UUID Support:
- Improved the anomaly search functionality by enabling users to search anomalies using the UUID of the anomaly, making the search process more flexible and comprehensive.
General Fixes
- General Fixes and Improvements
2023.10.27
Feature Enhancements
-
Check Creation through Field Details Page:
- Users can now initiate check creation directly from the Field Details page, streamlining the check creation process and improving usability.
-
Tree View Enhancements:
- Introduced a favorite group feature where favorite datastores are displayed in a specific section, making them quicker and easier to access.
- Added search functionalities at both Profile and Field levels to improve the navigation experience.
- Nodes now follow the default sorting of pages, creating consistency across various views.
- Enhanced the descriptions in tree view nodes for non-catalogued datastores and non-profiled profiles, providing a clearer explanation for the absence of sub-items.
-
Bulk Actions for Freshness & SLAs:
- Users can now perform bulk actions in Freshness & SLAs, enabling or disabling freshness tracking and setting or unsetting SLAs for profiles efficiently.
-
Archived Check Details Visualization:
- Enhanced the anomaly modal to allow users to view the details of archived checks in a read-only mode, improving the visibility and accessibility of archived checks’ information.
-
User Pictures as Avatars:
- User pictures have been incorporated across the application as avatars, enhancing the visual representation in user listings, teams, and anomaly comments.
-
Slide Navigation in Card Dialogs:
- Introduced a slide navigation feature in the Anomalies and Checks dialogs, enhancing user navigation. Users can now effortlessly navigate between items using navigational arrows, eliminating the need to close the dialog to view next or previous items.
General Fixes
- General Fixes and Improvements
2023.10.23
Feature Enhancements
-
Enhanced Data Asset Navigation:
- Tree View Implementation: Easily navigate through your data assets with our new organized tree view structure
- Context-Specific Actions: Access settings and actions that matter most depending on your current level of interaction.
- Simplified User Experience: This update is designed to streamline and simplify your data asset navigation and management.
-
Aggregation Comparison Check:
- New Rule Added: Ensure valid comparisons by checking the legitimacy of operators between two aggregation expressions.
- Improved Monitoring: Conduct in-depth comparisons, such as verifying if total row counts match across different source assets.
-
Efficient Synchronization for Schema Changes:
- Seamless Integration: Our system now adeptly synchronizes schema changes in source datastores with Qualytics profiles.
- Avoid Potential Errors: We reduced the risk of creating checks with fields that have been removed or altered in the source datastore.
-
Clarity in Quality Check Editors:
- Distinct Update Sources: Easily identify if an update was made manually by a user or automatically through the API.
-
Dynamic Quality Score Updates:
- Live Anomaly Status Integration: Quality Scores now reflect real-time changes based on anomaly status updates.
General Fixes
- Various bug fixes and system improvements for a smoother experience.
2023.10.13
Feature Enhancements
-
Export Metadata Enhancements:
- Added a "weight" property to the quality check asset
-
New AWS Athena Connector:
- Introduced support for a new connector, AWS Athena, expanding the options and flexibility for users managing data connections.
-
Operations List:
- Introduced a multi-select filter to the operation list, enabling users to efficiently view operations based on their status such as running, success, failure, and warning, thereby streamlining navigation and issue tracking.
General Fixes
- Logging Adjustments:
- Enhanced logging for catalog operations, ensuring that logs are visible and accessible even for catalogs with a warning status, facilitating improved tracking and resolution of issues.
- General Fixes and Improvements
2023.10.09
Feature Enhancements
-
Check Categorization:
- Introduced new check categories on the checks page to streamline UX and prioritize viewing:
- Important: Designed around a check's weight value, this category will by default comprise authored checks and inferred checks with active anomalies.
- Favorite: Featuring all user-favorited checks
- Metrics: Incorporating all metric checks
- All: Displaying all checks, whether inferred, authored, or anomalous
- The default view is set to "Important" (if available) to highlight critical checks and avoid overwhelming users
- Introduced new check categories on the checks page to streamline UX and prioritize viewing:
-
Anomalies Page Update:
- Revamped the Anomalies page with a simplified status filter, adopting a design in alignment with the checks page:
- Quick Status Filter: Facilitates an effortless switch between anomaly statuses.
- The "Active" tab is presented as the default, providing immediate visibility into ongoing anomalies.
- Revamped the Anomalies page with a simplified status filter, adopting a design in alignment with the checks page:
-
Notification Testing:
- Enhanced the Notification Form with a "Test Notification" button, enabling users to validate notification settings before saving
-
Metadata Export to Enrichment Stores:
- Enabled users to export metadata from their datastore directly into enrichment datastores, with initial options for quality checks and field profiles.
- Users can specify which profiles to include in the export operation, ensuring relevant data transfer.
General Fixes
- General Fixes and Improvements
2023.10.04
Feature Enhancements
-
Anomalies Details User Experience:
- Implemented a "skeleton loading" feature in the Anomaly Details dialog, enhancing user feedback during data loading.
-
Enhanced Check Dialog:
- Added "Last Updated" date to the Check Dialog to provide users with additional insights regarding check modifications.
-
API Engine Control:
- Exposed a new endpoint allowing users to gracefully restart the analytics engine through the API.
General Fixes
- Timezone Handling on MacOS:
- Resolved an issue affecting timezone retrieval due to MacOS privacy updates, ensuring accurate timezone handling.
- Notifications and Alerts:
- Pager Duty Integration: Resolved issues preventing message sending and improved UI for easier configuration.
- HTTP Action Notification: Fixed Anomaly meta-data serialization issues affecting successful delivery in some circumstances.
- Scan Duration Accuracy:
- Adjusted scan duration calculations to accurately represent the actual processing time, excluding time between a failed scan and a successful retry.
- Spark Partitioning:
- Certain datastores may fail to properly coerce types into Spark-compatible partition column values if that column itself contains anomalous values. When this occurs, an attempt will be made to load the data without a partition column and a warning will be generated for the user.
- General Fixes and Improvements
2023.09.29
Feature Enhancements
-
Operations & Schedules UI Update:
- Redesigned the UI for the operations and schedules lists for a more intuitive UX and to provide additional information.
- Introduced pagination, filtering, and sorting for the schedules list.
- Added a "Next Trigger" column to the schedules list to inform users of upcoming schedule triggers.
- Improved Profile List Modal:
- Enhanced the profile list modal accessible from operations and schedules.
- Users can now search by both ID and profile name.
- Redesigned the UI for the operations and schedules lists for a more intuitive UX and to provide additional information.
-
Check Navigation Enhancements:
- Enhanced navigation between Standard and Metric Cards by introducing direct links that allow users to access metric charts seamlessly from check forms.
- The checks page navigation state is now reflected in the URL, enhancing UX and enabling precise redirect capabilities.
-
Computed Table Enhancements:
- Upon the creation or update of a computed table, a minimalistic profile operation is now automatically triggered. This basic profile limits sampling to 1,000 and does not infer quality checks.
- This enhancement streamlines the process when working with computed tables. Users can now directly create checks after computed table creation without manually initiating a profile operation, as the system auto-fetches required field data types.
-
Analytics Engine Enhancements:
- This release replaces our previous consistency model with a more robust one relying upon AMQP brokered durable messaging. The change dramatically improves Qualytics' internal fault tolerance with accompanying performance enhancements for common operations.
General Fixes
- Insights Filter Consistency:
- Fixed an inconsistency issue with the datastore filter that was affecting a couple of charts in Insights
- General Fixes and Improvements
2023.09.21
Feature Enhancements
-
Anomalies Modal Redesign:
- Streamlined the presentation of Failed Checks by removing the Anomalous Fields grouping. The new layout focuses on a list of Failed Checks, each tagged with the associated field(s) name, if applicable. This eliminates redundancy and simplifies the UI, making it easier to compare failed checks directly against the highlighted anomalous fields in the Source Record.
- Added the ability to filter Failed Checks by anomalous fields.
- Introduced direct links to datastores and profiles for enhanced navigation.
- Updated the tag input component for better UX.
- Removed the 'Hide Anomalous' option and replaced it with an 'Only Anomalous' option for more focused analysis.
- Included a feature to display the number of failed checks a field has across the modal.
- Implemented a menu allowing users to copy Violation messages easily.
-
Bulk Operation for Profiles:
- Extended the profile selection functionality to allow initiating bulk operations like profiling and scanning directly from the selection interface.
General Fixes
- DFS Incremental Scans:
- Addressed an issue that caused incremental scans to fail when no new files were detected on globs. Scans will now proceed without failure or warning in such cases.
- Improve performance of the Containers endpoint
- General Fixes and Improvements
2023.09.16
Feature Enhancements
-
Insights Timeframe and Grouping:
- Trend tooltips have been refined to change responsively based on the selected timeframe and grouping, ensuring that users receive the most relevant information at a glance.
-
Enhanced PDF export for Insights:
- Incorporated the selected timeframe and grouping settings into the exported PDF, ensuring that users experience consistent detail and clarity both within the application and in the exported document.
- Added a "generated at" timestamp to the PDF exports, providing traceability and context to when the data was captured, further enhancing the comprehensiveness of exported insights.
-
Source Record Display Improvements:
- The internal columns' background color has been calibrated to offer a seamless appearance in both light and dark themes.
General Fixes
-
Time Series Chart Rendering:
- Addressed an issue where the time series chart would not display data points despite having valid measurements. The core of the problem was pinpointed to how the system handled
0
values, especially when set as min and/or max thresholds. - Resolved inconsistencies in how undefined min/max thresholds were displayed across different comparison types. While we previously had a UI indicator displaying for some comparison types, this was missing for "Absolute Change" and "Absolute Value".
- Addressed an issue where the time series chart would not display data points despite having valid measurements. The core of the problem was pinpointed to how the system handled
-
General Fixes and Improvements
2023.09.14
Feature Enhancements
-
Insights Improvements:
- Performance has been significantly optimized for smoother interactions.
- Introduced timeframe filters, allowing users to view insights data by week, month, quarter, or year.
- Introduced grouping capabilities, enabling users to segment visualizations within a timeframe, such as by days or weeks.
-
Metric Checks Enhancements:
- Introduced a new Metric Checks tab in both the datastore and explore perspectives.
- Added a Time Series Chart within the Metric Checks tab:
- Displays check measurements over time.
- Allows on-the-fly adjustments of min/max threshold values.
- Showcases enhanced check metadata including tags, active anomaly counts, and check weights.
-
Check Form Adjustments:
- Disabled the
Comparison Type
input for asserted checks
- Disabled the
General Fixes
- Configuring Metric Checks through the Check Form:
- Resolved a bug where users were unable to clear optional inputs such as "min" or "max".
- General Fixes and Improvements
2023.09.08
Feature Enhancements
- Presto & Trino Connectors:
- We've enhanced our suite of JDBC connectors by introducing dedicated support for both Presto and Trino. Whether you're utilizing the well-established Presto or the emerging Trino, our platform ensures seamless compatibility to suit your data infrastructure needs.
General Fixes
- Incremental Scan:
- Resolved an issue where the scan operation would fail during the "Exists In Check" if there were no records to be processed.
- General Fixes and Improvements
2023.09.07
Feature Enhancements
-
Concurrent Operations:
- Introduced the ability to run multiple operations of the same type concurrently within a single datastore, even if one is yet to finish. This brings more flexibility and efficiency in executing operations
-
Autocomplete Widget:
- A hint for a shortcut has been added, allowing users to manually trigger the autocomplete widget and enhancing usability
-
Source Record Display Enhancements:
- Added a new 'Hide Anomalous' option, providing users with the choice to hide anomalous records for clearer viewing
- Transitioned from hover-based tooltips to click-activated ones for better UX
- For a consistent data presentation, internal columns will now always be displayed first
-
Check Form Improvements:
- Users now receive feedback directly within the form upon successful validation, replacing the previous toast notification method
- Additionally, for 504 validation timeouts, a more detailed and context-specific message is provided
General Fixes
- Addressed issues for 'Is Replica Of' failed checks in source record handling
- General Fixes and Improvements
2023.08.31
General Fixes
- Fixed an issue where the Source Record remediation was incorrectly displayed for all fields
- Adjusted the display of field Quality Scores and Suggestion Scores within the Source Record
- Fixed a bug in the Check Form where the field input wouldn’t display when cloning a check that hasn’t been part of a scan yet
- Resolved an issue where failed checks for shape anomalies were not receiving violation messages
2023.08.30
Feature Enhancements
-
Anomaly Dialog Updates:
- Optimized Source Data Columns Presentation: To facilitate faster identification of issues, anomalous fields are now presented first. This enhancement will prove particularly useful for data sources with a large number of columns.
- Enhanced Sorting Capabilities: Users can now sort the source record data by name, weight, and quality score, providing more flexible navigation and ease of use.
- Field Information at a Glance: A new menu box has been introduced to deliver quick insights about individual fields. Users can now view weight, quality score, and suggested remediation for each field directly from this menu box.
-
Syntax Highlighting Autocomplete Widget:
- Improved UX: The widget has been enhanced to better identify and display hint types, including distinctions between tables, keywords, views, and columns. This enhancement enriches the autocomplete experience.
General Fixes
- Check Dialog Accessibility:
- Addressed an issue where the check dialog was not opening as expected when accessed through a direct link from the profile page.
- General Fixes and Improvements
2023.08.23
Feature Enhancements
-
Profiles Page:
- Introduced two new sorting methods to provide users with more intuitive ways to explore their profiles: Sort by last profiled and Sort by last scanned.
- Updated the default sorting behavior. Profiles will now be ordered by name right from the start, rather than by their creation date.
-
Add New isNotReplicaOf Check:
- With this rule, users can assert that certain datasets are distinct and don't contain matching data, enhancing the precision and reliability of data comparisons and assertions.
-
Introduce new Metric Check
- We've added a new Metric check tailored specifically for handling timeseries data. This new check is set to replace the previous Absolute and Relative Change Checks.
- To offer a more comprehensive and customizable checking mechanism, the Metric check comes with a comparison input:
- Percentage Change: Asserts that the field hasn't deviated by more than a certain percentage (inclusive) since the last scan.
- Absolute Change: Ensures the field hasn't shifted by more than a predetermined fixed amount (inclusive) from the previous scan.
- Absolute Value: During each scan, this option records the field value and asserts that it remains within a specified range (inclusive).
General Fixes
-
Schema Validation:
- We've resolved an issue where the system was permitting the persistence of empty values under certain conditions for datastores and checks. This fix aims to prevent unintentional data inconsistencies, ensuring data integrity.
-
General Fixes and Improvements
2023.08.18
Feature Enhancements
-
Auditing:
- Introduced significant enhancements to the auditing capabilities of the platform, designed to provide better insights and control over changes. The new auditing features empower users to keep track of change sets across all entities, offering transparency and accountability like never before. A new activity endpoint has been introduced, providing a log of user interactions across the application.
-
Search Enhancements:
- Profiles and Anomalies lists can now be searched by both identifiers and descriptions using the same search input.
-
Catalog Operation Flow Update:
- Made a minor update to the datastore creation and catalog flow to enhance user flexibility and experience. Instead of automatically running a catalog operation post datastore creation, users now have a clearer, intuitive manual process. This change offers users the flexibility to set custom catalog configurations, like syncing only tables or views.
-
Operation Flow Error Handling:
- Enhanced user experience during failures in the Operation Flow. Along with the failure message, a "Try Again" link has been added. Clicking this link will revert to the configuration state, allowing users to make necessary edits without restarting the entire operation process.
-
Sorting Enhancements:
- Introduced new sorting options: "Completeness" and "Quality Score". These options are now available on the profiles & fields pages.
General Fixes
-
Datastore Connection Edit:
- Improved the Datastore connection edit experience, especially for platforms like BigQuery. Resolved an issue where file inputs were previously obligatory for minor edits. For instance, renaming a BigQuery Datastore no longer requires a file input, addressing this past inconvenience.
-
Pagination issues:
- Resolved an issue with paginated endpoints returning 500 instead of 422 on requests with invalid parameters.
2023.08.11
Feature Enhancements
- Insights Export: Added a new feature that allows users to export Insights directly to PDF, making it easier to share and review data insights.
- Check Form UX:
- Fields in the Check Form can now be updated if the check hasn't been used in a Scan operation, offering more flexibility to users.
- Enhanced visual cues in the form with boxed information to clarify the limitations certain properties have, depending on the state of the form.
- A new icon has been introduced to represent the number of scan operations that have utilized the check, providing users with a clearer overview.
- SLA Form UX:
- Revamped Date Time handling for enhanced time zone coverage, allowing for user-specified date time configurations based on their preferred time zone.
- Filter and Sorting:
- Added Datastore Type filter and sorting for source datastores
- Added Profile Completeness sorting and type filtering and sorting
- Added Check search by identifier or description
General Fixes
- SparkSQL Expressions: Added support to field names with special characters to SparkSQL expressions using backticks
- Pagination Adjustment: The pagination limit has been fine-tuned to support a maximum of 100 items per page, improving readability and navigation.
2023.08.03
Maintenance Release
- Updated enrichment sidebar details design.
- Tweaked SQL input dialog sizing.
- Fixed filter components width bug.
- Retain the start time of operation on restart.
- Fixed exclude fields to throw exceptions on errors.
- Improved performance when using DFS to load reference data.
2023.07.31
Maintenance Release
- Changed UX verbiage and iconography for Anomaly status updates.
- Fixed intermittent notification template failure.
- Fixed UI handling of certain rule types where unused properties were required.
- Improved error messages when containers are no longer accessible.
- Fixed Hadoop authentication conflicts with ABFS.
- Fixed an issue where a Profile operation run on an empty container threw a runtime exception.
2023.07.29
Feature Enhancements
- Added a NotExistsIn Check Type: Introducing a new rule type that asserts that values assigned to this field do not exist as values in another field.
- Check Authoring UI enhancements: Improved user interface with larger edit surfaces and parenthesis highlighting for better usability.
- Container Details UI enhancement: Improved presentation of container information in sidebars for easier accessibility and understanding.
- Added Check Authoring Validation: Users can now perform a dry run of the proposed check against representative data to ensure accuracy and effectiveness.
- Change in default linkage between Checks and Anomalies: Filters now default to "Active" status, providing more refined results and support for specific use cases.
2023.07.25
Feature Enhancements
- Satisfies Expression Enhancement: The Satisfies Expression feature has been upgraded to automatically bind fields referenced in the user-defined expressions, streamlining integration and improving usability.
Added Support
- Extended Support for ExistsIn Checks: The ExistsIn checks now offer support for computed tables, empowering users to perform comprehensive data validation on computed data.
General Fixes
-
Enhanced Check Referencing: Checks can now efficiently reference the full dataframe by using the alias "qualytics_self," simplifying referencing and providing better context within checks.
-
Improved Shape Anomaly Descriptions: Shape anomaly descriptions now include totals alongside percentages, providing more comprehensive insights into data irregularities.
-
Fix for Computed Table Record Calculation: A fix has been implemented to ensure accurate calculation of the total number of records in computed tables, improving data accuracy and reporting.
-
Enhanced Sampling Source Records Anomaly Detection: For shape anomalies, sampling source records now explicitly exclude replacement, leading to more precise anomaly detection and preserving data integrity during analysis.
2023.07.23
Bug Fixes
- Fix for total record counts when profiling large tables
2023.07.21
Feature Enhancements
- Notification Form: Enhanced the user interface and experience by transforming the Channel and Tag inputs into a more friendly format.
- Checks & Anomalies: Updated the default Sort By criterion to be based on "Weight", enabling a more effective overview of checks and anomalies.
- Profile Details (Side Panel): Introduced a tooltip to display the actual value of the records metric, providing clearer and instant information.
- Freshness Page: Added a new navigation button that directly leads to the Profile Details page, making navigation more seamless.
- Profile Details: Introduced a settings option for the user to perform actions identical to those from the Profile Card, such as changing profile settings and configuring Checks and SLAs.
- SparkSQL Inputs: Implemented a new autocomplete feature to enhance user experience. Writing SQL queries is now more comfortable and less error-prone.
2023.07.19
General Fixes
- General Fixes and Improvements
2023.07.14
Feature Enhancements
- API enhancements
- Improved performance of our json validation through the adoption of Pydantic 2.0
- Upgraded our API specification to OpenAPI 3.1.0 compatible, this uses JSON Schema 2020-12.
- Upgraded to Spark 3.4
- Significant performance enhancements for long-running tasks and shuffles
- Added support for Kerberos authentication for Hive datastores
- Enhanced processing for large dataframes with JDBC sources
- Handle arbitrarily large tables and views by chunking into sequentially processed dataframes
- Improvements for Insights view when limited data is available
- Various user experience enhancements
Bug Fixes
- Date Picker fix for Authored Checks
- Allow tags with special characters to be edited
2023.07.03
Feature Enhancements
- Insights Made Default View on Data Explorer
- Gain valuable data insights more efficiently with the revamped Insights feature, now set as the default view on the Data Explorer.
- Reworked Freshness with Sorting and Grouping
- Easily analyze and track data freshness based on specific requirements thanks to the improved Freshness feature, now equipped with sorting and grouping functionalities.
- Enhanced Tables/Files Cards Design:
- Experience improved data analysis with the updated design of tables/files cards, including added average completeness information and reorganized identifiers.
Added Support
-
Support for Recording Sample Shape Anomalies to Remediation Tables
- Address potential data shape issues more effectively as the platform now supports recording a sample of shape anomalies to remediation tables.
-
New Metrics and Redirect to Anomalies for Profile/Scan Results
- Access additional metrics for profile/scan results and easily redirect to anomalies generated by a scan from Activity tab for efficient identification and resolution of data issues.
General Fixes
- Reduced Margin Between Form Input Fields:
- Enjoy a more compact and streamlined design with a reduced margin between form input fields for an improved user experience.
Bug Fixes
- Fixed Pagination Reset Issue During Check Updates
- Pagination will no longer reset when checks are updated, providing a smoother user experience, with reset now occurring only during filtering.
- Resolved Vertical Misalignment of Check and Anomaly Icons
- The issue causing vertical misalignment between Check and Anomaly icons on the Field Profile page has been fixed, resulting in a visually pleasing and intuitive user interface.
2023.06.24
Feature Enhancements
- Refactored Partition Reads on JDBC
- Refactored partitioned reads on JDBC to improve performance, resulting in faster and more efficient data retrieval.
Bug Fixes
-
Fixed Inputs on Change Checks
- Refined inputs on change checks to differentiate between Absolute and Relative measurements, ensuring precise detection and handling of data modifications based on numeric values (Absolute) and percentage (Relative) variations.
-
Resolved Enum Type Ordering Bug for Paginated Views
- Fixed bug causing inconsistent and incorrect sorting of enum values across all paginated views, ensuring consistent and accurate sorting of enum types.
General Fixes
- Added Success Effect
- Added effect when a datastore is configured successfully, enhancing the user experience by providing visual confirmation of a successful configuration process.
2023.06.20
Feature Enhancements
-
Reworked Tags View
- Improved the usability and visual appeal of the tags view. Added new properties like description and weight modifier to provide more detailed information and assign relative importance to tags. The weight value directly correlates with the level of importance, where a higher weight indicates higher significance.
-
Inherited Tags Support
- Implemented support for inherited tags in taggable entities. Now tags can be inherited from parent entities, streamlining the tagging process and ensuring consistency across related items. Inherited Tags will be applied to anomalies AFTER a Scan operation.
-
Added Total Data Under Management to Insights
- Introduced a new metric under Insights that displays the total data under management. This provides users with valuable insights into the overall data volume being managed within the system.
Added Support
-
Bulk Update Support
- Introduced bulk update functionality for tables, files, and fields. Users can now efficiently Tag multiple items simultaneously, saving time and reducing repetitive tasks.
-
Smart Partitioning of BigQuery
- Enabled smart partitioning in BigQuery using cluster keys. Optimized data organization within BigQuery for improved query performance and cost savings.
Bug Fixes
- Fixed Scheduling Operation Issues
- Addressed a bug causing scheduling operations to fail with invalid days in crontabs. Users can now rely on accurate scheduling for time-based tasks without encountering errors.
General Fixes
-
Improved Backend Performance
- Implemented various internal fixes to optimize backend performance. This results in faster response times, smoother operations, and an overall better user experience.
-
Enhanced Tag Input:
- Improved tag input functionality in the Check form dialog. Users can now input tags more efficiently with enhanced suggestions and auto-complete features, streamlining the tagging process.
-
Enhanced File Input Component
- Upgraded the file input component in the Datastore form dialog, providing a more intuitive and user-friendly interface for uploading files. Simplifies attaching files to data entries and improves overall usability.
2023.06.12
Feature Enhancements
- Explore is the new centralized view of Activities, Containers (Profiles, Tables, Computed Tables), Checks, Anomalies and Insights across ALL Datastores. This new view allows for filtering by Datastores & Tags, which will persist the filters across all of the submenu tabs. The goal is to help with Critical Data Elements and filter out irrelevant information.
- Enhanced Navigation Features
- The navigation tabs have been refined for increased user-friendliness.
- Enhanced the Profile View and added a toggle between card and list views.
Datastores
andEnrichment Datastores
have been unified, with a tabular view introduced to distinguish between your Source Datastores and Enrichment Datastores.Explore
has been added to the main navigation, andInsights
has been conveniently relocated into the Explore submenu.- Renamed
Tables/Files
toProfiles
in the Datastore details page.
Added Support
-
We're thrilled to introduce two new checks, the
Absolute Change Limit
and theRelative Change Limit
, tailored to augment data change monitoring. These checks enable users to set thresholds on their numeric data fields and monitor fluctuations from one scan to the next. If the changes breach the predefined limits, an anomaly is generated.- The
Absolute Change Limit
check is designed to monitor changes in a field's value by a fixed amount. If the field's value changes by more than the specified limit since the last applicable scan, an anomaly is generated. - The
Relative Change Limit
check works similarly but tracks changes in terms of percentages. If the change in a field's value exceeds the defined percentage limit since the last applicable scan, an anomaly is generated.
General Fixes
- General UI fixes with new navigational tabs
- Resolved an issue when creating a computed table
- Incorporated functionality to execute delete operations and their related results.
- Renamed "Rerun" button to "Retry" in the operation list
2023.06.02
General Fixes
-
Added GCS connector with Keyfile support:
- The GCS connector now supports Keyfile authentication, allowing users to securely connect to Google Cloud Storage.
-
Improved BigQuery connector by removing unnecessary inputs:
- Enhancements have been made to the BigQuery connector by streamlining the inputs, eliminating any unnecessary fields or options.
- This results in a more user-friendly and efficient experience.
-
Renamed satisfiesEquation to satisfiesExpression:
- The function "satisfiesEquation" has been renamed to "satisfiesExpression" to better reflect its functionality.
- This change makes it easier for users to understand and use the function.
Added Support
-
Added Check Description to Notification rule messages:
- Notification rule messages now include the Check Description.
- This allows users to add additional context and information about the specific rule triggering the notification and passing that information to downstream workflows.
-
Added API support for tuning operations with a high correlation threshold for profiles and high count rollup threshold for anomalies in scan:
- The API now supports tuning operations by allowing users to set a higher correlation threshold for profiles.
- It also enables users to set a higher count rollup threshold for anomalies in scan.
- This customization capability helps users fine-tune the behavior of the system according to their specific needs and preferences.
2023.05.26
Usability
- Improved the navigation in the Activity tab’s side panel for easier and more intuitive browsing including exposing the ability to comment directly into an anomaly
- Added a redirect to the Activity tab when an operation is initiated for a smoother workflow.
Bug Fixes
- Resolved an issue where the date and time were not displaying correctly for the highest value in profiles.
- Fixed a problem with scheduled operations when the configured timing was corrupted.
- Addressed an issue where filtered checks were causing unexpected errors outside of the intended dataset.
2023.05.23
Feature Enhancements
- Scheduled operation editing
- Added the ability for users to edit a scheduled operation. This allows users to make changes to the schedule of an operation.
- Catalog includes filters
- Added catalog include filters to only process tables, views, or both in JDBC datastores. This allows users to control which object types are processed in the datastore.
- isReplicaOf check filters
- Added filter support to the isReplicaOf check. This allows users to control which tables are checked for replication.
- Side panel updates
- Updated side panel design and added an enrichment redirect option.
Added Support
- IBM DB2 datastore
- Added support for the IBM DB2 datastore. This allows users to connect to and process data from IBM DB2 databases.
- API support for tagging fields
- Added API support for tagging fields. This allows users to tag fields in the datastore with custom metadata.
Bug Fixes
- Freshness attempting to measure views
- Fixed an issue with freshness attempting to measure views.
- Enrichment to Redshift and string data types
- Fixed an issue with enrichment to Redshift and string data types. This issue caused enrichment to fail for tables that contained string data types
2023.05.10
Feature Enhancements
-
Container Settings
- Introducing the ability to Group fields for improved insights and profiling precision.
- Added functionality to Exclude fields from the container, allowing associated checks to be ignored during operations, leading to reduced processing time and power consumption.
- We now support identifiers on commuted tables during profiling operations.
-
Checks
- Improved usability by enabling quick cloning of checks within the same datastore.
- Users can now easily create a new check with minor edits to tables, fields, descriptions, and tags based on an existing check.
- Introducing the ability to write Check Descriptions to the Enrichment store, enabling better organization and management of check-related data downstream.
- Note: Updating the Enrichment store data requires a new Scan operation.
- Enhanced anomaly management by providing a convenient way to filter and view all anomalies generated by a specific check.
- Users can now access the Anomaly warning sign icon within the Check dialog, providing quick access to two options: View Anomalies and Archive Anomalies.
- Improved usability by enabling quick cloning of checks within the same datastore.
- Usability
- Introducing the ability to generate an API token from within the user interface.
- This can be done through the Settings > Security section, providing a convenient way to manage API authentication.
- Added the ability to search tables/files and apply filters to running operations.
- This feature eliminates the need to rely solely on pagination, making it easier to select specific tables/files for operations.
- Included API and SparkSQL links in the documentation for easy access to additional resources and reference materials.
- Introducing the ability to generate an API token from within the user interface.
Added Support
- Hive datastore support has been added, allowing seamless integration with Hive data sources.
- Timescale datastore support has been added, enabling efficient handling of time-series data.
- Added support for HTTP(S) and SOCKS5 proxies, allowing users to configure proxy settings for data operations.
- Default encryption for rabbitMQ has been implemented, enhancing security for data transmission.
Bug Fixes
- Resolved a bug related to updating tag names, ensuring that tag name changes are properly applied.
- Fixed an overflow bug in freshness measurements for data size, resulting in accurate measurements and improved reliability.
General Fixes
- Updated default weighting for shape anomalies, enhancing the accuracy of anomaly detection and analysis.
- Increased datastore connection timeouts, improving stability and resilience when connecting to data sources.
- Implemented general bug fixes and made various improvements to enhance overall performance and user experience.
2023.04.19
We're pleased to announce the latest update that includes enhancements to UI for an overall better experience:
Feature Enhancements
- Added Volumetric measurements to Freshness Dashboard:
- Gain valuable insights into your data's scale and storage requirements with our new volumetric measurements. SortBy Row Count or Data Size to make informed decisions about your data resources.
- Added
isReplicaOf
check:- The new
isReplicaOf
check allows you to easily compare data between two different tables or fields, helping you identify and resolve data inconsistencies across your datastores.
- The new
Added Support
- Redesigned Checks and Anomalies listing:
- Enjoy a cleaner, more organized layout with more information that makes navigating and managing checks and anomalies even easier.
- Redesigned Anomaly Details view:
- The updated anomaly view provides a more thoughtful and organized layout.
- Improved Filter components:
- With a streamlined layout and organized categories, filtering your data is now more intuitive. Dropdown options are now to the right to allow view of the Clear and Apply buttons
- Updated Importance score to Weight & added SortBy support:
- Manage checks and anomalies more effectively with our updated ‘Weight' feature (formerly ‘Importance Score') and the new SortBy support function, allowing you to quickly identify high-priority issues.
General Fixes
- General Fixes and Performance Improvements
2023.04.07
Feature Enhancements
- We've just deployed an MVP version of the Freshness Dashboard! This feature lets you create, manage, and monitor all of the SLAs for each of your datastores and their child files/tables/containers, all in one place. It's like having a birds-eye view of how your datastores are doing in relation to their freshness.
- To access the Freshness Dashboard, just locate and click on the clock icon in the top navigation between Insights and Anomalies. By default, you'll see a rollup of all the datastores in a list view with their child files/tables/containers collapsed. Simply click on a datastore row to expand the list.
- We've also made some improvements to the UI, including more sorting and filtering options in Datastores, Files/Tables, Checks, and Anomalies. Plus, we've added the ability to search the description field in checks, making it easier to find what you're looking for.
- Last but not least, we've added a cool new feature to checks - the ability to archive ALL anomalies generated by a check. Simply click on the anomaly warning icon at the top of the check details box to bring up the archive anomalies dialog box.