User Guide

Description	User Guide for the Qualytics data quality platform
Author(s)	Qualytics Team
Repository	https://github.com/Qualytics/userguide
Copyright	Copyright © 2025 Qualytics

Introduction to Qualytics

Qualytics is your Active Data Quality Platform that empowers teams to manage data quality at scale through advanced automation. By analyzing the shapes and patterns in your historical data, Qualytics infers contextual data quality rules that actively monitor new data, including incremental loads, to identify anomalies. When issues arise, Qualytics provides your team with everything needed to take corrective actions using your existing data tools and preferred monitoring solutions.

Managing Data Quality

Qualytics helps your data teams proactively address data issues by automating the discovery and maintenance of essential data quality measures.

Here's how it works:

Analyzing Historical Data: Qualytics examines your existing data to understand its patterns and characteristics, creating a comprehensive set of rules that define good data quality.
Finding Anomalies: These automatically inferred rules, along with any custom rules you create, work together to identify abnormalities in both historical and new data, even as it arrives incrementally.
Taking Corrective Actions: When Qualytics detects an anomaly, it springs into action. Using tags, it can:
- Send notifications through your preferred platforms (Teams, Slack, PagerDuty)
- Trigger workflows in your tools (Airflow, Fivetran, Airbyte)
- Provide detailed anomaly information to your chosen datastore
- Suggest optimal solutions through its intuitive interface and API
Continuous Monitoring and Improvement: Qualytics maintains constant vigilance over your data quality, automatically adapting quality checks to reflect changes in your data and business needs. This ongoing process strengthens data quality and builds confidence in your organization's data assets.

Key Features

Qualytics delivers powerful capabilities designed to transform your data quality management:

Automated Data Profiling: Qualytics creates comprehensive profiles of your data assets automatically, providing deep insights that form the foundation of robust data quality management.
Rule Inference: Say goodbye to the challenge of manually crafting and maintaining data quality rules at scale. Qualytics automatically infers appropriate rules based on your data profiles, saving time while ensuring precise anomaly detection.
Anomaly Detection: Detect data irregularities both at rest and in flight throughout your data ecosystem. Qualytics excels at highlighting outliers and anomalies, helping you maintain high data quality standards.
Anomaly Remediation: When issues emerge, Qualytics seamlessly integrates with your preferred tools to enable swift corrective actions through automated workflows.
Freshness Monitoring: Keep your data current with built-in monitoring of data freshness Service Level Agreements (SLAs). Define and track timeliness requirements to ensure your data meets critical business needs.
Insights Dashboard: Access a clear, intuitive executive dashboard that provides a holistic view of your data health and quality. Visualize key metrics, track progress, and derive actionable insights to drive data-driven strategies.

Seamless Integration and Deployment

Qualytics adapts to your infrastructure with flexible integration options:

Deployment Options: Choose the deployment model that works best for you - on-premise, single-tenant cloud, or SaaS. Qualytics meets you where your data lives.
Support for Modern & Legacy Data Stacks: Whether you use modern solutions like Snowflake and Amazon S3 or legacy systems like Oracle and MSSQL, Qualytics seamlessly integrates with your entire data stack to maintain quality across all sources.

Demo

Here is a short video demonstrating the platform with a quick walkthrough:

Embarking on Your Journey

This user guide will walk you through Qualytics' key capabilities with clear, step-by-step instructions. Whether you're new to the platform or looking to deepen your expertise, we're here to help you optimize your data quality management journey.

Let's begin empowering your organization with accurate, reliable, and trustworthy data using Qualytics!

Using the Platform

Onboarding

Qualytics is a comprehensive data quality management solution that helps enterprises proactively manage their full data quality lifecycle at scale. Through automated profiling, contextual quality checks, rule inference, anomaly detection, remediation, and tailored notifications, Qualytics transforms how organizations approach data quality.

This guide will walk you through getting started with Qualytics, ensuring a smooth and efficient onboarding experience.

Let's get started 🚀

Onboarding Process

Your Qualytics journey begins with understanding your enterprise's specific requirements. We'll work with you to create a tailored approach based on your data needs and goals.

1. Screening & Criteria Gathering

Schedule a demo with our team to help us understand your enterprise data needs. During this session, we'll: - Create a detailed plan - Identify key success criteria - Tailor the deployment to your specific requirements - Explore relevant use cases for your business

2. User Invitations

Once your deployment setup is complete, we'll send invitations to your team members' email addresses. These invitations include: - Instructions for accessing the platform - Role assignments (admin or member) based on your preferences - Access configuration details

Admins receive full platform configuration and management capabilities, while members receive access based on admin-defined permissions.

Deployment Options

Qualytics offers flexible deployment options to seamlessly integrate with your existing data infrastructure:

1. SaaS Deployment (Default)

Our Software as a Service (SaaS) deployment provides a fully managed experience hosted by Qualytics. This option offers: - Minimal maintenance requirements - Rapid scalability - Automatic updates and improvements - Focus on data quality rather than infrastructure

2. On-Premise Deployment

For organizations that prefer complete control over their data environment, our on-premise deployment option allows you to: - Maintain data within your own data centers - Ensure compliance with internal policies and regulations - Exercise complete control over your data and security

Tip

This deployment option is recommended for customers with sensitive data

Frequent Asked Questions (FAQs)

Q 1: What type of support is provided during a POC?

A 1: A dedicated Customer Success Manager, with mandatory weekly check-ins.

Q 2: What are the deployment options for POC?

A 2: Qualytics offers deployment options for Proof of Concept (POC) primarily as a Software as a Service (SaaS) solution.

Q 3: What type of data should we use for a POC?

A 3: In most cases, potential customers use their actual data during a POC for the most realistic evaluation. Some customers opt to use cleaned data (removing PII) or sample test data.

Q 4: Are there limitations to data size for POC?

A 4: There are no limitations to data size for a Proof of Concept (POC).

Q 5: What type of support is provided during the Onboarding process?

A 5: A dedicated Customer Success Manager, with mandatory weekly check-ins.

Q 6: What types of data stacks does Qualytics support?

A 6: Qualytics supports both modern solutions and legacy systems:

Modern Solutions

Qualytics seamlessly integrates with modern data platforms like Snowflake, Amazon S3, BigQuery, and more to ensure robust data quality management.
Legacy Systems

We maintain high data quality standards across legacy systems including MySQL, Microsoft SQL Server, and other reliable relational database management systems.

For detailed integration instructions, please refer to the quick start guide.

Q 7: What types of database technology can you connect in Qualytics?

A 7: Qualytics supports any Apache Spark-compatible datastore, including: - Relational databases (RDBMS) - Raw file formats (CSV, XLSX, JSON, Avro, Parquet)

Q 8: What is an enrichment datastore?

A 8: An Enrichment Datastore is a user-managed storage location where Qualytics records and accesses metadata through system-defined tables. It's specifically designed to capture metadata generated during profiling and scanning operations.

Q 9: Can I download my metadata and data quality checks?

A 9: Yes, Qualytics's metadata export feature captures the mutable states of various data entities. You can export Quality Checks, Field Profiles, and Anomalies metadata from selected profiles into your designated enrichment datastore.

Q 10: How is the Quality Score calculated?

A 10: Quality Scores measure data quality at the field, container, and datastore levels, recorded as a time series to track improvements. Scores range from 0-100, with higher scores indicating better quality.

Q 11: What is a catalog operation?

A 11: A Catalog Operation scans your datastore to import named collections (tables, views, files). It automatically identifies optimal approaches for: - Incremental scanning - Data partitioning - Record identification

Q 12: What is a profiling operation?

A 12: A Profile Operation analyzes every available record across all containers in a datastore. Full Profiles deliver 100% fidelity metadata at the cost of maximum compute time.

Q 13: What is a scan operation?

A 13: The Scan Operation evaluates data quality checks across your datastore's collections, producing: - Record anomalies for individual anomalous values - Shape anomalies for multi-record anomalies - Detailed analysis in your Enrichment Datastore

Quick Start Guide

Welcome to Qualytics! This guide will help you quickly get up and running with the platform, from initial setup through your first data quality operations. Whether you're a business user or technical administrator, you'll find everything needed to start managing data quality at scale.

Let's get started 🚀

Deployment Access

Each Qualytics deployment is a single-tenant, dedicated cloud instance, configured to your organization's requirements. Your deployment will be accessible via a custom URL (e.g., https://acme.qualytics.io), with corresponding API documentation at /api/docs.

Onboarding Process

The Qualytics onboarding process ensures your environment is perfectly tailored to your needs:

1. Screening and Criteria Gathering

Our team works with you to understand your specific needs, including:

Evaluating sample data requirements.
Identifying primary success criteria.
Exploring relevant use cases for your environment.
Determining deployment specifications.

2. Environment Setup

Based on your requirements, we:

Create your custom deployment URL.
Configure your preferred cloud provider and region.
Set up initial security parameters.
Establish integration endpoints.

3. User Access

Once deployment is complete:

Team members receive email invitations.
Roles are assigned based on your specifications.
Access credentials are securely distributed.

Tip

Please check your spam folder if you don't see the invite.

See our onboarding page for a more detailed view of what to expect during onboarding!

Signing In

Qualytics supports two authentication methods:

Method 1: Direct Credentials

Ideal for:

Initial platform evaluation.
Proof of Concept (POC) phases.
Environments without SSO integration.

sign-in

Method 2: Enterprise SSO

For production deployments:

Integrates with your organization's Identity Provider.
Supports standard SSO protocols.
Provides seamless access management.

sign-in

Getting Started Checklist

To begin using Qualytics, you'll complete these key steps:

Connect Your First Datastore.
Run Initial Profile Operation.
Review Generated Quality Checks.
Configure Monitoring & Alerts.

Let's walk through each step in detail.

Understanding Datastores

In Qualytics, a Datastore represents your data source connection. Qualytics supports any Apache Spark-compatible data source, including:

JDBC Datastores

Traditional relational databases (RDBMS).
Data warehouses.
Analytical databases.

Distributed File System (DFS) Datastores

Cloud storage (AWS S3, Azure Blob, GCP).
Raw files (CSV, XLSX, JSON, Avro, Parquet).
Local file systems.

Connecting Your First Datastore

Adding a Source Datastore

From the main menu, select "Add Source Datastore":

add-first-datastore

Select your datastore type.
Provide connection details.
Test connectivity.
Configure an Enrichment Datastore (strongly recommended).

Warning

While optional, not configuring an Enrichment Datastore limits platform capabilities.

Enrichment Datastores

An Enrichment Datastore serves as the storage location for:

Anomaly detection results.
Metadata and profiling information.
Quality check outcomes.
Historical analysis data.

You can either:

Configure a new Enrichment Datastore.
Select an existing Enrichment Datastore from the dropdown.

Core Operations

After connecting your datastore, three fundamental operations manage data quality:

1. Catalog Operation

The first step in understanding your data:

Systematically collects data structures.
Analyzes existing metadata.
Prepares for profiling and scanning.
Runs automatically on datastore creation.

catalog-operation

2. Profile Operation

The Profile operation performs deep analysis of your data:

Generates comprehensive metadata.
Calculates statistical measures:
- Basic metrics (type, min/max, and lengths).
- Advanced analytics (skewness, kurtosis, and correlations).
- Value distributions and patterns.
Automatically infers data quality rules.
Uses machine learning for pattern detection.

Our profiling engine analyzes:

Field types and patterns.
Value distributions.
Statistical relationships.
Data quality patterns.
Structural consistency.

The engine uses machine learning to:

Identify column data types.
Discover relationships.
Generate quality rules.
Detect anomaly patterns.

3. Scan Operation

The Scan operation actively monitors data quality:

Asserts all defined quality checks.
Identifies anomalies and violations.
Records results in the Enrichment Datastore.
Generates quality scores.

scan-operation

The first scan runs as a "Full" scan to establish baselines. After completion, you can review:

Start and finish times.
Records processed.
Anomalies detected.
Quality scores.

Managing Data Quality

Quality Checks

Qualytics uses two types of quality checks:

1. Inferred Checks

Automatically generated during profiling.
Cover 80-90% of common quality rules.
Based on statistical analysis and ML.
Continuously refined through operation.

inferred-check-details

2. Authored Checks

Manually created by users.
Support complex business rules.
Use Spark SQL or Scala UDFs.
Can be templated and shared.

authored-check-details

Explore Dashboard

The Explore interface provides comprehensive visibility:

1. Insights

Overview of anomaly detection.
Quality monitoring metrics.
Filterable by source, tags, and dates.

explore-insights

2. Activity

Operation history and status.
Data volume heatmaps.
Anomaly tracking.

explore-activity

3. Profiles

Unified view of all data assets:

Tables and Views.
Computed Assets.
Field-level Details.

explore-profiles

4. Observability

Monitor platform health and performance:

Volume metrics.
Quality trends.
System health.

observability

5. Checks

Unified view of all data quality validations across datastores:

Active, Draft, Favorite, and Archived checks.
Filter by Source Datastore, Tags, or Importance.
View validation results by table and field, including pass/fail status and anomaly counts.

checks

6. Anomalies

Centralized view of all detected data issues across datastores:

Filter anomalies by status — Open, Active, Acknowledged, or Archived.
View details including datastore, table, affected fields, rules triggered, and detection date.
Track anomaly trends and weights to prioritize investigation and resolution.

anomalies

Configuration & Management

Tags

Organize and prioritize:

Categorize data assets.
Drive notifications.
Weight importance.

settings-tags

Flows

Automate and streamline:

Trigger actions based on specific events.
Manage workflows efficiently.
Monitor and track execution status.

flows

Platform Settings

Access key configuration areas:

Connections
- Manage datastores.
- Configure integrations.
Security
- User management.
- Role assignments.
Integrations
- External tool setup.
- API configuration.
Status
- Deployment status.
- Analytics engine management.

Next Steps

Now that you're familiar with the Qualytics basics, consider:

Setting up additional datastores.
Creating custom quality checks.
Configuring notifications.
Exploring advanced features.

For detailed information on any topic, explore the relevant sections in our documentation.

Web Application

Upon signing in to Qualytics, users are greeted with a thoughtfully designed web application that offers intuitive navigation and quick access to essential features and datasets, ensuring an efficient and comprehensive data quality management experience.

In this documentation, we will explore every component of the Qualytics web application.

Let’s get started 🚀

Global Search

The Global Search feature in Qualytics is designed to streamline the process of finding crucial assets such as Datastores, Containers, and Fields. This enhancement provides quick and precise search results, significantly improving navigation and user interaction. By entering keywords in the search bar located at the top of the dashboard, users can efficiently locate specific data elements, facilitating better data management and access. This functionality is especially useful for large datasets, ensuring users can swiftly find the information they need without navigating through multiple layers of the interface.

Tip

Press the shortcut key: Ctrl+K for quick access to Global Search.

In-App Notifications

In Qualytics, notifications keep users updated on flow executions in real time. When a flow is triggered, users receive alerts with details like the flow name, status (success or failure), completion time, and actions performed. Clicking on a notification provides more details, including any detected anomalies. These notifications help users monitor workflows efficiently and respond quickly to important updates.

Discover

The Discover option in Qualytics features a dropdown menu that provides access to various resources and tools to help users navigate and utilize the platform effectively. The menu includes the following options:

Resources:

User Guide: Opens the comprehensive user guide for Qualytics, which provides detailed instructions and information on how to use the platform effectively.
SparkSQL: Directs users to resources or documentation related to using SparkSQL within the Qualytics platform, aiding in advanced data querying and analysis.

API:

Docs: Opens the API documentation, offering detailed information on how to interact programmatically with the Qualytics platform. This is essential for developers looking to integrate Qualytics with other systems or automate tasks.
Playground: Provides access to an interactive environment where users can test and experiment with API calls. This feature is particularly useful for developers who want to understand how the API works and try out different queries before implementing them in their applications.

Support:

Qualytics Helpdesk: Provides users with access to a support environment where they can get assistance with any issues or questions related to the platform.

discovery

Theme

Qualytics offers both dark mode and light mode to enhance user experience and cater to different preferences and environments.

Light Mode:

This is the default visual theme of Qualytics, featuring a light background with dark text.
It provides a clean and bright interface, which is ideal for use in well-lit environments.
To switch from dark mode to light mode, click the Light Mode button.

Dark Mode:

Dark mode features a dark background with light text, reducing eye strain and glare, especially in low-light environments.
It is designed to be easier on the eyes during prolonged usage and can help save battery life on devices.
To activate dark mode, click the Dark Mode button.

System Appearance:

The system theme automatically adjusts based on the user’s device settings.
When enabled, Qualytics will switch between light and dark mode based on the system preference.
It provides a seamless experience by adapting to the user’s environment without manual adjustments.

Tip

Users can still manually select dark or light mode if they prefer a fixed theme.

light-mode-theme

View Mode

In Qualytics, users have the option to switch between two display modes: List View and Card View. These modes are available on the Source Datastore page, Enrichment Datastore page, and Library page, allowing users to choose their preferred method of displaying information.

List View: List View arranges items in a linear, vertical list format. This mode focuses on providing detailed information in a compact and organized manner. To activate List View, click the "List View" button (represented by an icon with three horizontal lines) located at the top of the page.
Card View: Card View displays items as individual cards arranged in a grid. Each card typically includes a summary of the most important information about the item. To switch to Card View, click the "Card View" button (represented by an icon with a grid of squares) located at the top of the page.

view-mode

Product Updates

In Qualytics, the Product Updates feature helps users stay up to date with the latest changes. They can see new features, bug fixes, and improvements directly in the app, with links to full release notes for more details.

product-updates

User Profile

The user profile section in Qualytics provides essential information and settings related to the user's account. Here's an explanation of each element:

Name: Displays the user's email address used as the account identifier.
Role: Indicates the user's role within the Qualytics platform (e.g., Admin), which defines their level of access and permissions.
Teams: Shows the teams to which the user belongs (e.g., Public), helping organize users and manage permissions based on group membership.
Preview Features: A toggle switch that enables or disables preview features. When turned on, it adds an AI Readiness Benchmark for the Quality Score specifically on the Explore page.
Logout: A button that logs the user out of their Qualytics account, ending the current session and returning them to the login page.
Version: Displays the current version of the Qualytics platform being used, which is helpful for troubleshooting and ensuring compatibility with other tools and features.

The left sidebar of the app displays the primary navigation menu, which allows users to quickly access various functionalities of the Qualytics platform. The menu items include:

Source Datastores (Default View)

Lists all the source datastores connected to Qualytics in the left sidebar. It also provides the option to:

Add a new source datastore.
Search from existing source datastores.
Sort existing datastores based on the name, records, checks, etc.
Filter source datastores.

Enrichment Datastores

Lists all the enrichment datastores connected to Qualytics in the left sidebar. It also provides options to:

Add an enrichment datastore.
Search from existing enrichment datastores.
Sort existing datastores based on the name, records, checks, etc.

Explore

The Explore section in Qualytics enables effective data management and analysis through several key sections:

Insights: Offers an overview of anomaly detection and data monitoring, allowing users to filter by source datastores, tags, and dates. It displays profile data, applied checks, quality scores, records scanned, and more. Moreover, you can also export the insight reports in PDF format.
Activity: Provides a detailed view of operations (catalog, profile, and scan) across source datastores with a heatmap to visualize daily activities and detected anomalies.
Profiles: Unifies all containers, including tables, views, computed tables, computed files, and fields, with search, sort, and filter functionalities.
Observability: Observability gives users an easy way to track changes in data volume over time. It introduces two types of checks: Volumetric and Metric.
Checks: Shows all applied checks, both inferred and authored, across source datastores to monitor and manage data quality rules.
Anomalies: Lists all detected anomalies across source datastores for quick identification and resolution of issues.

Library

The library section allows for managing check templates and editing applied checks in source datastores with two main functionalities:

Add Check Templates: Easily add new templates to apply standardized checks across datastores.
Export Check Templates: Export template metadata to a specified Enrichment datastore.

Tip

You can also search, sort, and filter checks across the source datastores.

library-access

Tags

Tags help users organize and prioritize data assets by categorizing them. They can be applied to Datastores, Profiles, Fields, Checks, and Anomalies, improving data management and workflows.

tags-nav

Flows

Qualytics allows users to set up flows, enabling them to create pipelines by chaining actions and configuring how they are triggered. Triggers can be set based on predefined events and filters, providing a flexible and efficient way to automate processes. These actions can include notifications or operations, allowing users to notify various channels or execute tasks based on specific operations.

notification-nav

Global Settings

Manage global configurations with the following options:

Connection: Manage datastore sources (add, edit, delete).
Integration: Configure parameters for integrating external tools.
Security: Manage teams, roles, and user access.
Tokens: Create tokens for secure API interactions.
Status: Monitor and restart the Qualytics deployment.

global-settings

Keyboard Shortcuts

Qualytics offers a comprehensive set of keyboard shortcuts to reduce mouse usage and accelerate everyday tasks. These shortcuts enable you to navigate the platform, run operations, manage checks, and update entities directly from the keyboard.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click on the Search bar at the top of the dashboard.

Alternatively, you can press Ctrl + K on Windows/Linux or ⌘ + K on macOS to open the Search bar directly.

A popup window will appear, presenting various search options.

Step 2: Click on Search Keyboard Shortcuts to see all the shortcuts in one place.

A window will appear, displaying a comprehensive list of all shortcuts organized by category.

keyboard-shortcuts

Step 3: You can now either search for a specific shortcut or scroll through the list to see all available options.

Qualytics supports a wide range of keyboard shortcut categories, including but not limited to:

No.	Shortcut Category
1.	Navigation
2.	Anomaly
3.	Check
4.	Checks
5.	Container
6.	Datastore
7.	Enrichment
8.	Field
9.	Flow
10.	Flow Execution
11.	Interface
12.	Operations
13.	Search
14.	Tags
15.	Template

The table below shows the available shortcuts and the keys you can use to perform actions with your keyboard:

Search

The Search Shortcuts allow you to quickly find datastores, containers, fields, and even view the complete list of shortcuts without navigating through menus.

Action	Windows/Linux	macOS
Search datastores, containers, and fields	Ctrl + ↑ + F	⌘ + ↑ + F
Search keyboard shortcuts	Ctrl + /	⌘ + /

The Navigation Shortcuts make it easy to move between pages, switch tabs, and access core areas of the Qualytics platform efficiently.

Action	Windows/Linux	macOS
Go back to previous page	G then B	G then B
Go forward to next page	G then N	G then N
Go to explore	G then E	G then E
Go to flows	G then F	G then F
Go to library	G then L	G then L
Go to settings	G then S	G then S
Go to tags	G then T	G then T
Select next tab	Alt + →	⌥ + →
Select previous tab	Alt + ←	⌥ + ←

Anomaly

The Anomaly Shortcuts provide quick actions for archiving, deleting, or tagging anomalies, helping you manage data issues faster.

Action	Windows/Linux	macOS
Archive anomaly	Ctrl + E	⌘ + E
Assign tags	A then T	A then T
Delete anomaly	Ctrl + Del	⌘ + Del

Check

The Check Shortcuts give you instant access to actions like editing, archiving, tagging, or favoriting checks directly from the keyboard.

Action	Windows/Linux	macOS
Archive check	Ctrl + E	⌘ + E
Assign tags	A then T	A then T
Delete check	Ctrl + Del	⌘ + Del
Edit check	E	E
Mark as favorite	Alt + F	⌥ + F

Checks

The Checks Shortcuts are focused on creating new checks, whether from scratch or from templates, to speed up your validation process.

Action	Windows/Linux	macOS
Add multiple checks from template	Ctrl + Shift + C	⌘ + Shift + C
Add new check from template	Shift + C	Shift + C
Add new check	C	C

Container

The Container Shortcuts let you manage containers efficiently by assigning tags, editing, deleting, or marking them as favorites.

Action	Windows/Linux	macOS
Assign tags	A then T	A then T
Delete container	Ctrl + Del	⌘ + Del
Mark as favorite	Alt + F	⌥ + F
Open container settings	E	E

Datastore

The Datastore Shortcuts simplify datastore management, allowing you to quickly assign tags, edit, delete, or favorite a datastore.

Action	Windows/Linux	macOS
Assign tags	A then T	A then T
Delete datastore	Ctrl + Del	⌘ + Del
Edit datastore	E	E
Mark as favorite	Alt + F	⌥ + F

Enrichment

The Enrichment Shortcuts provide fast access to edit or delete enrichment configurations without leaving the keyboard.

Action	Windows/Linux	macOS
Delete enrichment	Ctrl + Del	⌘ + Del
Edit enrichment	E	E

Field

The Field Shortcuts help you manage fields by assigning tags, editing details, or deleting computed fields directly from the keyboard.

Action	Windows/Linux	macOS
Assign tags	A then T	A then T
Delete computed field	Ctrl + Del	⌘ + Del
Edit field	E	E

Flow

The Flow Shortcuts allow you to clone, delete, execute, or publish flows efficiently, reducing time spent in menus.

Action	Windows/Linux	macOS
Clone flow	Ctrl + D	⌘ + D
Delete flow	Ctrl + Del	⌘ + Del
Execute flow	Ctrl + Enter	⌘ + Enter
Publish flow	P	P

Flow Execution

The Flow Execution Shortcuts are designed for quick deletion of flow execution records to keep your workspace clean.

Action	Windows/Linux	macOS
Delete flow execution	Ctrl + Del	⌘ + Del

Interface

The Interface Shortcuts let you control the look and feel of the platform, such as collapsing the sidebar or switching themes.

Action	Windows/Linux	macOS
Collapse navigation sidebar	Ctrl + B	⌘ + B
Switch theme	Ctrl + Shift + L	⌘ + Shift + L

Operations

The Operations Shortcuts provide fast commands for running catalog, export, materialize, profile, and scan operations on datastores.

Action	Windows/Linux	macOS
Run catalog operation	R then C	R then C
Run export operation	R then E	R then E
Run materialize operation	R then M	R then M
Run profile operation	R then P	R then P
Run scan operation	R then S	R then S

Tags

The Tags Shortcuts let you quickly add new tags to classify and organize your data assets.

Action	Windows/Linux	macOS
Add new tag	T	T

Template

The Template Shortcuts cover editing and archiving templates, helping you maintain reusable patterns with ease.

Action	Windows/Linux	macOS
Archive template	Ctrl + E	⌘ + E
Edit template	E	E

Add Datastores

Source Datastore

Qualytics connects to source datastores using "Datastores," a framework that enables organizations to:

Connect with Apache Spark-compatible source datastores.
Support both traditional databases and modern object storage.
Profile and monitor structured data across systems.
Ensure secure and fast access to data.
Scale data quality operations across platforms.
Manage data quality centrally across all sources.

These source datastore integrations ensure comprehensive quality management across your entire data landscape, regardless of where your data resides.

Understanding Datastores

A Datastore in Qualytics represents any structured source datastore, such as:

Relational databases (RDBMS)
Raw file formats like CSV, XLSX, JSON, Avro, or Parquet
Cloud storage platforms like AWS S3, Azure Blob Storage, or GCP Cloud Storage

Qualytics integrates with these source datastores through a layered architecture:

datastore

Configuring Source Datastores

Configure your source datastores in Qualytics by connecting them through a new datastore.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add

Step 2: A modal window, Add Datastore, will appear, providing you with the options to connect a datastore.

connector

REF.	FIELDS	ACTIONS
1.	Name	Specify the name of the datastore (e.g., the specified name will appear on the datastore cards).
2.	Toggle Button	Toggle on to create a new source datastore from scratch, or toggle off to reuse credentials from an existing connection.
3.	Connector	Select Any source datastore from the dropdown list.

Available Datastore Connectors

Qualytics supports a range of source datastores, including but not limited to:

Tip

Want to check which datastores have Enrichment support? See the Supported Enrichment Datastores

No.	Source Datastores
1.	Amazon Redshift
2.	Amazon S3
3.	Athena
4.	Azure Datalake Storage (ABFS)
5.	Big Query
6.	Databricks
7.	DB2
8.	Dremio
9.	Google Cloud Storage
10.	Hive
11.	MariaDB
12.	Microsoft SQL Server
13.	MySQL
14.	Oracle
15.	PostgreSQL
16.	Presto
17.	Snowflake
18.	Synapse
19.	Teradata
20.	Timescale DB
21.	Trino

Connection Management

To connect to a datastore, users must provide the required connection details, such as Host/Port or URI. These fields may vary depending on the datastore and are essential for establishing a secure and reliable connection to the target database.

For demonstration purposes, we have selected the Snowflake connector.

Option I: Create a Datastore with a New Connection

If the toggle for Add New Connection is turned on, this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select any connector (as we are selecting the Snowflake connector) from the dropdown list and add connection properties such as Secrets Management, host, port, username, and password, along with datastore properties like catalog, database, etc.

detail

For the next steps, refer to the "Add Source Datastore" section in the Snowflake Datastore documentation.

Once a datastore is verified and created, it appears in your source datastores.

home

Datastore Operations

Once a datastore is added in Qualytics, you can perform three key operations to manage and ensure data quality effectively:

1. Catalog Operation

This operation imports named data collections such as tables, views, and files into the source datastore. It identifies incremental fields for scans and allows you to recreate or delete containers, streamlining data organization and enhancing discovery.

For more details about the catalog operation, refer to the "Catalog Operation" document.

2. Profile Operation

After cataloging, the Profile Operation analyzes each record within the collections to assess and improve data quality. By generating detailed metadata and interacting with the Qualytics Inference Engine, it identifies quality issues and refines checks for maintaining data integrity.

For more details about the profile operation, refer to the "Profile Operation" document.

3. Scan Operation

Finally, the Scan Operation enforces data quality checks on the collections. It identifies anomalies at the record and schema levels, highlights structural issues, and records all findings for further analysis. Flexible options allow for incremental scans, specific table/file scans, and scheduling future scans.

For more details about the scan operation, refer to the "Scan Operation" document.

By performing these operations sequentially, you can efficiently manage and ensure the quality of your data in Qualytics.

View Operation

Once the datastores are connected, you can run operations on the selected datastore. To track the progress, simply navigate to the Activity tab, where you can view the running operation.

Step 1: Simply click to open the datastore on which you ran the operation.

home

Step 2: After clicking on the datastore, select the "Activity" tab to view the ongoing operation.

activity

JDBC Datastores

JDBC Datastore Overview

JDBC Datastore in Qualytics allows you to easily integrate and manage data from relational databases. Using the Java Database Connectivity (JDBC) API, you can securely connect to databases, analyze data, and perform data profiling. This feature supports a wide range of relational databases, providing you with a flexible solution for data discovery and quality checks.

Adding JDBC Datastore

Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

For detailed steps on adding a JDBC Datastore, refer to the Add the Source Datastore section of the documentation.

Supported JDBC Databases

Qualytics supports a range of relational databases, including but not limited to:

Athena
Databricks
DB2
Hive
MariaDB
Microsoft SQL Server
MySQL
Oracle
PostgreSQL
Presto
Amazon Redshift
Snowflake
Synapse
Timescale DB
Trino

Connection Details

To connect to a JDBC datastore, users must provide the required connection details, such as Host/Port or URI. These fields may vary depending on the datastore and are essential for establishing a secure and reliable connection to the target database.

For more information about connections, refer to the Connection Overview documentation.

Catalog Operation

After adding a JDBC Datastore, you can initiate a Catalog operation to extract key metadata from the database. This operation provides:

A list of containers (schemas, tables, or views).
Field names within each container.
Record counts for data analysis and profiling.

catalog

For more information about how to run catalog operation, refer to the Catalog Operation documentation.

Field Types Inference

Qualytics employs weighted histogram analysis during the Catalog operation to infer field types automatically. This advanced method ensures accurate detection of data types within the JDBC Datastore, enhancing the precision of data profiling.

Containers Overview

Containers are fundamental entities representing structured data sets. These containers could manifest as tables in JDBC datastores or as files within DFS datastores. They play a pivotal role in data organization, profiling, and quality checks within the Qualytics application. For a more detailed understanding of how Qualytics manages and interacts with containers in JDBC Datastores, please refer to the Containers overview documentation.

Athena

Adding and configuring an Amazon Athena connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on adding Athena as a source datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Athena environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add the Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Athena is an example of such a datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the Athena datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-source-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

add-datastore-details

REF.	FIELDS	ACTIONS
1.	Name	Specify the name of the datastore (e.g., The specified name will appear on the datastore cards).
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector	Select Athena from the dropdown list.

Option I: Create a Datastore with a new Connection

If the toggle for Add New Connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Athena connector from the dropdown list and add connection properties such as Secrets Management, host, port, username, and password, along with datastore properties like catalog, database, etc.

add-source-datastore-details

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF.	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 2: The configuration form, requesting credential details before establishing the connection.

add-configutre-details

REF.	FIELDS	ACTIONS
1.	Host	Get Hostname from your Athena account and add it to this field.
2.	Port	Specify the Port number.
3.	User	Enter the User ID to connect.
4.	Password	Enter the password to connect to the database.
5.	S3 Output Location	Define the S3 bucket location where the output will be stored. This is specific to AWS Athena and specifies where query results are saved.
6.	Catalog	Enter the catalog name. In AWS Athena, this refers to the data catalog that contains database and table metadata.
7.	Database	Specify the database name.
8.	Teams	Select one or more teams from the dropdown to associate with this source datastore.
9.	Initiate Cataloging	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-source-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

add-datastore-details-existing

Note

If you are using existing credentials, you can only edit the details such as Catalog, Database, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.

test-existing-enrichment-connection

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

Click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

After successfully testing and verifying your source datastore connection, you have the option to add an enrichment datastore (recommended). This datastore is used to store analyzed results, including any anomalies and additional metadata in tables. This setup provides comprehensive visibility into your data quality, enabling you to manage and improve it effectively.

Warning

Qualytics does not support the Athena connector as an enrichment datastore, but you can point to a different enrichment datastore.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

click-source

Step 2: A modal window - Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

enrichment-details

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle for Add New Connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

select-enrichment

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

enrichment-detail

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for Add New Connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

Note

Qualytics does not support Athena as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using BigQuery as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

add-enrichment-details

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection.

test-enrichment-connection

If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

Step 4: Click on the Finish button to complete the configuration process.

enrichment-details-finish

When the configuration process is finished, a success notification appears on the screen indicating that the datastore was added successfully.

Step 5: Close the success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

athena-created

Option II: Use an Existing Connection

If the toggle for Use an existing enrichment datastore is turned on, you will be prompted to configure the enrichment datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

select-enrichment-details

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

Note

Qualytics does not support Athena as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using BigQuery as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

add-enrichment-details

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the enrichment datastore instance is hosted. It is the endpoint used to connect to the enrichment datastore environment.
Database: Refers to the specific database within the enrichment datastore environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

add-existing-enrichment

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

click-finish

When the configuration process is finished, a success notification appears on the screen indicating that the datastore was added successfully.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

athena-existing-created

API Payload Examples

Creating a Source Datastore

This section provides a sample payload for creating an Athena datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post): /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "athena_catalog",
    "schema": "athena_database",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "host": "athena_host",
        "port": 443,
        "username": "athena_user",
        "password": "athena_password",
        "parameters": { "output": "s3://<bucket_name>" },
        "type": "athena"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "athena_catalog",
    "schema": "athena_database",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": connection_id
}

Link an Enrichment Datastore to a Source Datastore

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

BigQuery

Adding and configuring a BigQuery connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on adding BigQuery as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their BigQuery environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let's get started 🚀

BigQuery Setup Guide

This guide explains how to create and use a temporary dataset with an expiration time in BigQuery. This dataset helps manage intermediate query results and temporary tables when using the Google BigQuery JDBC driver.

It is recommended for efficient data management, performance optimization, and automatic reduction of storage costs by deleting data when it is no longer needed.

Access the BigQuery Console

Step 1: Navigate to the BigQuery console within your Google Cloud Platform (GCP) account.

google-cloud-platform-page

Step 2: Click on the vertical ellipsis, it will open a popup menu for creating a dataset. Click on the Create dataset to set up a new dataset.

create-a-dataset

Step 3: Fill details for the following fields to create a new dataset.

Info

Dataset Location: Select the location that aligns with where your other datasets reside to minimize data transfer delays.
Default Table Expiration: Set the expiration to 1 day to ensure any table created in this dataset is automatically deleted one day after its creation.

configure-dataset-details

Step 4: Click the Create Dataset button to apply the configuration and create the dataset.

Step 5: Navigate to the created dataset and find the Dataset ID in the Dataset Info.

created-dataset-page

The Dataset info section contains the Dataset ID and other information related to the created dataset. This generated Dataset ID is used to configure the BigQuery datastore.

BigQuery Roles and Permissions

This section explains the roles required for viewing, editing, and running jobs in BigQuery. To integrate BigQuery with Qualytics, you need specific roles and permissions.

Assigning these roles ensures Qualytics can perform data discovery, management, and analytics tasks efficiently while maintaining security and access control.

BigQuery Roles

BigQuery Data Editor (roles/bigquery.dataEditor) Allows modification of data within BigQuery, including adding new tables and changing table schemas. It is suitable if you want to regularly update or insert data.
BigQuery Data Viewer (roles/bigquery.dataViewer) Enables viewing datasets, tables, and the contents. It is essential if you need to read data structures and information.
BigQuery Job User (roles/bigquery.jobUser) Allows creating and managing jobs in BigQuery, such as queries, data imports, and data exports. It is important if you want to run automated queries.
BigQuery Read Session User (roles/bigquery.readSessionUser) Allows usage of the BigQuery Storage API for efficient retrieval of large data volumes. It provides capabilities to create and manage read sessions, facilitating large-scale data transfers.

Warning

If a temporary dataset already exists in BigQuery and users want to use it when creating a new datastore connection, the service account must have the bigquery.tables.create permission to perform the test connection and proceed to the datastore creation.

Datastore BigQuery Privileges

The following table outlines the privileges associated with BigQuery roles when configuring datastore connections in Qualytics:

Source Datastore Permissions (Read-Only)

Provides read access to view table data and metadata:

REF.	READ-ONLY PERMISSIONS	DESCRIPTION
1.	`roles/bigquery.dataViewer`	Allows viewing of datasets, tables, and their data.
2.	`roles/bigquery.jobUser`	Enables running of jobs such as queries and data loading.
3.	`roles/bigquery.readSessionUser`	Facilitates the creation of read sessions for efficient data retrieval.

Enrichment Datastore Permissions (Read-Write)

Grants read and write access for data editing and management:

REF.	WRITE-ONLY PERMISSIONS	DESCRIPTION
1.	`roles/bigquery.dataEditor`	Provides editing permissions for table data and schemas.
2.	`roles/bigquery.dataViewer`	Allows viewing of datasets, tables, and their data.
3.	`roles/bigquery.jobUser`	Enables running of jobs such as queries and data loading.
4.	`roles/bigquery.readSessionUser`	Facilitates the creation of read sessions for efficient data retrieval.

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. BigQuery is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore (e.g. The specified name will appear on the datastore cards).
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select BigQuery from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New Connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the BigQuery connector from the dropdown list and add connection details such as temp dataset ID, service account key, project ID, and dataset ID.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF.	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secrets-management

Step 2: The configuration form, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELDS	ACTIONS
1.	Temp Dataset ID (Optional)	Enter a temporary Dataset ID for intermediate data storage during BigQuery operations.
2.	Service Account Key (Required)	Upload a JSON file that contains the credentials required for accessing BigQuery.
3.	Project ID (Required)	Enter the Project ID associated with BigQuery.
4.	Dataset ID (Required)	Enter the Dataset ID (schema name) associated with BigQuery.
5.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
6.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add New Connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Project ID, Dataset ID, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window - Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata for your selected datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle for Add New Connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

select-enrichment-connector

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for Add New Connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

enrichment-datastore-explain

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

Once the HashiCorp Vault is set up, use the ${key} format in Connection form to reference a Vault secret.

REF.	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hash-details

Step 3: The configuration form, requesting credential details after selecting enrichment datastore connector.

connection-details

REF.	FIELD	ACTIONS
1.	Temp Dataset ID (Optional)	Enter a temporary Dataset ID for intermediate data storage during BigQuery operations.
2.	Service Account Key (Required)	Upload a JSON file that contains the credentials required for accessing BigQuery.
3.	Project ID (Required)	Enter the Project ID associated with BigQuery.
4.	Dataset ID (Required)	Enter the Dataset ID (schema name) associated with BigQuery.
5.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.

Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the enrichment datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 5: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a success notification appears on the screen indicating that the datastore was added successfully.

Step 6: Close the success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Connection

If the Use Enrichment Datastore option is selected from the dropdown menu, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

select-existing-enrichment-datastore

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the BigQuery instance is hosted. It is the endpoint used to connect to the BigQuery environment.
Database: Refers to the specific database within the BigQuery environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a success notification appears on the screen indicating that the datastore was added successfully.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating a BigQuery datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "your_project_id",
    "schema": "your_dataset_id",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "bigquery",
        "password": "your_service_account_key"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "your_project_id",
    "schema": "your_dataset_id",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection-id
}

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an Existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "your_project_id",
    "schema": "your_enrichment_dataset_id",
    "enrich_only": true,
    "connection": {
        "name": "your_connection_name",
        "type": "bigquery",
        "password": "your_service_account_key"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "your_project_id",
    "schema": "your_enrichment_dataset_id",
    "enrich_only": true,
    "connection_id": connection-id
}

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Databricks

Adding and configuring a Databricks connection within Qualytics empowers the platform to build a symbolic link with your database to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add Databricks as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Databricks environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let's get started 🚀

Databricks Setup Guide

This guide provides a comprehensive walkthrough for setting up Databricks. It highlights the distinction between SQL Warehouses and All-Purpose Compute, the functionality of node pools, and the enhancements they offer.

Additionally, it details the process for attaching compute resources to node pools and explains the minimum requirements for effective operation.

Understanding SQL Warehouses and All-Purpose Compute

SQL Warehouses (Serverless)

SQL Warehouses (Serverless) in Databricks utilize serverless SQL endpoints for running SQL queries.

REF.	ATTRIBUTE	DESCRIPTION
1.	Cost-effectiveness	Serverless SQL endpoints allow you to pay only for the queries you execute, without the need to provision or manage dedicated infrastructure, making it more cost-effective for ad-hoc or sporadic queries.
2.	Scalability	Serverless architectures automatically scale resources based on demand, ensuring optimal performance for varying workloads.
3.	Simplified Management	With serverless SQL endpoints, you don't need to manage clusters or infrastructure, reducing operational overhead.
4.	Minimum Requirements	The minimum requirements for using SQL Warehouse with serverless typically include access to a Databricks workspace and appropriate permissions to create and run SQL queries.

All-Purpose Compute

All-purpose compute in Databricks refers to clusters that are not optimized for specific tasks. While they offer flexibility, they may not provide the best performance or cost-effectiveness for certain workloads.

REF.	ATTRIBUTE	DESCRIPTION
1.	Slow Spin-up Time	All-purpose compute clusters may take longer to spin up compared to specialized clusters, resulting in delays before processing can begin.
2.	Timeout Connections	Due to longer spin-up times, there's a risk of timeout connections, especially for applications or services that expect quick responses.

Node Pool and Its Usage

A node pool in Databricks is a set of homogeneous virtual machines (VMs) within a cluster. It allows you to have a fixed set of instances dedicated to specific tasks, ensuring consistent performance and resource isolation.

REF.	ATTRIBUTE	DESCRIPTION
1.	Resource Isolation	Node pools provide resource isolation, allowing different workloads or applications to run without impacting each other's performance.
2.	Optimized Performance	By dedicating specific nodes to particular tasks, you can optimize performance for those workloads.
3.	Cost-effectiveness	Node pools can be more cost-effective than using all-purpose compute for certain workloads, as you can scale resources according to the specific requirements of each task.

Improving All-Purpose Compute with Node Pools

To improve the performance of all-purpose compute using node pools, you can follow these steps:

REF.	ATTRIBUTE	DESCRIPTION
1.	Define Workload-Specific Node Pools	Identify the specific tasks or workloads that require optimized performance and create dedicated node pools for them.
2.	Specify Minimum Requirements	Determine the minimum resources (such as CPU, memory, and disk) required for each workload and configure the node pools accordingly.
3.	Monitor and Adjust	Continuously monitor the performance of your node pools and adjust resource allocations as needed to ensure optimal performance.

Step 1: Configure details for Qualytics Node Pool.

configure-qualytics-node-pool

Step 2: Attach Compute details with the Node Pool.

attach-compute-with-node-pool

Retrieve the Connection Details

This section explains how to retrieve the connection details that you need to connect to Databricks.

Credentials to Connect with Qualytics

To configure Databricks, you need the following credentials:

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your Databricks account and add it to this field.
2.	HTTP Path (Required)	Add HTTP Path (web address) to fetch data from your Databricks account.
3.	Catalog (Required)	Add a Catalog to fetch data structures and metadata from Databricks.
4.	Database (Required)	Specify the database name to be accessed.
5.	Personal Access Token (Required)	Generate a Personal Access Token from your Databricks account and add it for authentication.

Get Connection Details for the SQL Warehouse

Follow the given steps to get the connection details for the SQL warehouse:

Click on the SQL Warehouses in the sidebar.
Choose a warehouse to connect to.
Navigate to the Connection Details tab.
Copy the connection details.

connection-details-for-sql-warehouse

Get Connection Details for the Cluster

Follow the given steps to get the connection details for the cluster:

Click on the Compute in the sidebar.
Choose a cluster to connect to.
Navigate to the Advanced Options.
Click on the JDBC/ODBC tab.
Copy the connection details.

connection-details-for-the-cluster

Get the Access Token

Step 1: In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the dropdown menu.

user-settings

Note

Refer to the Databricks Official Docs to generate the Access Token.

Step 2: In the Settings page, select the Developer option in the User section.

developer-option

Step 3: In the Developer page, click on Manage in Access Tokens.

manage-access-token

Step 4: In the Access Tokens page, click on the Generate new token button.

generate-new-token

Step 5: You will see a modal to add a description and validation time (in days) for the token.

add-a-description

Step 6: After adding the contents, click on Generate, and it will show the token.

generated-token

Warning

Before closing the modal window by clicking on the Done button, ensure the Personal Access Token is saved to a secure location.

Step 7: You can see the new token on the Access Tokens page.

new-token

You can also revoke a token on the Access Tokens page by clicking on the Revoke token button.

revoke-token

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Databricks is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the datastore name (e.g., This name will appear on the datastore cards).
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select Databricks from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New Connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Databricks connector from the dropdown list and add connection details such as Secrets Management, host, HTTP path, database, and personal access token.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF.	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hashcorp-vault

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELD	ACTIONS
1.	Host (Required)	Get the hostname from your Databricks account and add it to this field.
2.	HTTP Path (Required)	Add the HTTP Path (web address) to fetch data from your Databricks account.
3.	Personal Access Token (Required)	Generate a Personal Access Token from your Databricks account and add it for authentication.
4.	Catalog (Required)	Add a Catalog to fetch data structures and metadata from the Databricks.
5.	Database (Optional)	Specify the database name to be accessed.
6.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
7.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add New Connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Catalog, Database, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window - Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down button to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle for Add New Connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret-button

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

modal-window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for Add New Connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

enrichment-datastore-explain

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

Once the HashiCorp Vault is set up, use the ${key} format in Connection form to reference a Vault secret.

REF.	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 3: The configuration form, requesting credential details after selecting enrichment datastore connector.

config-details

REF.	FIELD	ACTIONS
1.	Host (Required)	Get the hostname from your Databricks account and add it to this field.
2.	HTTP Path (Required)	Add the HTTP Path (web address) to fetch data from your Databricks account.
3.	Personal Access Token (Required)	Generate a Personal Access Token from your Databricks account and add it for authentication.
4.	Catalog (Required)	Add a Catalog to fetch data structures and metadata from Databricks.
5.	Database (Optional)	Specify the database name.
6.	Teams (Required)	Select one or more teams from the dropdown to associate with this enrichment datastore.

Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the enrichment datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 5: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a success notification appears on the screen indicating that the datastore was added successfully.

Step 6: Close the success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Connection

If the Use Enrichment Datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

use-enrichment-details

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the Databricks instance is hosted. It is the endpoint used to connect to the Databricks environment.
Database: Refers to the specific database within the Databricks environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a success notification appears on the screen indicating that the datastore was added successfully.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating a Databricks datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "databricks_database",
    "schema": "databricks_catalog",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "databricks",
        "host": "databricks_host",
        "password": "databricks_token",
        "parameters": {
            "path": "databricks_http_path"
        }
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "databricks_database",
    "schema": "databricks_catalog",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection-id
}

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an Existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "databricks_database",
    "schema": "databricks_enrichment_catalog",
    "enrich_only": true,
    "connection": {
        "name": "your_connection_name",
        "type": "databricks",
        "host": "databricks_host",
        "password": "databricks_token",
        "parameters": {
            "path": "databricks_http_path"
        }
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "databricks_database",
    "schema": "databricks_enrichment_catalog",
    "enrich_only": true,
    "connection_id": connection-id
}

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

DB2

Adding and configuring a DB2 connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add DB2 as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their DB2 environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. DB2 is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the datastore name (e.g., This name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select DB2 from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the DB2 connector from the dropdown list and add connection details such as Secrets Management, host, port, user, password, SSL connection, database, and schema.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hashcorp-explain

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your DB2 account and add it to this field.
2.	Port (Required)	Specify the Port number.
3.	User (Required)	Enter the User to connect.
4.	Password (Required)	Enter the password to connect to the database.
5.	SSL Connection	Enable the SSL connection to ensure secure communication between Qualytics and the selected datastore.
6.	Database (Required)	Specify the database name.
7.	Schema (Required)	Define the schema within the database that should be used.
8.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
9.	Initial Catalog	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Database, Schema, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). This datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret-button

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

modal-window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

modal-window

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

Once the HashiCorp Vault is set up, use the ${key} format in Connection form to reference a Vault secret.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 3: The configuration form, requesting credential details after selecting the enrichment datastore connector.

enrichment-datastore-explain

REF.	FIELDS	ACTIONS
1.	Host	Get Hostname from your DB2 account and add it to this field.
2.	Port	Specify the Port number.
3.	User	Enter the User to connect.
4.	Password	Enter the password to connect to the database.
5.	SSL Connection	Enable the SSL connection to ensure secure communication between Qualytics and the selected datastore.
6.	Database	Specify the database name.
7.	Schema	Define the schema within the database that should be used.
8.	Teams	Select one or more teams from the dropdown to associate with this datastore.

Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the enrichment datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 5: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

select-existing-enrichment-datastore

REF.	FIELDS	ACTIONS
1	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the DB2 instance is hosted. It is the endpoint used to connect to the DB2 environment.
Database: Refers to the specific database within the DB2 environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating a DB2 datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "db2_database",
    "schema": "db2_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "db2",
        "host": "db2_host",
        "port": "db2_port",
        "username": "db2_username",
        "password": "db2_password",
        "parameters": {
            "ssl": true
        }
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "db2_database",
    "schema": "db2_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection_id
}

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "db2_database",
    "schema": "db2_enrichment_schema",
    "enrich_only": true,
    "connection": {
        "name": "your_connection_name",
        "type": "db2",
        "host": "db2_host",
        "port": "db2_port",
        "username": "db2_username",
        "password": "db2_password",
        "parameters": {
            "ssl": true
        }
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "db2_database",
    "schema": "db2_enrichment_schema",
    "enrich_only": true,
    "connection_id": connection_id
}

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Hive

Adding and configuring a Hive connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add Hive as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Hive environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Hive is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

source

Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.

window

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the datastore name (e.g., This name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select Hive from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Hive connector from the dropdown list and add the connection details.

connector

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

Secrets

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

REF.	FIELDS	ACTIONS
1.	Host(Required)	Get Hostname from your Hive account and add it to this field.
2.	Port(Required)	Specify the Port number.
3.	Authentication	You can choose between Basic Authentication and Kerberos Authentication for validating and securing the connection to your Hive instance. Basic Authentication: This method uses a username and password combination for authentication. It is a straightforward method where the user's credentials are directly used to access Hive. Type: Select the authentication type from the dropdown menu. User: Enter the username that Qualytics will use to connect to Hive. Password: Enter the password associated with the specified user account. Kerberos Authentication: This method uses Kerberos tickets for authentication. It relies on a secure, ticket-based mechanism managed by your environment’s Kerberos configuration. Type: Select Kerberos from the authentication type dropdown. Principal: Enter the Kerberos principal (for example: `hive/_HOST@DOMAIN.COM`) that Qualytics will use to connect to Hive.
4.	Schema(Required)	Define the schema within the database that should be used.
5.	Teams(Required)	Select one or more teams from the dropdown to associate with this source datastore.
6.	Initial Cataloging(Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

form

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

connections

Note

If you are using existing credentials, you can only edit the details such as Database, Schema, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Warning

Qualytics does not support the Hive connector as an enrichment datastore, but you can point to a different enrichment datastore.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

link

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

enrichment

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

Note

Qualytics does not support Hive as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using BigQuery as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

datastore

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test

Step 4: Click on the Finish button to complete the configuration process.

test

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

dialog

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

caret

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

link

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the enrichment datastore instance is hosted. It is the endpoint used to connect to the enrichment datastore environment.
Database: Refers to the specific database within the enrichment datastore environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

enrichment

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish

When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

operation

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API.

Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Datastore

This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post)

/api/datastores (post)

Creating a datastore with a new connectionCreating a datastore with an existing connection

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "hive_database",
        "schema": "hive_schema",
        "enrich_only": false,
        "trigger_catalog": true,
        "connection": {
            "name": "your_connection_name",
            "type": "hive",
            "host": "hive_host",
            "port": "hive_port",
            "username": "hive_username",
            "password": "hive_password",
            "parameters": {
                "zookeeper": false
            }
        }
    }

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "hive_database",
        "schema": "hive_schema",
        "enrich_only": false,
        "trigger_catalog": true,
        "connection_id": connection_id
    }

Linking Datastore to an Enrichment Datastore through API

Endpoint (Patch)

/api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

MariaDB

Adding and configuring a MariaDB connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add MariaDB as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their MariaDB environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. MariaDB is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

source

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

source

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the datastore name (e.g., this name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select MariaDB from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the MariaDB connector from the dropdown list and add connection details such as Secrets Management, host, port, user, password, SSL connection, database, and schema.

connector

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

secret

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

form

REF.	FIELDS	ACTIONS
1.	Host	Get Hostname from your MariaDB account and add it to this field.
2.	Port	Specify the Port number.
3.	User	Enter the User ID to connect.
4.	Password	Enter the password to connect to the database.
5.	Database	Specify the database name.
6.	Teams	Select one or more teams from the dropdown to associate with this source datastore.
7.	Initial Cataloging	Check the checkbox to automatically perform a catalog operation on the configured source to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

connections

Note

If you are using existing credentials, you can only edit the details such as Database, Schema, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

tests

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). This datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

modal

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

windoww

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Enter a name for the enrichment datastore.
3.	Toggle Button For Add New Connection	Toggle ON to create a new enrichment datastore from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

details

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

testss

Step 4: Click on the Finish button to complete the configuration process.

finish

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 5: Close the success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

operations

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

carett

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

windowww

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. For example, marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the MariaDB instance is hosted. It is the endpoint used to connect to the MariaDB environment.
Database: Refers to the specific database within the MariaDB environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

enrichment

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finishh

When the configuration process is finished, a modal window will display and a success flash message stating that your data has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

operation

API Payload Examples

Creating a Datastore

This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post)

/api/datastores (post)

Creating a datastore with a new connectionCreating a datastore with an existing connection

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "mariadb_database",
        "enrich_only": false,
        "trigger_catalog": true,
        "connection": {
            "name": "your_connection_name",
            "type": "mariadb",
            "host": "mariadb_host",
            "port": "mariadb_port",
            "username": "mariadb_username",
            "password": "mariadb_password"
        }
    }

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "mariadb_database",
        "enrich_only": false,
        "trigger_catalog": true,
        "connection_id": connection-id
    }

Creating an Enrichment Datastore

Endpoint (Post)

/api/datastores (post)

This section provides a sample payload for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Creating an enrichment datastore with a new connectionCreating an enrichment datastore with an existing connection

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "mariadb_database",
        "enrich_only": true,
        "connection": {
            "name": "your_connection_name",
            "type": "mariadb",
            "host": "mariadb_host",
            "port": "mariadb_port",
            "username": "mariadb_username",
            "password": "mariadb_password"
        }
    }

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "mariadb_database",
        "enrich_only": true,
        "connection_id": connection-id
    }

Linking Datastore to an Enrichment Datastore through API

Endpoint (Patch)

/api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Microsoft SQL Server

Adding and configuring Microsoft SQL Server connection within Qualytics empowers the platform to build a symbolic link with your database to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on adding Microsoft SQL Server as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Microsoft SQL Server environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Microsoft SQL Server is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the datastore name (e.g., This name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select Microsoft SQL Server from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Microsoft SQL Server connector from the dropdown list and add connection details such as Secret Management, host, port, username, password, and database.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hashcorp-explain

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your Microsoft SQL Server account and add it to this field.
2.	Port (Optional)	Specify the Port number.
3.	User (Required)	Enter the User to connect.
4.	Password (Required)	Enter the password to connect to the database.
5.	Database (Required)	Specify the database name.
6.	Schema (Required)	Define the schema within the database that should be used.
7.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
8.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Database, Schema, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure to add an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret-button

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

modal-window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

modal-window

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

Once the HashiCorp Vault is set up, use the ${key} format in Connection form to reference a Vault secret.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 3: The configuration form, requesting credential details after selecting the enrichment datastore connector.

enrichment-datastore-explain

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your Microsoft SQL Server account and add it to this field.
2.	Port (Optional)	Specify the Port number.
3.	User (Required)	Enter the User to connect.
4.	Password (Required)	Enter the Password to connect to the database.
5.	Database (Required)	Specify the database name.
6.	Schema (Required)	Define the schema within the database that should be used.
7.	Teams (Required)	Select one or more teams from the dropdown to associate with this datastore.

Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 5: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

select-existing-enrichment-datastore

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the Microsoft SQL Server instance is hosted. It is the endpoint used to connect to the Microsoft SQL Server environment.
Database: Refers to the specific database within the Microsoft SQL Server environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API.

Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating a Microsoft SQL Server datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "sqlserver_database",
    "schema": "sqlserver_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "sqlserver",
        "host": "sqlserver_host",
        "port": "sqlserver_port",
        "username": "sqlserver_username",
        "password": "sqlserver_password"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "sqlserver_database",
    "schema": "sqlserver_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection_id
}

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an Existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "sqlserver_database",
    "schema": "sqlserver_enrichment_schema",
    "enrich_only": true,
    "connection": {
        "name": "your_connection_name",
        "type": "sqlserver",
        "host": "sqlserver_host",
        "port": "sqlserver_port",
        "username": "sqlserver_username",
        "password": "sqlserver_password"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "sqlserver_database",
    "schema": "sqlserver_enrichment_schema",
    "enrich_only": true,
    "connection_id": connection_id
}

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

MySQL

Adding and configuring a MySQL connection within Qualytics empowers the platform to build a symbolic link with your database to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add MySQL as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their MySQL environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. MySQL is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the datastore name (e.g., This name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select MySQL from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the MySQL connector from the dropdown list and add connection details such as Secrets Management, host, port, username, password, and database.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hashcorp-explain

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your MySQL account and add it to this field.
2.	Port (Required)	Specify the Port number.
3.	User (Required)	Enter the User to connect.
4.	Password (Required)	Enter the password to connect to the database.
5.	Database (Required)	Specify the database name.
6.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
7.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Database, Teams and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret-button

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

modal-window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

modal-window

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

Once the HashiCorp Vault is set up, use the ${key} format in Connection form to reference a Vault secret.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 3: The configuration form, requesting credential details after selecting the enrichment datastore connector.

enrichment-datastore-explain

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your MySQL account and add it to this field.
2.	Port (Required)	Specify the Port number.
3.	User (Required)	Enter the User to connect.
4.	Password (Required)	Enter the password to connect to the database.
5.	Database (Required)	Specify the database name.
6.	Teams (Required)	Select one or more teams from the dropdown to associate with this datastore.

Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 5: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

select-existing-enrichment-datastore

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Team: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the MySQL instance is hosted. It is the endpoint used to connect to the MySQL environment.
Database: Refers to the specific database within the MySQL environment where the data is stored.

use-existing-enrichment-datastore

Step 3: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API.

Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating a MySQL datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "mysql_database",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "mysql",
        "host": "mysql_host",
        "port": "mysql_port",
        "username": "mysql_username",
        "password": "mysql_password"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "mysql_database",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection-id
}

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "mysql_database",
    "enrich_only": true,
    "connection": {
        "name": "your_connection_name",
        "type": "mysql",
        "host": "mysql_host",
        "port": "mysql_port",
        "username": "mysql_username",
        "password": "mysql_password"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "mysql_database",
    "enrich_only": true,
    "connection_id": connection-id
}

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Oracle

Adding and configuring an Oracle connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add Oracle as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Oracle environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add the Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Oracle, for example, is a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the Oracle datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-source-datastore

Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.

add-datastore-details

Step	FIELDS	Description
1.	Name	Specify the name of the datastore (e.g., The specified name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector	Select “Oracle” from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add new connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Oracle connector from the dropdown list and add connection details such as Secret Management, host, port, username, sid, and schema.

add-source-datastore-details

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hashcorp-explain

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datasource-details

REF.	FIELDS	ACTIONS
1.	Host	Get “Hostname” from your Oracle account and add it to this field.
2.	Port	Specify the “Port” number.
3.	Protocol	Specifies the connection protocol used for communicating with the database. Choose between TCP or TCPS.
4.	Connect By	You can choose between SID or Service Name to establish a connection with the Oracle database, depending on how your database instance is configured.
5.	User	Enter the “User ID” to connect.
6.	Password	Enter the “password” to connect to the database.
7.	Schema	Define the schema within the database that should be used.
8.	Teams	Select one or more teams from the dropdown to associate with this source data store.
9.	Initial Cataloging	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-source-datastore

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

existing-source-datastore

Note

If you are using existing credentials, you can only edit the details such as Database, Schema, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If the connection details are verified, a success message will be displayed.

test-existing-connection

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Warning

Qualytics does not support the Oracle connector as an enrichment datastore, but you can point to a different enrichment datastore.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

click-next-datastore

Step 2: A modal window Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

add-enrichment

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret-button

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

modal-window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

select-enrichment

Note

Qualytics does not support Oracle as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using Microsoft SQL Server as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-datastore

Step 4: Click on the Finish button to complete the configuration process.

finish-datastore

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

new-datastore

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

Note

Qualytics does not support Oracle as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using Microsoft SQL Server as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

use-enrichment-datastore

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. For example, Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the enrichment datastore instance is hosted. It is the endpoint used to connect to the enrichment datastore environment.
Database: Refers to the specific database within the enrichment datastore environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects(tables, views, etc.).Each schema belongs to a single database.

select-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

click-finish-datastore

When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added**.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

new-datastore

API Payload Examples

Creating a Source Datastore

This section provides a sample payload for creating an Oracle datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post): /api/datastores (post)

Creating a source datastore with a new connectionCreating a datastore with an existing connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "oracle_database",
    "schema": "oracle_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "oracle",
        "host": "oracle_host",
        "port": "oracle_port",
        "username": "oracle_username",
        "password": "oracle_password",
        "parameters": {
            "sid": "orcl"
        }
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "oracle_database",
    "schema": "oracle_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": "connection-id"
}

Link an Enrichment Datastore to a Source Datastore

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

PostgreSQL

Adding and configuring a PostgreSQL connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add PostgreSQL as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their PostgreSQL environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. PostgreSQL is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore. (e.g., The specified name will appear on the datastore cards.)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select PostgreSQL from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add new existing connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the PostgreSQL connector from the dropdown list and add connection details such as Secrets Management, host, port, username, database, and schema.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hashcorp-explain

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your PostgreSQL account and add it to this field.
2.	Port (Required)	Specify the Port number.
3.	User (Required)	Enter the User to connect.
4.	Password (Required)	Enter the password to connect to the database.
5.	Database (Required)	Specify the database name.
6.	Schema (Required)	Define the schema within the database that should be used.
7.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
8.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Database, Schema, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Info

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret-button

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

modal-window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

modal-window

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

Once the HashiCorp Vault is set up, use the ${key} format in Connection form to reference a Vault secret.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 3: The configuration form will expand, requesting credential details after the selected enrichment datastore connector is chosen.

enrichment-datastore-explain

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your PostgreSQL account and add it to this field.
2.	Port (Required)	Specify the Port number.
3.	User (Required)	Enter the User to connect.
4.	Password (Required)	Enter the password associated with the PostgreSQL user account.
5.	Database (Required)	Specify the database name to be accessed.
6.	Schema (Required)	Define the schema within the database that should be used.
7.	Teams (Required)	Select one or more teams from the dropdown to associate with this datastore.

Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 5: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

select-existing-enrichment-datastore

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment datastore:

Team: The team associated with managing the enrichment datastore is based on the role of public or private. Example- Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the PostgreSQL instance is hosted. It is the endpoint used to connect to the PostgreSQL environment.
Database: Refers to the specific database within the PostgreSQL environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating a PostgreSQL datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "postgresql_database",
    "schema": "postgresql_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "postgresql",
        "host": "postgresql_host",
        "port": "postgresql_port",
        "username": "postgresql_username",
        "password": "postgresql_password"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "postgresql_database",
    "schema": "postgresql_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection-id
}

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "postgresql_database",
    "schema": "postgresql_schema",
    "enrich_only": true,
    "connection": {
        "name": "your_connection_name",
        "type": "postgresql",
        "host": "postgresql_host",
        "port": "postgresql_port",
        "username": "postgresql_username",
        "password": "postgresql_password"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "postgresql_database",
    "schema": "postgresql_schema",
    "enrich_only": true,
    "connection_id": connection-id
}

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Presto

Adding and configuring a Presto connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add Presto as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Presto environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Presto is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add

Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.

detail

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore. (e.g., The specified name will appear on the datastore cards.)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select Presto from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New existing connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Presto connector from the dropdown list and add connection details such as Secrets Management, host, port, username, database, and schema.

connector

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF.	FIELDS	ACTION
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

vault

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

detail

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your Presto account and add it to this field.
2.	Port (Required)	Specify the Port number.
3.	User (Required)	Enter the User to connect.
4.	Password (Required)	Enter the Password to connect to the database.
5.	Catalog (Required)	Add a Catalog to fetch data structures and metadata from Presto.
6.	Schema (Required)	Define the schema within the database that should be used.
7.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
8.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

detail

Note

If you are using existing credentials, you can only edit the details such as Database, Schema, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Info

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Warning

Qualytics does not support the Presto connector as an enrichment datastore, but you can point to a different enrichment datastore.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

enrichment

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

add

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

detail

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment datastore from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

detail

Note

Qualytics does not support Presto as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using Microsoft SQL Server as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test

Step 4: Click on the Finish button to complete the configuration process.

finish

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 5: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

overview

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

link

Note

Qualytics does not support Presto as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using Microsoft SQL Server as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. For example, marked as Public means that this datastore is accessible to all users.
Host: This is the server address where the enrichment datastore instance is hosted. It is the endpoint used to connect to the enrichment datastore environment.
Database: Refers to the specific database within the enrichment datastore environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

detail

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish

When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

overview

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API.

Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Datastore

This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post)

/api/datastores (post)

Creating a datastore with a new connectionCreating a datastore with an existing connection

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "presto_database",
        "schema": "presto_schema",
        "enrich_only": false,
        "trigger_catalog": true,
        "connection": {
            "name": "your_connection_name",
            "type": "presto",
            "host": "presto_host",
            "port": "presto_port",
            "username": "presto_username",
            "password": "presto_password",
            "parameters":{
                "ssl_truststore":"truststore.jks"
            }
        }
    }

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "presto_database",
        "schema": "presto_schema",
        "enrich_only": false,
        "trigger_catalog": true,
        "connection_id": connection-id
    }

Linking Datastore to an Enrichment Datastore through API

Endpoint (Patch)

/api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Redshift

Adding and configuring a Redshift connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add Redshift as both a source and an enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Redshift environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Redshift is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore. (e.g., The specified name will appear on the datastore cards.)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select Redshift from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Redshift connector from the dropdown list and add connection details such as Secrets Management, port, host, password, database, and schema.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hashcorp-explain

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your Redshift account and add it to this field.
2.	Port (Required)	Specify the Port number.
3.	User (Required)	Enter the User to connect.
4.	Password (Required)	Enter the password associated with the Redshift user account.
5.	Database (Required)	Specify the database name.
6.	Schema (Required)	Define the schema within the database that should be used.
7.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
8.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Database, Schema, Teams and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Info

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you can add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window - Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret-button

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

modal-window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Enter a name for the enrichment datastore.
3.	Toggle Button for Add new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

modal-window

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

Once the HashiCorp Vault is set up, use the ${key} format in Connection form to reference a Vault secret.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 3: The configuration form, requesting credential details after selected enrichment datastore connector.

enrichment-datastore-explain

REF.	FIELDS	ACTIONS
1.	Host (Required)	Get Hostname from your Redshift account and add it to this field.
2.	Port (Required)	Specify the Port number.
3.	User (Required)	Enter the User to connect.
4.	Password (Required)	Enter the password associated with the Redshift user account.
5.	Database (Required)	Specify the database name to be accessed.
6.	Schema (Required)	Define the schema within the database that should be used.
7.	Teams (Required)	Select one or more teams from the dropdown to associate with this datastore.

Step 4: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 5: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

select-existing-enrichment-datastore

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Team: The team associated with managing the enrichment datastore is based on the role of public or private. Example - Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the Redshift instance is hosted. It is the endpoint used to connect to the Redshift environment.
Database: Refers to the specific database within the Redshift environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating a Redshift datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "redshift_database",
    "schema": "redshift_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "redshift",
        "host": "redshift_host",
        "port": "redshift_port",
        "username": "redshift_username",
        "password": "redshift_password"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "redshift_database",
    "schema": "redshift_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection-id
}

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "redshift_database",
    "schema": "redshift_schema",
    "enrich_only": true,
    "connection": {
        "name": "your_connection_name",
        "type": "redshift",
        "host": "redshift_host",
        "port": "redshift_port",
        "username": "redshift_username",
        "password": "redshift_password"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "redshift_database",
    "schema": "redshift_schema",
    "enrich_only": true,
    "connection_id": connection-id
}

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Snowflake

Adding and configuring a Snowflake connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add Snowflake as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Snowflake environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Snowflake Setup Guide

The Snowflake Setup Guide provides step-by-step instructions for configuring warehouses and roles, ensuring efficient data management and access control. It explains how to create a warehouse with minimal requirements and the setup of a default warehouse for a user. It also explains how to create custom read-only and read-write roles and grant the necessary privileges for data access and modification.

This guide is designed to help you optimize your Snowflake environment for performance and security, whether setting it up for the first time or refining your configuration.

Warehouse & Role Configuration

This section provides instructions for configuring Snowflake warehouses and roles. It includes creating a warehouse with minimal requirements, assigning a default warehouse for a user, creating custom read-only and read-write roles, and granting privileges to these roles for data access and modification.

Create a Warehouse

Use the following command to create a warehouse with minimal requirements:

CREATE WAREHOUSE qualytics_wh
WITH
    WAREHOUSE_SIZE = 'XSMALL'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE;

Set a specific warehouse as the default for a user:

ALTER USER <username> SET DEFAULT_WAREHOUSE = qualytics_wh;

Source Datastore Privileges and Permissions

Create a new role called qualytics_read_role and grant it privileges:

CREATE ROLE qualytics_read_role;
GRANT USAGE ON WAREHOUSE qualytics_wh TO ROLE qualytics_read_role;
GRANT USAGE ON DATABASE <database_name> TO ROLE qualytics_read_role;
GRANT USAGE ON SCHEMA <database_name>.<schema_name> TO ROLE qualytics_read_role;
GRANT SELECT ON TABLE <database_name>.<schema_name>.<table_name> TO ROLE qualytics_read_role;
GRANT SELECT ON ALL TABLES IN SCHEMA <database_name>.<schema_name> TO ROLE qualytics_read_role;
GRANT SELECT ON ALL VIEWS IN SCHEMA <database_name>.<schema_name> TO ROLE qualytics_read_role;
GRANT SELECT ON FUTURE TABLES IN SCHEMA <database_name>.<schema_name> TO ROLE qualytics_read_role;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA <database_name>.<schema_name> TO ROLE qualytics_read_role;
GRANT ROLE qualytics_read_role TO USER <user_name>;

Enrichment Datastore Privileges and Permissions

Create a new role called qualytics_readwrite_role and grant it privileges:

CREATE ROLE qualytics_readwrite_role;
GRANT USAGE ON WAREHOUSE qualytics_wh TO ROLE qualytics_readwrite_role;
GRANT USAGE, MODIFY ON DATABASE <database_name> TO ROLE qualytics_readwrite_role;
GRANT USAGE, MODIFY ON SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT CREATE TABLE ON SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT SELECT ON FUTURE TABLES IN SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT SELECT ON ALL TABLES IN SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT SELECT ON ALL VIEWS IN SCHEMA <database_name>.<qualytics_schema> TO ROLE qualytics_readwrite_role;
GRANT ROLE qualytics_readwrite_role TO USER <user_name>;

Authentication Changes in Snowflake

Snowflake has announced a migration plan to phase out Basic authentication (username and password) for service accounts in favor of Key-Pair authentication. While basic authentication is still supported, organizations should begin planning their migration to ensure uninterrupted service.

User Type Classification

Snowflake differentiates between user types based on their intended purpose:

User Type	Purpose	Current Authentication Support
Human users (`TYPE=PERSON`)	Interactive users accessing Snowflake	Basic authentication supported
Service users (`TYPE=SERVICE`)	Applications and services (like Qualytics)	Key-Pair authentication recommended
Legacy service (`TYPE=LEGACY_SERVICE`)	Temporary transition type	Basic authentication (being phased out)

Migration Timeline

Snowflake's migration plan includes:

Current Phase: Basic authentication still supported for service accounts
Transition Phase: LEGACY_SERVICE user type available for organizations needing additional migration time
Future Phase: Basic authentication will be fully deprecated for service users

Recommended Actions

To prepare for this transition:

New connections: Use Key-Pair authentication when creating new Snowflake datastores
Existing connections: Plan migration from Basic to Key-Pair authentication
Service accounts: Ensure proper user type classification (TYPE=SERVICE)

Additional Resources

For detailed information on the migration plan and implementation:

Migration Recommendation

While Basic authentication is currently supported, migrating to Key-Pair authentication ensures your Snowflake connections remain secure and future-proof as Snowflake implements their deprecation timeline.

Add a Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Snowflake is an example of a source datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the JDBC datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore. (e.g., The specified name will appear on the datastore cards.)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select Snowflake from the dropdown list.

Option I: Create a Source Datastore with a New Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Snowflake connector from the dropdown list and add the connection details.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hashcorp-explain

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELDS	ACTIONS
1.	Account (Required)	Define the account identifier to be used for accessing the Snowflake.
2.	Role (Required)	Specify the user role that grants appropriate access and permissions.
3.	Warehouse (Required)	Provide the warehouse name that will be used for computing resources.
4	Authentication	You can choose between Basic authentication or Keypair authentication for validating and securing the connection to your Snowflake instance. Basic Authentication: This method uses a username and password combination for authentication. It is a straightforward method where the user's credentials are directly used to access Snowflake. Type: Select the authentication type from the dropdown menu. User: Enter the username that Qualytics will use to connect to Snowflake. Password: Enter the password associated with the specified user account. Keypair Authentication: This method uses a combination of a private key and a corresponding public key for authentication. This is a more secure method compared to basic authentication, as it involves asymmetric cryptography. Type: Select "Keypair" from the dropdown menu. User: Enter the username that Qualytics will use to connect to Snowflake. Private Key: Upload the private key file that will be used for authentication. This key is part of a public-private key pair used to securely authenticate the user. Private Key Password (Optional): Enter the password associated with the private key, if any.
5.	Database (Required)	Specify the database name to be accessed.
6.	Schema (Required)	Define the schema within the database that should be used.
7.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
8.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

Option II: Use an Existing Connection

If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Database, Schema, Teams and Initiate Cataloging.

Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Info

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore Connection

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). The enrichment datastore is used to store the analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window - Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle for Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret-button

A modal window - Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

modal-window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

modal-window

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

Once HashiCorp Vault is set up, use the ${key} format in the Connection form to reference a Vault secret.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 3: The configuration form, requesting credential details after selecting the enrichment datastore connector.

enrichment-datastore-explain

REF.	FIELDS	ACTIONS
1.	Account (Required)	Define the account identifier to be used for accessing the Snowflake.
2.	Role (Required)	Specify the user role that grants appropriate access and permissions.
3.	Warehouse (Required)	Provide the warehouse name that will be used for computing resources.
4.	Authentication	You can choose between Basic authentication or Keypair authentication for validating and securing the connection to your Snowflake instance. Basic Authentication: This method uses a username and password combination for authentication. It is a straightforward method where the user's credentials are directly used to access Snowflake. Type: Select the authentication type from the dropdown menu. User: Enter the username that Qualytics will use to connect to Snowflake. Password: Enter the password associated with the specified user account. Keypair Authentication: This method uses a combination of a private key and a corresponding public key for authentication. This is a more secure method compared to basic authentication, as it involves asymmetric cryptography. Type: Select "Keypair" from the dropdown menu. User: Enter the username that Qualytics will use to connect to Snowflake. Private Key: Upload the private key file that will be used for authentication. This key is part of a public-private key pair used to securely authenticate the user. Private Key Password (Optional): Enter the password associated with the private key, if any.
5.	Database (Required)	Specify the database name to be accessed.
6.	Schema (Required)	Define the schema within the database that should be used.
7.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.

Step 4: Click on the Test Connection button to verify the enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the enrichment datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 5: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 6: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Datastore

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Step 2: A modal window - Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

select-existing-enrichment-datastore

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files for metadata.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore. Example - All users are assigned to the Public team, which means that this enrichment datastore is accessible to all users.
Host: This is the host domain of the Snowflake instance.
Database: Refers to the specific database within the Snowflake environment. This database is a logical grouping of schemas. Each database belongs to a single Snowflake account.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating a Snowflake datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "snowflake_database",
    "schema": "snowflake_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "snowflake",
        "host": "snowflake_host",
        "username": "snowflake_username",
        "password": "snowflake_password",
        "passphrase": "key_passphrase",
        "parameters": {
            "role": "snowflake_read_role",
            "warehouse": "qualytics_wh",
            "authentication_type": "KEYPAIR"
        }
    }
}

Note

If the authentication_type parameter is removed, BASIC authentication will be used by default.

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "snowflake_database",
    "schema": "snowflake_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection-id
}

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "snowflake_database",
    "schema": "snowflake_schema",
    "enrich_only": true,
    "connection": {
        "name": "your_connection_name",
        "type": "snowflake",
        "host": "snowflake_host",
        "username": "snowflake_username",
        "password": "snowflake_password",
        "parameters": {
            "role": "snowflake_readwrite_role",
            "warehouse": "qualytics_wh"
        }
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "snowflake_database",
    "schema": "snowflake_schema",
    "enrich_only": true,
    "connection_id": connection-id
}

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Synapse

Adding and configuring a Synapse connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on adding Synapse as a source datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Synapse environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add the Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Synapse is an example of such a datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the Synapse datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

add-datastore

REF.	FIELDS	ACTIONS
1.	Name	Specify the name of the datastore (e.g., The specified name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection
3.	Connector	Select Synapse from the dropdown list.

Option I: Create a Datastore with a new Connection

If the toggle for Add New Connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Synapse connector from the dropdown list and add connection details such as Secret Management, host, port, username, etc.

synapse-connector

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF.	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $data).

secret-management

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

credential

REF.	FIELDS	ACTIONS
1.	Host	Get Hostname from your Synapse account and add it to this field.
2.	Port	Specify the Port number.
3.	User	Enter the User ID to connect.
4.	Password	Enter the Password to connect to the database.
5.	Database	Specify the database name.
6.	Schema	Define the schema within the database that should be used.
7.	Teams	Select one or more teams from the dropdown to associate with the source datastore.
8.	Initial Cataloging	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Use an existing connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

existing-credentials

Note

If you are using existing credentials, you can only edit the details such as Database, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.

test-connection-2

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

After successfully testing and verifying your source datastore connection, you have the option to add an enrichment datastore (recommended). This datastore is used to store analyzed results, including any anomalies and additional metadata in tables. This setup provides comprehensive visibility into your data quality, enabling you to manage and improve it effectively.

Warning

Qualytics does not support the Synapse connector as an enrichment datastore, but you can point to a different enrichment datastore.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

link-enrichment-datastore

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

add-enrichment-datastore

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

add-enrichment-datastore-2

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for Add New Connection	Toggle ON to create a new enrichment datastore from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

enrichment-datastore-connector

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-connection-3

Step 4: Click on the Finish button to complete the configuration process.

finish-2

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 5: Close the Success dialogue and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

overview

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

link-enrichment-datastore-2

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example - Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the enrichment datastore instance is hosted. It is the endpoint used to connect to the enrichment datastore environment.
Database: Refers to the specific database within the enrichment datastore environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

enrichment-detail

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-2

When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

overview-2

API Payload Examples

Creating a Datastore

This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post)

/api/datastores (post)

Creating a datastore with a new connectionCreating a datastore with an existing connection

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "synapse_database",
        "schema": "synapse_schema",
        "enrich_only": false,
        "trigger_catalog": true,
        "connection": {
            "name": "your_connection_name",
            "type": "synapse",
            "host": "synapse_host",
            "port": "synapse_port",
            "username": "synapse_username",
            "password": "synapse_password"
        }
    }

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "synapse_database",
        "schema": "synapse_schema",
        "enrich_only": false,
        "trigger_catalog": true,
        "connection_id": connection-id
    }

Creating an Enrichment Datastore

Endpoint (Post)

/api/datastores (post)

This section provides a sample payload for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Creating an enrichment datastore with a new connectionCreating an enrichment datastore with an existing connection

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "synapse_database",
        "schema": "synapse_schema",
        "enrich_only": true,
        "connection": {
            "name": "your_connection_name",
            "type": "synapse",
            "host": "synapse_host",
            "port": "synapse_port",
            "username": "synapse_username",
            "password": "synapse_password"
        }
    }

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "synapse_database",
        "schema": "synapse_schema",
        "enrich_only": true,
        "connection_id": connection-id
    }

Linking Datastore to an Enrichment Datastore through API

Endpoint (Patch)

/api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Teradata

Adding and configuring a Teradata connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on adding Teradata as a source datastore in Qualytics. It covers the entire process from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Teradata environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add the Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Teradata is an example of such a datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the Teradata datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-source-teradata

Step 2: A modal window- Add Datastore will appear, providing you with the options to connect a datastore.

add-teradata-connection

REF.	FIELDS	ACTIONS
1.	Name	Specify the name of the datastore (e.g., the specified name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection
3.	Connector	Select Teradata from the dropdown list.

Option I: Create a Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Teradata connector from the dropdown list and add connection details such as Secret Management, host, port, username, etc.

add-connection-details

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hashcorp-explain

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

enter-teradata-details

REF.	FIELDS	ACTIONS
1.	Host	Get the Hostname from your Teradata account and add it to this field.
2.	Port	Specify the Port number.
3.	User	Enter the User ID to connect.
4.	Password	Enter the password to connect to the database.
5.	Database	Specify the database name.
6.	Teams	Select one or more teams from the dropdown to associate with this source datastore.
7.	Initial Cataloging	Check the checkbox to automatically perform a catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-connection-success

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Use an existing connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

existing-connection

Note

If you are using existing credentials, you can only edit details such as Database, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.

test-connection-success-existing

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

After successfully testing and verifying your source datastore connection, you have the option to add an enrichment datastore (recommended). This datastore is used to store analyzed results, including any anomalies and additional metadata in tables. This setup provides comprehensive visibility into your data quality, enabling you to manage and improve it effectively.

Warning

Qualytics does not support the Teradata connector as an enrichment datastore, but you can point to a different enrichment datastore.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

enrichment-next

Step 2: A modal window- Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

add-enrichment-details

REF.	FIELDS	ACTIONS
1	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret-button

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

modal-window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

enrichment-details-added

Note

Qualytics does not support Teradata as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using BigQuery as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-enrichment-success

Step 4: Click on the Finish button to complete the configuration process.

test-enrichment-finish

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 5: Close the success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

source-datastore-teradata

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Note

Qualytics does not support Teradata as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using BigQuery as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

add-enrichment-details-existing

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. For example, marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the enrichment datastore instance is hosted. It is the endpoint used to connect to the enrichment datastore environment.
Database: This refers to the specific database within the enrichment datastore environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

enrichment-details-check

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

enrichment-details-finish

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

teradata-source-overview

API Payload Examples

Creating a Source Datastore

This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post): /api/datastores (post)

Creating a source datastore with a new connectionCreating a datastore with an existing connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "schema": "schema_name",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "host": "teradata_host",
        "port": "teradata_port",
        "username": "teradata_user",
        "password": "teradata_password",
        "type": "teradata"
        }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "schema": "schema_name",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection_id
}

Link an Enrichment Datastore to a Source Datastore

Endpoint Patch:

/api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

TimescaleDB

Adding and configuring a TimescaleDB connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on adding TimescaleDB as a source datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their TimescaleDB environment is properly connected with Qualytics, unlocking the platform’s potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add the Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. TimescaleDB is an example of such a datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the TimescaleDB datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore. (e.g., The specified name will appear on the datastore cards.)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector	Select TimescaleDB from the dropdown list.

Option I: Create a Datastore with a new Connection

If the toggle for Add New Connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the TimescaleDB connector from the dropdown list and add connection details such as Secrets Management, host, port, username, database, and schema.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

hashcorp-explain

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELDS	ACTIONS
1.	Host	Get Hostname from your TimescaleDB account and add it to this field.
2.	Port	Specify the Port number.
3.	User	Enter the User ID to connect.
4.	Password	Enter the password to connect to the database.
5.	Database	Specify the database name.
6.	Schema	Define the schema within the database that should be used.
7.	Teams	Select one or more teams from the dropdown to associate with this source datastore.
8.	Initial cataloging	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection-light

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add New Connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

After successfully testing and verifying your source datastore connection, you have the option to add an enrichment datastore (recommended). This datastore is used to store analyzed results, including any anomalies and additional metadata in tables. This setup provides full visibility into your data quality, enabling you to manage and improve it effectively.

Warning

Qualytics does not support the TimescaleDB connector as an enrichment datastore, but you can point to a different enrichment datastore.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window - Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add New Connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret-button

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

modal-window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Enter a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment datastore from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

enrichment-datastore-explain

Note

Qualytics does not support TimescaleDB as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using Microsoft SQL Server as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 5: Close the Success dialogue and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

use-enrichment-datastore

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

Note

Qualytics does not support Timescale as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using Bank Enrichment as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

select-existing-enrichment-datastore

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example - Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the enrichment datastore instance is hosted. It is the endpoint used to connect to the enrichment datastore environment.
Database: Refers to the specific database within the enrichment datastore environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

use-existing-enrichment-datastore

When the configuration process is finished, a modal window will display and a success flash message stating that your data has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

Creating a Source Datastore

This section provides a sample payload for creating a TimescaleDB datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post): /api/datastores (post)

Creating a source datastore with a new connectionCreating a source datastore with an existing connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "timescale_database",
    "schema": "timescale_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "timescale",
        "host": "timescale_host",
        "port": "timescale_port",
        "username": "timescale_username",
        "password": "timescale_password"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "timescale_database",
    "schema": "timescale_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection-id
}

Link an Enrichment Datastore to a Source Datastore

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Trino

Adding and configuring a Trino connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on adding Trino as a source datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Trino environment is properly connected with Qualytics, unlocking the platform’s potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add the Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Trino is an example of such a datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the Trino datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

connect

REF.	FIELD	ACTIONS
1.	Name	Specify the name of the datastore (e.g., the specified name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection
3.	Connector	Select Trino from the dropdown list.

Option I: Create a Datastore with a new Connection

If the toggle for Add new connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Trino connector from the dropdown list and add connection details such as Secret Management, host, port, username, etc.

select

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF.	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

form

REF.	FIELDS	ACTIONS
1.	Host	Get Hostname from your Trino account and add it to this field.
2.	Port	Specify the Port number.
3.	User	Enter the User ID to connect.
4.	Password	Enter the Password to connect to the database.
5.	Catalog	Add a Catalog to fetch data structures and metadata from Trino.
6.	Schema	Define the schema within the database that should be used.
7.	Teams	Select one or more teams from the dropdown to associate with this source datastore.
8.	Initial Cataloging	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add new connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

existing

Note

If you are using existing credentials, you can only edit the details such as Database, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to check and verify the source datastore connection. If connection details are verified, a success message will be displayed.

testt

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

After successfully testing and verifying your source datastore connection, you have the option to add an enrichment datastore (recommended). This datastore is used to store analyzed results, including any anomalies and additional metadata in tables. This setup provides comprehensive visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

window

REF.	FIELDS	ACTIONS
1.	Prefix (Required)	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle Add new connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

windoww

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment datastore from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

selectedd

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

testtt

Step 4: Click on the Finish button to complete the configuration process.

Finishh

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 5: Close the success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

operation

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

carett

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

modall

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example - Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the Trino instance is hosted. It is the endpoint used to connect to the Trino environment.
Database: Refers to the specific database within the Trino environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

detail

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finishh

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

operation

API Payload Examples

Creating a Datastore

This section provides a sample payload for creating a datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post)

/api/datastores (post)

Creating a datastore with a new connectionCreating a datastore with an existing connection

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "trino_database",
        "schema": "trino_schema",
        "enrich_only": false,
        "trigger_catalog": true,
        "connection": {
            "name": "your_connection_name",
            "type": "trino",
            "host": "trino_host",
            "port": "trino_port",
            "username": "trino_username",
            "password": "trino_password",
            "parameters":{
                "ssl_truststore":"truststore.jks"
            }
        }
    }

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "trino_database",
        "enrich_only": false,
        "trigger_catalog": true,
        "connection_id": connection-id
    }

Creating an Enrichment Datastore

Endpoint (Post)

/api/datastores (post)

This section provides a sample payload for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Creating an enrichment datastore with a new connectionCreating an enrichment datastore with an existing connection

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "trino_database",
        "schema": "trino_schema",
        "enrich_only": true,
        "connection": {
            "name": "your_connection_name",
            "type": "trino",
            "host": "trino_host",
            "port": "trino_port",
            "username": "trino_username",
            "password": "trino_password",
            "parameters":{
                "ssl_truststore":"truststore.jks"
            }
        }
    }

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "database": "trino_database",
        "schema": "trino_schema",
        "enrich_only": true,
        "connection_id": connection-id
    }

Linking Datastore to an Enrichment Datastore through API

Endpoint (Patch)

/api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Dremio

Adding and configuring a Dremio connection within Qualytics empowers the platform to build a symbolic link with your schema to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on adding Dremio as a source datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Dremio environment is properly connected with Qualytics, unlocking the platform’s potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Add the Source Datastore

A source datastore is a storage location used to connect to and access data from external sources. Dremio is an example of such a datastore, specifically a type of JDBC datastore that supports connectivity through the JDBC API. Configuring the Dremio datastore allows the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-source-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

add-source-datastore

REF.	FIELDS	ACTIONS
1.	Name	Specify the name of the datastore (e.g., the specified name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection
3.	Connector	Select Dremio from the dropdown list.

Option I: Create a Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Dremio connector from the dropdown list and add connection properties such as Secrets Management, host, port, username, and password, along with datastore properties like catalog, database, etc.

add-source-datastore-details

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF.	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-configure-details

REF.	FIELDS	ACTIONS
1.	Host	Get the Hostname from your Dremio account and add it to this field.
2.	Port	Specify the Port number.
3.	Project ID	Enter the Project ID associated with Dremio.
4.	SSL Connection	Enable the SSL connection to ensure secure communication between Qualytics and the selected datastore.
5.	Authentication	You can choose between Basic authentication or Access Token for validating and securing the connection to your Dremio instance. Basic Authentication: This method uses a username and password combination for authentication. It is a straightforward method where the user's credentials are directly used to access Dremio. Type: Select the authentication type from the dropdown menu. User: Enter the username that Qualytics will use to connect to Dremio. Password: Enter the password associated with the specified user account. Access Token Authentication: This method uses an access token for authentication. This is a more secure method compared to basic authentication. Personal Access Token: Enter the personal access token here to authenticate and access the resources securely.
6.	Schema	Define the schema within the database that should be used.
7.	Teams	Select one or more teams from the dropdown to associate with this source datastore.
8.	Initial Cataloging	Tick the checkbox to automatically perform a catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-source-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

add-datastore-details-existing

Note

If you are using existing credentials, you can only edit the details such as Schema, Teams and Initiate Cataloging.

Step 2: Click on the Test Connection button to check and verify the source data connection. If the connection details are verified, a success message will be displayed.

test-existing-enrichment-connection

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

Click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

After successfully testing and verifying your source datastore connection, you have the option to add an enrichment datastore (recommended). This datastore is used to store analyzed results, including any anomalies and additional metadata in tables. This setup provides comprehensive visibility into your data quality, enabling you to manage and improve it effectively.

Warning

Qualytics does not support the Dremio connector as an enrichment datastore, but you can point to a different enrichment datastore.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

click-source

Step 2: A modal window - Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

enrichment-details

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

select-enrichment

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

enrichment-detail

REF.	FIELDS	ACTION
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment datastore from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

Note

Qualytics does not support Dremio as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using DB2 as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

add-enrichment-details

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-enrichment-connection

Step 4: Click on the Finish button to complete the configuration process.

enrichment-details-finish

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 5: Close the Success dialogue and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

athena-created

Option II: Use an Existing Connection

If the Use enrichment datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

select-enrichment-details

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

Note

Qualytics does not support Dremio as an enrichment datastore. Instead, you can select a different enrichment datastore for this purpose. For demonstration purposes, we are using DB2 as the enrichment datastore. You can use any other JDBC or DFS datastore of your choice for the enrichment datastore configuration.

add-enrichment-details

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example - Marked as Public means that this datastore is accessible to all the users.
Host: This is the server address where the enrichment datastore instance is hosted. It is the endpoint used to connect to the enrichment datastore environment.
Database: Refers to the specific database within the enrichment datastore environment where the data is stored.
Schema: The schema used in the enrichment datastore. The schema is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single database.

add-existing-enrichment

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

click-finish

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

athena-existing-created

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating a Dremio datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post): /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an Existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "dremio_database",
    "schema": "dremio_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection": {
        "name": "your_connection_name",
        "type": "dremio",
        "host": "dremio_host",
        "port": 443,
        "project_id": "dremio_id",
        "ssl": true,
        "authentication": {
            "type": "access_token",
            "personal_access_token": "your_personal_access_token"
        }
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "database": "dremio_database",
    "schema": "dremio_schema",
    "enrich_only": false,
    "trigger_catalog": true,
    "connection_id": connection_id
}

Link an Enrichment Datastore to a Source Datastore

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

DFS Datastores

DFS Datastore Overview

The DFS (Distributed File System) Datastore feature in Qualytics is designed to handle data stored in distributed file systems.

This includes file systems like Hadoop Distributed File System (HDFS) or similar distributed storage solutions.

Supported Distributed File Systems

Qualytics supports DFS Datastores, catering to distributed file systems like:

Amazon S3
Google Cloud Storage
Windows Azure Storage Blob

Connection Details

Users provide connection details for DFS Datastores, allowing Qualytics to establish a connection to the distributed file system.

Catalog Operation

The Catalog operation involves walking the directory tree, reading files with supported filename extensions, and creating containers based on file metadata.

Supported File Formats

Qualytics supports the following file formats in the DFS Datastore.

No.	File Format	Extension Example
1	Avro	`.avro`
2	CSV	`.csv`, `.csv.gz`
3	TSV	`.tsv`, `.tsv.gz`
4	TXT	`.txt`, `.txt.gz`
5	PSV	`.psv`, `.psv.gz`
6	SKV	`.skv`, `.skv.gz`
7	JSON	`.json`, `.json.gz`
8	ORC	`.orc`
9	Delta	`.delta`
10	Iceberg	`.iceberg`
11	Parquet	`.parquet`
12	XLS	`.xls`
13	XLSX	`.xlsx`
14	XLSM	`.xlsm`

Data Quality and Profiling

DFS Datastores support the initiation of Profile Operations, allowing users to understand the structure and characteristics of the data stored in the distributed file system.

Containers Overview

For a more detailed understanding of how Qualytics manages and interacts with containers in DFS Datastores, please refer to the Containers section in our comprehensive user guide.

This section covers topics such as container deletion, field deletion, and the initial profile of a Datastore's containers.

Multi-Token Filename Globbing and Container Formation

Filenames with similar structures in the same folder are automatically included in a single globbed container during the Catalog operation.

Use Folders for Precise File Grouping

Organizing files within distinct folders is a straightforward and effective strategy in Distributed File Systems (DFS).

When all files in a folder share a common schema, it simplifies the process of grouping and managing them.

This approach ensures precise file grouping without relying on complex glob patterns.

How to Use Folders for Shared Schema

1. Create a Folder:

Begin by creating a new folder in your distributed filesystem.

Suppose you have order data files with filenames like orders_20240229.csv, orders-20240228.csv, orders-20240227.csv.
Create a folder named Orders to group these files.

Qualytics Pattern: Qualytics will automatically create the container orders_*.csv based on the filenames.

Move or upload files that share a common schema into the created folder.

Move the order data files into the Orders folder.

3. Repeat for Each Schema:

Create separate folders for different schemas, and organize files accordingly.

Suppose you have customer data files with filenames like customers_us.csv, customers_eu.csv.
Create a folder named Customers to group these files.

Qualytics Pattern: Qualytics will automatically create the pattern customers_*.csv based on the filenames.

4. Naming Conventions:

Consider adopting clear and consistent naming conventions for folders to enhance organization.

Use descriptive names for folders, such as Orders, Customers, to make it easier to identify the contents.

Flowchart: Using Folders for Shared Schema

graph TD
  A[Start] -->|Create a Folder| B(Create Folder)
  B -->|Place Related Files| C(Move or Upload Files)
  C -->|Repeat for Each Schema| D(Create Separate Folders)
  D -->|Naming Conventions| E(Consider Clear Naming)
  E --> F[End]

Use Filename Conventions for Posix Globs:

This option leverages filename conventions that align with POSIX globs, allowing our system to automatically organize files for you.

The system intelligently analyzes filename patterns, making the process seamless and efficient.

How to Use Filename Conventions for Posix Globs

1. Follow Clear Filename Conventions:

Adopt clear and consistent filename conventions that lend themselves to POSIX globs.

Suppose you have log files with filenames like app_log_20240229.txt, app_log_20240228.txt, app_log_20240227.txt.
Use a consistent naming convention like app_log_*.txt, where * serves as a placeholder for varying elements.
The * in the convention acts as a wildcard, representing any variation in the filename. In this example, it matches the date part (20240229, 20240228, etc.).

2. Upload or Move Files:

Upload or move files with filenames following the adopted conventions to your distributed filesystem.

Move log files with the specified naming convention to your DFS.

3. System Analysis:

Our system will automatically detect and analyze the filename conventions, creating appropriate glob patterns.

With filenames like app_log_20240229.txt, app_log_20240228.txt, the system will create the pattern app_log_*.txt.

Flowchart: Using Folders for Filename Conventions

graph TD
  A[Start] -->|Follow Clear Conventions| B(Adopt Consistent Conventions)
  B -->|Upload or Move Files| C(Move Files to DFS)
  C -->|System Analysis| D(Automatic Pattern Creation)
  D --> E[End]

Why not manually creating your own Globs?

While our system offers powerful features to automate file organization, we strongly discourage manually creating globs.

This option may lead to errors, inconsistencies, and hinder the efficiency of our system.

We recommend leveraging our automated tools for a seamless and error-free experience.

Complexity and Error-Prone:

Manually creating globs can be complex, prone to typos, and susceptible to errors in pattern formation.

Suppose you want to group log files with the pattern app_log_*.txt. A manual attempt might result in mistakes like app_log_202*.txt or app_log_*.tx.

Inconsistencies Across Files:

Manual glob creation may lead to inconsistencies across different files, making it challenging to establish a uniform file organization.

Trying to manually create globs for order data files with varying date formats (orders_20240229.csv, orders-20240228.csv) can result in inconsistent patterns.

Explore Deeper Knowledge

If you want to go deeper into the knowledge or if you are curious and want to learn more about DFS filename globbing, you can explore our comprehensive guide here: How DFS Filename Globbing Works.

Amazon S3

Adding and configuring an Amazon S3 connection within Qualytics empowers the platform to build a symbolic link with your file system to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add Amazon S3 as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Amazon S3 environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Amazon S3 Setup Guide

This section provides a simple walkthrough for setting up Amazon S3, including retrieving URIs. It also explains how to retrieve the Access Key and Secret Key to configure datastore permissions.

By following the Amazon S3 setup process, you will ensure secure and efficient access to your stored data, allowing seamless datastore integration and proper access management in Qualytics.

Retrieve the URI

The S3 URI is the unique resource identifier within the context of the S3 protocol. They follow this naming convention: S3://bucket-name/key-name

To retrieve the URL of an S3 object via the AWS Console, follow these steps:

Navigate to the AWS S3 console and click on your bucket's name (use the search input to find the object if necessary).
Click on the checkbox next to the object's name
Click on the Copy S3 URI button

retrieve-the-uri

Retrieve the Access Key and Secret Key

The access keys are long-term credentials for an IAM user or the AWS account root user. You can use these keys to sign programmatic requests to the AWS CLI or AWS API (directly or using the AWS SDK).

To retrieve the Access Key and Secret Access Key, follow these steps:

Open the IAM console.
From the navigation menu, click on the Users.
Select your IAM user name.
Click on the User Actions, and then click on the Manage Access Keys.
Click on the Create Access Key.
Your keys will look something like this:
1. Access key ID example: AKIAIOSFODNN7EXAMPLE.
2. Secret access key example: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY.
Click on the Download Credentials, and store the keys in a secure location.

Warning

Your Secret Access Key will be visible only once at the time of creation. Please ensure you copy and securely store it for future use.

Datastore Privileges

If you are using a private bucket, authentication is required for the connection.

Source Datastore Permissions (Read-Only)

To create a policy, follow these steps:

Open the IAM console.
Navigate to Policies in the IAM dashboard and select Create Policy.
Go to the JSON tab and paste the provided JSON into the Policy editor.

Tip

Ensure you replace <bucket/path> with your specific resource.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:Get*"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket>/*",
                "arn:aws:s3:::<bucket>"
            ]
        }
    ]
}

Warning

Currently, object-level permissions alone are insufficient to authenticate the connection. Please ensure you also include bucket-level permissions as demonstrated in the example above.

Enrichment Datastore Permissions (Read-Write)

To create a policy, follow these steps:

Open the IAM console.
Navigate to Policies in the IAM dashboard and select Create Policy.
Go to the JSON tab and paste the provided JSON into the Policy editor.

Tip

Ensure you replace <bucket/path> with your specific resource.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket>/*",
                "arn:aws:s3:::<bucket>"
            ]
        }
    ]
}

Warning

Currently, object-level permissions alone are insufficient to authenticate the connection. Please ensure you also include bucket-level permissions as demonstrated in the example above.

Add a Source Datastore

A source datastore is a storage location used to connect and access data from external sources. Amazon S3 is an example of a source datastore, specifically a type of Distributed File System (DFS) datastore that is designed to handle data stored in distributed file systems. Configuring a DFS datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore (e.g., The specified name will appear on the datastore cards.).
2.	Toggle Button	Toggle ON to create a new source datastore from scratch or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select Amazon S3 from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Amazon S3 connector from the dropdown list and add connection details such as Secrets Management, URI, access key, secret key, root path, and teams.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF.	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELDS	ACTIONS
1.	URI (Required)	Enter the Uniform Resource Identifier (URI) of Amazon S3.
2.	Access Key (Required)	Input the access key provided for secure access.
3.	Secret Key (Required)	Input the secret key associated with the access key for secure authentication.
4.	Root Path (Required)	Specify the root path where the data is stored.
5.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
6.	Initiate Cataloging (Optional)	Check the checkbox to automatically perform a catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Root Path, Teams, and initiate Cataloging.

Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). This datastore is used to store the analyzed results, including any anomalies and additional metadata in files. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window - Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

select-enrichment

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

enrichment-detail

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for add new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

enrichment-datastore-explain

REF.	FIELDS	ACTIONS
1.	URI (Required)	Enter the Uniform Resource Identifier (URI) for the Amazon S3.
2.	Access Key (Required)	Input the access key provided for secure access.
3.	Secret Key (Required)	Input the secret key associated with the access key for secure authentication.
4.	Root Path (Required)	Specify the root path where the data is stored.
5.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Step 5: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Connection

If the Use Enrichment Datastore option is selected from the caret button, you will be prompted to configure the datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

select-enrichment-details

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

add-enrichment-details

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example - Marked as Public means that this datastore is accessible to all the users.
URI: The Uniform Resource Identifier (URI) points to the specific location of the source data and should be formatted accordingly (e.g., s3://bucket-name for Amazon S3).
Root Path: Specify the root path where the data is stored. This path defines the base directory or folder from which all data operations will be performed.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating the Amazon S3 datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "trigger_catalog": true,
    "root_path": "/s3_root_path",
    "enrich_only": false,
    "connection": {
        "name": "your_connection_name",
        "type": "s3",
        "uri": "s3://<bucket_name>",
        "access_key": "s3_access_key",
        "secret_key": "s3_secret_key"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "trigger_catalog": true,
    "root_path": "/s3_root_path",
    "enrich_only": false,
    "connection_id": connection-id
}

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an existing Connection

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "trigger_catalog": true,
    "root_path": "/s3_root_path",
    "enrich_only": true,
    "connection": {
        "name": "your_connection_name",
        "type": "s3",
        "uri": "s3://<bucket_name>",
        "access_key": "s3_access_key",
        "secret_key": "s3_secret_key"
    }
}

{
    "name": "your_datastore_name",
    "teams": ["Public"],
    "trigger_catalog": true,
    "root_path": "/s3_root_path",
    "enrich_only": true,
    "connection_id": connection-id
}

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Azure Datalake Storage

Adding and configuring an Azure Datalake Storage connection within Qualytics empowers the platform to build a symbolic link with your file system to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add Azure Datalake Storage as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Azure Datalake Storage environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Azure Datalake Storage Setup Guide

This setup guide details the process for retrieving the Account Name and Access Key of your Azure Datalake Storage account, essential for seamless configuration in Qualytics.

Azure Datalake Storage URI

The Uniform Resource Identifier (URI) for Azure Datalake Storage is structured to uniquely identify resources within your storage account. The format of the URI is as follows:

abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>

abfs[s]: The abfs or abfss protocol is used as the scheme identifier.
\<file_system>: The parent location that holds the files and folders. This is similar to containers in the Azure Storage Blobs service.
<account-name>: The name assigned to your storage account during creation.
<path>: A forward slash delimited (/) representation of the directory structure.

Retrieve the Account Name and Access Key

To configure Azure Datalake Storage Datastore in Qualytics, you need the account name and access key. Follow these steps to retrieve them:

To get the account_name and access_key you need to access your local storage in Azure.
Click on Access Keys tab and copy the values.

get-azure-datalake-account-credentials

Tip

Refer to the Azure Datalake Storage documentation for more information on how to retrieve the account name and access key of your storage account.

Add a Source Datastore

A source datastore is a storage location used to connect and access data from external sources. Azure Datalake Storage is an example of a source datastore, specifically a type of Distributed File System (DFS) datastore that is designed to handle data stored in distributed file systems. Configuring a DFS datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore. (e.g., The specified name will appear on the datastore cards.)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select Azure Datalake Storage from the dropdown list.

Option I: Create a Source Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Azure Datalake Storage connector from the dropdown list and add connection details such as Secrets Management, URI, account name, access key, root path, and teams.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF	FIELDS	ACTIONS
1.	URI (Required)	Enter the Uniform Resource Identifier (URI) of the Azure Datalake Storage.
2.	Account Name (Required)	Input the account name to access the Azure Datalake Storage.
3.	Access Key (Required)	Input the access key provided for secure access.
4.	Root Path (Required)	Specify the root path where the data is stored.
5.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
6.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Root Path, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). This datastore is used to store the analyzed results, including any anomalies and additional metadata in files. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window - Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

select-enrichment

A modal window - Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

enrichment-detail

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for Add New Connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

enrichment-datastore-explain

REF	FIELDS	ACTIONS
1.	URI (Required)	Enter the Uniform Resource Identifier (URI) of the Azure Datalake Storage.
2.	Account Name (Required)	Input the account name to access the Azure Datalake Storage.
3.	Access Key (Required)	Input the access key provided for secure access.
4.	Root Path (Required)	Specify the root path where the data is stored.
5.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

success-message

Step 5: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Connection

If the toggle for Use an existing enrichment datastore is turned on, you will be prompted to configure the enrichment datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

select-enrichment-details

Step 2: A modal window - Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

add-enrichment-details

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example - Marked as Public means that this datastore is accessible to all the users.
URI: Uniform Resource Identifier (URI) points to the specific location of the source data and should be formatted accordingly (e.g., abfss://storage-url for Azure Datalake Storage).
Root Path: Specify the root path where the data is stored. This path defines the base directory or folder from which all data operations will be performed.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a modal will display a success message indicating that your data has been successfully added.

success-message

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating the Azure Datalake Storage datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

{  
    "name": "your\_datastore\_name",  
    "teams": \["Public"\],  
    "trigger_catalog": true,  
    "root_path": "/azure\_root\_path",  
    "enrich_only": false,  
    "connection": {  
        "name": "your\_connection\_name",  
        "type": "abfs",  
        "uri": "abfs://<container>@<account_name>.dfs.core.windows.net",  
        "access_key": "azure\_account\_name",  
        "secret_key": "azure\_access\_key" 
    }  
}

{  
    "name": "your\_datastore\_name",  
    "teams": \["Public"\],  
    "trigger_catalog": true,  
    "root_path": "/azure\_root\_path",  
    "enrich_only": false,  
    "connection_id": connection-id  
}

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an Existing Connection

{
    "name": "your\_datastore\_name",  
    "teams": \["Public"\],  
    "trigger_catalog": true,  
    "root_path": "/azure\_root\_path",  
    "enrich_only": true,  
    "connection": {  
        "name": "your\_connection\_name",  
        "type": "abfs",  
        "uri": "abfs://<container>@<account_name>.dfs.core.windows.net",  
        "access_key": "azure\_account\_name",  
        "secret_key": "azure\_access\_key"  
    }  
}

{
    "name": "your\_datastore\_name",  
    "teams": \["Public"\],  
    "trigger_catalog": true,  
    "root_path": "/azure\_root\_path",  
    "enrich_only": true,  
    "connection_id": connection-id  
}

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Google Cloud Storage

Adding and configuring a Google Cloud Storage connection within Qualytics empowers the platform to build a symbolic link with your file system to perform operations like data discovery, visualization, reporting, cataloging, profiling, scanning, anomaly surveillance, and more.

This documentation provides a step-by-step guide on how to add Google Cloud Storage as both a source and enrichment datastore in Qualytics. It covers the entire process, from initial connection setup to testing and finalizing the configuration.

By following these instructions, enterprises can ensure their Google Cloud Storage environment is properly connected with Qualytics, unlocking the platform's potential to help you proactively manage your full data quality lifecycle.

Let’s get started 🚀

Google Cloud Storage Setup Guide

This guide will walk you through the steps to set up Google Cloud Storage, including how to retrieve the necessary URIs, access keys, and secret keys, which are essential for integrating this datastore into Qualytics.

Retrieve the Google Cloud Storage URI

To retrieve the Cloud Storage URI, follow the given steps:

Go to the Cloud Storage Console.
Navigate to the location of the object (file) that holds the source data.
At the top of the Cloud Storage console, locate and note down the path to the object.
Create the URI using the following format:

gs://bucket/file

bucket is the name of the Cloud Storage bucket.
file is the name of the object (file) containing the data.

Retrieve the Access Key and Secret Key

You need these keys when integrating Google Cloud Storage with other applications or services, such as when adding it as a datastore in Qualytics. The keys allow you to reuse existing code to access Google Cloud Storage without needing to implement a different authentication mechanism.

To retrieve the access key and secret key in the Google Cloud Storage Console account, follow the given steps:

Step 1: Log in to the Google Cloud Console, navigate to the Google Cloud Storage settings, and this will redirect you to the Settings page.

navigate-to-gcs

Step 2: Click on the Interoperability tab.

interoperability-tab

Step 3: Scroll down the Interoperability page and under Access keys for your user account, click the CREATE A KEY button to generate a new Access Key and Secret Key.

create-a-key

Step 4: Use these generated Access Key and Secret Key values when adding your Google Cloud Storage account to SimpleBackups.

generate-keys

For example, once you generate the keys, they might look like this:

Access Key: GOOG1234ABCDEFGH5678
Secret Key: abcd1234efgh5678ijklmnopqrstuvwx

Warning

Make sure to store these keys securely, as they provide access to your Google Cloud Storage resources.

Add a Source Datastore

A source datastore is a storage location used to connect and access data from external sources. Google Cloud Storage is an example of a source datastore, specifically a type of Distributed File System (DFS) datastore that is designed to handle data stored in distributed file systems. Configuring a DFS datastore enables the Qualytics platform to access and perform operations on the data, thereby generating valuable insights.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

select-a-connector

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore (e.g., The specified name will appear on the datastore cards)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch, or toggle OFF to reuse credentials from an existing connection
3.	Connector (Required)	Select Google Cloud Storage from the dropdown list.

Option I: Create a Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the source datastore from scratch without using existing connection details.

Step 1: Select the Google Cloud Storage connector from the dropdown list and add connection details such as Secrets Management, URI, service account key, root path, and teams.

add-datastore-credentials

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret-management

Step 2: The configuration form will expand, requesting credential details before establishing the connection.

add-datastore-credentials-explain

REF.	FIELDS	ACTIONS
1.	URI (Required)	Enter the Uniform Resource Identifier (URI) of the Google Cloud Storage.
2.	Service Account Key (Required)	Upload a JSON file that contains the credentials required for accessing the Google Cloud Storage.
3.	Root Path (Required)	Specify the root path where the data is stored.
4.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
5.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test-datastore-connection

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Option II: Use an Existing Connection

If the toggle for Add New connection is turned off, then this will prompt you to configure the source datastore using the existing connection details.

Step 1: Select a connection to reuse existing credentials.

use-existing-datastore

Note

If you are using existing credentials, you can only edit the details such as Root Path, Teams, and Initiate Cataloging.

Step 2: Click on the Test Connection button to check and verify the source data connection. If connection details are verified, a success message will be displayed.

test-connection-for-existing-datastore

Note

Clicking on the Finish button will create the source datastore and bypass the enrichment datastore configuration step.

Tip

It is recommended to click on the Next button, which will take you to the enrichment datastore configuration page.

Add Enrichment Datastore

Once you have successfully tested and verified your source datastore connection, you have the option to add the enrichment datastore (recommended). This datastore is used to store the analyzed results, including any anomalies and additional metadata files. This setup provides full visibility into your data quality, helping you manage and improve it effectively.

Step 1: Whether you have added a source datastore by creating a new datastore connection or using an existing connection, click on the Next button to start adding the Enrichment Datastore.

Step 2: A modal window - Link Enrichment Datastore will appear, providing you with the options to configure an enrichment datastore.

select-enrichment-connector

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Option I: Create an Enrichment Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using an existing enrichment datastore and its connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

select-enrichment

A modal window - Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

enrichment-detail

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for Add New Connection	Toggle ON to create a new enrichment datastore from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.

Step 2: Add connection details for your selected enrichment datastore connector.

enrichment-datastore-explain

REF.	FIELDS	ACTIONS
1.	URI (Required)	Enter the Uniform Resource Identifier (URI) for the Google Cloud Storage.
2.	Service Account Key (Required)	Upload a JSON file that contains the credentials required for accessing the Google Cloud Storage.
3.	Root Path (Required)	Specify the root path where the data is stored.
4.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.

Step 3: Click on the Test Connection button to verify the selected enrichment datastore connection. If the connection is verified, a flash message will indicate that the connection with the datastore has been successfully verified.

test-connection-for-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process.

finish-configuration

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

success-message

Step 5: Close the Success dialog and the page will automatically redirect you to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

Option II: Use an Existing Connection

If the toggle for Use an existing enrichment datastore is turned on, you will be prompted to configure the enrichment datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

select-enrichment-details

Step 2: A modal window - Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

add-enrichment-details

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.

Step 3: After selecting an existing enrichment datastore connection, you will view the following details related to the selected enrichment:

Teams: The team associated with managing the enrichment datastore is based on the role of public or private. Example: Marked as Public means that this datastore is accessible to all the users.
URI: Uniform Resource Identifier (URI) points to the specific location of the source data and should be formatted accordingly (e.g., gs://bucket/file for Google Cloud Storage).
Root Path: Specify the root path where the data is stored. This path defines the base directory or folder from which all data operations will be performed.

use-existing-enrichment-datastore

Step 4: Click on the Finish button to complete the configuration process for the existing enrichment datastore.

finish-configuration-for-existing-enrichment-datastore

When the configuration process is finished, a modal will display a success message indicating that your datastore has been successfully added.

success-message

Close the success message and you will be automatically redirected to the Source Datastore Details page where you can perform data operations on your configured source datastore.

data-operation-page

API Payload Examples

This section provides detailed examples of API payloads to guide you through the process of creating and managing datastores using Qualytics API. Each example includes endpoint details, sample payloads, and instructions on how to replace placeholder values with actual data relevant to your setup.

Creating a Source Datastore

This section provides sample payloads for creating the Google Cloud Storage datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create a Source Datastore with a new ConnectionCreate a Source Datastore with an existing Connection

        {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "trigger_catalog": true,
        "root_path": "/gcs_root_path",
        "enrich_only": false,
        "connection": {
            "name": "your_connection_name",
            "type": "gcs",
            "uri": "gs://<bucket_name>",
            "secret_key": "gcs_service_account_key"
        }
    }

   {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "trigger_catalog": true,
        "root_path": "/gcs_root_path",
        "enrich_only": false,
        "connection_id": connection-id
    }

Creating an Enrichment Datastore

This section provides sample payloads for creating an enrichment datastore. Replace the placeholder values with actual data relevant to your setup.

Endpoint: /api/datastores (post)

Create an Enrichment Datastore with a new ConnectionCreate an Enrichment Datastore with an existing Connection

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "trigger_catalog": true,
        "root_path": "/gcs_root_path",
        "enrich_only": true,
        "connection": {
            "name": "your_connection_name",
            "type": "gcs",
            "uri": "gs://<bucket_name>",
            "secret_key": "gcs_service_account_key"
        }
    }

    {
        "name": "your_datastore_name",
        "teams": ["Public"],
        "trigger_catalog": true,
        "root_path": "/gcs_root_path",
        "enrich_only": true,
        "connection_id": connection-id
    }

Link an Enrichment Datastore to a Source Datastore

Use the provided endpoint to link an enrichment datastore to a source datastore:

Endpoint Details: /api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Connections Overview

In Qualytics, setting up datastore connections is simple and efficient. Enter the necessary details like datastore name, connector, and authentication credentials to connect your datastores. You can also enable Secrets Management for secure credential handling with HashiCorp Vault.

Once verified, the Reuse Connection feature lets you use existing credentials for future datastores, saving time and ensuring consistency. Manage your connections easily by adding, editing, or deleting datastores as needed.

Setup a Connection

To configure a datastore connection in Qualytics, begin by entering the required details such as the datastore name, connector, and authentication credentials. Optionally, enable Secrets Management for secure credential handling. Once the connection is tested and confirmed, your datastore will be set up and ready for use.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add-datastore

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

details

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore (e.g., the specified name will appear on the datastore cards.)
2.	Toggle Button	Toggle ON to create a new source datastore from scratch.
3.	Connector (Required)	Select a connector from the drop-down list.

For demonstration purposes, we have selected the BigQuery Connector.

Step 3: Add connection details such as temp dataset ID, service account key, project ID, and dataset ID.

detail

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any connection property to reference a key from the configured Vault secret. Each time the connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret

Step 4: The configuration form requests credential details before establishing a connection.

form

Note

Different connectors have unique fields and parameters. For this demonstration, we have selected the BigQuery Connector, so the fields displayed are specific to the BigQuery configuration.

REF.	FIELDS	ACTIONS
1.	Temp Dataset ID (Optional)	Enter a temporary dataset ID for intermediate data storage during BigQuery operations.
2.	Service Account Key (Required)	Upload a JSON file that contains the credentials required for accessing BigQuery.
3.	Project ID (Required)	Enter the Project ID associated with BigQuery.
4.	Dataset ID (Required)	Enter the Dataset ID (schema name) associated with BigQuery.
5.	Teams (Required)	Select one or more teams from the dropdown to associate with this source datastore.
6.	Initiate Cataloging (Optional)	Tick the checkbox to automatically perform a catalog operation on the configured source datastore to gather data structures and corresponding metadata.

Step 5: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

test

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Step 6: Once the connection is verified, click the Finish button to complete the process.

finish

Step 7: A message will appear indicating that your datastore has been successfully added. Once the datastore is added, you can reuse the connection for future tasks without needing to re-enter the details.

msg

Reuse a Connection

The Reuse Connection feature lets you use existing credentials to set up a new datastore, saving time and ensuring consistency in your Qualytics account. Simply toggle the option to reuse credentials instead of entering new ones.

Step 1: Log in to your Qualytics account and click on the Add Source Datastore button located at the top-right corner of the interface.

add

Step 2: A modal window - Add Datastore will appear, providing you with the options to connect a datastore.

details

REF.	FIELDS	ACTIONS
1.	Name (Required)	Specify the name of the datastore (e.g., the specified name will appear on the datastore cards).
2.	Toggle Button	Toggle OFF to reuse credentials from an existing connection.
3.	Connector (Required)	Select a connector from the drop-down list.

For demonstration purposes, we have selected the BigQuery Connector.

Step 3: Add connection details such as temp dataset ID, service account key, project ID, and dataset ID.

detail

Step 4: Click on the Test Connection button to verify the existing connection details. If connection details are verified, a success message will be displayed.

test-connection

Step 5: Once the connection is verified, click the Finish button to complete the process.

finish

A message will appear indicating that your datastore has been successfully added.

msg

Manage Connection

You can effectively manage your connections by editing, deleting, and adding datastores to maintain accuracy and efficiency.

For more information on managing connections, refer to the Manage Connection section.

Conclusion

Using Connections optimizes datastore management by enabling the reuse of connection parameters, making the process more streamlined and organized.

Source Datastores

Right Click Options

Once you add a source datastore, whether a JDBC or DFS, Qualytics provides right-click options on the following:

Added source datastore
Tables or files within the source datastore
Fields within the tables
Checks within the source datastore
Anomalies within the source datastore

Let’s get started 🚀

Right Click Source Datastore

Log in to your Qualytics account and right-click on the source datastore whether a JDBC or DFS. A dropdown list of options will appear:

Open in New Tab.
Copy Link.
Copy ID.
Copy Name.

add-datastore

No	Field	Description
1	Open in New Tab	Opens the selected source datastore in a new browser tab, where you can view its quality score, sampling, completeness, active checks, active anomalies, etc.
2	Copy Link	Copy the unique URL of the selected source datastore to your clipboard.
3	Copy ID	Copy the unique ID of the selected source datastore.
4	Copy Name	Copy the name of the selected source datastore to your clipboard.

Alternatively, you can access these right-click options by performing the direct right-click operation on a source datastore from the list.

add-datastore

Right Click Tables & Files

Right-click on the specific table or file underlying a connected source datastore.

A dropdown list of options will appear:

Open in New Tab.
Copy Link.
Copy ID.
Copy Name.

tables-files

No	Field	Description
1	Open in New Tab	Open the selected table from the datastore in a new browser tab, where you can view its quality score, sampling, completeness, active checks, active anomalies, etc.
2	Copy Link	Copy the unique URL of the selected table to your clipboard.
3	Copy ID	Copy the unique identifier (ID) of the selected table.
4	Copy Name	Copy the name of the selected table to your clipboard.

Alternatively, you can access these right-click options by opening the dedicated page of the source datastore, navigating to its Tables or files section, and performing the right-click operation on any table or file from the list.

tables-files

Right Click Fields

Right-click on the specific field underlying within a table or file.

A dropdown list of options will appear:

Open in New Tab.
Copy Link.
Copy ID.
Copy Name.

field

No	Field	Description
1	Open in New Tab	Open the selected field in a new browser tab, where you can view its quality score, sampling, completeness, active checks, active anomalies, etc.
2	Copy Link	Copy the unique URL of the selected field to your clipboard.
3	Copy ID	Copy the unique identifier (ID) of the selected field.
4	Copy Name	Copy the name of the selected field to your clipboard.

Alternatively, you can access these right-click options by opening the dedicated page of the table, navigating to its Fields section, and performing the right-click operation on any field from the list.

field

Right Click Checks

Right-click on the specific check from All, Active, Draft, and Archived within a connected source datastore.

A dropdown list of options will appear:

Open in New Tab.
Copy Link.
Copy ID.
Copy Name.

checks

No	Field	Description
1	Open in New Tab	Open the selected check in a new browser tab, where you can view its quality score, sampling, completeness, active checks, active anomalies, etc.
2	Copy Link	Copy the unique URL of the selected check to your clipboard.
3	Copy ID	Copy the unique identifier (ID) of the selected check.
4	Copy Name	Copy the name of the selected check to your clipboard.

Alternatively, you can access these right-click options by navigating to the Checks from the Explore section.

checks

Right Click Anomalies

Right-click on the specific anomaly from All, Active, Acknowledged, and Archived within a connected source datastore.

A dropdown list of options will appear:

Open in New Tab.
Copy Link.
Copy ID.
Copy Name.

anomalies

No	Field	Description
1	Open in New Tab	Open the selected anomaly in a new browser tab, where you can view its quality score, sampling, completeness, active checks, active anomalies, etc.
2	Copy Link	Copy the unique URL of the selected anomaly to your clipboard.
3	Copy ID	Copy the unique identifier (ID) of the selected anomaly.
4	Copy Name	Copy the name of the selected anomaly to your clipboard.

Alternatively, you can access these right-click options by navigating to the Anomalies from the Explore section.

anomalies

Assign Tag

Assigning tags to your Datastore helps you to identify and categorize your datastore easily. Tags serve as labels that categorize and identify various data sets, enhancing efficiency and organization. By highlighting checks and anomalies, tags make it easier to monitor data quality. They also allow you to list file patterns and assign quality scores, enabling quick identification and resolution of issues.

In this documentation, we will explore the steps to assign a tag to the datastore.

Step 1: Login in to your Qualytics account and select the datastore from the left menu on which you want to assign a tag.

add-datastore

Step 2: Click on Assign Tag to this Datastore located at the bottom-left corner of the interface.

assign-tag

Step 3: A drop-up menu will appear, providing you with a list of tags. Assign an appropriate tag to your datastore to simplify sorting, accessing, and managing data.

select-tag

You can also create a new tag by clicking on the call to action (➕) button.

call-to-action

A modal window will appear, providing the options to create the tag. Enter the required values to get started.

modal-window

For more information on creating tags, refer to the Add Tag section.

Step 4: Once you have assigned a tag, the tag will be instantly labeled on your source Datastore, and all related records will be updated.

For demonstration, we have assigned the High tag for the Snowflake source datastore Covid-19 Data, so it will automatically be applied to all related tables and checks within the datastore.

tag-snowflake

Catalog Operation

A Catalog Operation imports named data collections like tables, views, and files into a Source Datastore. It identifies incremental fields for incremental scans, and offers options to recreate or delete these containers, streamlining data management and enhancing data discovery.

Let's get started 🚀

Key Components

Incremental Identifier

An incremental identifier is essential for supporting incremental scan operations, as it allows the system to detect changes since the last operation.

Partition Identifier

For large data containers or partitions, a partition identifier is necessary to process data efficiently. In DFS datastores, the default fields for both incremental and partition identifiers are set to the last-modified timestamp. If a partition identifier is missing, the system uses repeatable ordering candidates (order-by fields) to process containers, although this method is less efficient for handling large datasets with many rows.

Info

Attribute Overrides: After the profile operation, the qualytics engine might automatically update the containers to have partition fields and incremental fields. Those "attributes" can be manually overridden.

Note

Advanced users should be able to override these auto-detected selections and overridden options should persist through subsequent Catalog Operations.

Initialization & Operation Options

Automatic Catalog Operation

While adding the datastore, tick the Initiate Cataloging checkbox to automatically perform a catalog operation on the configured source datastore.

test-connection

With the automatic cataloging option turned on, you will be redirected to the datastore details page once the datastore (whether JDBC or DFS) is successfully added. You will observe the cataloging operation running automatically with the following default options:

Prune: Disabled ❌
Recreate: Disabled ❌
Include: Tables and views ✔️

catalog

Manual Catalog Operation

If automatic cataloging is disabled while adding the datastore, users can initiate the catalog operation manually by selecting preferred options. Manual catalog operation offers users the flexibility to set up custom catalog configurations like syncing only tables or views.

Step 1: Select a source datastore from the side menu on which you would like to perform the catalog operation.

add-source-datastore

Step 2: Clicking on your preferred datastore will navigate you to the datastore details page. Within the overview tab (default view), click on the Run button under Catalog to initiate the catalog operation.

run-catalog

A modal window will display Operation Triggered and you will be notified once the catalog operation is completed.

Note

You will receive a notification when the catalog operation is completed.

operation-triggered

Step 3: Close the Success modal window and you will observe in the UI that the Catalog operation has been completed and it has gathered the data structures, file patterns, and corresponding metadata from your configured datastore.

file-pattern

Users might encounter a error if the schema of the datastore is empty or if the specified user for logging does not have the necessary permissions to read the objects. This ensures that proper access controls are in place and that the data structure is correctly defined.

catalog-aborted

Custom Catalog Configuration

The catalog operation can be custom-configured with the following options:

Prune: Remove any existing named collections that no longer appear in the datastore
Recreate: Restore any previously removed named collection that does currently appear in the database
Include: Include Tables and Views

Step 1: Click on the Run button from the datastore details page (top-right corner) and select Catalog from the dropdown list.

run-dropdown

Step 2: When configuring the catalog operation settings, you have two options to tune:

Prune: This option allows the removal of any named collections (tables, views, files, etc.) that no longer exist in the datastore. This ensures that outdated or obsolete collections are not included in future operations, keeping the datastore clean and relevant.
Recreate: This option enables the recreation of any named collections that have been previously deleted in Qualytics. It is useful for restoring collections that may have been removed accidentally or need to be brought back for analysis.

options

Step 3: The user can choose whether to include only tables, only views, or both in the catalog operation. This flexibility allows for more targeted metadata analysis based on the specific needs of the data management task.

include

Run Instantly

Click on the “Run Now” button to perform the catalog operation immediately.

run

After clicking Run Now, a confirmation message appears stating "Operation Triggered".

operation-triggered

Schedule

Step 1: Click on the “Schedule” button to configure the available schedule options in the catalog operation.

schedule-catalog

Step 2: Set the scheduling preferences for the catalog operation.

1. Hourly: This option allows you to schedule the catalog operation to run every hour at a specified minute. You can define the frequency in hours and the exact minute within the hour the cataloging should start. Example: If set to "Every 1 hour(s) on minute 0," the catalog operation will run every hour at the top of the hour (e.g., 1:00, 2:00, 3:00).

hourly

2. Daily: This option schedules the catalog operation to run once every day at a specific time. You specify the number of days between scans and the exact time of day in UTC. Example: If set to "Every 1 day(s) at 00:00 UTC," the scan will run every day at midnight UTC.

daily

3. Weekly: This option schedules the catalog operation to run on specific days of the week at a set time. You select the days of the week and the exact time of day in UTC for the catalog operation to run. Example: If configured to run on "Sunday" and "Friday" at 00:00 UTC, the scan will execute at midnight UTC on these days.

weekly

4. Monthly: This option schedules the catalog operation to run once a month on a specific day at a set time. You specify the day of the month and the time of day in UTC. If set to "On the 1st day of every 1 month(s), at 00:00 UTC," the catalog operation will run on the first day of each month at midnight UTC.

monthly

5. Advanced: The advanced section for scheduling operations allows users to set up more complex and custom scheduling using Cron expressions. This option is particularly useful for defining specific times and intervals for catalog operations with precision.

Cron expressions are a powerful and flexible way to schedule tasks. They use a syntax that specifies the exact timing of the task based on five fields:

Minute (0 - 59)
Hour (0 - 23)
Day of the month (1 - 31)
Month (1 - 12)
Day of the week (0 - 6) (Sunday to Saturday)

Each field can be defined using specific values, ranges, or special characters to create the desired schedule.

Example: For instance, the Cron expression 0 0 * * * schedules the catalog operation to run at midnight (00:00) every day. Here’s a breakdown of this expression:

0 (Minute) - The task will run at the 0th minute.
0 (Hour) - The task will run at the 0th hour (midnight).
*(Day of the month) - The task will run every day of the month.
*(Month) - The task will run every month.
*(Day of the week) - The task will run every day of the week.

Users can define other specific schedules by adjusting the Cron expression. For example:

0 12 * * 1-5 - Runs at 12:00 PM from Monday to Friday.
30 14 1 * * - Runs at 2:30 PM on the first day of every month.
0 22 * * 6 - Runs at 10:00 PM every Saturday.

To define a custom schedule, enter the appropriate Cron expression in the "Custom Cron Schedule (UTC)" field before specifying the schedule name. This will allow for precise control over the timing of the catalog operation, ensuring it runs exactly when needed according to your specific requirements.

advanced

Step 3: Define the “Schedule Name” to identify the scheduled operation at run time.

schedule-name

Step 4: Click on the “Schedule” button to activate your catalog operation schedule.

schedule

After clicking Schedule, a confirmation message appears stating "Operation Scheduled".

operation-scheduled

Once the catalog operation is triggered, your view will be automatically switched to the Activity tab, allowing you to explore post-operation details on your ongoing/completed catalog operation.

completed-catalog

Operations Insights

When the catalog operation is completed, you will receive a notification and can navigate to the Activity tab for the datastore on which you triggered the Catalog Operation and learn about the operation results.

Top Panel

1. Runs (Default View): Provides insights into the operations that have been performed.

2. Search: Search any operation (including catalog) by entering the operation ID

3. Sort by: Organize the list of operations based on the Created Date or the Duration.

4. Filter: Narrow down the list of operations based on:

Operation Type
Operation Status
Table

activity

Activity Heatmap

The activity heatmap shown in the snippet below represents activity levels over a period, with each square indicating a day and the color intensity representing the number of operations or activities on that day. It is useful for tracking the number of operations performed on each day within a specific timeframe.

Tip

You can click on any of the squares from the Activity Heatmap to filter operations

activity-calender

Operation Detail

Running

This status indicates that the catalog operation is still running at the moment and is yet to be completed. A catalog operation having a running status reflects the following details and actions:

Parameter	Interpretation
Operation ID	Unique identifier
Operation Type	Type of operation performed (catalog, profile, or scan)
Timestamp	Timestamp when the operation was started
Progress Bar	The progress of the operation
Triggered By	The author who triggered the operation
Schedule	Whether the operation was scheduled or not
Prune	Indicates whether Prune was enabled or disabled in the operation
Recreate	Indicates whether Recreate was enabled or disabled in the operation
Table	Indicates whether the Table was included in the operation or not
Views	Indicates whether the Views was included in the operation or not
Abort	Click on the Abort button to stop the catalog operation

running-catalog

Aborted

This status indicates that the catalog operation was manually stopped before it could be completed. A catalog operation having an aborted status reflects the following details and actions:

Parameter	Interpretation
Operation ID	Unique identifier
Operation Type	Type of operation performed (catalog, profile, or scan)
Timestamp	Timestamp when the operation was started
Progress Bar	The progress of the operation
Triggered By	The author who triggered the operation
Schedule	Whether the operation was scheduled or not
Prune	Indicates whether Prune was enabled or disabled in the operation
Recreate	Indicates whether Recreate was enabled or disabled in the operation
Table	Indicates whether the Table was included in the operation or not
Views	Indicates whether the Views was included in the operation or not
Resume	Click on the Resume button to continue a previously aborted catalog operation from where it left off
Rerun	Click on the Rerun button to initiate the catalog operation from the beginning, ignoring any previous attempts
Delete	Click on the Delete button to remove the record of the catalog operation from the list

aborted

Warning

This status signals that the catalog operation encountered some issues and displays the logs that facilitate improved tracking of blockers and issue resolution. A catalog operation having a warning status reflects the following details and actions:

Parameter	Interpretation
Operation ID	Unique identifier
Operation Type	Type of operation performed (catalog, profile, or scan)
Timestamp	Timestamp when the operation was started
Progress Bar	The progress of the operation
Triggered By	The author who triggered the operation
Schedule	Whether the operation was scheduled or not
Prune	Indicates whether Prune was enabled or disabled in the operation
Recreate	Indicates whether Recreate was enabled or disabled in the operation
Table	Indicates whether the Table was included in the operation or not
Views	Indicates whether the Views was included in the operation or not
Rerun	Click on the Rerun button to initiate the catalog operation from the beginning, ignoring any previous attempts
Delete	Click on the Delete button to remove the record of the catalog operation from the list
Logs	Logs include error messages, warnings, and other pertinent information generated during the execution of the Catalog Operation

warning-catalog

Success

This status confirms that the catalog operation was completed successfully without any issues. A catalog operation having a success status reflects the following details and actions:

Parameter	Interpretation
Operation ID	Unique identifier
Operation Type	Type of operation performed (catalog, profile, or scan)
Timestamp	Timestamp when the operation was started
Progress Bar	The progress of the operation
Triggered By	The author who triggered the operation
Schedule	Whether the operation was scheduled or not
Prune	Indicates whether Prune was enabled or disabled in the operation
Recreate	Indicates whether Recreate was enabled or disabled in the operation
Table	Indicates whether the Table was included in the operation or not
Views	Indicates whether the Views was included in the operation or not
Rerun	Click on the Rerun button to initiate the catalog operation from the beginning, ignoring any previous attempts
Delete	Click on the Delete button to remove the record of the catalog operation from the list

success-catalog

Post-Operation Details

For JDBC Source Datastores

After the catalog operation is completed on a JDBC source datastore, users can view the following information:

Container Names: These are the names of the data collections (e.g., tables, views) identified during the catalog operation.

container-names

Fields for Each Container: Each container will display its fields or columns, which were detected during the catalog operation.

connector-field

Incremental Identifiers and Partition Fields: These settings are automatically configured based on the catalog operation. Incremental identifiers help in recognizing changes since the last scan, and partition fields aid in efficient data processing.

Tree view > Container node > Gear icon > Settings option

table-settings

For DFS Source Datastores

After the catalog operation is completed on a DFS source datastore, users can view the following information:

Container Names: Similar to JDBC, these are the data collections identified during the catalog operation.
Fields for Each Container: Each container will list its fields or metadata detected during the catalog operation.
Directory Tree Traversal: The catalog operation traverses the directory tree, treating each file with a supported extension as a single-partition container. It reveals metadata such as the relative path, filename, and extension.
Incremental Identifier and Partition Field: By default, both the incremental identifier and partition field are set to the last-modified timestamp. This ensures efficient incremental scans and data partitioning.
"Globbed" Containers: Files in the same folder with the same extensions and similar naming formats are grouped into a single container, where each file is treated as a partition. This helps in managing and querying large datasets effectively.

API Payload Examples

This section provides API payload examples for initiating and checking the running status of a catalog operation. Replace the placeholder values with data specific to your setup.

Running a Catalog operation

To run a catalog operation, use the API payload example below and replace the placeholder values with your specific values:

Endpoint (Post): /api/operations/run (post)

{
  "type": "catalog",
  "datastore_id": "datastore-id",
  "prune": false,
  "recreate": false,
  "include": [
    "table",
    "view"
  ]
}

Retrieving Catalog Operation Status

To retrieve the catalog operation status, use the API payload example below and replace the placeholder values with your specific values:

Endpoint (Get): /api/operations/{id} (get)

{
  "items": [
    {
      "id": 12345,
      "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
      "type": "catalog",
      "start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
      "end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
      "result": "success",
      "message": null,
      "triggered_by": "user@example.com",
      "datastore": {
        "id": 54321,
        "name": "Datastore-Sample",
        "store_type": "jdbc",
        "type": "db_type",
        "enrich_only": false,
        "enrich_container_prefix": "_data_prefix",
        "favorite": false
      },
      "schedule": null,
      "include": [
        "table",
        "view"
      ],
      "prune": false,
      "recreate": false
    }
  ],
  "total": 1,
  "page": 1,
  "size": 50,
  "pages": 1
}

Profile Operation

The Profile Operation is a comprehensive analysis conducted on every record within all available containers in a datastore. This process is aimed at understanding and improving data quality by generating metadata for each field within the collections of data (like tables or files).

By gathering detailed statistical data and interacting with the Qualytics Inference Engine, the operation not only identifies and evaluates data quality but also suggests and refines checks to ensure ongoing data integrity. Executing profile operations periodically helps maintain up-to-date and accurate data quality checks based on the latest data.

This guide explains how to configure the profile operation with available functionalities such as tables, tags, and schedule options.

Let's get started 🚀

How Profiling Works

Fields Identification

The initial step involves recognizing and identifying all the fields within each data container. This step is crucial as it lays the foundation for subsequent analysis and profiling.

Statistical Data Gathering

After identifying the fields, the Profile Operation collects statistical data for each field based on its declared or inferred data type. This data includes essential metrics such as minimum and maximum values, mean, standard deviation, and other relevant statistics. These metrics provide valuable insights into the characteristics and distribution of the data, helping to understand its quality and consistency.

Metadata Generation

The gathered statistical data is then submitted to the Qualytics Inference Engine. The engine utilizes this data to generate metadata that forms the basis for creating appropriate data quality checks. This metadata is essential for setting up robust quality control mechanisms within the data management system.

Data Quality Checks

The inferred data quality checks are rigorously tested against the actual source data. This testing phase is critical to fine-tuning the checks to the desired sensitivity levels, ensuring they are neither too strict (causing false positives) nor too lenient (missing errors). By calibrating these checks accurately, the system can maintain high data integrity and reliability.

Step 1: Select a source datastore from the side menu on which you would like to perform the profile operation.

Step 2: Clicking on your preferred datastore will navigate you to the datastore details page. Within the overview tab (default view), click on the Run button under Profile to initiate the profile operation.

Configuration

Step 1: Click on the Run button to initiate the profile operation.

Note

You can run Profile Operation anytime to update the inferred data quality checks, automatically based on new data in the datastore. It is recommended to schedule the profile operations periodically to update inferred rules. More details are discussed in the Schedule section below.

Step 2: Select tables (in your JDBC datastore) or file patterns (in your DFS datastore) and tags you would like to be profiled.

1. All Tables/File Patterns

This option includes all tables or files currently available in the datastore for profiling. Selecting this will profile every table within the source datastore without the need for further selection.

2. Specific

This option allows users to manually select individual tables or files for profiling. It provides the flexibility to focus on particular tables of interest, which can be useful if the user is only interested in a subset of the available data.

3. Tag

This option automatically profiles tables associated with selected tags. Tags are used to categorize tables, and by selecting a specific tag, all tables associated with that tag will be profiled. This option helps in managing and profiling grouped data efficiently.

Step 3: After making the relevant selections, click on the Next button to configure the Operation Settings.

Step 4: Configure the following two Read Settings:

Starting Threshold
Record Limit

Starting Threshold

This setting allows users to specify a minimum incremental identifier value to set a starting point for the profile operation. It helps in filtering data from a specific point in time or a particular batch value.

Greater Than Time: Users can select a timestamp in UTC to start profiling data from a specific time onwards. This is useful for focusing on recent data or data changes since a particular time.
Greater Than Batch: Users can enter a batch value to start profiling from a specific batch. This option is helpful for scenarios where data is processed in batches, allowing the user to profile data from a specific batch number onwards.

Note

The starting threshold i.e. Greater Than Time and Greater Than Batch are applicable only to the tables or files with an incremental timestamp strategy.

Record Limit

Define the number of records to be profiled per table: This feature allows users to manually enter a custom record limit value using a text field in the profile operation. This setting helps in controlling the scope of the profiling operation, particularly for large datasets, by capping the number of records to analyze.

You can also use a drop-down menu to quickly select from commonly used limits such as 1M, 10M, 100M, and All.

Note

The number of records must be between 1 and 1,000,000,000.

Step 5: After making the relevant selections, click on the Next button to configure the Inference Settings.

Step 6: Configure the following two Inference Settings:

Inference Threshold
Inference State

Inference Threshold

The Inference Threshold allows you to customize the data quality checks that are automatically created and updated when your data is analyzed. This means you can adjust the data quality checks based on how complex the data rules are, giving you more control over how your data is checked and monitored.

Default Configuration

By default, the Inference Threshold is set to 2, which provides a comprehensive range of checks designed to ensure data integrity across different scenarios. Users have the flexibility to adjust this threshold based on their specific needs, allowing for either basic or advanced checks as required.

Levels of Check Inference

The Inference Threshold ranges from 0 to 5, with each level including progressively more complex and comprehensive checks. Below is an explanation of each level:

Note

Each level includes all the checks from the previous levels and adds new checks specific to that level. For example, at Level 1, there are five basic checks. At Level 2, you get those five checks plus additional ones for Level 2. By the time you reach Level 5, it covers all the checks from Levels 1 to 4 and adds its own new checks for complete review.

Level 0: No Inference

At this level, no checks are automatically inferred. This is suitable when users want complete control over which checks are applied, or if no checks are needed. Ideal for scenarios where profiling should not infer any constraints, and all checks will be manually defined.

Level 1: Basic Data Integrity and Simple Value Threshold Checks

This level includes fundamental rules for basic data integrity and simple validations. It ensures that basic constraints like completeness, non-negative numbers, and valid date ranges are applied. Included Checks:

Completeness Checks: Ensure data fields are complete if previously complete.
Categorical Range Checks: Validate if values fall within a predefined set of categories.
Non-Negative Numbers: Ensure numeric values are non-negative.
Non-Future Date/Time: Ensure datetime values are not set in the future.

Use Case: Suitable for datasets where basic integrity checks are sufficient.

The following table shows the inferred checks that the Analytics Engine can generate based on the user's data. At Level 1, five checks are created.

Inferred Checks	Reference
Not Null (record)	See more.
Any Not Null (record)	See more.
Expected Values (record)	See more.
Not Negative	See more.
Not Future	See more.

Level 2: Value Range and Pattern Checks

Builds upon Level 1 by adding more specific checks related to value ranges and patterns. This level is more detailed and begins to enforce rules related to the nature of the data itself. Included Checks:

Date Range Checks: Ensure dates fall within a specified range.
Numeric Range Checks: Validate that numeric values are within acceptable ranges.
String Pattern Checks: Ensure strings match specific patterns (e.g., email formats).
Approximate Uniqueness: Validate uniqueness of values if they are approximately unique.

Use Case: Ideal for datasets where patterns and ranges are important for ensuring data quality.

The following table shows the inferred checks that the Analytics Engine can generate based on the user's data. At Level 2, four checks are created.

Checks	Reference
Between Times	See more.
Between	See more.
Matches Pattern	See more.
Unique	See more.

Level 3: Time Series and Comparative Relationship Checks

This level includes all checks from Level 2 and adds sophisticated checks for time series and comparative relationships between datasets. Included Checks:

Date Granularity Checks: Ensure the granularity of date values is consistent (e.g., day, month, year).
Consistent Relationships: Validate that relationships between overlapping datasets are consistent.

Use Case: Suitable for scenarios where data quality depends on time-series data or when comparing data across different datasets.

The following table shows the inferred checks that the Analytics Engine can generate based on the user's data. At Level 3, eight checks are created.

Inferred checks	Reference
Time Distribution Size	See more.
After Date Time	See more.
Before Date Time	See more.
Greater Than	See more.
Greater Than Field	See more.
Less Than	See more.
Less Than Field	See more.
Equal To Field	See more.

Level 4: Linear Regression and Cross-Datastore Relationship Checks

This level includes all checks from Level 3 and adds even more advanced checks, including linear regression analysis and validation of relationships across different data stores. Included Checks:

Linear Regression Checks: Validate data using regression models to identify trends and outliers.
Cross-Datastore Relationships: Ensure that data relationships are maintained across different data sources.

Use Case: Best for complex datasets where advanced analytical checks are necessary.

The following table shows the inferred checks that the Analytics Engine can generate based on the user's data. At Level 4, four checks are created.

Inferred Checks	Reference
Data Diff	See more.
Exists In	See more.
Not Exists In	See more.
Predicted By	See more.
Is Replica Of (is sunsetting)	See more.

Level 5: Shape Checks

The most comprehensive level includes all previous checks plus checks that validate the shape of certain distribution patterns that can be identified in your data. Included Checks:

Shape Checks: Checks that define an expectation for some percentage of your data less than 100%. The property “coverage” holds the percentage of your data for which the expressed check should be true.

Use Case: Ideal for scenarios where each incremental set of scanned data should exhibit the same distributions of values as the training set. For example, a transactions table is configured for a weekly incremental scan after each week’s data is loaded. A shape check could define that 80% of all transactions are expected to be performed using “cash” or “credit”.

This table shows the inferred checks that the Analytics Engine can generate based on the user's data. At Level 5, three checks are created.

Inferred Checks	Reference
Expected Values (Shape)	See more.
Matches Pattern (Shape)	See more.
Not Null (Shape)	See more.

Warning

If the checks inferred during a profile operation do not detect any anomalies, and the check inference level decreases in the next profile operation, the checks that did not generate anomalies will be archived or discarded. However, if the checks detect any anomalies, they will be retained to continue monitoring the data and addressing potential issues.

Inference State

Check the box labeled "Infer As Draft" to ensure that all inferred checks will be generated in a draft state. This allows for greater flexibility as you can review and refine these checks before they are finalized.

Run Instantly

Click on the Run Now button, and perform the profile operation immediately.

Schedule

Step 1: Click on the Schedule button to configure the available schedule options in the profile operation.

Step 2: Set the scheduling preferences for the profile operation.

1. Hourly: This option allows you to schedule the profile operation to run every hour at a specified minute. You can define the frequency in hours and the exact minute within the hour the profiling should start. Example: If set to "Every 1 hour(s) on minute 0," the profile operation will run every hour at the top of the hour (e.g., 1:00, 2:00, 3:00).

2. Daily: This option schedules the profile operation to run once every day at a specific time. You specify the number of days between scans and the exact time of day in UTC. Example: If set to "Every 1 day(s) at 00:00 UTC," the scan will run every day at midnight UTC.

3. Weekly: This option schedules the profile operation to run on specific days of the week at a set time. You select the days of the week and the exact time of day in UTC for the profile operation to run. Example: If configured to run on "Sunday" and "Friday" at 00:00 UTC, the scan will execute at midnight UTC on these days.

4. Monthly: This option schedules the profile operation to run once a month on a specific day at a set time. You specify the day of the month and the time of day in UTC. If set to "On the 1st day of every 1 month(s), at 00:00 UTC," the profile operation will run on the first day of each month at midnight UTC.

5. Advanced: The advanced section for scheduling operations allows users to set up more complex and custom scheduling using Cron expressions. This option is particularly useful for defining specific times and intervals for profile operations with precision.

Cron expressions are a powerful and flexible way to schedule tasks. They use a syntax that specifies the exact timing of the task based on five fields:

Minute (0 - 59)
Hour (0 - 23)
Day of the month (1 - 31)
Month (1 - 12)
Day of the week (0 - 6) (Sunday to Saturday)

Each field can be defined using specific values, ranges, or special characters to create the desired schedule.

Example: For instance, the Cron expression 0 0 * * * schedules the profile operation to run at midnight (00:00) every day. Here’s a breakdown of this expression:

0 (Minute) - The task will run at the 0th minute.
0 (Hour) - The task will run at the 0th hour (midnight).
*(Day of the month) - The task will run every day of the month.
*(Month) - The task will run every month.
*(Day of the week) - The task will run every day of the week.

Users can define other specific schedules by adjusting the Cron expression. For example:

0 12 * * 1-5 - Runs at 12:00 PM from Monday to Friday.
30 14 1 * * - Runs at 2:30 PM on the first day of every month.
0 22 * * 6 - Runs at 10:00 PM every Saturday.

To define a custom schedule, enter the appropriate Cron expression in the Custom Cron Schedule (UTC) field before specifying the schedule name. This will allow for precise control over the timing of the profile operation, ensuring it runs exactly when needed according to your specific requirements.

Step 3: Define the Schedule Name to identify the scheduled operation at the running time.

Step 4: Click on the Schedule button to activate your profile operation schedule.

Note

You will receive a notification when the profile operation is completed.

Operation Insights

When the profile operation is completed, you will receive the notification and can navigate to the Activity tab for the datastore on which you triggered the Profile Operation and learn about the operation results.

Top Panel

Runs (Default View): Provides insights into the operations that have been performed
Schedule: Provides insights into the scheduled operations.
Search: Search any operation (including profile) by entering the operation ID
Sort by: Organize the list of operations based on the Created Date or the Duration.
Filter: Narrow down the list of operations based on:
Operation Type
Operation Status
Table

Activity Heatmap

The activity heatmap shown in the snippet below represents activity levels over a period, with each square indicating a day and the color intensity representing the number of operations or activities on that day. It is useful in tracking the number of operations performed on each day within a specific timeframe.

Tip

You can click on any of the squares from the Activity Heatmap to filter operations

Operation Detail

Running

This status indicates that the profile operation is still running at the moment and is yet to be completed. A profile operation having a running status reflects the following details and actions:

No.	Parameter	Interpretation
1.	Operation ID & Operation Type	Unique identifier and type of operation performed (catalog, profile, or scan).
2.	Timestamp	Timestamp when the operation was started.
3.	Progress Bar	The progress of the operation.
4.	Triggered By	The author who triggered the operation.
5.	Schedule	Whether the operation was scheduled or not.
6.	Inference Threshold	Indicates how much control you have over automatic data quality checks, adjustable based on complexity.
7.	Checks Synchronized	Indicates the count of Checks Synchronized in the operation.
8.	Infer as Draft	Indicates whether Infer as Draft was enabled or disabled in the operation.
9.	Read Record Limit	Defines the maximum number of records to be scanned per table after initial filtering.
10.	Results	Provides immediate insights into the profile operation conducted.
11.	Abort	The "Abort" button enables you to stop the ongoing profile operation.
12.	Summary	The "Summary" section provides a real-time overview of the profile operation's progress. It includes key metrics such as: Tables Requested: The total number of tables that were requested for profiling. Click on the adjacent magnifying glass icon to view the tables requested. Tables Profiled: The number of tables that have been profiled so far. Click on the adjacent magnifying glass icon to view the tables profiled. Records Profiled: This represents the total number of records that were included in the profiling process. Field Profiles Updates: This number shows how many field profiles were updated as a result of the profiling operation. Inferred Checks Synchronized: This indicates the number of inferred checks that were synchronized based on the profile operation. Added: Shows the count of newly added inferred checks. Updated: Indicates the count of checks that were updated in the operation.

Aborted

This status indicates that the profile operation was manually stopped before it could be completed. A profile operation having an aborted status reflects the following details and actions:

No.	Parameter	Interpretation
1.	Operation ID & Operation Type	Unique identifier and type of operation performed (catalog, profile, or scan).
2.	Timestamp	Timestamp when the operation was started.
3.	Progress Bar	The progress of the operation.
4.	Aborted By	The author who Aborted the operation.
5.	Schedule	Whether the operation was scheduled or not.
6.	Inference Threshold	Indicates how much control you have over automatic data quality checks, adjustable based on complexity.
7.	Checks Synchronized	Indicates the count of Checks Synchronized in the operation.
8.	Infer as Draft	Indicates whether Infer as Draft was enabled or disabled in the operation.
9.	Read Record Limit	Defines the maximum number of records to be scanned per table after initial filtering.
10.	Results	Provides immediate insights into the profile operation conducted.
11.	Resume	Provides an option to continue the profile operation from where it left off.
12.	Rerun	Allows you to start a new profile operation using the same settings as the aborted scan.
13.	Delete	Removes the record of the aborted profile operation from the system, permanently deleting results.
14.	Summary	The "Summary" section provides a real-time overview of the profile operation's progress. It includes key metrics such as: Tables Requested: The total number of tables that were requested for profiling. Click on the adjacent magnifying glass icon to view the tables requested. Tables Profiled: The number of tables that were profiled before the operation was aborted. Click on the adjacent magnifying glass icon to view the tables profiled. Records Profiled: This represents the total number of records that were included before the profiling process was aborted. Field Profiles Updates: This number shows how many field profiles were updated as a result of the profiling operation. Inferred Checks Synchronized: This indicates the number of inferred checks that were synchronized based on the profile operation. Added: Shows the count of newly added inferred checks. Updated: Indicates the count of checks that were updated in the operation.

Warning

This status signals that the profile operation encountered some issues and displays the logs that facilitate improved tracking of the blockers and issue resolution. A profile operation having a completed with warning status reflects the following details and actions:

No.	Parameter	Interpretation
1.	Operation ID & Operation Type	Unique identifier and type of operation performed (catalog, profile, or scan).
2.	Timestamp	Timestamp when the operation was started.
3.	Progress Bar	The progress of the operation.
4.	Triggered By	The author who triggered the operation.
5.	Schedule	Whether the operation was scheduled or not.
6.	Inference Threshold	Indicates how much control you have over automatic data quality checks, adjustable based on complexity.
7.	Checks Synchronized	Indicates the count of Checks Synchronized in the operation.
8.	Infer as Draft	Indicates whether Infer as Draft was enabled or disabled in the operation.
9.	Read Record Limit	Defines the maximum number of records to be scanned per table after initial filtering.
10.	Results	Provides immediate insights into the profile operation conducted.
11.	Rerun	Allows you to start a new profile operation using the same settings as the warning scan.
12.	Delete	Removes the record of the profile operation, permanently deleting all results.
13.	Summary	The "Summary" section provides a real-time overview of the profile operation's progress. It includes key metrics such as: Tables Requested: The total number of tables that were requested for profiling. Click on the adjacent magnifying glass icon to view the tables requested. Tables Profiled: The number of tables that were profiled before the operation completed. Click on the adjacent magnifying glass icon to view the tables profiled. Records Profiled: This represents the total number of records that were included before the profiling process was completed. Field Profiles Updates: This number shows how many field profiles were updated as a result of the profiling operation. Inferred Checks Synchronized: This indicates the number of inferred checks that were synchronized based on the profile operation.
14.	Logs	Logs include error messages, warnings, and other pertinent information that occurred during the execution of the Profile Operation.

Success

This status confirms that the profile operation was completed successfully without any issues. A profile operation having a success status reflects the following details and actions:

No.	Parameter	Interpretation
1.	Operation ID & Operation Type	Unique identifier and type of operation performed (catalog, profile, or scan).
2.	Timestamp	Timestamp when the operation was started.
3.	Progress Bar	The progress of the operation.
4.	Triggered By	The author who triggered the operation.
5.	Schedule	Whether the operation was scheduled or not.
6.	Inference Threshold	Indicates how much control you have over automatic data quality checks, allowing adjustments based on data complexity.
7.	Checks Synchronized	Indicates the count of Checks Synchronized in the operation.
8.	Infer as Draft	Indicates whether Infer as Draft was enabled or disabled in the operation.
9.	Read Record Limit	Defines the maximum number of records to be scanned per table after initial filtering.
10.	Results	Provides immediate insights into the profile operation conducted.
11.	Rerun	Allows you to start a new profile operation using the same settings as the warning scan, useful for restarting after errors.
12.	Delete	Removes the record of the profile operation from the system, permanently deleting all results; this action cannot be undone.
13.	Summary	The "Summary" section provides a real-time overview of the profile operation's progress. It includes key metrics such as: Tables Requested: The total number of tables that were requested for profiling. Click on the adjacent magnifying glass icon to view the tables requested. Tables Profiled: The number of tables that were profiled before the operation was aborted. Click on the adjacent magnifying glass icon to view the tables profiled. Records Profiled: This represents the total number of records that were included before the profiling process was aborted. Field Profiles Updates: This number shows how many field profiles were updated as a result of the profiling operation. Inferred Checks Synchronized: This indicates the number of inferred checks that were synchronized based on the profile operation. Added: Shows the count of newly added inferred checks. Updated: Indicates the count of checks that were updated in the operation.

Full View of Metrics in Operation Summary

Users can now hover over abbreviated metrics to see the full value for better clarity. For demonstration purposes, we are hovering over the Records Profiled field to display the full value.

Post Operation Details

Step 1: Click on any of the successful Profile Operations from the list and hit the Results button.

Step 2: The Profile Results modal displays a list of both profiled and non-profiled containers. You can filter the view to show only non-profiled containers by toggling on button, which will display the complete list of unprofiled containers.

The Profile Results modal also provides two analysis options for you:

Details for a Specific Container (Container's Profile)
Details for a Specific Field of a Container (Field Profile)

Unwrap any of the containers from the Profile Results modal and click on the arrow icon.

Details for a Specific Container (Container's Profile)

Based on your selection of container from the profile operation results, you will be automatically redirected to the container details on the source datastore details page.

The following details (metrics) will be visible for analyzing the specific container you selected:

Quality Score (79): This represents an overall quality assessment of the field, likely on a scale of 0 to 100. A score of 79 suggests that the data quality is relatively good but may need further improvement.
Sampling (100%): Indicates that 100% of the data in this field was sampled for analysis. This means the entire dataset for this field was reviewed.
Completeness (100%): Suggests that all entries in this field are complete, with no missing or null values, signifying data integrity.
Active Checks (2): This shows that 2 data quality checks are actively running on this field. These checks likely monitor aspects such as format, uniqueness, or consistency.
Active Anomalies (0): Indicates that there are no active anomalies or issues detected in the field, meaning no irregularities have been found during the checks.

Details for a Specific Field of a Container (Field Profile)

Unwrap the container to view the underlying fields. The following details (metrics) will be visible for analyzing a specific field of the container:

No	Profile	Description
1	Declared Type	Indicates whether the type is declared by the source or inferred.
2	Distinct Values	Count of distinct values observed in the dataset.
3	Min Length	Shortest length of the observed string values or lowest value for numerics.
4	Max Length	Greatest length of the observed string values or highest value for numerics.
5	Mean	Mathematical average of the observed numeric values.
6	Median	The median of the observed numeric values.
7	Standard Deviation	Measure of the amount of variation in observed numeric values.
8	Kurtosis	Measure of the 'tailedness' of the distribution of observed numeric values.
9	Skewness	Measure of the asymmetry of the distribution of observed numeric values.
10	Q1	The first quartile; the central point between the minimum and the median.
11	Q3	The third quartile; the central point between the median and the maximum.
12	Sum	Total sum of all observed numeric values.

Histogram

Shows how the values in the field are spread out. Each bar represents how many values fall within a certain range, making it easy to spot trends and outliers.

API Payload Examples

This section provides payload examples for initiating and checking the running status of a profile operation. Replace the placeholder values with data specific to your setup.

Running a Profile Operation

To run a profile operation, use the API payload example below and replace the placeholder values with your specific values:

Endpoint (Post): /api/operations/run (post)

Option I: Running a profile operation of all containers

container_names: [ ]: This setting indicates that profiling will encompass all containers.
max_records_analyzed_per_partition: null: This setting implies that all records within all containers will be profiled.
infer_threshold: 5: This setting indicates that the engine will automatically infer quality checks of level 5 for you.

{  
    "type":"profile",  
    "datastore_id": datastore-id,  
    "container_names":[],  
    "max_records_analyzed_per_partition":null,  
    "inference_threshold":5  
}

Option II: Running a profile operation of specific containers

container_names: ["table_name_1", "table_name_2"]: This setting indicates that profiling will only cover the tables named table_name_1 and table_name_2.
max_records_analyzed_per_partition: 1000000: This setting means that up to 1 million rows per container will be profiled.
infer_threshold: 0: This setting indicates that the engine will not automatically infer quality checks for you.

{  
    "type":"profile",  
    "datastore_id":datastore-id,  
    "container_names":[  
        "table_name_1",  
        "table_name_2"  
    ],  
    "max_records_analyzed_per_partition":1000000,  
    "inference_threshold":0  
}

Scheduling a Profile Operation

Below is a sample payload for scheduling a profile operation. Please substitute the placeholder values with the appropriate data relevant to your setup.

Endpoint (Post): /api/operations/schedule (post)

INFO: This payload is to run a scheduled profile operation every day at 00:00

Scheduling profile operation of all containers

{  
    "type":"profile",  
    "name":"My scheduled Profile operation",  
    "datastore_id":"datastore-id",  
    "container_names":[],  
    "max_records_analyzed_per_partition":null,  
    "infer_constraints":5,  
    "crontab":"00 00 /1  *"  
}

Retrieving Profile Operation Status

To retrieve the profile operation status, use the API payload example below and replace the placeholder values with your specific values:

Endpoint (Get): /api/operations/{id} (get)

{
    "items": [
        {
            "id": 12345,
            "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "type": "profile",
            "start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "result": "success",
            "message": null,
            "triggered_by": "user@example.com",
            "datastore": {
                "id": 101,
                "name": "Sample-Store",
                "store_type": "jdbc",
                "type": "db_type",
                "enrich_only": false,
                "enrich_container_prefix": "data_prefix",
                "favorite": false
            },
            "schedule": null,
            "inference_threshold": 5,
            "max_records_analyzed_per_partition": -1,
            "max_count_testing_sample": 100000,
            "histogram_max_distinct_values": 100,
            "greater_than_time": null,
            "greater_than_batch": null,
            "percent_testing_threshold": 0.4,
            "high_correlation_threshold": 0.5,
            "status": {
                "total_containers": 2,
                "containers_analyzed": 2,
                "partitions_analyzed": 2,
                "records_processed": 1126,
                "fields_profiled": 9,
                "checks_synchronized": 26
            },
            "containers": [
                {
                    "id": 123,
                    "name": "Container1",
                    "container_type": "table",
                    "table_type": "table"
                },
                {
                    "id": 456,
                    "name": "Container2",
                    "container_type": "table",
                    "table_type": "table"
                }
            ],
            "container_profiles": [
                {
                    "id": 789,
                    "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                    "parent_profile_id": null,
                    "container": {
                        "id": 456,
                        "name": "Container2",
                        "container_type": "table",
                        "table_type": "table"
                    },
                    "records_count": 550,
                    "records_processed": 550,
                    "checks_synchronized": 11,
                    "field_profiles_count": 4,
                    "result": "success",
                    "message": null
                },
                {
                    "id": 790,
                    "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                    "parent_profile_id": null,
                    "container": {
                        "id": 123,
                        "name": "Container1",
                        "container_type": "table",
                        "table_type": "table"
                    },
                    "records_count": 576,
                    "records_processed": 576,
                    "checks_synchronized": 15,
                    "field_profiles_count": 5,
                    "result": "success",
                    "message": null
                }
            ],
            "tags": []
        }
    ],
    "total": 1,
    "page": 1,
    "size": 50,
    "pages": 1
}

Scan Operation

The Scan Operation in Qualytics is performed on a datastore to enforce data quality checks for various data collections, such as tables, views, and files. It supports centralized configuration through the Datastore Enrichment Settings, where options like the Remediation Strategy, Source Record Limit, and Anomaly Rollup Threshold are defined. While these defaults are applied automatically during a scan, users retain the flexibility to adjust the Source Record Limit and Anomaly Rollup Threshold directly within the scan form. This operation has several key functions:

Record Anomalies: Identifies a single record (row) as anomalous and provides specific details regarding why it is considered anomalous. The simplest form of a record anomaly is a row that lacks an expected value for a field.
Shape Anomalies: Identifies structural issues within a dataset at the column or schema level. It highlights broader patterns or distributions that deviate from expected norms. If a dataset is expected to have certain fields and one or more fields are missing or contain inconsistent patterns, this would be flagged as a shape anomaly.
Anomaly Data Recording: All identified anomalies, along with related analytical data, are recorded in the associated Enrichment Datastore for further examination.

Additionally, the Scan Operation offers flexible options, including the ability to:

Perform checks on incremental loads versus full loads.
Limit the number of records scanned.
Run scans on a selected list of tables or files.
Schedule scans for future execution.

Let's get started! 🚀

Step 1: Select a source datastore from the side menu on which you would like to perform the scan operation.

Step 2: Clicking on your preferred datastore will navigate you to the datastore details page. Within the overview tab (default view), click on the Run button under Scan to initiate the scan operation.

details-page

Note

Scanning operation can be commenced once the catalog operation and profile operation are completed.

Configuration

Step 1: Click on the Run button to initiate the scan operation.

run

Step 2: Select tables (in your JDBC datastore) or file patterns (in your DFS datastore) and tags you would like to be scanned.

Note

Scan operation also supports .txt.gz and .csv.gz files in DFS datastores.

1. All Tables/File Patterns

This option includes all tables or file patterns currently available for scanning in the datastore. It means that every table or file pattern recognized in your datastore will be subjected to the defined data quality checks. Use this when you want to perform a comprehensive scan covering all the available data without any exclusions.

all-operation

2. Specific Tables/File Patterns

This option allows you to manually select the individual table(s) or file pattern(s) in your datastore to scan. Upon selecting this option, all the tables or file patterns associated with your datastore will be automatically populated, allowing you to select the datasets you want to scan.

You can also search the tables/file patterns you want to scan directly using the search bar. Use this option when you need to target particular datasets or when you want to exclude certain files from the scan for focused analysis or testing purposes.

specific

3. Tag

This option enables you to automatically scan file patterns associated with the selected tags. Tags can be predefined or created to categorize and manage file patterns effectively.

tag

Step 3: Click on the Next button to Configure Select Check Categories.

Users can choose one or more check categories when initiating a scan. This allows for flexible selection based on the desired scope of the operation:

Metadata: Include checks that define the expected properties of the table, such as volume. It belongs to the Volumetric rule type.
Data Integrity: Include checks that specify the expected values for the data stored in the table. It belongs to all rule types except volumetric.

select-check

Step 5: Click on the Next button to Configure the Read Settings.

Select the Read Strategy for your scan operation.
Incremental: This strategy is used to scan only the new or updated records since the last scan operation. On the initial run, a full scan is conducted unless a specific starting threshold is set. For subsequent scans, only the records that have changed since the last scan are processed. If tables or views do not have a defined incremental key, a full scan will be performed. Ideal for regular scans where only changes need to be tracked, saving time and computational resources.

Note

Incremental scans fully support Apache Iceberg table, significantly expanding the range of asset types eligible for incremental scanning operations.

Full: This strategy performs a comprehensive scan of all records within the specified data collections, regardless of any previous changes or scans. Every scan operation will include all records, ensuring a complete check each time. Suitable for periodic comprehensive checks or when incremental scanning is not feasible due to the nature of the data.

incremental

Warning

If any selected tables do not have an incremental identifier, a full scan will be performed for those tables.

Info

When running an Incremental Scan for the first time, Qualytics automatically performs a full scan, saving the incremental field for subsequent runs.

This ensures that the system establishes a baseline and captures all relevant data.
Once the initial full scan is completed, the system intelligently uses the saved incremental field to execute future Incremental Scans efficiently, focusing only on the new or updated data since the last scan.
This approach optimizes the scanning process while maintaining data quality and consistency.

Define the Starting Threshold (Optional) i.e., specify a minimum incremental identifier value to set a starting point for the scan.
Greater Than Time: This option applies only to tables with an incremental timestamp strategy. Users can specify a timestamp to scan records that were modified after this time.
Greater Than Batch: This option applies to tables with an incremental batch strategy. Users can set a batch value, ensuring that only records with a batch identifier greater than the specified value are scanned.

starting-threshold

Define the Record Limit - the maximum number of records to be scanned per table after any initial filtering. This is a crucial feature for managing large datasets.

record-limit-line

You can manually enter a custom value in the text field or quickly select from a dropdown menu with commonly used limits such as 1M, 10M, 100M, and All.

record-limit-options

Note

The number of records must be between 1 and 1,000,000,000.

Step 7: Click on the Next button to Configure the Scan Settings.

Step 8: Configure the Scan Settings.

1. Anomaly Options: Manage duplicate anomalies efficiently by archiving duplicates or reactivating recurring ones. These settings help streamline anomaly tracking and maintain data accuracy.

Archive Duplicate Anomalies: Automatically archive duplicate anomalies from previous scans that overlap with the current scan to enhance data management efficiency.
Reactivate Recurring Anomalies: Enabling Reactivate Recurring Anomalies marks new anomalies as duplicates of archived ones, reactivates the original anomaly, and creates a Fingerprint column in the Enrichment Datastore.

anomaly-option

2. Anomaly Rollup Threshold: Set the maximum number of anomalies generated per check before they are merged into a single rolled-up anomaly. This helps manage anomaly volume and simplifies review.

anomaly-option

3. Source Record Limit: Sets a maximum limit on the number of records written to the enrichment datastore for each detected anomaly. This helps manage storage and processing requirements effectively.

source-record-limit

Run Instantly

Click on the Run Now button to perform the scan operation immediately.

run-now

Schedule

Step 1: Click on the Schedule button to configure the available schedule options for your scan operation.

click-schedule

Step 2: Set the scheduling preferences for the scan operation.

1. Hourly: This option allows you to schedule the scan to run every hour at a specified minute. You can define the frequency in hours and the exact minute within the hour the scan should start. Example: If set to Every 1 hour(s) on minute 0, the scan will run every hour at the top of the hour (e.g., 1:00, 2:00, 3:00).

hourly

2. Daily: This option schedules the scan to run once every day at a specific time. You specify the number of days between scans and the exact time of day in UTC. Example: If set to Every 1 day(s) at 00:00 UTC, the scan will run every day at midnight UTC.

daily

3. Weekly: This option schedules the scan to run on specific days of the week at a set time. You select the days of the week and the exact time of day in UTC for the scan to run. Example: If configured to run on "Sunday" and "Friday" at 00:00 UTC, the scan will execute at midnight UTC on these days.

weekly

4. Monthly: This option schedules the scan to run once a month on a specific day at a set time. You specify the day of the month and the time of day in UTC. If set to "On the 1st day of every 1 month(s), at 00:00 UTC," the scan will run on the first day of each month at midnight UTC.

monthly

5. Advanced: The advanced section for scheduling operations allows users to set up more complex and custom scheduling using Cron expressions. This option is particularly useful for defining specific times and intervals for scan operations with precision.

Cron expressions are a powerful and flexible way to schedule tasks. They use a syntax that specifies the exact timing of the task based on five fields:

Minute (0 - 59)
Hour (0 - 23)
Day of the month (1 - 31)
Month (1 - 12)
Day of the week (0 - 6) (Sunday to Saturday)

Each field can be defined using specific values, ranges, or special characters to create the desired schedule.

Example: The Cron expression 0 0 * * * schedules the scan operation to run at midnight (00:00) every day. Here’s a breakdown of this expression:

0 (Minute) - The task will run at the 0th minute.
0 (Hour) - The task will run at the 0th hour (midnight).
*(Day of the month) - The task will run every day of the month.
*(Month) - The task will run every month.
*(Day of the week) - The task will run every day of the week.

Users can define other specific schedules by adjusting the Cron expression. For example:

0 12 * * 1-5 - Runs at 12:00 PM from Monday to Friday.
30 14 1 * * - Runs at 2:30 PM on the first day of every month.
0 22 * * 6 - Runs at 10:00 PM every Saturday.

To define a custom schedule, enter the appropriate Cron expression in the "Custom Cron Schedule (UTC)" field before specifying the schedule name. This will allow for precise control over the timing of the scan operation, ensuring it runs exactly when needed according to your specific requirements.

advanced

Step 3: Define the Schedule Name to identify the scheduled operation at the running time.

schedule-name

Step 4: Click on the Schedule button to schedule your scan operation.

schedule

Note

You will receive a notification when the scan operation is completed.

Advanced Options

The advanced use cases described below require options that are not yet exposed in our user interface but possible through interaction with Qualytics API.

Runtime Variable Assignment

It is possible to reference a variable in a check definition (declared in double curly braces) and then assign that variable a value when a Scan operation is initiated. Variables are supported within any Spark SQL expression and are most commonly used in a check filter.

If a Scan is meant to assert a check with a variable, a value for that variable must be supplied as part of the Scan operation's check_variables property.

For example, a check might include a filter.- transaction_date == {{ checked_date }} which will be asserted against any records where transaction_date is equal to the value supplied when the Scan operation is initiated. In this case that value would be assigned by passing the following payload when calling /api/operations/run

{
    "type": "scan",
    "datastore_id": 42,
    "container_names": ["my_container"],
    "incremental": true,
    "remediation": "none",
    "max_records_analyzed_per_partition": 0,
    "check_variables": {
        "checked_date": "2023-10-15"
    },
    "high_count_rollup_threshold": 10
}

Operations Insights

When the scan operation is completed, you will receive the notification and can navigate to the Activity tab for the datastore on which you triggered the Scan Operation and learn about the scan results.

Top Panel

1. Runs (Default View): Provides insights into the operations that have been performed.

2. Schedule: Provides insights into the scheduled operations.

3. Search: Search for any operation (including scan) by entering the operation ID.

4. Sort by: Organize the list of operations based on the Created Date or the Duration.

5. Filter: Narrow down the list of operations based on:

Operation Type
Operation Status
Table

activity-operation

Activity Heatmap

The activity heatmap shown in the snippet below represents activity levels over a period, with each square indicating a day and the color intensity representing the number of operations or activities on that day. It is useful in tracking the number of operations performed on each day within a specific timeframe.

Tip

You can click on any of the squares from the Activity Heatmap to filter operations.

activity

Operation Detail

Running

This status indicates that the scan operation is still running at the moment and is yet to be completed. A scan operation having a running status reflects the following details and actions:

running

No.	Parameter	Interpretation
1	Operation ID and Type	Unique identifier and type of operation performed (catalog, profile, or scan).
2	Timestamp	Timestamp when the operation was started.
3	Progress Bar	The progress of the operation.
4	Triggered By	The author who triggered the operation.
5	Schedule	Indicates whether the operation was scheduled or not.
6	Incremental Field	Indicates whether Incremental was enabled or disabled in the operation.
7	Remediation	Indicates whether Remediation was enabled or disabled in the operation.
8	Anomalies Identified	Provides a count of the number of anomalies detected during the running operation.
9	Read Record Limit	Defines the maximum number of records to be scanned per table after initial filtering.
10	Check Categories	Indicates which categories should be included in the scan (e.g., Metadata, Data Integrity).
11	Archive Duplicate Anomalies	Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation.
12	Reactivate Recurring Anomalies	Indicates whether previously detected anomalies that reappear in subsequent scans will be reactivated.
13	Source Record Limit	Indicates the limit on records stored in the enrichment datastore for each detected anomaly.
14	Anomaly Rollup Threshold	Number of anomalies grouped together for rollup reporting.
15	Results	View the details of the ongoing scan operation. This includes information on which tables are currently being scanned, the anomalies identified so far (if any), and other related data collected during the active scan.
16	Abort	The Abort button enables you to stop the ongoing scan operation.
17	Summary	The summary section provides an overview of the scan operation in progress. It includes: Tables Requested: The total number of tables that were scheduled for scanning. Click on the adjacent magnifying glass icon to view the tables requested. Tables Scanned: The number of tables that have been scanned so far. Click on the adjacent magnifying glass icon to view the tables scanned. Partitions Scanned: The number of partitions scanned during the ongoing operation. Records Scanned: The total number of records processed up to this point. Anomalies Identified: The total number of detected anomalies, with a breakdown of open and archived ones.

Aborted

This status indicates that the scan operation was manually stopped before it could be completed. A scan operation having an aborted status reflects the following details and actions:

aborted-operation

No.	Parameter	Interpretation
1	Operation ID and Type	Unique identifier and type of operation performed (catalog, profile, or scan).
2	Timestamp	Timestamp when the operation was started
3	Progress Bar	The progress of the operation
4	Aborted By	The author who triggered the operation
5	Schedule	Whether the operation was scheduled or not
6	Incremental Field	Indicates whether Incremental was enabled or disabled in the operation
7	Remediation	Indicates whether Remediation was enabled or disabled in the operation
8	Anomalies Identified	Provides a count on the number of anomalies detected before the operation was aborted
9	Read Record Limit	Defines the maximum number of records to be scanned per table after initial filtering
10	Check Categories	Indicates which categories should be included in the scan (Metadata, Data Integrity)
11	Archive Duplicate Anomalies	Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation
12	Reactivate Recurring Anomalies	Indicates whether previously detected anomalies that reappear in subsequent scans will be reactivated.
13	Source Record Limit	Indicates the limit on records stored in the enrichment datastore for each detected anomaly
14	Anomaly Rollup Threshold	Number of anomalies grouped together for rollup reporting.
15	Results	View the details of the scan operation that was aborted, including tables scanned and anomalies identified
16	Resume	Provides an option to continue the scan operation from where it left off
17	Rerun	The "Rerun" button allows you to start a new scan operation using the same settings as the aborted scan
18	Delete	Removes the record of the aborted scan operation from the system, permanently deleting scan results and anomalies
19	Summary	The summary section provides an overview of the scan operation up to the point it was aborted. It includes: Tables Requested: The total number of tables that were scheduled for scanning. Click on the adjacent magnifying glass icon to view the tables requested. Tables Scanned: The number of tables that have been scanned so far. Click on the adjacent magnifying glass icon to view the tables scanned. Partitions Scanned: The number of partitions scanned before the operation was aborted. Records Scanned: The total number of records processed before the scan was stopped. Anomalies Identified: The total number of detected anomalies, with a breakdown of open and archived ones.

Warning

This status signals that the scan operation encountered some issues and displays the logs that facilitate improved tracking of the blockers and issue resolution. A scan operation having a completed with warning status reflects the following details and actions:

warning

No.	Parameter	Interpretation
1	Operation ID and Type	Unique identifier and type of operation performed (catalog, profile, or scan).
2	Timestamp	Timestamp when the operation was started
3	Progress Bar	The progress of the operation
4	Triggered By	The author who triggered the operation
5	Schedule	Whether the operation was scheduled or not
6	Incremental Field	Indicates whether Incremental was enabled or disabled in the operation
7	Remediation	Indicates whether Remediation was enabled or disabled in the operation
8	Anomalies Identified	Provides a count on the number of anomalies detected before the operation was warned.
9	Read Record Limit	Defines the maximum number of records to be scanned per table after initial filtering
10	Check Categories	Indicates which categories should be included in the scan (Metadata, Data Integrity)
11	Archive Duplicate Anomalies	Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation
12	Source Record Limit	Indicates the limit on records stored in the enrichment datastore for each detected anomaly
13	Anomaly Rollup Threshold	Number of anomalies grouped together for rollup reporting.
14	Result	View the details of the scan operation that was completed with warning, including tables scanned and anomalies identified
15	Rerun	The "Rerun" button allows you to start a new scan operation using the same settings as the warning scan
16	Delete	Removes the record of the warning operation from the system, permanently deleting scan results and anomalies
17	Summary	The summary section provides an overview of the scan operation, highlighting any warnings encountered. It includes: Tables Requested: The total number of tables that were scheduled for scanning. Click on the adjacent magnifying glass icon to view the tables requested. Tables Scanned: The number of tables that have been scanned so far. Click on the adjacent magnifying glass icon to view the tables scanned. Partitions Scanned: The number of partitions scanned during the operation, including any partitions that triggered warnings. Records Scanned: The total number of records processed during the scan, along with any records that raised warnings. Anomalies Identified: The total number of detected anomalies, with a breakdown of open and archived ones.
18	Logs	Logs include error messages, warnings, and other pertinent information that occurred during the execution of the Scan Operation.

Success

The summary section provides an overview of the scan operation upon successful completion. It includes:

success

No.	Parameter	Interpretation
1	Operation ID and Type	Unique identifier and type of operation performed (catalog, profile, or scan).
2	Timestamp	Timestamp when the operation was started
3	Progress Bar	The progress of the operation
4	Triggered By	The author who triggered the operation
5	Schedule	Whether the operation was scheduled or not
6	Incremental Field	Indicates whether Incremental was enabled or disabled in the operation
7	Remediation	Indicates whether Remediation was enabled or disabled in the operation
8	Anomalies Identified	Provides a count of the number of anomalies detected during the successful completion of the operation.
9	Read Record Limit	Defines the maximum number of records to be scanned per table after initial filtering
10	Check Categories	Indicates which categories should be included in the scan (Metadata, Data Integrity)
11	Archive Duplicate Anomalies	Indicates whether Archive Duplicate Anomalies was enabled or disabled in the operation
12	Source Record Limit	Indicates the limit on records stored in the enrichment datastore for each detected anomaly
13	Anomaly Rollup Threshold	Number of anomalies grouped together for rollup reporting.
14	Results	View the details of the completed scan operation. This includes information on which tables were scanned, the anomalies identified (if any), and other relevant data collected throughout the successful completion of the scan.
15	Rerun	The "Rerun" button allows you to start a new scan operation using the same settings as the success scan
16	Delete	Removes the record of the aborted scan operation from the system, permanently deleting scan results and anomalies
17	Summary	The summary section provides an overview of the scan operation upon successful completion. It includes: Tables Requested: The total number of tables that were scheduled for scanning. Click on the adjacent magnifying glass icon to view the tables requested. Tables Scanned: The number of tables that have been scanned successfully. Click on the adjacent magnifying glass icon to view the tables scanned. Partitions Scanned: The number of partitions scanned. Records Scanned: The total number of records processed. Anomalies Identified: The total number of detected anomalies, with a breakdown of open and archived ones.

Full View of Metrics in Operation Summary

Users can now hover over abbreviated metrics to see the full value for better clarity. For demonstration purposes, we are hovering over the Records Scanned field to display the full value.

records-scan-operation

Post Operation Details

Step 1: Click on any of the successful Scan Operations from the list and hit the Results button.

result-scan-operation

Step 2: The Scan Results modal demonstrates the highlighted anomalies (if any) identified in your datastore with the following properties:

result

Ref.	Scan Properties	Description
1.	Table/File	The table or file where the anomaly is found.
2.	Field	The field(s) where the anomaly is present.
3.	Location	Fully qualified location of the anomaly.
4.	Rule	Inferred and authored checks that failed assertions.
5.	Description	Human-readable, auto-generated description of the anomaly.
6.	Status	The status of the anomaly: Active, Acknowledged, Resolved, or Invalid.
7.	Type	The type of anomaly (e.g., Record or Shape)
8.	Date time	The date and time when the anomaly was found.

Step 3: By clicking the dropdown button next to the All button, you can filter anomalies based on their status.

drop-down

API Payload Examples

This section provides payload examples for running, scheduling, and checking the status of scan operations. Replace the placeholder values with data specific to your setup.

Running a Scan operation

To run a scan operation, use the API payload example below and replace the placeholder values with your specific values.

Endpoint (Post):

/api/operations/run (post)

Option I: Running a scan operation of all containersOption II: Running a scan operation of specific containers

container_names: [] means that it will scan all containers.
max_records_analyzed_per_partition: null means that it will scan all records of all containers.
Remediation: append replicates source containers using an append-first strategy.

{
    "type":"scan",
    "name":null,
    "datastore_id": datastore-id,
    "container_names":[],
    "remediation":"append",
    "incremental":false,
    "max_records_analyzed_per_partition":null,
    "enrichment_source_record_limit":10
}

container_names: ["table_name_1", "table_name_2"] means that it will scan only the tables table_name_1 and table_name_2.
max_records_analyzed_per_partition: 1000000 means that it will scan a maximum of 1 million records per partition.
Remediation: overwrite replicates source containers using an overwrite strategy.

{
    "type":"scan",
    "name":null,
    "datastore_id":datastore-id,
    "container_names":[
      "table_name_1",
      "table_name_2"
    ],
    "max_records_analyzed_per_partition":1000000,
    "enrichment_source_record_limit":10
}

Scheduling scan operation of all containers

To schedule a scan operation, use the API payload example below and replace the placeholder values with your specific values.

Endpoint (Post):

/api/operations/schedule (post)

This payload is to run a scheduled scan operation every day at 00:00

Scheduling scan operation of all containers

{
    "type":"scan",
    "name":"My scheduled Scan operation",
    "datastore_id":"datastore-id",
    "container_names":[],
    "remediation": "overwrite"
    "incremental": false,
    "max_records_analyzed_per_partition":null,
    "enrichment_source_record_limit":10,
    "crontab":"00 00 */2 * *"
}

Retrieving Scan Operation Information

Endpoint (Get)

/api/operations/{id} (get)

Example result response

{
    "items": [
        {
            "id": 12345,
            "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "type": "scan",
            "start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "result": "success",
            "message": null,
            "triggered_by": "user@example.com",
            "datastore": {
                "id": 101,
                "name": "Datastore-Sample",
                "store_type": "jdbc",
                "type": "db_type",
                "enrich_only": false,
                "enrich_container_prefix": "data_prefix",
                "favorite": false
            },
            "schedule": null,
            "incremental": false,
            "remediation": "none",
            "max_records_analyzed_per_partition": -1,
            "greater_than_time": null,
            "greater_than_batch": null,
            "high_count_rollup_threshold": 10,
            "enrichment_source_record_limit": 10,
            "status": {
                "total_containers": 2,
                "containers_analyzed": 2,
                "partitions_scanned": 2,
                "records_processed": 28,
                "anomalies_identified": 2
            },
            "containers": [
                {
                "id": 234,
                "name": "Container1",
                "container_type": "table",
                "table_type": "table"
                },
                {
                "id": 235,
                "name": "Container2",
                "container_type": "table",
                "table_type": "table"
                }
            ],
            "container_scans": [
                {
                "id": 456,
                "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "container": {
                    "id": 235,
                    "name": "Container2",
                    "container_type": "table",
                    "table_type": "table"
                },
                "start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "records_processed": 8,
                "anomaly_count": 1,
                "result": "success",
                "message": null
                },
                {
                "id": 457,
                "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "container": {
                    "id": 234,
                    "name": "Container1",
                    "container_type": "table",
                    "table_type": "table"
                },
                "start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "end_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                "records_processed": 20,
                "anomaly_count": 1,
                "result": "success",
                "message": null
                }
            ],
            "tags": []
        }
    ],
    "total": 1,
    "page": 1,
    "size": 50,
    "pages": 1
}

External Scan Operation

An external scan is ideal for ad hoc scenarios, where you may receive a file intended to be replicated to a source datastore. Before loading, you can perform an external scan to ensure the file aligns with existing data standards. The schema of the file must match the target table or file pattern that has already been profiled within Qualytics, allowing you to reuse the quality checks to identify any issues before data integration.

Let’s get started 🚀

Step 1: Select a source datastore from the side menu to perform the external scan operation.

datastore

Step 2: After selecting your preferred source datastore, you will be taken to the details page. From there, click on "Tables" and select the table you want to perform the external scan operation on.

Note

This example is based on a JDBC table, but the same steps apply to a DFS as well. For DFS source datastores, you will need to click on "File Patterns" and select a File Pattern to run the external scan.

tables

For demonstration purposes, we have selected the “CUSTOMER” table.

container

External Scan Configuration

Step 1: Click on the “Run” button and select the “External Scan” option.

external-scan

Step 2: After selecting the "External Scan" option, a modal window will appear with an input for uploading your external file. After uploading the file, click the “Run” button to start the operation.

external-file

Note

An External Scan operation supports the following file formats: CSV, XLSX, and XLS.

Step 3: After clicking the "Run" button, the external scan operation will begin, and you will receive a confirmation message if the operation is successfully triggered.

success

Supported File Formats

External scan operation accepts CSV, XLSX, and XLS files. CSV is a simple text format, while XLSX and XLS are Excel formats that support more complex data structures. This versatility enables seamless integration of data from various sources.

An External Scan Operation can be configured with the following file formats:

File Extension	.csv	.xls	.xlsx
File Format	Comma-separated values	Microsoft Excel 97-2003 Workbook	Microsoft Excel 2007+ Workbook
Header Row	Required for optimal reading. It should contain column names.	Recommended, but not strictly required.	Recommended, but not strictly required.
Empty Cells	Represented as empty strings.	Allowed.	Allowed.
Data Types	Typically inferred by Spark.	May require explicit specification for complex types.	May require explicit specification for complex types.
Nested Data	Not directly supported. Consider flattening or using alternative file formats.	Not directly supported. Consider flattening or using alternative file formats.	Not directly supported. Consider flattening or using alternative file formats.
Additional Considerations	- Ensure consistent delimiter usage (usually commas). - Avoid special characters or line breaks within fields. - Enclose text fields containing commas or delimiters in double quotes.	- Use a plain XLS format without macros or formatting. - Consider converting to CSV for simpler handling.	- Use a plain XLSX format without macros or formatting. - Consider converting to CSV for simpler handling.

Scenario

A company maintains a large sales database containing information about various transactions, customers, and products. They have received a new sales data file that will be integrated into the existing database. Before loading the data, the organization wants to ensure there are no issues with the file.

An External Scan is initiated to perform checks on the incoming file, validating that it aligns with the quality standards of the sales table.

Specific Checks:

Check	Description
`Expected Schema`	Verify that all columns have the same data type as the selected profile structure.
`Exists in`	Verify that all transactions have valid customer and product references.
`Between Times`	Ensure that transaction dates fall within an expected range.
`Satisfies Expression`	Validate that the calculated revenue aligns with the unit price and quantity sold. The formula is: `R = Quantity × Unit Price`

Potential Anomalies:

This overview highlights common issues such as data type mismatches, missing references, out-of-range dates, and inconsistent revenue calculations. Each anomaly affects data integrity and requires corrective action.

Anomaly	Description
Data type issue	The external resource does not follow the data type schema.
Missing References	Transactions without valid customer or product references.
Out-of-Range Dates	Transactions with dates outside the expected range.
Inconsistent Revenue	Mismatch between calculated revenue and unit price times quantity.

Benefits of External Scan:

Benefit	Description
Quality Assurance	Identify and rectify data inconsistencies before downstream processes.
Data Integrity	Ensure that all records adhere to defined schema and constraints.
Anomaly Detection	Uncover potential issues that might impact business analytics and reporting.

CSV Table (Sales Data):

This dataset includes transaction records with details such as Transaction_ID, Customer_ID, Product_ID, Transaction_Date, Quantity, and Unit_Price. It provides essential information for tracking and analyzing sales activities.

Transaction_ID	Customer_ID	Product_ID	Transaction_Date	Quantity	Unit_Price
1	101	201	2023-01-15	5	20.00
2	102	202	2023-02-20	3	15.50
3	103	201	2023-03-10	2	25.00
4	104	203	2023-04-05	1	30.00
...	...	...	...	...	...

Flowchart

graph TB
subgraph Init
  A[Start] --> B[Load Sales Data]
end

subgraph Checks
  B --> C1[Expected schema]
  B --> C2[Exists in]
  B --> C3[Between times]
  B --> C4[Satisfies expression]

  C1 -->|Invalid| E1[Expected schema anomaly]
  C2 -->|Invalid| E2[Exists in anomaly]
  C3 -->|Invalid| E3[Between times anomaly]
  C4 -->|Invalid| E4[Satisfies expression anomaly]
end

subgraph End

  E1 --> J[Finish]
  E2 --> J[Finish]
  E3 --> J[Finish]
  E4 --> J[Finish]
end

API Payload Examples

Running an External Scan operation

This section provides a sample payload for running an external scan operation. Replace the placeholder values with actual data relevant to your setup.

Endpoint (Post)

/api/containers/{container-id}/scan (post)

Running an external scan operation of a datastore

{
    "name":"file_name.csv",
    "records": [{\"COLUMN_1\":\"VALUE 1\",\"COLUMN_2\":\"VALUE 1\"},{\"COLUMN_1\":\"VALUE_2\",\"COLUMN_2\":\"VALUE 2\"}]
}

Retrieving an External Scan Operation Status

Endpoint (Get)

/api/operations/{id} (get)

Example result response

{
    "items": [
        {
            "id": 12345,
            "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "type": "external_scan",
            "start_time": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
            "end_time": null,
            "result": "running",
            "message": null,
            "triggered_by": "user@example.com",
            "datastore": {
                "id": 101,
                "name": "Datastore-Sample",
                "store_type": "jdbc",
                "type": "db_type",
                "enrich_only": false,
                "enrich_container_prefix": "data_prefix",
                "favorite": false
            },
            "schedule": null,
            "incremental": false,
            "remediation": "none",
            "max_records_analyzed_per_partition": -1,
            "greater_than_time": null,
            "greater_than_batch": null,
            "high_count_rollup_threshold": 10,
            "enrichment_source_record_limit": 10,
            "status": {
                "total_containers": 1,
                "containers_analyzed": 0,
                "partitions_scanned": 0,
                "records_processed": 0,
                "anomalies_identified": 0
            },
            "containers": [
                    {
                    "id": 234,
                    "name": "Container1",
                    "container_type": "table",
                    "table_type": "table"
                    }
                ],
                "container_scans": [
                    {
                    "id": 456,
                    "created": "YYYY-MM-DDTHH:MM:SS.ssssssZ",
                    "container": {
                        "id": 234,
                        "name": "Container1",
                        "container_type": "table",
                        "table_type": "table"
                    },
                    "start_time": null,
                    "end_time": null,
                    "records_processed": 0,
                    "anomaly_count": 0,
                    "result": "running",
                    "message": null
                    }
                ],
                "tags": []
        }
    ],
    "total": 1,
    "page": 1,
    "size": 50,
    "pages": 1
}

Datastore Settings

Settings Overview

Qualytics allows you to manage your datastore efficiently by editing source datastore information, linking an enrichment datastore for enhanced insights, establishing new connections to expand data sources, choosing connectors to integrate diverse data, adjusting the quality score to ensure data accuracy, and deleting the store. This ensures flexibility and control over your data management processes within the platform.

Let's get started 🚀

Step 1: Select a source datastore from the side menu for which you would like to manage the settings.

add-datastore

Step 2: Click on the Settings icon from the top right window. A drop-down menu will appear with the following options:

Edit
Enrichment
Score
Delete

settings

Edit Datastore

Use the Edit Datastore option to make changes to your datastore’s connection setup whenever updates are needed.

For more information on how to edit the datastore, please refer to the Edit Datastore documentation.

Link Enrichment

An enrichment datastore is a database used to enhance your existing data by adding additional, relevant information. This helps you to provide more comprehensive insight into data and improve data accuracy.

You have the option to link an enrichment datastore to your existing source datastore. However, some datastores cannot be linked as enrichment datastores. For example, Oracle, Athena, and Timescale cannot be used for this purpose.

For more information on how to link an enrichment datastore, please refer to the Link Enrichment documentation.

Unlink Enrichment

You can remove the connection between the source datastore and the enrichment datastore using the Unlink Enrichment Datastore option. This action stops the enrichment process and ensures that no further data is enhanced using the unlinked datastore.

For more information on how to unlink an enrichment datastore, please refer to the Unlink Enrichment documentation.

Quality Score Settings

Use the Quality Score Settings option to adjust factor weights and decay period, aligning data quality scoring with your organization’s needs.

For more information on the quality score, please refer to the Quality Score documentation.

Delete Datastore

Use the Delete Datastore option to remove outdated or unused datastores along with their related configurations, helping keep your workspace organized.

For more information on how to delete a datastore, please refer to the Delete Datastore documentation.

Edit Datastore

Step 1: Click on the Edit option.

edit-datastore

Step 2: After selecting the Edit option, a modal window will appear, displaying the connection details. This window allows you to modify any specific connection details.

connection-details

Step 3: After editing the connection details, click on the Test Connection button to check and verify its connection.

test

If the credentials and provided connection details are verified, a success message will be displayed indicating that the connection has been verified.

Step 4: Click on the Save button.

save-datastore

After clicking on the Save button, a success notification appears on the screen showing the action was completed successfully.

Link Enrichment

Step 1: Click on the Enrichment from the dropdown list.

enrichment

A modal window Link Enrichment Datastore will appear, providing you with two options to link an enrichment datastore.

link-enrichment

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Caret Down Button	Click the caret down to select either Use Enrichment Datastore or Add Enrichment Datastore.
3.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.
4.	Anomaly Rollup Threshold	Sets the maximum number of anomalies per check before they are merged into one anomaly. Value must be between 1 and 1,000.
5.	Source Record Limit	Sets the maximum number of records written to the enrichment datastore for each detected anomaly. The value must be between 1 and 1,000,000,000.
6.	Remediation Strategy	The Remediation Strategy defines how anomalous source tables are replicated in the enrichment datastore. You can choose None (no replication), Append (append new data), or Overwrite (replace existing data).

Option I: Link New Enrichment

If the toggle for Add new connection is turned on, then this will prompt you to link a new enrichment datastore from scratch without using existing connection details.

Step 1: Click on the caret button and select Add Enrichment Datastore.

caret

A modal window Link Enrichment Datastore will appear. Enter the following details to create an enrichment datastore with a new connection.

window

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Name	Give a name for the enrichment datastore.
3.	Toggle Button for adding new connection	Toggle ON to create a new enrichment from scratch or toggle OFF to reuse credentials from an existing connection.
4.	Connector	Select a datastore connector from the dropdown list.
5.	Anomaly Rollup Threshold	Sets the maximum number of anomalies per check before they are merged into one anomaly. Value must be between 1 and 1,000.
6.	Source Record Limit	Sets the maximum number of records written to the enrichment datastore for each detected anomaly. The value must be between 1 and 1,000,000,000.
7.	Remediation Strategy	The Remediation Strategy defines how anomalous source tables are replicated in the enrichment datastore. You can choose None (no replication), Append (append new data), or Overwrite (replace existing data).

Step 2: Add connection details for your selected enrichment datastore connector.

Note

Connection details can vary from datastore to datastore. For illustration, we have demonstrated linking BigQuery as a new enrichment datastore.

select-new-enrichment

Step 3: After adding the source datastore details, click on the Test Connection button to check and verify its connection.

connector-creds

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Step 4: Click on the Save button.

save-enrichment

After clicking on the Save button, a success notification appears on the screen showing the action was completed successfully.

Option II: Link Existing Connection

If the Use an existing enrichment datastore option is selected from the dropdown menu, you will be prompted to link the enrichment datastore using existing connection details.

Step 1: Click on the caret button and select Use Enrichment Datastore.

caret

Step 2: A modal window Link Enrichment Datastore will appear. Add a prefix name and select an existing enrichment datastore from the dropdown list.

modal

REF.	FIELDS	ACTIONS
1.	Prefix	Add a prefix name to uniquely identify tables/files when Qualytics writes metadata from the source datastore to your enrichment datastore.
2.	Enrichment Datastore	Select an enrichment datastore from the dropdown list.
3.	Anomaly Rollup Threshold	Sets the maximum number of anomalies per check before they are merged into one anomaly. Value must be between 1 and 1,000.
4.	Source Record Limit	Sets the maximum number of records written to the enrichment datastore for each detected anomaly. The value must be between 1 and 1,000,000,000.
5.	Remediation Strategy	The Remediation Strategy defines how anomalous source tables are replicated in the enrichment datastore. You can choose None (no replication), Append (append new data), or Overwrite (replace existing data).

Step 3: View and check the connection details of the enrichment datastore and click on the Save button.

save

After clicking on the Save button, a success notification appears on the screen showing the action was completed successfully.

Endpoint (Patch)

/api/datastores/{datastore-id}/enrichment/{enrichment-id} (patch)

Unlink Enrichment Datastore

Step 1: Click on the Enrichment from the drop-down list.

enrichment-option

A modal window titled Enrichment Datastore Settings will appear, displaying configuration options for the linked enrichment datastore.

enrichment-settings

Step 2: Click the Unlink Enrichment Datastore option (represented by unlink icon) located on the right side of the Details section to remove the linked enrichment datastore.

unlink-option

A modal window titled Unlink Enrichment Datastore will appear.

unlink-window

Step 3: Click the Unlink button to remove the enrichment datastore connection.

unlink

After clicking the Unlink button, a success message confirms that the datastore has been updated successfully.

Quality Score Settings

Quality Scores are quantified measures of data quality calculated at the field and container levels recorded as time series to enable tracking of changes over time. Scores range from 0-100, with higher values indicating superior quality. These scores integrate eight distinct factors, providing a granular analysis of the attributes that impact the overall data quality.

Each field receives a total quality score based on eight key factors, each evaluated on a 0-100 scale. The overall score is a composite reflecting the relative importance and configured weights of these factors:

Completeness: Measures the average completeness of a field across all profiles.
Coverage: Assesses the adequacy of data quality checks for the field.
Conformity: Checks alignment with standards defined by quality checks.
Consistency: Ensures uniformity in type and scale across all data representations.
Precision: Evaluates the resolution of field values against defined quality checks.
Timeliness: Gauges data availability according to schedule, inheriting the container's timeliness.
Volumetrics: Analyzes consistency in data size and shape over time, inheriting the container's volumetrics.
Accuracy: Determines the fidelity of field values to their real-world counterparts.

The Quality Score Settings allow users to tailor the impact of each quality factor on the total score by adjusting their weights, allowing the scoring system to align with your organization’s data governance priorities.

Step 1: Click on the Score option in the settings icon.

score

Step 2: A modal window "Quality Score Settings" will appear.

score-settings

Step 3: The Decay Period slider sets the time frame over which the system evaluates historical data to determine the quality score. The decay period for considering past data events defaults to 180 days, but can be customized to fit your operational needs, ensuring the scores reflect the most relevant data quality insights.

decay-period

Step 4: Adjust the Factor Weights using the sliding bar. The factor weights determine the importance of different data quality aspects.

factor-weights

Step 5: Click on the Save button to save the quality score settings.

score-save

After clicking the Save button, a success notification appears on the screen showing the action was completed successfully.

Delete Datastore

Step 1: Click on the Delete option in the settings icon.

delete

Step 2: A modal window titled Delete Datastore will appear.

delete-datastore

Step 3: Enter the name of the datastore in the given field (confirmation check) and then click on the I’M SURE, DELETE THIS DATASTORE button to delete the datastore.

confirm-delete

After clicking the I’M SURE, DELETE THIS DATASTORE button, a success notification appears confirming the deletion.

Enrichment Datastores

Enrichment Datastore Overview

An Enrichment Datastore is a user-managed storage location where the Qualytics platform records and accesses metadata through a set of system-defined tables. It is purpose-built to capture metadata generated by the platform's profiling and scanning operations.

Let’s get started 🚀

Key Points

Metadata Storage: The Enrichment Datastore acts as a dedicated mechanism for writing and retaining metadata that the platform generates. This includes information about anomalies, quality checks, field profiling, and additional details that enrich the source data.
Feature Enablement: By using the Enrichment Datastore, the platform unlocks certain features such as the previewing of source records. For instance, when an anomaly is detected, the platform typically previews a limited set of affected records. For a comprehensive view and persistent access, the Enrichment Datastore captures and maintains a complete snapshot of the source records associated with the anomalies.
User-Managed Location: While the Qualytics platform handles the generation and processing of metadata, the actual storage is user-managed. This means that the user maintains control over the Enrichment Datastore, deciding where and how this data is stored, adhering to their governance and compliance requirements.
Insight and Reporting: Beyond storing metadata, the Enrichment Datastore allows users to derive actionable insights and develop custom reports for a variety of use cases, from compliance tracking to data quality improvement initiatives.

Log in to your Qualytics account and click the Enrichment Datastores button on the left side panel of the interface.

click

Table Types

The Enrichment Datastore contains several types of tables, each serving a specific purpose in the data enrichment and remediation process.

Note

For more information, please refer to the table types documentation.

Diagram

The diagram below provides a visual representation of the associations between various tables in the Enrichment Datastore. It illustrates how tables can be joined to track and analyze data across different processes.

Screenshot

Handling JSON and string splitting

SnowflakePostgreSQLMySQL

SELECT
    PARSE_JSON(ADDITIONAL_METADATA):metadata_1::string AS Metadata1_Key1,
    PARSE_JSON(ADDITIONAL_METADATA):metadata_2::string AS Metadata2_Key1,
    PARSE_JSON(ADDITIONAL_METADATA):metadata_3::string AS Metadata3_Key1,
    -- Add more lines as needed up to MetadataN
    CONTAINER_ID,
    COVERAGE,
    CREATED,
    DATASTORE_ID,
    DELETED_AT,
    DESCRIPTION,
    SPLIT_PART(FIELDS, ',', 1) AS Field1,
    SPLIT_PART(FIELDS, ',', 2) AS Field2,
    -- Add more lines as needed up to FieldN
    FILTER,
    GENERATED_AT,
    SPLIT_PART(GLOBAL_TAGS, ',', 1) AS Tag1,
    SPLIT_PART(GLOBAL_TAGS, ',', 2) AS Tag2,
    -- Add more lines as needed up to TagN
    HAS_PASSED,
    ID,
    INFERRED,
    IS_NEW,
    IS_TEMPLATE,
    LAST_ASSERTED,
    LAST_EDITOR,
    LAST_UPDATED,
    NUM_CONTAINER_SCANS,
    PARSE_JSON(PROPERTIES):allow_other_fields::string AS Property_AllowOtherFields,
    PARSE_JSON(PROPERTIES):assertion::string AS Property_Assertion,
    PARSE_JSON(PROPERTIES):comparison::string AS Property_Comparison,
    PARSE_JSON(PROPERTIES):datetime_::string AS Property_Datetime,
    -- Add more lines as needed up to Property
    RULE_TYPE,
    SOURCE_CONTAINER,
    SOURCE_DATASTORE,
    TEMPLATE_ID,
    WEIGHT
FROM "_EXPORT_CHECKS";

SELECT
    (ADDITIONAL_METADATA::json ->> 'metadata_1') AS Metadata1_Key1,
    (ADDITIONAL_METADATA::json ->> 'metadata_2') AS Metadata2_Key1,
    (ADDITIONAL_METADATA::json ->> 'metadata_3') AS Metadata3_Key1,
    -- Add more lines as needed up to MetadataN
    CONTAINER_ID,
    COVERAGE,
    CREATED,
    DATASTORE_ID,
    DELETED_AT,
    DESCRIPTION,
    (string_to_array(FIELDS, ','))[1] AS Field1,
    (string_to_array(FIELDS, ','))[2] AS Field2,
    -- Add more lines as needed up to FieldN
    FILTER,
    GENERATED_AT,
    (string_to_array(GLOBAL_TAGS, ','))[1] AS Tag1,
    (string_to_array(GLOBAL_TAGS, ','))[2] AS Tag2,
    -- Add more lines as needed up to TagN
    HAS_PASSED,
    ID,
    INFERRED,
    IS_NEW,
    IS_TEMPLATE,
    LAST_ASSERTED,
    LAST_EDITOR,
    LAST_UPDATED,
    NUM_CONTAINER_SCANS,
    (PROPERTIES::json ->> 'allow_other_fields') AS Property_AllowOtherFields,
    (PROPERTIES::json ->> 'assertion') AS Property_Assertion,
    (PROPERTIES::json ->> 'comparison') AS Property_Comparison,
    (PROPERTIES::json ->> 'datetime_') AS Property_Datetime,
    -- Add more lines as needed up to PropertyN
    RULE_TYPE,
    SOURCE_CONTAINER,
    SOURCE_DATASTORE,
    TEMPLATE_ID,
    WEIGHT
FROM "_EXPORT_CHECKS";

SELECT
    (ADDITIONAL_METADATA->>'$.metadata_1') AS Metadata1_Key1,
    (ADDITIONAL_METADATA->>'$.metadata_2') AS Metadata2_Key1,
    (ADDITIONAL_METADATA->>'$.metadata_3') AS Metadata3_Key1,
    -- Add more lines as needed up to MetadataN
    CONTAINER_ID,
    COVERAGE,
    CREATED,
    DATASTORE_ID,
    DELETED_AT,
    DESCRIPTION,
    SUBSTRING_INDEX(FIELDS, ',', 1) AS Field1,
    -- Add more lines as needed up to FieldN
    SUBSTRING_INDEX(GLOBAL_TAGS, ',', 1) AS Tag1,
    -- Add more lines as needed up to TagN
    HAS_PASSED,
    ID,
    INFERRED,
    IS_NEW,
    IS_TEMPLATE,
    LAST_ASSERTED,
    LAST_EDITOR,
    LAST_UPDATED,
    NUM_CONTAINER_SCANS,
    (PROPERTIES->>'$.allow_other_fields') AS Property_AllowOtherFields,
    (PROPERTIES->>'$.assertion') AS Property_Assertion,
    (PROPERTIES->>'$.comparison') AS Property_Comparison,
    (PROPERTIES->>'$.datetime_') AS Property_Datetime,
    -- Add more lines as needed up to PropertyN
    RULE_TYPE,
    SOURCE_CONTAINER,
    SOURCE_DATASTORE,
    TEMPLATE_ID,
    WEIGHT
FROM "_EXPORT_CHECKS";

Usage Notes

Both metadata tables and remediation tables, are designed to be ephemeral and thus are recommended to be used as temporary datasets. Users are advised to move this data to a more permanent dataset for long-term storage and reporting.
The anomaly UUID in the remediation tables acts as a link to the detailed data in the _anomaly enrichment table. This connection not only shows the number of failed checks but also provides insight into each one, such as the nature of the issue, the type of rule violated, and associated check tags. Additionally, when available, suggested remediation actions, including suggested field modifications and values, are presented alongside a score indicating the suggested action's potential effectiveness. This information helps users to better understand the specifics of each anomaly related to the remediation tables.
The Qualytics platform is configured to capture and write a maximum of 10 rows of data per anomaly by default for both the _source_records enrichment table and the remediation tables. To adjust this limit, users can utilize the enrichment_source_record_limit parameter within the Scan Operation settings. This parameter accepts a minimum value of 10 but allows the specification of a higher limit, up to an unrestricted number of rows per anomaly. It is important to note that if an anomaly is associated with fewer than 10 records, the platform will only write the actual number of records where the anomaly was detected.

API Payload Examples

Note

For more information, please refer to the API Payload Example documentation.

Table Types

The Enrichment Datastore contains several types of tables, each serving a specific purpose in the data enrichment and remediation process. These tables are categorized into:

Enrichment Tables
Remediation Tables
Metadata Tables

Enrichment Tables

When anomalies are detected, the platform writes metadata into four primary enrichment tables:

Note

For more information, please refer to the enrichment tables documentation.

Remediation Tables

When anomalies are detected in a container, the platform has the capability to create remediation tables in the Enrichment Datastore.

Note

For more information, please refer to the remediation tables documentation

Metadata Tables

The Qualytics platform enables users to manually export metadata into the enrichment datastore, providing a structured approach to data analysis and management.

Note

For more information, please refer to the metadata tables documentation.

Enrichment Tables

When anomalies are detected, the platform writes metadata into four primary enrichment tables:

<enrichment_prefix>_check_metrics
<enrichment_prefix>_failed_checks
<enrichment_prefix>_source_records
<enrichment_prefix>_scan_operations

_CHECK_METRICS_Table

Captures and logs detailed metrics for every data quality check performed within the Qualytics Platform, providing insights into asserted and anomalous records across datasets.

Columns

Name	Data Type	Description
OPERATION_ID	NUMBER	Unique Identifier for the check metric.
CONTAINER_ID	NUMBER	Identifier for the container associated with the check metric.
SOURCE_DATASTORE	STRING	Datastore where the source data resides.
SOURCE_CONTAINER	STRING	Name of the source data container.
SOURCE_PARTITION	STRING	Partition of the source data.
ASSERTION_RESULT	STRING	Result of the check assertion: one of `passed`, `failed`, or `unasserted`.
ASSERTION_DETAILS	STRING	Text description explaining any warnings, errors, or notes from the check.
QUALITY_CHECK_ID	NUMBER	Unique identifier for the quality check performed.
ASSERTED_RECORDS_COUNT	NUMBER	Count of records expected or asserted in the source.
ANOMALOUS_RECORDS_COUNT	NUMBER	Count of records identified as anomalous.
_QUALYTICS_SOURCE_PARTITION	STRING	Partition information specific to Qualytics metrics.

_FAILED_CHECKS Table

Acts as an associative entity that consolidates information on failed checks, associating anomalies with their respective quality checks.

Columns

Name	Data Type	Description
QUALITY_CHECK_ID	NUMBER	Unique identifier for the quality check.
ANOMALY_UUID	STRING	UUID for the anomaly detected.
QUALITY_CHECK_MESSAGE	STRING	Message describing the quality check outcome.
SUGGESTED_REMEDIATION_FIELD	STRING	Field suggesting remediation.
SUGGESTED_REMEDIATION_VALUE	STRING	Suggested value for remediation.
SUGGESTED_REMEDIATION_SCORE	FLOAT	Score indicating confidence in remediation.
QUALITY_CHECK_RULE_TYPE	STRING	Type of rule applied for quality check.
QUALITY_CHECK_TAGS	STRING	Tags associated with the quality check.
QUALITY_CHECK_PARAMETERS	STRING	Parameters used for the quality check.
QUALITY_CHECK_DESCRIPTION	STRING	Description of the quality check.
OPERATION_ID	NUMBER	Identifier for the operation detecting anomaly.
DETECTED_TIME	TIMESTAMP	Timestamp when the anomaly was detected.
SOURCE_CONTAINER	STRING	Name of the source data container.
SOURCE_PARTITION	STRING	Partition of the source data.
SOURCE_DATASTORE	STRING	Datastore where the source data resides.
FINGERPRINT	INTEGER	Unique identifier created when Reactivate Recurring Anomalies is enabled.

Info

This table is not characterized by unique ANOMALY_UUID or QUALITY_CHECK_ID values alone. Instead, the combination of ANOMALY_UUID and QUALITY_CHECK_ID serves as a composite key, uniquely identifying each record in the table.

_SOURCE_RECORDS Table

Stores source records in JSON format, primarily to enable the preview source record feature in the Qualytics App.

Columns

Name	Data Type	Description
SOURCE_CONTAINER	STRING	Name of the source data container.
SOURCE_PARTITION	STRING	Partition of the source data.
ANOMALY_UUID	STRING	UUID for the associated anomaly.
CONTEXT	STRING	Contextual information for the anomaly.
RECORD	STRING	JSON representation of the source record.

_SCAN_OPERATIONS Table

Captures and stores the results of every scan operation conducted on the Qualytics Platform.

Columns

Name	Data Type	Description
OPERATION_ID	NUMBER	Unique identifier for the scan operation.
DATASTORE_ID	NUMBER	Identifier for the source datastore associated with the operation.
CONTAINER_ID	NUMBER	Identifier for the container associated with the operation.
CONTAINER_SCAN_ID	NUMBER	Identifier for the container scan associated with the operation.
PARTITION_NAME	STRING	Name of the source partition on which the scan operation is performed.
INCREMENTAL	BOOLEAN	Boolean flag indicating whether the scan operation is incremental.
RECORDS_PROCESSED	NUMBER	Total number of records processed during the scan operation.
ENRICHMENT_SOURCE_RECORD_LIMIT	NUMBER	Maximum number of records written to the enrichment for each anomaly detected.
MAX_RECORDS_ANALYZED	NUMBER	Maximum number of records analyzed in the scan operation.
ANOMALY_COUNT	NUMBER	Total number of anomalies identified in the scan operation.
START_TIME	TIMESTAMP	Timestamp marking the start of the scan operation.
END_TIME	TIMESTAMP	Timestamp marking the end of the scan operation.
RESULT	STRING	Textual representation of the scan operation's status.
MESSAGE	STRING	Detailed message regarding the process of the scan operation.

Remediation Tables

When anomalies are detected in a container, the platform has the capability to create remediation tables in the Enrichment Datastore. These tables are detailed snapshots of the affected container, capturing the state of the data at the time of anomaly detection. They also include additional columns for metadata and remediation purposes. However, the creation of these tables depends upon the chosen remediation strategy during the scan operation.

Currently, there are three types of remediation strategies:

None: No remediation tables will be created, regardless of anomaly detection.
Append: Replicate source containers using an append-first strategy.
Overwrite: Replicate source containers using an overwrite strategy.

Note

The naming convention for the remediation tables follows the pattern of <enrichment_prefix>_remediation_<container_id>, where <enrichment_prefix> is user-defined during the Enrichment Datastore configuration and <container_name> corresponds to the original source container.

Illustrative Table

_{ENRICHMENT_CONTAINER_PREFIX}_REMEDIATION_{CONTAINER_ID}

This remediation table is an illustrative snapshot of the "Orders" container for reference purposes.

Name	Data Type	Description
_QUALYTICS_SOURCE_PARTITION	STRING	The partition from the source data container.
ANOMALY_UUID	STRING	Unique identifier of the anomaly.
ORDERKEY	NUMBER	Unique identifier of the order.
CUSTKEY	NUMBER	The customer key related to the order.
ORDERSTATUS	CHAR	The status of the order (e.g., 'F' for 'finished').
TOTALPRICE	FLOAT	The total price of the order.
ORDERDATE	DATE	The date when the order was placed.
ORDERPRIORITY	STRING	Priority of the order (e.g., 'urgent').
CLERK	STRING	The clerk who took the order.
SHIPPRIORITY	INTEGER	The priority given to the order for shipping.
COMMENT	STRING	Comments related to the order.

Note

In addition to capturing the original container fields, the platform includes two metadata columns designed to assist in the analysis and remediation process.

_QUALYTICS_SOURCE_PARTITION
ANOMALY_UUID

Understanding Remediation Tables vs. Source Record Tables

When managing data anomalies in containers, it's important to understand the structures of Remediation Tables and Source Record Tables in the Enrichment Datastore.

Remediation Tables

Purpose: Remediation tables are designed to capture detailed snapshots of the affected containers at the time of anomaly detection. They serve as a primary tool for remediation actions.

Creation: These tables are generated based on the remediation strategy selected during the scan operation:

None: No tables are created.
Append: Tables are created with new data appended.
Overwrite: Tables are created and existing data is overwritten.

Structure: The structure includes all columns from the source container, along with additional columns for metadata and remediation purposes. The naming convention for these tables is <enrichment_prefix>_remediation_<container_id>, where <enrichment_prefix> is defined during the Enrichment Datastore configuration.

Source Record Tables

Purpose: The Source Record Table is mainly used within the Qualytics App to display anomalies directly to users by showing the source records.

Structure: Unlike remediation tables, the Source Record Table stores each record in a JSON format within a single column named RECORD, along with other metadata columns like SOURCE_CONTAINER, SOURCE_PARTITION, ANOMALY_UUID, and CONTEXT.

Key Differences

Format: Remediation tables are structured with separate columns for each data field, making them easier to use for consulting and remediation processes.

Source Record Tables store data in a JSON format within a single column, which can be less convenient for direct data operations.
Usage: Remediation tables are optimal for performing corrective actions and are designed to integrate easily with data workflows.

Source Record Tables are best suited for reviewing specific anomalies within the Qualytics App due to their format and presentation.

Recommendation

For users intending to perform querying or need detailed snapshots for audit purposes, Remediation Tables are recommended.

For those who need to quickly review anomalies directly within the Qualytics App, Source Record Tables are more suitable due to their straightforward presentation of data.

Metadata Tables

The Qualytics platform enables users to manually export metadata into the enrichment datastore, providing a structured approach to data analysis and management. These metadata tables are structured to reflect the evolving characteristics of data entities, primarily focusing on aspects that are subject to changes.

Currently, the following assets are available for exporting:

_<enrichment_prefix>_export_anomalies
_<enrichment_prefix>_export_checks
_<enrichment_prefix>_export_field_profiles

Note

The strategy used for managing these metadata tables employs a create or replace approach, meaning that the export process will create a new table if one does not exist, or replace it entirely if it does. This means that any previous data will be overwritten.

For more detailed information on exporting metadata, please refer to the export operation documentation

_EXPORT_ANOMALIES Table

Contains metadata from anomalies in a distinct normalized format. This table is specifically designed to capture the mutable states of anomalies, emphasizing their status changes.

Columns

Name	Data Type	Description
ID	NUMBER	Unique identifier for the anomaly.
CREATED	TIMESTAMP	Timestamp of anomaly creation.
UUID	UUID	Universal Unique Identifier of the anomaly.
TYPE	STRING	Type of the anomaly (e.g., 'shape').
STATUS	STRING	Current status of the anomaly (e.g., 'Active').
GLOBAL_TAGS	STRING	Tags associated globally with the anomaly.
CONTAINER_ID	NUMBER	Identifier for the associated container.
SOURCE_CONTAINER	STRING	Name of the source container.
DATASTORE_ID	NUMBER	Identifier for the associated datastore.
SOURCE_DATASTORE	STRING	Name of the source datastore.
GENERATED_AT	TIMESTAMP	Timestamp when the export was generated.

_EXPORT_CHECKS Table

Contains metadata from quality checks.

Columns

Name	Data Type	Description
ADDITIONAL_METADATA	STRING	JSON-formatted string containing additional metadata for the check.
COVERAGE	FLOAT	Represents the expected tolerance of the rule.
CREATED	STRING	Created timestamp of the check.
DELETED_AT	STRING	Deleted timestamp of the check.
DESCRIPTION	STRING	Description of the check.
FIELDS	STRING	Fields involved in the check separated by comma.
FILTER	STRING	Criteria used to filter data when asserting the check.
GENERATED_AT	STRING	Indicates when the export was generated.
GLOBAL_TAGS	STRING	Represents the global tags of the check separated by comma.
HAS_PASSED	BOOLEAN	Boolean indicator of whether the check has passed its last assertion .
ID	NUMBER	Unique identifier for the check.
INFERRED	BOOLEAN	Indicates whether the check was inferred by the platform.
IS_NEW	BOOLEAN	Flags if the check is new.
LAST_ASSERTED	STRING	Timestamp of the last assertion performed on the check.
LAST_EDITOR	STRING	Represents the last editor of the check.
LAST_UPDATED	STRING	Represents the last updated timestamp of the check.
NUM_CONTAINER_SCANS	NUMBER	Number of containers scanned.
PROPERTIES	STRING	Specific properties for the check in a JSON format.
RULE_TYPE	STRING	Type of rule applied in the check.
WEIGHT	FLOAT	Represents the weight of the check.
DATASTORE_ID	NUMBER	Identifier of the datastore used in the check.
CONTAINER_ID	NUMBER	Identifier of the container used in the check.
TEMPLATE_ID	NUMBER	Identifier of the template id associated to the check.
IS_TEMPLATE	BOOLEAN	Indicates whether the check is a template or not.
SOURCE_CONTAINER	STRING	Name of the container used in the check.
SOURCE_DATASTORE	STRING	Name of the datastore used in the check.

_EXPORT_CHECK_TEMPLATES Table

Contains metadata from check templates.

Columns

Name	Data Type	Description
ADDITIONAL_METADATA	STRING	JSON-formatted string containing additional metadata for the check.
COVERAGE	FLOAT	Represents the expected tolerance of the rule.
CREATED	STRING	Created timestamp of the check.
DELETED_AT	STRING	Deleted timestamp of the check.
DESCRIPTION	STRING	Description of the check.
FIELDS	STRING	Fields involved in the check separated by comma.
FILTER	STRING	Criteria used to filter data when asserting the check.
GENERATED_AT	STRING	Indicates when the export was generated.
GLOBAL_TAGS	STRING	Represents the global tags of the check separated by comma.
ID	NUMBER	Unique identifier for the check.
IS_NEW	BOOLEAN	Flags if the check is new.
IS_TEMPLATE	BOOLEAN	Indicates whether the check is a template or not.
LAST_EDITOR	STRING	Represents the last editor of the check.
LAST_UPDATED	STRING	Represents the last updated timestamp of the check.
PROPERTIES	STRING	Specific properties for the check in a JSON format.
RULE_TYPE	STRING	Type of rule applied in the check.
TEMPLATE_CHECKS_COUNT	NUMBER	The count of associated checks to the template.
TEMPLATE_LOCKED	BOOLEAN	Indicates whether the check template is locked or not.
WEIGHT	FLOAT	Represents the weight of the check.

_EXPORT_FIELD_PROFILES Table

Contains metadata from field profiles.

Columns

Name	Data Type	Description
APPROXIMATE_DISTINCT_VALUES	FLOAT	Estimated number of distinct values in the field.
COMPLETENESS	FLOAT	Ratio of non-null entries to total entries in the field.
CONTAINER_ID	NUMBER	Identifier for the container holding the field.
SOURCE_CONTAINER	STRING	Name of the container holding the field.
CONTAINER_STORE_TYPE	STRING	Storage type of the container.
CREATED	STRING	Date when the field profile was created.
DATASTORE_ID	NUMBER	Identifier for the datastore containing the field.
SOURCE_DATASTORE	STRING	Name of the datastore containing the field.
DATASTORE_TYPE	STRING	Type of datastore.
ENTROPY	FLOAT	Measure of randomness in the information being processed.
FIELD_GLOBAL_TAGS	STRING	Global tags associated with the field.
FIELD_ID	NUMBER	Unique identifier for the field.
FIELD_NAME	STRING	Name of the field being profiled.
FIELD_PROFILE_ID	NUMBER	Identifier for the field profile record.
FIELD_QUALITY_SCORE	FLOAT	Score representing the quality of the field.
FIELD_TYPE	STRING	Data type of the field.
FIELD_WEIGHT	NUMBER	Weight assigned to the field for quality scoring.
GENERATED_AT	STRING	Date when the field profile was generated.
HISTOGRAM_BUCKETS	STRING	Distribution of data within the field represented as buckets.
IS_NOT_NORMAL	BOOLEAN	Indicator of whether the field data distribution is not normal.
KLL	STRING	Sketch summary of the field data distribution.
KURTOSIS	FLOAT	Measure of the tailedness of the probability distribution.
MAX	FLOAT	Maximum value found in the field.
MAX_LENGTH	FLOAT	Maximum length of string entries in the field.
MEAN	FLOAT	Average value of the field's data.
MEDIAN	FLOAT	Middle value in the field's data distribution.
MIN	FLOAT	Minimum value found in the field.
MIN_LENGTH	FLOAT	Minimum length of string entries in the field.
NAME	STRING	Descriptive name of the field.
Q1	FLOAT	First quartile in the field's data distribution.
Q3	FLOAT	Third quartile in the field's data distribution.
SKEWNESS	FLOAT	Measure of the asymmetry of the probability distribution.
STD_DEV	FLOAT	Standard deviation of the field's data.
SUM	FLOAT	Sum of all numerical values in the field.
TYPE_DECLARED	BOOLEAN	Indicator of whether the field type is explicitly declared.
UNIQUE_DISTINCT_RATIO	FLOAT	Ratio of unique distinct values to the total distinct values.

API Payload Examples

Retrieving Enrichment Datastore Tables

Endpoint (Get)

/api/datastores/{enrichment-datastore-id}/listing (get)

Example result response

    [
        {
            "name":"_datastore_prefix_scan_operations",
            "label":"scan_operations",
            "datastore":{
                "id":123,
                "name":"My Datastore",
                "store_type":"jdbc",
                "type":"postgresql",
                "enrich_only":false,
                "enrich_container_prefix":"_datastore_prefix",
                "favorite":false
            }
        },
        {
            "name":"_datastore_prefix_source_records",
            "label":"source_records",
            "datastore":{
                "id":123,
                "name":"My Datastore",
                "store_type":"jdbc",
                "type":"postgresql",
                "enrich_only":false,
                "enrich_container_prefix":"_datastore_prefix",
                "favorite":false
            }
        },
        {
            "name":"_datastore_prefix_failed_checks",
            "label":"failed_checks",
            "datastore":{
                "id":123,
                "name":"My Datastore",
                "store_type":"jdbc",
                "type":"postgresql",
                "enrich_only":false,
                "enrich_container_prefix":"_datastore_prefix",
                "favorite":false
            }
        },
        {
            "name": "_datastore_prefix_remediation_container_id",
            "label": "table_name",
            "datastore": {
                "id": 123,
                "name": "My Datastore",
                "store_type": "jdbc",
                "type": "postgresql",
                "enrich_only": false,
                "enrich_container_prefix": "_datastore_prefix",
                "favorite": false
            }
        }
    ]

Retrieving Enrichment Datastore Source Records

Endpoint (Get)

/api/datastores/{enrichment-datastore-id}/source-records?path={_source-record-table-prefix} (get)

Endpoint With Filters (Get)

/api/datastores/{enrichment-datastore-id}/source-records?filter=anomaly_uuid='{uuid}'&path={_source-record-table-prefix} (get)

Example result response

    {
        "source_records": "[{\"source_container\":\"table_name\",\"source_partition\":\"partition_name\",\"anomaly_uuid\":\"f11d4e7c-e757-4bf1-8cd6-d156d5bc4fa5\",\"context\":null,\"record\":\"{\\\"P_NAME\\\":\\\"\\\\\\\"strategize intuitive systems\\\\\\\"\\\",\\\"P_TYPE\\\":\\\"\\\\\\\"Radiographer, therapeutic\\\\\\\"\\\",\\\"P_RETAILPRICE\\\":\\\"-24.69\\\",\\\"LAST_MODIFIED_TIMESTAMP\\\":\\\"2023-09-29 11:17:19.048\\\",\\\"P_MFGR\\\":null,\\\"P_COMMENT\\\":\\\"\\\\\\\"Other take so.\\\\\\\"\\\",\\\"P_PARTKEY\\\":\\\"845004850\\\",\\\"P_SIZE\\\":\\\"4\\\",\\\"P_CONTAINER\\\":\\\"\\\\\\\"MED BOX\\\\\\\"\\\",\\\"P_BRAND\\\":\\\"\\\\\\\"PLC\\\\\\\"\\\"}\"}]"
    }

Retrieving Enrichment Datastore Remediation

Endpoint (Get)

/api/datastores/{enrichment-datastore-id}/source-records?path={_remediation-table-prefix} (get)

Endpoint With Filters (Get)

/api/datastores/{enrichment-datastore-id}/source-records?filter=anomaly_uuid='{uuid}'&path={_remediation-table-prefix} (get)

Example result response

    {
        "source_records": "[{\"source_container\":\"table_name\",\"source_partition\":\"partition_name\",\"anomaly_uuid\":\"f11d4e7c-e757-4bf1-8cd6-d156d5bc4fa5\",\"context\":null,\"record\":\"{\\\"P_NAME\\\":\\\"\\\\\\\"strategize intuitive systems\\\\\\\"\\\",\\\"P_TYPE\\\":\\\"\\\\\\\"Radiographer, therapeutic\\\\\\\"\\\",\\\"P_RETAILPRICE\\\":\\\"-24.69\\\",\\\"LAST_MODIFIED_TIMESTAMP\\\":\\\"2023-09-29 11:17:19.048\\\",\\\"P_MFGR\\\":null,\\\"P_COMMENT\\\":\\\"\\\\\\\"Other take so.\\\\\\\"\\\",\\\"P_PARTKEY\\\":\\\"845004850\\\",\\\"P_SIZE\\\":\\\"4\\\",\\\"P_CONTAINER\\\":\\\"\\\\\\\"MED BOX\\\\\\\"\\\",\\\"P_BRAND\\\":\\\"\\\\\\\"PLC\\\\\\\"\\\"}\"}]"
    }

Retrieving Enrichment Datastore Failed Checks

Endpoint (Get)

/api/datastores/{enrichment-datastore-id}/source-records?path={_failed-checks-table-prefix} (get)

Endpoint With Filters (Get)

/api/datastores/{enrichment-datastore-id}/source-records?filter=anomaly_uuid='{uuid}'&path={_failed-checks-table-prefix} (get)

Example result response

    {
        "source_records": "[{\"quality_check_id\":155481,\"anomaly_uuid\":\"1a937875-6bce-4bfe-8701-075ba66be364\",\"quality_check_message\":\"{\\\"SNPSHT_TIMESTAMP\\\":\\\"2023-09-03 10:26:15.0\\\"}\",\"suggested_remediation_field\":null,\"suggested_remediation_value\":null,\"suggested_remediation_score\":null,\"quality_check_rule_type\":\"greaterThanField\",\"quality_check_tags\":\"Time-Sensitive\",\"quality_check_parameters\":\"{\\\"field_name\\\":\\\"SNPSHT_DT\\\",\\\"inclusive\\\":false}\",\"quality_check_description\":\"Must have a value greater than the value of SNPSHT_DT\",\"operation_id\":28162,\"detected_time\":\"2024-03-29T15:08:07.585Z\",\"source_container\":\"ACTION_TEST_CLIENT_V3\",\"source_partition\":\"ACTION_TEST_CLIENT_V3\",\"source_datastore\":\"DB2 Dataset\"}]"
    }

Retrieving Enrichment Datastore Scan Operations

Endpoint (Get)

/api/datastores/{enrichment-datastore-id}/source-records?path={_scan-operations-table-prefix} (get)

Endpoint With Filters (Get)

/api/datastores/{enrichment-datastore-id}/source-records?filter=operation_id='{operation-id}'&path={_scan-operations-table-prefix} (get)

Example result response

    {
        "source_records": "[{\"operation_id\":22871,\"datastore_id\":850,\"container_id\":7239,\"container_scan_id\":43837,\"partition_name\":\"ACTION_TEST_CLIENT_V3\",\"incremental\":true,\"records_processed\":0,\"enrichment_source_record_limit\":10,\"max_records_analyzed\":-1,\"anomaly_count\":0,\"start_time\":\"2023-12-04T20:35:54.194Z\",\"end_time\":\"2023-12-04T20:35:54.692Z\",\"result\":\"success\",\"message\":null}]"
    }

Retrieving Enrichment Datastore Exported Metadata

Endpoint (Get)

/api/datastores/{enrichment-datastore-id}/source-records?path={_export-metadata-table-prefix} (get)

Endpoint With Filters (Get)

/api/datastores/{enrichment-datastore-id}/source-records?filter=container_id='{container-id}'&path={_export-metadata-table-prefix} (get)

Example result of export anomalies responseExample result of export checks responseExample result of field profiles response

    {
        "source_records": "[{\"container_id\":13511,\"created\":\"2024-06-10T17:07:20.751438Z\",\"datastore_id\":1198,\"generated_at\":\"2024-06-11 18:42:31+0000\",\"global_tags\":\"\",\"id\":224818,\"source_container\":\"PARTSUPP-FORMATTED.csv\",\"source_datastore\":\"TPCH GCS\",\"status\":\"Active\",\"type\":\"shape\",\"uuid\":\"f2d4fae3-982b-45a1-b289-5854b7af4b03\"}]"
    }

    {
        "source_records": "[{\"additional_metadata\":null,\"container_id\":13515,\"coverage\":1.0,\"created\":\"2024-06-10T16:27:05.600041Z\",\"datastore_id\":1198,\"deleted_at\":null,\"description\":\"Must have a numeric value above >= 0\",\"fields\":\"L_QUANTITY\",\"filter\":null,\"generated_at\":\"2024-06-11 18:42:38+0000\",\"global_tags\":\"\",\"has_passed\":false,\"id\":196810,\"inferred\":true,\"is_new\":false,\"is_template\":false,\"last_asserted\":\"2024-06-11T18:04:24.480899Z\",\"last_editor\":null,\"last_updated\":\"2024-06-10T17:07:43.248644Z\",\"num_container_scans\":4,\"properties\":null,\"rule_type\":\"notNegative\",\"source_container\":\"LINEITEM-FORMATTED.csv\",\"source_datastore\":\"TPCH GCS\",\"template_id\":null,\"weight\":7.0}]"
    }

    {
        "source_records": "[{\"approximate_distinct_values\":106944.0,\"completeness\":0.7493389459,\"container_container_type\":\"file\",\"container_id\":13509,\"created\":\"2024-06-10T16:23:48.457907Z\",\"datastore_id\":1198,\"datastore_type\":\"gcs\",\"entropy\":null,\"field_global_tags\":\"\",\"field_id\":145476,\"field_name\":\"C_ACCTBAL\",\"field_profile_id\":882170,\"field_quality_score\":\"{\\\"total\\\": 81.70052209952111, \\\"completeness\\\": 74.93389459101233, \\\"coverage\\\": 66.66666666666666, \\\"conformity\\\": null, \\\"consistency\\\": 100.0, \\\"precision\\\": 100.0, \\\"timeliness\\\": null, \\\"volumetrics\\\": null, \\\"accuracy\\\": 100.0}\",\"field_type\":\"Fractional\",\"field_weight\":1,\"generated_at\":\"2024-06-11 18:42:32+0000\",\"histogram_buckets\":null,\"is_not_normal\":true,\"kll\":null,\"kurtosis\":-1.204241522,\"max\":9999.99,\"max_length\":null,\"mean\":4488.8079264033,\"median\":4468.34,\"min\":-999.99,\"min_length\":null,\"name\":\"C_ACCTBAL\",\"q1\":1738.87,\"q3\":7241.17,\"skewness\":0.0051837205,\"source_container\":\"CUSTOMER-FORMATTED.csv\",\"source_datastore\":\"TPCH GCS\",\"std_dev\":3177.3005493585,\"sum\":5.0501333575999904E8,\"type_declared\":false,\"unique_distinct_ratio\":null}]"
    }

Data Preview

Data Preview in Qualytics makes it simple to explore data tables and fields within a selected enrichment dataset. It supports filtering, field selection, and record downloads for deeper analysis, ensuring streamlined and efficient data management.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the Enrichment Datastores button on the left side panel of the interface.

click

Step 2: You will see a list of available enrichment datastores. Click on the specific datastore you want to preview its details and data.

specific

For Demonstration purposes, we have selected Netsuite Financials Enrich enrichment datastore.

Step 3: After clicking on your selected enrichment datastore, you will be able to preview its enrichment, export, materialize, remediation, including all data tables, and unlinked data tables.

preview

Data Preview Tab

Data Preview Tab provides a clear visualization of enriched dataset tables and fields like _FAILED_CHECKS, _SOURCE_RECORDS, and _SCAN_OPERATIONS. Users can explore remediation data, export data, materialized datasets, and unlinked objects, refine data with filters, select specific fields, and download records for further analysis. This tab ensures efficient data review and management for enhanced insights.

All

By selecting All, users can access a comprehensive list of data tables associated with the selected enrichment datastore. This includes all relevant tables categorized under Enrichment, Remediation, Export, Materialize, and Unlinked sections, enabling users to efficiently explore and manage the data. Click on a specific table or dataset within the All section to access its detailed information.

all

After clicking on a specific table or dataset, a detailed view opens, displaying fields such as _FAILED_CHECKS, _SOURCE_RECORDS, _SCAN_OPERATIONS, remediation tables (e.g., _ENRICHMENT_CONTAINER_PREFIX_REMEDIATION_CONTAINER_ID), exported tables, materialized outputs, and unlinked objects (orphaned data) for review and action.

data

Enrichment

By selecting Enrichment users can access a comprehensive view of the table or data associated with the selected enrichment datastore. Click on specific table or dataset within the Enrichment section to access its detailed information.

enrichment

After clicking on a specific table or dataset, a detailed view opens, displaying fields of the selected table or dataset.

data2

Remediation

By selecting Remediation users can access a comprehensive view of the table or data associated with the selected enrichment datastore. Click on specific table or dataset within the Remediation section to access its detailed information.

remediation

After clicking on a table or dataset, a detailed view opens, displaying all the fields and data associated with the selected table or dataset.

data3

Export

By selecting Export, users can access a comprehensive view of the exported tables or data associated with the selected enrichment datastore. Click on a specific table or dataset within the Export section to access its detailed information.

export

After clicking on a table or dataset, a detailed view opens, displaying all the fields and data associated with the selected table or dataset.

data1

Materialize

By selecting Materialize, users can access a comprehensive view of the materialized tables or data associated with the selected enrichment datastore. Click on a specific table or dataset within the Materialize section to access its detailed information.

materialize

After clicking on a table or dataset, a detailed view opens, displaying all the fields and data associated with the selected table or dataset.

data4

Unlinked

By selecting Unlinked users can access a comprehensive view of the table or data associated with the selected enrichment datastore. Click on specific table or dataset within the Unlinked section to access its detailed information.

unlinked

After clicking on a table or dataset, a detailed view opens, displaying all the fields and data associated with the selected table or dataset.

data5

Filter Clause and Refresh

Data Preview tab includes a filter functionality that enables users to focus on specific fields by applying filter clauses. This refines the displayed rows based on specific criteria, enhancing data analysis and providing more targeted insights and a Refresh button to update the data view with the latest data.

Filter Clause

Use the Filter Clause to narrow down the displayed rows by applying specific filter clauses, allowing for focused and precise data analysis.

filter

Refresh

Click Refresh button to update the data view with the latest information, ensuring accuracy and relevance.

refresh

Select Specific Fields

Select specific fields to display, allowing you to focus on the most relevant data for analysis.To focus on relevant data for analysis, click on the Select Fields to Show dropdown. Choose specific fields you want to review by checking or unchecking options.

specific

Download Records

Download Records feature in Qualytics allows users to easily export all source records from the selected enrichment dataset. This functionality is essential for performing deeper analysis outside the platform or for sharing data with external tools and teams.

download

Supported Enrichment Datastores

Qualytics supports enrichment datastore connectors that help enhance data discovery, profiling, and quality checks. Some connectors include enrichment capabilities, while others provide only standard connectivity.

In this guide, we will cover:

JDBC Connectors
DFS Connectors

JDBC Connectors

The table below shows the list of JDBC connectors and whether they support enrichment or not:

No.	Connector	Enrichment Support
01.	Athena	❌
02.	Big Query	✅
03.	Databricks	✅
04.	DB2	✅
05.	Dremio	❌
06.	Hive	❌
07.	MariaDB	✅
08.	Microsoft SQL Server	✅
09.	MySQL	✅
10.	Oracle	❌
11.	PostgreSQL	✅
12.	Presto	❌
13.	Redshift	✅
14.	Snowflake	✅
15.	Synapse	✅
16.	Teradata	❌
17.	TimescaleDB	❌
18.	Trino	✅

DFS Connectors

The table below shows the list of DFS connectors and whether they support enrichment or not:

No.	Connector	Enrichment Support
01.	Amazon S3	✅
02.	Azure Datalake Storage (ABFS)	✅
03.	Google Cloud Storage (GCS)	✅

Manage Enrichment

Enrichment Actions

Enrichment Actions in Qualytics help you manage enrichment datastores efficiently—whether you're adding a new source, updating existing settings, or removing. These actions keep your enrichment workflows accurate, current, and easy to maintain.

Let’s get started 🚀

Log in to your Qualytics account and click the Enrichment Datastores button on the left side panel of the interface.

select-enrichment

Add Enrichment

Use this action to create a new enrichment datastore by entering details such as name, connector type, and authentication credentials.

For more information, refer to the Add Enrichment Documentation.

Edit & Delete Enrichment

Use Edit to update existing configuration details, and Delete to permanently remove a datastore and its linked components when no longer needed.

Step 1: Select the specific enrichment datastore you want to edit or delete.

settings

Step 2: Click the Settings icon located at the top right corner of the interface, then choose Edit or Delete depending on the action you want to perform.

settings

Edit

Use this action to modify an existing enrichment datastore—update its connection details or reconfigure any required fields.

For more information, refer to the Edit Enrichment Documentation.

Delete Enrichment

Use this action to permanently delete an enrichment datastore that is no longer required.

For more information, refer to the Delete Enrichment Documentation.

Add Enrichment Datastore

Step 1: Click on the Add Enrichment Datastore button located at the top-right corner of the interface.

add

Step 2: A modal window- Add Enrichment Datastore will appear, providing you with the options to add enrichment datastore.

window

REF.	FIELDS	ACTIONS
1.	Name	Specify the name of the enrichment datastore
2.	Toggle Button	Toggle ON to create a new enrichment datastore from scratch, or toggle OFF to reuse credentials from an existing connection
3.	Connector	Select connector from the dropdown list.

Option I: Add Enrichment Datastore with a new Connection

If the toggle for Add New connection is turned on, then this will prompt you to add and configure the enrichment datastore from scratch without using existing connection details.

Step 1: Select the connector from the dropdown list and add connection details such as Secrets Management, temp dataset ID, service account key, project ID, and dataset ID.

For demonstration purposes we have selected the Snowflake connector.

connector

Secrets Management: This is an optional connection property that allows you to securely store and manage credentials by integrating with HashiCorp Vault and other secret management systems. Toggle it ON to enable Vault integration for managing secrets.

Note

After configuring HashiCorp Vault integration, you can use ${key} in any Connection property to reference a key from the configured Vault secret. Each time the Connection is initiated, the corresponding secret value will be retrieved dynamically.

REF	FIELDS	ACTIONS
1.	Login URL	Enter the URL used to authenticate with HashiCorp Vault.
2.	Credentials Payload	Input a valid JSON containing credentials for Vault authentication.
3.	Token JSONPath	Specify the JSONPath to retrieve the client authentication token from the response (e.g., $.auth.client_token).
4.	Secret URL	Enter the URL where the secret is stored in Vault.
5.	Token Header Name	Set the header name used for the authentication token (e.g., X-Vault-Token).
6.	Data JSONPath	Specify the JSONPath to retrieve the secret data (e.g., $.data).

secret

Step 2: The configuration form, requesting credential details before add the enrichment datastore.

Note

Different connectors have different sets of fields and options appearing when selected.

REF	FIELDS	ACTIONS
1.	Account (Required)	Define the account identifier to be used for accessing the Snowflake.
2.	Role (Required)	Specify the user role that grants appropriate access and permissions.
3.	Warehouse (Required)	Provide the warehouse name that will be used for computing resources.
4.	Authentication (Required)	You can choose between Basic authentication or Keypair authentication for validating and securing the connection to your Snowflake instance. Basic Authentication: This method uses a username and password combination for authentication. It is a straightforward method where the user's credentials are directly used to access Snowflake. Type: Select the authentication type from the dropdown menu. User: Enter the username that Qualytics will use to connect to Snowflake. Password: Enter the password associated with the specified user account. Keypair Authentication: This method uses a combination of a private key and a corresponding public key for authentication. This is a more secure method compared to basic authentication, as it involves asymmetric cryptography Type: Select "Keypair" from the dropdown menu. User: Enter the username that Qualytics will use to connect to Snowflake. Private Key: Upload the private key file that will be used for authentication. This key is part of a public-private key pair used to securely authenticate the user. Private Key Password (Optional): Enter the password associated with the private key, if any
5.	Database	Specify the database name to be accessed.
6.	Schema	Define the schema within the database that should be used.
7.	Teams	Select one or more teams from the dropdown to associate with this source datastore.

Step 3: After adding the details, click on the Test Connection button to check and verify its connection.

test

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Step 4: Click on the Save button.

save

A modal window appears and shows a success message that the enrichment was updated successfully.

Step 5: Close the success dialog. Here, you can view a list of all the enrichment datastores you have added to the system. For demonstration purposes, we have created an enrichment datastore named Snowflake_demo, which is visible in the list.

Option II: Use an Existing Enrichment Datastore

If the toggle for Add New connection is turned off, then this will prompt you to add and configure the enrichment datastore using existing connection details.

Step 1: Select a connection to reuse existing credentials.

Note

If you are using existing credentials, you can only edit the details such as Database, Schema, and Teams.

connection

Step 2: After adding the details, click on the Test Connection button to check and verify its connection.

test

If the credentials and provided details are verified, a success message will be displayed indicating that the connection has been verified.

Step 3: Click on the Save button.

save2

A modal window appears and shows a success message that the enrichment was updated successfully.

Step 4: Close the success dialog. Here, you can view a list of all the enrichment datastores you have added to the system. For demonstration purposes, we have created an enrichment datastore named Snowflake_demo, which is visible in the list.

Edit Enrichment

Step 1: Click on the Edit option.

edit

Step 2: After selecting the Edit option, a modal window will appear, displaying the connection details. This window allows you to modify any specific connection details.

edit-modal

Step 3: After editing the connection details, click on the Test Connection button to check and verify its connection.

test-notification

If the credentials and provided connection details are verified, a success message will be displayed indicating that the connection has been verified.

Step 4: Click on the Save button.

save

After clicking the Save button, a success notification appears on the screen showing the action was completed successfully.

Delete Enrichment

Step 1: Click on the Delete icon.

delete

A modal window Delete Enrichment Datastore will appear.

delete-modal

When deleting an enrichment datastore, the confirmation dialog displays the number of linked source datastores.

linked

Step 2: Enter the name of the enrichment datastore in the given field (confirmation check) and then click on the I’M SURE, DELETE THIS ENRICHMENT DATASTORE button to delete the enrichment datastore.

delete-options

After clicking the I’M SURE, DELETE THIS ENRICHMENT DATASTORE button, a success notification appears confirming the deletion.

Containers

Containers Overview

Containers are fundamental entities representing structured data sets. These containers could manifest as tables in JDBC datastores or as files within DFS datastores. They play a pivotal role in data organization, profiling, and quality checks within the Qualytics application.

Let’s get started 🚀

Container Types

There are two main types of containers in Qualytics:

JDBC Container

JDBC containers are virtual representations of database objects, making it easier to work with data stored in relational databases. These containers include tables, which organize data into rows and columns like a spreadsheet, views that provide customized displays of data from one or more tables, and other database objects such as indexes or stored procedures. Acting as a bridge between applications and databases, JDBC enables seamless interaction with these containers, allowing efficient data management and retrieval.

DFS Container

DFS containers are used to represent files stored in distributed file systems, such as Hadoop or cloud storage. These files can include formats like CSV, JSON, or Parquet, which are commonly used for storing and organizing data. DFS containers make it easier for applications to work with these files by providing a structured way to access and process data in large-scale storage systems.

Container Attributes

Totals

Note

Totals are calculated from sampled data, not the full dataset. Values may differ from actual totals across all records.

Quality Score: This represents the overall health of the data based on various checks. A higher score indicates better data quality and fewer issues detected.
Sampling: Displays the percentage of data sampled during profiling. A 100% sampling rate means the entire dataset was analyzed for the quality report.
Completeness: Indicates the percentage of records that are fully populated without missing or incomplete data. Lower percentages may suggest that some fields have missing values.
Records Profiled: Shows the number or percentage of records that have been analyzed during the profiling process.
Fields Profiled: This shows the number of fields or attributes within the dataset that have undergone data profiling, which helps identify potential data issues in specific columns.
Active Checks: Represents the number of ongoing checks applied to the dataset. These checks monitor data quality, consistency, and correctness.
Active Anomalies: Displays the total number of anomalies found during the data profiling process. Anomalies can indicate inconsistencies, outliers, or potential data quality issues that need resolution.

totals

Observability

1. Volumetric Measurement

Volumetric measurement allows users to track the size of data stored within the table over time. This helps in monitoring how the data grows or changes, making it easier to detect sudden spikes that may impact system performance. Users can visualize data volume trends and manage the table's efficiency. This helps in optimizing storage, adjusting resource allocation, and improving query performance based on the size and growth of the computed table.

volumetric

2. Anomalies Measurement

The Anomalies section helps users track any unusual data patterns or issues within the computed tables. It shows a visual representation of when anomalies occurred over a specific time period, making it easy to spot unusual activity. This allows users to quickly identify when something might have gone wrong and take action to fix it, ensuring the data stays accurate and reliable.

anomalies

Actions on Container

Users can perform various operations on containers to manage datasets effectively. The actions are divided into three main sections: Settings, Add, and Run. Each section contains specific options to perform different tasks.

action

Settings

The Settings button allows users to configure the container. By clicking on the Settings button, users can access the following options:

settings

No.	Options	Description
1.	Settings	Configure incremental strategy, partitioning fields, and exclude specific fields from analysis.
2.	Score	Score allowing you to adjust the decay period and factor weights for metrics like completeness, accuracy, and consistency.
3.	Observability	Enables or disables tracking for data volume and freshness. Volume Tracking: Monitors daily volume metrics to identify trends and detect anomalies over time. Freshness Tracking: Records the last update timestamp to ensure data timeliness and detect pipeline delays.
4.	Migrate	Migrate authored quality checks from one container to another (even across datastores) to quickly reuse, standardize, and avoid recreating rules.
5.	Export	Export quality checks, field profiles, and anomalies to an enrichment datastore for further action or analysis.
6.	Materialize	Captures snapshots of data from a source datastore and exports it to an enrichment datastore for faster access and analysis.
7.	Delete	Delete the selected container from the system.

Add

The Add button allows users to add checks or computed fields. By clicking on the Add button, users can access the following options:

add

No.	Options	Description
1.	Checks	Checks allow you to add new checks or validation rules for the container.
2.	Computed Field	Allows you to add a computed field.

Run

The Run button provides options to execute operations on datasets, such as profiling, scanning, and external scans. By clicking on the Run button, users can access the following options:

run

No.	Options	Descriptions
1.	Profile	Profile allows you to run a profiling operation to analyze the data structure, gather metadata, set thresholds, and define record limits for comprehensive dataset profiling.
2.	Scan	Scan allows you to perform data quality checks, configure scan strategies, and detect anomalies in the dataset.
3.	External Scan	External Scan allows you to upload a file and validate its data against predefined checks in the selected table.

Field Profiles

After profiling a container, individual field profiles offer granular insights:

Totals

1. Quality Score: This provides a comprehensive assessment of the overall health of the data, factoring in multiple checks for accuracy, consistency, and completeness. A higher score, closer to 100, indicates optimal data quality with minimal issues or errors detected. A lower score may highlight areas that require attention and improvement.

2. Sampling: This shows the percentage of data that was evaluated during profiling. A sampling rate of 100% indicates that the entire dataset was analyzed, ensuring a complete and accurate representation of the data’s quality across all records, rather than just a partial sample.

3. Completeness: This metric measures how fully the data is populated without missing or null values. A higher completeness percentage means that most fields contain the necessary information, while a lower percentage indicates data gaps that could negatively impact downstream processes or analysis.

4. Active Checks: This refers to the number of ongoing quality checks being applied to the dataset. These checks monitor aspects such as format consistency, uniqueness, and logical correctness. Active checks help maintain data integrity and provide real-time alerts about potential issues that may arise.

5. Active Anomalies: This tracks the number of anomalies or irregularities detected in the data. These could include outliers, duplicates, or inconsistencies that deviate from expected patterns. A count of zero indicates no anomalies, while a higher count suggests that further investigation is needed to resolve potential data quality issues.

totals

Profile

This provides detailed insights into the characteristics of the field, including its type, distinct values, and length. You can use this information to evaluate the data's uniqueness, length consistency, and complexity.

No	Profile	Description
1	Declared Type	Indicates whether the type is declared by the source or inferred.
2	Distinct Values	Count of distinct values observed in the dataset.
3	Min Length	Shortest length of the observed string values or lowest value for numerics.
4	Max Length	Greatest length of the observed string values or highest value for numerics.
5	Mean	Mathematical average of the observed numeric values.
6	Median	The median of the observed numeric values.
7	Standard Deviation	Measure of the amount of variation in observed numeric values.
8	Kurtosis	Measure of the ‘tailedness’ of the distribution of observed numeric values.
9	Skewness	Measure of the asymmetry of the distribution of observed numeric values.
10	Q1	The first quartile; the central point between the minimum and the median.
11	Q3	The third quartile; the central point between the median and the maximum.
12	Sum	Total sum of all observed numeric values.

Last Profile

The Last Profile timestamp helps users understand how up-to-date the field is. When you hover over the time indicator shown on the right side of the Last Profile label (e.g., "1 week ago"), a tooltip displays the complete date and time the field was last profiled.

last-profiled

This visibility ensures better context for interpreting profile metrics like mean, completeness, and anomalies.

Compare Profile

You can compare the current field profile with earlier versions to spot changes over time. Visual indicators highlight modified metrics, interactive charts show numeric trends across profile history, and special badges identify data drift or field type changes.

By clicking on the dropdown under Compare With, you can select an earlier profile run (for example, 1 day ago or 5 days ago).

compare-light

Once selected, the system highlights differences between profiles, marking metrics as Changed or Unchanged. It compares data quality (Sampling, Completeness) and statistical measures (mean, median, standard deviation, skewness, kurtosis, min, max, distinct values, etc.), making it easy to track shifts in data quality and distribution.

change-light

View Metric Chart

You can access detailed metric charts by clicking the View Metric Chart button. This will display variations across the last 10 profiles. By hovering over points on the chart, you can see additional details such as profile dates, measured values, and sampling percentages for deeper analysis.

metric-chart-light

Data Preview

Data Preview in Qualytics makes it easy for users to view and understand their container data. It provides a clear snapshot of the data's structure and contents, showing up to 100 rows from the source. With options to filter specific data, refresh for the latest updates, and download records, it helps users focus on the most relevant information, troubleshoot issues, and analyze data effectively. The simple grid view ensures a smooth and efficient way to explore and work with your data.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and select the source datastore (JDBC or DFS) from the left menu that contains the data you want to preview.

select

Step 2: Select Tables (if a JDBC datastore is connected) or File Patterns (if a DFS datastore is connected) from the Navigation tab at the top.

table

Step 3: You will view the full list of tables or files belonging to the selected source datastore. Select the specific table or file whose data you want to preview.

list

Alternatively, you can access the tables or files by clicking the drop-down arrow on the selected datastore. This will display the full list of tables or files associated with the selected source datastore. From there, select the specific table or file whose data you want to preview.

alter

Step 4: After selecting the specific table or file, click on the Data Preview tab.

preview

You will see a tabular view of the data, displaying the field names (columns) and their corresponding data values, allowing you to review the data's structure, types, and sample records.

tabular

UI Caching

Upon initial access to the Data Preview section, the data may not be stored (cached) yet, which can cause longer loading times. How long it takes to load depends on the type of datastore being used (like DFS or JDBC) and whether the data warehouse is serverless. However, the next time you access the same data, it will load faster because it will be cached, meaning the data is stored temporarily for quicker access.

Filter Clause and Refresh

The Data Preview tab includes filter functionality that enables users to focus on specific fields by applying filter clauses. This refines the displayed rows based on specific criteria, enhancing data analysis and providing more targeted insights and includes a Refresh button to update the data view with the latest data.

Filter Clause

Use the Filter Clause to narrow down the displayed rows by applying specific criteria, allowing for focused and precise data analysis.

filter

Refresh

Click the Refresh button to update the data view with the latest information, ensuring accuracy and relevance.

refresh

Select Specific Fields

Select specific fields to display, allowing you to focus on the most relevant data for analysis. Click on the Select Fields to Show dropdown and choose specific fields you want to review by checking or unchecking options.

field

Download Records

The Download Records feature in Qualytics allows users to easily export all source records from the selected enrichment dataset. This functionality is essential for performing deeper analysis outside the platform or for sharing data with external tools and teams.

download

Use Cases

Debugging Checks

One of the primary use cases of the Data Preview tab is for debugging checks. Users can efficiently inspect the first 100 rows of container data to identify any anomalies, inconsistencies, or errors, facilitating the debugging process and improving data quality.

Data Analysis

The Data Preview tab also serves as a valuable tool for data analysis tasks. Users can explore the dataset, apply filters to focus on specific subsets of data, and gain insights into patterns, trends, and correlations within the container data.

Examples

Example 1: Debugging Data Import

Suppose a user encounters issues with importing data into a container. By utilizing the Data Preview tab, the user can quickly examine the first 100 rows of imported data, identify any formatting errors or missing values, and troubleshoot the data import process effectively.

Example 2: Filtering Data by Date Range

In another scenario, a user needs to analyze sales data within a specific date range. The user can leverage the filter support feature of the Data Preview tab to apply date range filters, displaying only the sales records that fall within the specified timeframe. This allows for targeted analysis and informed decision-making.

Computed Tables & Files

Computed Tables and Computed Files are powerful virtual tables within the Qualytics platform, each serving distinct purposes in data manipulation. Computed Tables are created using SQL queries on JDBC source datastores, enabling advanced operations like joins and where clauses. Computed Files, derived from Spark SQL transformations on DFS source datastores, allow for efficient data manipulation and transformation directly within the DFS environment.

This guide explains how to add Computed Tables and Computed Files and discusses the differences between them.

Let's get started 🚀

Computed Tables

Use Computed Tables when you want to perform the following operations on your selected source datastores:

Data Preparation and Transformation: Clean, shape, and restructure raw data from JDBC source datastores.
Complex Calculations and Aggregations: Perform calculations not easily supported by standard containers.
Data Subsetting: Extract specific data subsets based on filters using SQL's WHERE clause.
Joining Data Across Source Datastores: Combine data from multiple JDBC source datastores using SQL joins.

Add Computed Tables

Step 1: Log in to your Qualytics account and select a JDBC-type source datastore from the side menu on which you would like to add a computed table.

select-datastore

Step 2: After selecting your preferred source datastore, you will be redirected to the source datastore operations page. From this page, click on the Add button and select the Computed Table option from the dropdown menu.

computed-table

Step 3: A modal window will appear prompting you to enter a name for your computed table, a valid SQL query that supports your selected source datastore, and optionally, additional metadata.

REF.	FIELDS	ACTIONS
1.	Name (Required)	Enter a name for your computed table. The name should be descriptive and meaningful to help you easily identify the table later (e.g., add a meaningful name like `Customer_Order_Statistics`).
2.	Query (Required)	Write a valid SQL query that supports your selected source datastore. The query helps to perform joins and aggregations on your selected source datastore.
3.	Additional Metadata (Optional)	Add custom metadata to enhance the definition of your computed table. Click the plus icon (+) next to this section to open the metadata input form, where you can add key-value pairs.

add-computed-table

Step 4: Click on the Validate button to instantly check the syntax and semantics of your SQL query. This ensures your query runs successfully and prevents errors before saving.

validate-computed-table

Step 5: Once validation is successful, click on the Save button to add the computed table to your selected source datastore.

click-add

Computed Files

Use Computed Files when you want to perform the following operations on your selected source datastore:

Data Preparation and Transformation: Efficiently clean and restructure raw data stored in a DFS.
Column-Level Transformations: Utilize Spark SQL functions to manipulate and clean individual columns.
Filtering Data: Extract specific data subsets within a DFS container using Spark SQL's WHERE clause.

Add Computed Files

Step 1: Log in to your Qualytics account and select a DFS-type source datastore from the side menu on which you would like to add a computed file.

select-datastore

Step 2: After clicking on your preferred source datastore, you will be redirected to the source datastore operations page. From this page, click on the Add button and select the Computed File option from the dropdown menu.

select-computed-file

Step 3: A modal window will appear prompting you to enter a name for your computed file, select a source file pattern, choose the expression, and optionally define a filter clause and add additional metadata.

REF.	FIELDS	ACTION
1.	Name (Required)	Enter a name for your computed file. The name should be descriptive and meaningful to help you easily identify the file later (e.g., add a meaningful name like Customer_Order_Statistics).
2.	Source File Pattern (Required)	Select a source file pattern from the dropdown menu to match files that have a similar naming convention.
3.	Select Expression (Required)	Select the expression to define the data you want to include in the computed file.
4.	Filter Clause (Optional)	Add a WHERE clause to filter the data that meets certain conditions.
5.	Additional Metadata (Optional)	Enhance the computed file definition by setting custom metadata. Click the plus icon (+) next to this section to open the metadata input form, where you can add key-value pairs.

add-compute-file

Step 4: Click on the Validate button to quickly check your query or expression before saving.

validate

Step 5: Once validation is successful, click on the Save button to add the computed file to your selected source datastore.

click-add-file

After clicking the Save button, a success notification appears on the screen showing the action was completed successfully.

Computed Table Vs. Computed File

Feature	Computed Table (JDBC)	Computed File (DFS)
Source Data	JDBC source datastores	DFS source datastores
Query Language	SQL (database-specific functions)	Spark SQL
Supported Operations	Joins, where clauses, and database functions	Column transforms, where clauses (no joins), Spark SQL functions

Note

Computed tables and files function like regular tables. You can profile them, create checks, and detect anomalies.

Updating a computed table's query will trigger a profiling operation.
Updating a computed file's select or where clause will trigger a profiling operation.
When you create a computed table or file, a basic profile of up to 1000 records is automatically generated.

View Assigned Teams

By hovering over the information icon, users can view the assigned teams for enhanced collaboration and data transparency.

team

Computed Join

A Computed Join Container allows you to combine data from two containers, which can be from the same source datastore or different source datastores (e.g., a database table vs. a file system container). You can choose the join type (Inner, Left, Right, or Full Outer) and apply transformations, filters, and custom queries to the joined result.

This feature is useful when you want to:

Merge information from multiple source datastores into a single dataset.
Perform cross-datastore analysis (e.g., JDBC tables with DFS files).
Apply Spark SQL transformations and filters on top of the joined data.

Let's get started 🚀

How It Works

The Add Computed Join form consists of:

REF.	FIELDS	DESCRIPTION
1	Name	The unique name for your computed join container.
2	Join Type	Choose one of the following: • Inner Join: Keeps only rows with matching keys in both containers. • Left Join: Keeps all rows from the left container, matching rows from the right. • Right Join: Keeps all rows from the right container, matching rows from the left. • Full Outer Join: Keeps all rows from both containers.
3	Left Reference	• Datastore: Source datastore where the computed join container will be created. • Container: The left container to join. • Field: The key (column) to join on. • Prefix: A label (e.g., `left`) applied to all columns from this container.
4	Right Reference	• Datastore: Source datastore containing the second container. • Container: The right container to join. • Field: The key (column) to join on. • Prefix: A label (e.g., `right`) applied to all columns from this container.
5	Select Expression	A list of columns to include in the result. Columns are automatically prefixed (e.g., `left_name`, `right_name`) to avoid conflicts.
6	Filter Clause (WHERE)	Additional filters applied to the join result.

computed-join

Example Use Case

Scenario

We want to join:

Left Container: customers
Right Container: orders
Join Key: customer_id
Join Type: Left Join
Prefixes: cust_ and order_

Input Tables

customers

customer_id	name	city
1	Alice	Berlin
2	Bob	London
3	Charlie	Paris

orders

order_id	customer_id	product
101	1	Laptop
102	1	Mouse
103	2	Keyboard

Joined Result (Left Join)

cust_customer_id	cust_name	cust_city	order_order_id	order_product
1	Alice	Berlin	101	Laptop
1	Alice	Berlin	102	Mouse
2	Bob	London	103	Keyboard
3	Charlie	Paris	NULL	NULL

Visual Diagram

+------------+                 +--------+
| customers  |    LEFT JOIN    | orders |
+------------+ <-------------> +--------+
    |                               |
    +-- customer_id = customer_id --+

API Example

Endpoint

POST:/api/containers

Expected response: 200 OK

Request Payload

{
    "container_type": "computed_join",
    "name": "customer_orders_join",
    "select_clause": "cust_customer_id, cust_name, cust_city, order_order_id, order_product",
    "where_clause": null,
    "left_join_field_name": "customer_id",
    "left_prefix": "cust",
    "right_join_field_name": "customer_id",
    "right_prefix": "order",
    "join_type": "left",
    "left_container_id": 101,
    "right_container_id": 202
}

Tips

Always set prefixes to avoid column name collisions.
Use Select Expression to choose only the columns you need.
Apply a Filter Clause for better performance by reducing unnecessary data.
Test the join type with sample data to verify expected behavior.

Manage Tables & Files

Managing JDBC “tables” and DFS “files” in a connected source datastore allows you to perform actions such as adding validation checks, running scans, monitoring data changes, exporting, or deleting them. For JDBC tables, you can also handle metadata, configure partitions, and manage incremental data for optimized processing. However, for DFS datastores, the default incremental field is the file’s last modified timestamp, and users cannot configure incremental or partition fields manually.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and select the source datastore (JDBC or DFS) from the left menu that you want to manage.

select-datastore

Step 2: Select Tables (if JDBC datastore is connected) or File Patterns (if DFS datastore is connected) from the Navigation tab on the top.

select-table

Step 3: You will view the full list of tables or files belonging to the selected source datastore.

table-list

Settings For JDBC Table

Settings allow you to edit how data is processed and analyzed for a specific table in your connected source datastore. This includes selecting fields for incremental and partitioning strategies, grouping data, excluding certain fields from scans, and adjusting general behaviors.

Step 1: Click on the vertical ellipse next to the table of your choice and select Settings from the dropdown list.

vertical-setting

A modal window will appear for “Table Settings”.

table-setting

Step 2: Modify the table setting based on:

Identifiers
Group Criteria
Excluding
General

Identifiers

An Identifier is a field that can be used to help load the desired data from a Table in support of analysis. For more details about identifiers, you can refer to the documentation on Identifiers.

Incremental Strategy

This is crucial for tracking changes at the row level within tables. This approach is essential for efficient data processing, as it is specifically used to track which records have already been scanned. This allows for scan operations to focus exclusively on new records that have not been previously scanned, thereby optimizing the scanning process and ensuring that only the most recent and relevant data is analyzed.

Note

If you have connected a DFS datastore, no manual setup is needed for the incremental strategy, the system automatically tracks and processes the latest data changes.

incremental-strategy

For information about incremental strategy, you can refer to the Incremental Strategy section in the Identifiers documentation.

Incremental Field

Incremental Field lets you select a field that tracks changes in your data. This ensures only new or updated records are scanned, improving efficiency and reducing unnecessary processing.

incremental-field

Partition Field

Partition Field is used to divide the data in a table into distinct segments, or dataframes. These partitions allow for parallel analysis, improving efficiency and performance. By splitting the data, each partition can be processed independently. This approach helps optimize large-scale data operations.

partition-field

For information about Partition Field, you can refer to the Partition Field section in the Identifiers documentation.

Group Criteria

Group Criteria allow you to organize data into specific groups for more precise analysis. By grouping fields, you can gain better insights and enhance the accuracy of your profiling.

group-criteria

For information about Group Criteria, you can refer to the documentation on Grouping.

Excluding

Excluding allows you to choose specific fields from a table that you want to exclude from data checks. This helps focus on the fields that matter most for validation while ignoring others that are not relevant to the current analysis.

excluding-field

For information about Excluding, you can refer to the documentation on Excluding Settings.

General

You can control the default behavior of the specific table by checking or unchecking the option to infer the data type for each field. When checked, the system will automatically determine and cast the data types as needed for accurate data processing.

general-field

For information about General, you can refer to the documentation on General Settings.

Step 3: Once you have configured the table settings, click on the Save button.

save-button

After clicking on the Save button, your table is successfully updated and a success flash message will appear stating "Table has been successfully updated".

flash-message

Settings For DFS Files Pattern

Settings allow you to edit how data is processed and analyzed for a specific file patterns in your connected source datastore. This includes selecting fields for incremental and partitioning strategies, grouping data, excluding certain fields from scans, and adjusting general behaviors.

Step 1: Click on the vertical ellipse next to the file pattern of your choice and select Settings from the dropdown list.

vertical-setting

A modal window will appear for “File Pattern Settings”.

table-setting

Step 2: Modify the table setting based on:

Group Criteria
Excluding
General

Group Criteria

Group Criteria allow you to organize data into specific groups for more precise analysis. By grouping fields, you can gain better insights and enhance the accuracy of your profiling.

group-criteria

For information about Group Criteria, you can refer to the documentation on Grouping.

Excluding

Excluding allows you to choose specific fields from a file pattern that you want to exclude from data checks. This helps focus on the fields that matter most for validation while ignoring others that are not relevant to the current analysis.

excluding-field

For information about Excluding, you can refer to the documentation on Excluding Settings.

General

You can control how file patterns behave by checking or unchecking options to make data processing easier and more consistent. These settings help the system automatically adjust file structures for better integration and analysis.

general-field

Inferring Data Types: When enabled, the system figures out the correct data type for each field and applies it automatically. This keeps data consistent and reduces errors, saving you time on manual fixes.

general-field

First Row as Field Names: Turning this on uses the first row of a file as headers, making it simple to map and organize data in the right format.

general-field

Treating Empty Values as Nulls: The Treat empty values as null setting controls how empty fields in files like Excel and CSV are handled. If enabled (true), empty fields are treated as NULL (missing data). If disabled (false), they are stored as empty strings (""), meaning the field exists but is blank. This affects reporting, calculations, and data processing, as NULL values are ignored while empty strings may still be counted.

general-field

Step 3: Once you have configured the file pattern settings, click on the Save button.

save-button

After clicking on the Save button, your table is successfully updated and a success flash message will appear stating "File Pattern has been successfully updated".

flash-message

Add Checks

Add Check allows you to create rules to validate the data within a particular table. You can choose the type of rule, link it directly to the selected table, and add descriptions or tags. This ensures that the table's data remains accurate and compliant with the required standards.

Step 1: Click on the vertical ellipse next to the table name and select Add Checks.

add-checks

A modal window will appear to add checks against the selected table.

check-window

To understand how to add checks, you can follow the remaining steps from the documentation Checks Template.

Run

Execute various operations like profiling or scanning your table or file. It helps validate data quality and ensures that the table meets the defined checks and rules, providing insights into any anomalies or data issues that need attention.

Step 1: Click on the vertical ellipse next to the table name and select Run.

run-button

Under Run, choose the type of operation you want to perform:

Profile: To collect metadata and profile the table's contents.
Scan: To validate the data against defined rules and checks.

run-option

To understand how a profile operation is performed, you can follow the remaining steps from the documentation Profile Operation..

To understand how a scan operation is performed, you can follow the remaining steps from the documentation Scan Operation.

Observability Settings

Observability helps you track and monitor data performance in your connected source datastore’s tables and files. It provides insights into data volume, detects anomalies, and ensures smooth data processing by identifying potential issues early. This makes it easier to manage and maintain data quality over time.

Step 1: Select the table in your JDBC datastore that you would like to monitor, then click on Observability.

observability-setting

A modal window “Observability Settings” will appear. Here you can view the details of the table and datastore where actions have been applied.

observability-setting

Step 2: Check the "Volume Tracking" to enable trend analysis and anomaly detection in data volumes over time and check the "Freshness Tracking" to ensure data timeliness and to identify pipeline delays.

Volume Tracking monitors and records daily volume metrics for this data asset. This feature enables trend analysis and anomaly detection in data volumes over time. Freshness Tracking measures and records the last time data was added or updated in the data asset. This feature helps ensure data timeliness and identifies pipeline delays.

allow-tracking

Step 3: Click on the Save button.

save-button

After clicking on the Save button, a success flash message will appear stating "Profile has been successfully updated".

success-msg

Export

Export feature lets you capture changes in your tables. You can export metadata for Quality Checks, Field Profiles, and Anomalies from selected tables to an enrichment datastore. This helps you analyze data trends, find issues, and make better decisions based on the table data.

Step 1: Select the tables in your JDBC datastore that you would like to export, then click on Export.

export-button

A modal window will appear with the Export Operation setting.

observability-window

For the next steps, detailed information on the export operation is available in the Export Operation section of the documentation.

Materialize

Materialize Operation captures snapshots of selected containers from a source datastore and exports them to an enrichment datastore for seamless data loading. Users can run it instantly or schedule it at set intervals, ensuring structured data is readily available for analysis and integration.

Step 1: Select the tables in your JDBC datastore that you would like to capture and export containers for the Materialize Operation, then click on Materialize.

materialize-button

A modal window will appear with the Materialize Operation setting.

observability-materialize

For the next steps, detailed information on the materialize operation is available in the Materialize Operation section of the documentation.

Delete

Delete allows you to remove a table from the connected source datastore. While the table and its associated data will be deleted, it is not permanent, as the table can be recreated if you run a catalog with the "recreate" option.

Note

Deleting a table is a reversible action if a catalog with the "recreate" option is run later.

Step 1: Select the tables in your connected source datastore that you would like to delete, then click on Delete.

delete-button

Step 2: A confirmation modal window will appear, click on the Delete button to remove the table from the system.

Step 3: After clicking on the delete button, your table is successfully deleted and a success flash message will appear saying "Profile has been successfully deleted"

success-message

Mark Tables & Files as Favorite

Marking a tables and files as a favorite allows you to quickly access important items. This feature helps you prioritize and manage the tables and files you use frequently, making data management more efficient.

Step 1: Locate the table and file you want to mark as a favorite and click on the bookmark icon to mark the table and file as a favorite.

mark-fav

After Clicking on the bookmark icon your table and file is successfully marked as a favorite and a success flash message will appear stating “The Table has been favorited”.

fav-msg fav-msg

Step 2: To unmark a tables and files, simply click on the bookmark icon of the marked tables and files. This will remove it from your favorites.

unmark-fav

Computed Fields

Computed Fields allow you to enhance data analysis by applying dynamic transformations directly to your data. These fields let you create new data points, perform calculations, and customize data views based on your specific needs, ensuring your data is both accurate and actionable.

Let's get started 🚀

Add Computed Fields

Step 1: Log in to Your Qualytics Account, navigate to the side menu, and select the source datastore where you want to create a computed field.

datastore

Step 2: Select the Container within the chosen datastore where you want to create the computed field. This container holds the data to which the new computed field will be applied, enabling you to enhance your data analysis within that specific datastore.

For demonstration purposes, we have selected the Bank Dataset-Staging source datastore and the bank_transactions_.csv container within it to create a computed field.

container

Step 3: After selecting the container, click on the Add button and select Computed Field from the dropdown menu to create a new computed field.

computed-field

A modal window will appear, allowing you to enter the details for your computed field.

add-field

Step 4: Enter the Name for the computed field, select Transformation Type from the dropdown menu, and optionally add Additional Metadata.

REF.	FIELDS	ACTION
1.	Field Name (Required)	Add a unique name for your computed field.
2.	Transformation Type (Required)	The type of transformation you want to apply from the available options.
3.	Additional Metadata (Optional)	Enhance the computed field definition by setting custom metadata. Click the plus icon (+) to open the metadata input form and add key-value pairs.

fields fields

Info

Transformations are changes made to data, like converting formats, doing calculations, or cleaning up fields. In Qualytics, you can use transformations to meet specific needs, such as cleaning entity names, converting formatted numbers, or applying custom expressions. With various transformation types available, Qualytics enables you to customize your data directly within the platform, ensuring it’s accurate and ready for analysis.

Transformation Types	Purpose	Reference
Cleaned Entity Name	Removes business signifiers (such as 'Inc.' or 'Corp') from an entity name.	See here
Convert Formatted Numeric	Removes formatting (such as parentheses for denoting negatives or commas as delimiters) from values that represent numeric data, converting them into a numerically typed field.	See here
Custom Expression	Allows you to create a new field by applying any valid Spark SQL expression to one or more existing fields.	See here

transformation-type

Step 5: After selecting the appropriate Transformation Type, click the Save button.

save save

Step 6: After clicking on the Save button, your computed field is created and a success flash message will display saying The computed field has been successfully created.

success success

You can find your computed field by clicking on the dropdown arrow next to the container you selected when creating the computed field.

field-created

Computed Fields Details

Totals

1. Quality Score: This provides a comprehensive assessment of the overall health of the data, factoring in multiple checks for accuracy, consistency, and completeness. A higher score, closer to 100, indicates optimal data quality with minimal issues or errors detected. A lower score may highlight areas that require attention and improvement.

2. Sampling: This shows the percentage of data that was evaluated during profiling. A sampling rate of 100% indicates that the entire dataset was analyzed, ensuring a complete and accurate representation of the data’s quality across all records, rather than just a partial sample.

3. Completeness: This metric measures how fully the data is populated without missing or null values. A higher completeness percentage means that most fields contain the necessary information, while a lower percentage indicates data gaps that could negatively impact downstream processes or analysis.

4. Active Checks: This refers to the number of ongoing quality checks being applied to the dataset. These checks monitor aspects such as format consistency, uniqueness, and logical correctness. Active checks help maintain data integrity and provide real-time alerts about potential issues that may arise.

5. Active Anomalies: This tracks the number of anomalies or irregularities detected in the data. These could include outliers, duplicates, or inconsistencies that deviate from expected patterns. A count of zero indicates no anomalies, while a higher count suggests that further investigation is needed to resolve potential data quality issues.

Totals Totals

Profile

This provides detailed insights into the characteristics of the field, including its type, distinct values, and length. You can use this information to evaluate the data's uniqueness, length consistency, and complexity.

No	Profile	Description
1	Declared Type	Indicates whether the type is declared by the source or inferred.
2	Distinct Values	Count of distinct values observed in the dataset.
3	Min Length	Shortest length of the observed string values or lowest value for numerics.
4	Max Length	Greatest length of the observed string values or highest value for numerics.
5	Mean	Mathematical average of the observed numeric values.
6	Median	The median of the observed numeric values.
7	Standard Deviation	Measure of the amount of variation in observed numeric values.
8	Kurtosis	Measure of the ‘tailedness’ of the distribution of observed numeric values.
9	Skewness	Measure of the asymmetry of the distribution of observed numeric values.
10	Q1	The first quartile; the central point between the minimum and the median.
11	Q3	The third quartile; the central point between the median and the maximum.
12	Sum	Total sum of all observed numeric values.

You can hover over the (i) button to view the native field properties, which provide detailed information such as the field's type (numeric), size, decimal digits, and whether it allows null values.

Hover Hover

Last Profile

The Last Profile timestamp helps users understand how up to date the field is. When you hover over the time indicator shown on the right side of the Last Profile label (e.g., "8 months ago"), a tooltip displays the complete date and time the field was last profiled.

This visibility ensures better context for interpreting profile metrics like mean, completeness, and anomalies.

Manage Tags in field details

Tags can now be directly managed in the field profile within the Explore section. Simply access the Field Details panel to create, add, or remove tags, enabling more efficient and organized data management.

Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.

explore explore

Step 2: Click on the Profiles tab and select fields.

profiles-tab

Step 3: Click on the specific field that you want to manage tags.

specific

A Field Details modal window will appear. Click on the plus button (+) to assign tags to the selected field.

field field

Step 4: You can also create the new tag by clicking on the ➕ button.

plus plus

A modal window will appear, providing the options to create the tag. Enter the required values to get started.

modal-window

For more information on creating tags, refer to the Add Tag section.

View Team

By hovering over the information icon, users can view the assigned teams for enhanced collaboration and data transparency.

view-team

Filter and Sort Fields

Filter and Sort options allow you to organize your fields by various criteria, such as Name, Checks, Completeness, Created Date, and Tags. You can also apply filters to refine your list of fields based on Type and Tags

Sort

You can sort your checks by Active Anomalies, Active Checks, Completeness, Created Date, Name, Quality Score, and Type to easily organize and prioritize them according to your needs.

Sort Sort

No	Sort By	Description
1	Active Anomalies	Sorts fields based on the number of currently active anomalies detected.
2	Active Checks	Sorts fields by the number of active validation checks applied.
3	Completeness	Sorts fields based on their data completeness percentage.
4	Created Date	Sorts fields by the date they were created, showing the newest or oldest fields first.
5	Name	Sorts fields alphabetically by their names.
6	Quality Score	Sorts fields based on their quality score, indicating the reliability of the data in the field.
7	Type	Sorts fields based on their data type (e.g., string, boolean, etc.).

Sort Sort

Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.

caret caret

Filter

You can filter your fields based on values like Type and Tag to easily organize and prioritize them according to your needs.

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

caret

caret caret

No	Filter	Description
1	Type	Filters fields based on the data type (e.g., string, boolean, date, etc.).
2	Tag	Tag Filter displays only the tags associated with the currently visible items, along with their color icon, name, type, and the number of matching records. Selecting one or more tags refines the list based on your selection. If no matching items are found, a No option found message is displayed.

Types of Transformations

Cleaned Entity Name

This transformation removes common business signifiers from entity names, making your data cleaner and more uniform.

Options for Cleaned Entity Name

REF.	FIELDS	ACTIONS
1.	Drop from Suffix	Add a unique name for your computed field.
2.	Drop from Prefix	Removes specified terms from the beginning of the entity name.
3.	Drop from Interior	Removes specified terms from the beginning of the entity name.
4.	Additional Terms to Drop (Custom)	Allows you to specify additional terms that should be dropped from the entity name.
5.	Terms to Ignore (Custom)	Designate terms that should be ignored during the cleaning process.

Example for Cleaned Entity Name

Example	Input	Transformation	Output
1	"TechCorp, Inc."	Drop from Suffix: "Inc."	"TechCorp"
2	"Global Services Ltd."	Drop from Prefix: "Global"	"Services Ltd."
3	"Central LTD & Finance Co."	Drop from Interior: "LTD"	"Central & Finance Co."
4	"Eat & Drink LLC"	Additional Terms to Drop: "LLC", "&"	"Eat Drink"
5	"ProNet Solutions Ltd."	Terms to Ignore: "Ltd."	"ProNet Solutions"

Convert Formatted Numeric

This transformation converts formatted numeric values into a plain numeric format, stripping out any characters like commas or parentheses that are not numerically significant.

Example for Convert Formatted Numeric

Example	Input	Transformation	Output
1	"$1,234.56"	Remove non-numeric characters: ",", "$"	"1234.56"
2	"(2020)"	Remove non-numeric characters: "(", ")"	"-2020"
3	"100%"	Remove non-numeric characters: "%"	"100"

Custom Expression

Enables the creation of a field based on a custom computation using Spark SQL. This is useful for applying complex logic or transformations that are not covered by other types.

Using Custom Expression:

You can combine multiple fields, apply conditional logic, or use any valid Spark SQL functions to derive your new computed field.

Example: To create a field that sums two existing fields, you could use the expression SUM(field1, field2).

Advanced Example: You need to ensure that a log of leases has no overlapping dates for an asset but your data only captures a single lease's details like

LeaseID	AssetID	Lease_Start	Lease_End
1	42	1/1/2025	2/1/2026
2	43	1/1/2025	2/1/2026
3	42	1/1/2026	2/1/2026
4	43	2/2/2026	2/1/2027

You can see in this example that Lease 1 has overlapping dates with Lease 3 for the same Asset. This can be difficult to detect without a full transformation of the data. However, we can accomplish our goal easily with a Computed Field. We'll simply add a Computed Field to our table named "Next_Lease_Start" and define it with the following custom expression so that our table will now hold the new field and render it as shown below.

LEAD(Lease_Start, 1) OVER (PARTITION BY AssetID ORDER BY Lease_Start)

LeaseID	AssetID	Lease_Start	Lease_End	Next_Lease_Start
1	42	1/1/2025	2/1/2026	1/1/2026
2	43	1/1/2025	2/1/2026	2/2/2026
3	42	1/1/2026	2/1/2026
4	43	2/2/2026	2/1/2027

Now you can author a Quality Check stating that Lease_End should always be less than "Next_Lease_Start" to catch any errors of this type. In fact, Qualytics will automatically infer that check for you at Level 3 Inference!

More Examples for Custom Expression

Example	Input Fields	Custom Expression	Output
1	`field1 = 10`, `field2 = 20`	`SUM(field1, field2)`	`30`
2	`salary = 50000`, `bonus = 5000`	`salary + bonus`	`55000`
3	`hours = 8`, `rate = 15.50`	`hours * rate`	`124`
4	`status = 'active'`, `score = 85`	`CASE WHEN status = 'active' THEN score ELSE 0 END`	`85`

Enrichment Operation

Export Operation

Qualytics metadata export feature lets you capture the changing states of your data. You can export metadata for Quality Checks, Field Profiles, and Anomalies from selected profiles into an enrichment datastore so that you can perform deeper analysis, identify trends, detect issues, and make informed decisions based on your data.

To keep things organized, the exported files use specific naming patterns:

Anomalies: Saved as _<enrichment_prefix>_anomalies_export.
Quality Checks: Saved as _<enrichment_prefix>_checks_export.
Field Profiles: Saved as _<enrichment_prefix>_field_profiles_export.

Note

Ensure that an enrichment datastore is already set up and properly configured to accommodate the exported data. This setup is essential for exporting anomalies, quality checks, and field profiles successfully.

Let’s get started 🚀

Step 1: Select a source datastore from the side menu from which you would like to export the metadata.

select select

For demonstration purposes, we have selected the “COVID-19 Data” Snowflake source datastores.

snow snow

Step 2: After selecting a datastore, a bottom-up menu appears on the right side of the interface. Click Enrichment Operations next to the Enrichment Datastore and select Export.

export export

Step 3: After clicking Export, the Export Operation modal window appears, allowing metadata extraction from the selected source datastore to the enrichment datastore.

operation

Step 4: Select the tables you wish to export. All, Specific, or Tag and click Next to proceed.

profile profile

Step 5: After clicking Next, select the assets you want to export to your Enrichment Datastore: Anomalies, Quality Checks, or Field Profiles, and click Export to proceed with the export process.

asset asset

After clicking Export, a confirmation message appears stating "Export in motion." In a couple of minutes, the metadata will be available in your Enrichment Datastore.

msg msg

Schedule Operation

Step 1: Click Schedule to configure scheduling options for the Export Operation.

schedule

Step 2: Configure the scheduling preferences for the Export Operation.

Hourly: Runs every set number of hours at a specified minute. (e.g., Every 1 hour at 00 minutes).
Daily: Runs once per day at a specific UTC time. (e.g., Every day at 00:00 UTC).
Weekly: Runs on selected weekdays at a set time. (e.g., Every Sunday and Friday at 00:00 UTC).
Monthly: Runs on a specific day of the month at a set time. (e.g., 1st day of every month at 00:00 UTC).
Advanced: Use Cron expressions for custom scheduling. (e.g., 0 12 * * 1-5 runs at 12 PM, Monday to Friday).

time time

Step 3: Define the Schedule Name to identify the scheduled Export Operation when it runs.

name name

Step 4: Click Schedule to finalize and schedule the Export Operation.

schedule2

After clicking Schedule, a confirmation message appears stating "Operation Scheduled". Go to the Activity tab to see the progress of export operation.

schedule2

Review Exported Data

Step 1: Once the metadata has been exported, navigate to the “Enrichment Datastores” located on the left menu.

review review

Step 2: In the “Enrichment Datastores” section, select the datastore where you exported the metadata. The exported metadata will now be visible in the selected datastore.

visible visible

Step 3: Click on the exported files to view the metadata. For demonstration purposes, we have selected the “export_field_profiles” file to review the metadata.

The exported metadata is displayed in a table format, showing key details about the field profiles from the datastore. It typically includes columns that indicate the uniqueness of data, the completeness of the fields, and the data structure. You can use this metadata to check data quality, prepare for analysis, ensure compliance, and manage your data.

Materialize Operation

Materialize Operation captures snapshots of selected containers from a source datastore and exports them to an enrichment datastore for seamless data loading. Users can run it instantly or schedule it at set intervals, ensuring structured data is readily available for analysis and integration.

Materialize Naming Conventions

To keep materialized data organized and compatible across different enrichment datastores, the system applies specific naming conventions. These conventions ensure that the resulting container names remain valid, readable, and free of conflicts.

Default Naming Convention

Used when the container name is safe to use as-is.

<enrichment_prefix>_mat_<container_name>.

This naming format is applied when:

The container name length is 60 characters or less.
The container name does not contain special characters that may cause invalid table or file names.

Example:

If the enrichment prefix is sales and the container name is orders_2024:

sales_mat_orders_2024.

Fallback Naming Convention

If the container name contains characters that may cause issues in downstream systems, the system switches to a safer naming structure by using the container ID instead.

<enrichment_prefix>_materialize_<container_id>.

This fallback is used when:

The container name exceeds 60 characters.
The container name includes restricted or special characters. (e.g., symbols, glob patterns when moving from DFS to JDBC).

Example:

If the enrichment prefix is sales and the container ID is 1023456:

sales_materialize_1023456.

Note

The fallback naming ensures successful loading into the enrichment datastore by preventing invalid or non-compliant table names.

Let’s get started 🚀

Step 1: Select a source datastore from the side menu to capture and export containers for the Materialize Operation.

select

For demonstration purposes, we have selected the “COVID-19 Data” Snowflake source datastore.

datastore

Step 2: After selecting a datastore, a bottom-up menu appears on the right side of the interface. Click Enrichment Operations next to the Enrichment Datastore and select Materialize.

material

Step 3: After clicking Materialize a modal window appears, allowing you to configure the data export settings for the Materialize Operation.

window

Step 4: Select tables to materialize all tables, specific tables, or tables by tag, then click Next.

record

Run Now

Click Run Now to instantly materialize selected containers.

run

After clicking Run Now, a confirmation message appears stating "Operation Triggered". Go to the Activity tab to see the progress of materialize operation.

operation-triggered

Schedule

Step 1: Click Schedule to configure scheduling options for the Materialize Operation.

schedule

Step 2: Configure the scheduling preferences for the Materialize Operation.

Hourly: Runs every set number of hours at a specified minute. (e.g., Every 1 hour at 00 minutes).
Daily: Runs once per day at a specific UTC time. (e.g., Every day at 00:00 UTC).
Weekly: Runs on selected weekdays at a set time. (e.g., Every Sunday and Friday at 00:00 UTC).
Monthly: Runs on a specific day of the month at a set time. (e.g., 1st day of every month at 00:00 UTC).
Advanced: Use Cron expressions for custom scheduling. (e.g., 0 12 * * 1-5 runs at 12 PM, Monday to Friday).

time

Step 3: Define the Schedule Name to identify the scheduled Materialize Operation when it runs.

name

Step 4: Click Schedule to finalize and schedule the Materialize Operation.

schedule2

After clicking Schedule, a confirmation message appears stating "Operation Scheduled". Go to the Activity tab to see the progress of materialize operation.

operation-scheduled

Review Materialized Data

Step 1: Once the selected containers are materialized, go to Enrichment Datastores from the left menu.

preview

Step 2: In the Enrichment Datastores section, select the datastore where you materialized the snapshot. The materialized containers will now be visible.

data

Step 3: Click on the materialized files to review the snapshot. For demonstration, we have selected the "materialized_field_profiles" file.

The materialized data is displayed in a table format, showing key details about the selected containers. It typically includes columns indicating data structure, completeness, and uniqueness. You can use this data for analysis, validation, and integration.

preview2

Settings

Identifiers

An Identifier is a field that can be used to help load the desired data from a table in support of analysis. There are two types of identifiers that can be declared for a table:

Incremental Field: Track records in the table that have already been scanned in order to support Scan operations that only analyze new (not previously scanned) data.
Partition Field: Divides the data in the table into distinct dataframes that can be analyzed in parallel.

Managing an Identifier

Step 1: Log in to your Qualytics account and select the source datastore (JDBC or DFS) from the left menu that you want to manage.

datastore

Step 2: Select Tables (if JDBC datastore is connected) or File Patterns (if DFS datastore is connected) from the Navigation tab on the top.

Step 3: You will view the full list of tables or files belonging to the selected source datastore.

list

Step 4: Click on the vertical ellipse next to the table of your choice and select Settings from the dropdown list.

settings

A modal window will appear for “Table Settings”, where you can manage identifiers for the selected table.

window

Incremental Strategy

The Incremental Strategy configuration in Qualytics is crucial for tracking changes at the row level within tables.

This approach is essential for efficient data processing, as it is specifically used to track which records have already been scanned.

This allows for scan operations to focus exclusively on new records that have not been previously scanned, thereby optimizing the scanning process and ensuring that only the most recent and relevant data is analyzed.

incremental

No	Strategy Option	Description
1	None	No incremental strategy, it will run full.
2	Last Modified	- Available types are Date or Timestamp was last modified. - Uses a "last modified column" to track changes in the data set. - This column typically contains a timestamp or date value indicating when a record was last modified. - The system compares the "last modified column" to a previous timestamp or date, updating only the records modified since that time.
3	Batch Value	- Available types are Integral or Fractional. - Uses a "batch value column" to track changes in the data set. - This column typically contains an incremental value that increases as new data is added. - The system compares the current "batch value" with the previous one, updating only records with a higher "batch value". - Useful when data comes from a system without a modification timestamp.
4	Postgres Commit Timestamp Tracking	- Utilizes commit timestamps for change tracking.

Availability based on technologies:

Option	Availability
Last Modified	All
Batch Value	All
Postgres Commit Timestamp Tracking	PostgreSQL

Info

All options are useful for incremental strategy, it depends on the availability of the data and how it is modeled.
The 3 options will allow you to track and process only the data that has changed since the last time the system was run, reducing the amount of data that needs to be read and processed, and increasing the efficiency of your system.

Incremental Strategy with DFS (Distributed File System)

For DFS in Qualytics, the incremental strategy leverages the last modified timestamps from the file metadata.

This automated process means that DFS users do not need to manually configure their incremental strategy, as the system efficiently identifies and processes the most recent changes in the data.

Example

Objective: Identify and process new or modified records in the ORDERS table since the last scan using an Incremental Strategy.

Sample Data

O_ORDERKEY	O_PAYMENT_DETAILS	LAST_MODIFIED
1	{"date": "2023-09-25", "amount": 250.50, "credit_card": "5105105105105100"}	2023-09-25 10:00:00
2	{"date": "2023-09-25", "amount": 150.75, "credit_card": "4111-1111-1111-1111"}	2023-09-25 10:30:00
3	{"date": "2023-09-25", "amount": 200.00, "credit_card": "1234-5678-9012-3456"}	2023-09-25 11:00:00
4	{"date": "2023-09-25", "amount": 175.00, "credit_card": "5555-5555-5555-4444"}	2023-09-26 09:00:00
5	{"date": "2023-09-25", "amount": 300.00, "credit_card": "2222-2222-2222-2222"}	2023-09-26 09:30:00

Incremental Strategy Explanation

In this example, an Incremental Strategy would focus on processing records that have a LAST_MODIFIED timestamp after a certain cutoff point. For instance, if the last scan was performed on 2023-09-25 at 11:00:00, then only records with O_ORDERKEY 4 and 5 would be considered for the current scan, as they have been modified after the last scan time.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve Orders Since Last Scan]
B --> C{Record Modified After Last Scan?}
C -->|Yes| D[Process Record]
C -->|No| E[Skip Record]
D --> F[Move to Next Record/End]
E --> F

-- An illustrative SQL query to identify and process new or modified records in the ORDERS table since the last scan.
select
    o_orderkey,
    o_payment_details,
    last_modified
from orders
where
    last_modified > '2023-09-25 11:00:00'

Partition Field

The Partition Field is a fundamental feature for organizing and managing large datasets. It is specifically designed to divide the data within a table into separate, distinct dataframes.

This segmentation is a key strategy for handling and analyzing data more effectively. By creating these individual dataframes, Qualytics allows for parallel processing, which significantly accelerates the analysis.

Each partition can be analyzed independently, enabling simultaneous examination of different segments of the dataset.

This not only increases the efficiency of data processing but also ensures a more streamlined and scalable approach to handling large volumes of data, making it an indispensable tool in data analysis and management.

The ideal Partition Identifier is an Incremental Identifier of type datetime such as a last-modified field, however, alternatives are automatically identified and set during a Catalog operation.

partition

Info

Partition Field Selection: When selecting a partition field for a table during catalog operation, we will attempt to select a field with no nulls where possible.
User-Specified Partition Fields: Users are permitted to specify partition fields manually. While we ensure that the user selects a field of a supported data type, we do not currently enforce non-nullability or completeness. Care should be given to select partition fields with no or a low percentage of nulls in order to avoid unbalanced partitioning.

Warning

If no appropriate partition identifier can be selected, then repeatable ordering candidates (order by fields) are used for less efficient processing of containers with a very large number of rows.

Example

Objective: Identify an efficient process and analyze the ORDERS table by partitioning the data based on the O_ORDERDATE field, allowing parallel processing of different date segments.

Sample Data

O_ORDERKEY	O_CUSTKEY	O_ORDERSTATUS	O_TOTALPRICE	O_ORDERDATE
1	123	'O'	173665.47	2023-09-01
2	456	'O'	46929.18	2023-09-01
3	789	'F'	193846.25	2023-09-02
4	101	'O'	32151.78	2023-09-02
5	202	'F'	144659.20	2023-09-03

Partition Field Explanation

In this example, the O_ORDERDATE field is used to partition the ORDERS table. Each partition represents a distinct date, allowing for the parallel processing of orders based on their order date. This strategy enhances the efficiency of data analysis by distributing the workload across different partitions.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve Orders Data]
B --> C{Partition by O_ORDERDATE}
C --> D[Distribute Partitions for Parallel Processing]
C --> E[Identify Date Segments]
D --> F[Analyze Each Partition Independently]
E --> F
F --> G[Combine Results/End]

-- An illustrative SQL query to partition the ORDERS table by the O_ORDERDATE field for parallel processing.
SELECT
    O_ORDERKEY,
    O_CUSTKEY,
    O_ORDERSTATUS,
    O_TOTALPRICE,
    O_ORDERDATE,
    O_ORDERPRIORITY
FROM
    orders
DISTRIBUTE BY
    O_ORDERDATE;

Grouping Overview

Grouping is a fundamental aspect of data analysis, allowing users to organize data into meaningful categories for in-depth examination. With the ability to set grouping on Containers, users can define how data within a container should be grouped, facilitating more focused and efficient analysis.

Let’s get started 🚀

Managing an Grouping

Step 1: Log in to your Qualytics account and select the source datastore (JDBC or DFS) from the left menu that you want to manage.

grouping

Step 2: Select Tables (if JDBC datastore is connected) or File Patterns (if DFS datastore is connected) from the Navigation tab on the top.

table

Step 3: You will view the full list of tables or files belonging to the selected source datastore.

list

Step 4: Click on the vertical ellipse next to the table of your choice and select Settings from the dropdown list.

settings

A modal window will appear for “Table Settings”, where you can manage grouping for the selected table.Use the Grouping section to organize fields, with a warning to avoid large row groupings to maintain performance. Add grouping logic via Group Criteria.

window

Usage

The grouping parameter accepts a list of lists of field names. Each inner list holds the field names in the order that they will be applied as grouping criteria. This flexibility allows users to customize the grouping behavior based on their specific analytical requirements.

Example

Consider the following examples of grouping configurations:

["store_id"]: Groups data within the container by the store_id field.
["store_id", "month"]: Groups data first by store_id, then by month.
["store_id", "state"]: Groups data first by store_id, then by state.

By specifying different combinations of fields in the grouping parameter, users can tailor the grouping behavior to suit their analytical needs.

Impact on Data Profiles

The grouping has implications for various aspects of data profiling and analysis within Qualytics.

Field Profiles

Field Profiles are now produced with filters determined by the grouping specified on the Profile Operation. This means that the profiles generated will reflect the characteristics of data within each group defined by the grouping criteria.

Inferred Quality Checks

The inferred checks produced by the analytics engine will also hold the filter defined by the grouping. This ensures that data access controls and constraints are applied consistently across different groupings of data within the container.

Inferred Quality Check Filters

Quality Check filters, represented as Spark SQL where clauses, are set based on the grouping specified on the Profile Operation. This ensures that quality checks are applied appropriately to the data within each group, allowing for comprehensive data validation and quality assurance.

Conclusion

The introduction of Grouping for Containers in Qualytics enhances data organization and analysis capabilities, allowing users to define custom grouping criteria and analyze data at a granular level. By leveraging grouping, users can gain deeper insights into their data and streamline the analytical process, ultimately driving more informed decision-making and improving overall data quality and reliability.

General & Excluding Overview

General and excluding fields in Qualytics simplify data analysis by organizing key information and removing irrelevant or sensitive data. This ensures efficient management, protects privacy, and supports customized configurations for specific needs.

Let’s get started 🚀

Manage General & Excluding

Step 1: Log in to your Qualytics account and select the source datastore (JDBC or DFS) from the left menu that you want to manage.

datastore

Step 2: Select Tables (if JDBC datastore is connected) or File Patterns (if DFS datastore is connected) from the Navigation tab on the top.

tab

Step 3: You will view the full list of tables or files belonging to the selected source datastore.

list

Step 4: Click on the vertical ellipse next to the table of your choice and select Settings from the dropdown list.

settings

A modal window will appear for “Table Settings”, where you can manage general and excluding for the selected table.

window

Excluding Fields

This configuration allows you to selectively exclude specific fields from containers. These excluded fields will be omitted from check creation during profiling operations while also being hidden in data previews, without requiring a profile run.

This can be helpful when dealing with sensitive data, irrelevant information, or large datasets where you want to focus on specific fields.

excluding

Benefits of Excluding Fields

Targeted Analysis

Focus your analysis on the fields that matter most by removing distractions from excluded fields.

Data Privacy

Protect sensitive information by excluding fields that contain personal data or confidential information.

Important Considerations

Excluding fields will permanently remove them from profile creation and data preview until you re-include them and re-profile the container.

Infer Data Type

The "infer data type" option in containers allows the system to automatically determine the appropriate data types (e.g., fractional, integer, date) for columns within your data containers. This setting is configurable for both JDBC and DFS containers.

grouping

Behavior in JDBC Datastores

Default: Disabled
Reason: JDBC datastores provide inherent schema information from the database tables. Qualytics leverages this existing schema for accurate data typing.
Override: You can optionally enable this setting if encountering issues with automatic type detection from the source database.

Behavior in DFS Datastores

Default:
Enabled for CSV files
Disabled for other file formats (Parquet, Delta, Avro, ORC, etc.)
Reason:
CSV files lack a defined schema. Data type inference helps ensure correct data interpretation.
File formats like Parquet, Delta, Avro, and ORC have embedded schemas, making inference unnecessary.
Override: You can adjust the default behavior based on your specific data sources and requirements.

Rule for the "Infer Data Type"

Schema-Based Data Sources

If the data source has a defined schema (JDBC, Delta, Parquet, Avro, ORC), the flag is set to "False".

Schema-less Data Sources

If the data source lacks a defined schema (CSV), the flag is set to "True".

Override file pattern for DFS datastores

Override the file pattern to include files with the same schema but don't match the automatically generated pattern from the initial cataloging.

In some cases, you may have multiple files that share the same schema but don't match the automatically generated file pattern during the initial cataloging process. To address this, Qualytics has the ability to override file patterns in the UI. This allows you to specify a custom pattern that encompasses all files with the shared schema, ensuring they are properly included in profiling and analysis.

Explore Deeper Knowledge

If you want to go deeper into the knowledge or if you are curious and want to learn more about DFS filename globbing, you can explore our comprehensive guide here: How DFS Filename Globbing Works.

Important Considerations

Subsequent catalog operations without pruning (Disabled) will use the new pattern.

Weight

Weight Mechanism

Weight Mechanism for checks is designed to evaluate and prioritize checks based on three key factors: Rule Type Weighting, Anomaly Weighting, and Tag Weighting.

Let’s get started 🚀

1. Rule Type Weighting

Each quality check rule type has a specific weight based on its importance. The rule types are divided into three categories:

High Importance (Weight: 3)

These rules are assigned the highest weight of 3 to reflect their crucial role in maintaining data quality.

No.	Rule Type	Weight
1	Entity Resolution	3
2	Expected Schema	3
3	Matches Pattern	3
4	Predicted By	3
5	Satisfies Expression	3
6	Contains Social Security Number	3
7	Time Distribution Size	3
8	User Defined Function	3
9	Is Replica Of (is sunsetting)	3
10	Metric	3
11	Aggregation Comparison	3
12	Is Address	3
14	Data Diff	3

Medium Importance (Weight: 2)

These rules are assigned the medium weight of 2 to reflect their role in maintaining data quality.

No.	Rule Type	Weight
1	Any Not Null	2
2	Between	2
3	Between Times	2
4	Contains Credit Card	2
5	Contains Email	2
6	Equal To	2
7	Equal To Field	2
8	Exists In	2
9	Not Exists In	2
10	Expected Values	2
11	Greater Than Field	2
12	Less Than Field	2
13	Not Future	2
14	Required Values	2
15	Unique	2
16	Contains URL	2
17	Min Partition Size	2
18	Is Credit Card	2
19	Volumetric	2

Low Importance (Weight: 1)

These rules are assigned the lowest weight of 1 to reflect their role in maintaining data quality.

No.	Rule Type	Weight
1	After Date Time	1
2	Before DateTime	1
3	Distinct Count	1
4	Field Count	1
5	Is Type	1
6	Max Length	1
7	Max Value	1
8	Min Length	1
9	Min Value	1
10	Not Exists In	1
11	Not Negative	1
12	Not Null	1
13	Positive	1
14	Sum	1

2. Anomaly Weighting

Anomalies can impact the importance of a check by adjusting its weight. The adjustment is based on whether the check has anomalies and whether it is authored or inferred:

Authored Check with Anomalies: - The check's weight increases by 12 points.
Authored Check without Anomalies: - The check's weight increases by 9 points.
Inferred Check with Anomalies: - The check's weight increases by 6 points.
Inferred Check without Anomalies: - The check's weight remains 0 points.

3. Tag Weighting

Tags can further modify the weight of a check. When tags with weight modifiers are applied, their weights are added to the check’s total weight.

Tag with Weight Modifier: Each tag that has a specific weight modifier will contribute to the overall weight of the check. For example, if Tag B has a weight of 2, it will add 2 points to the total weight of the check.

Example of Weight Calculation

Let's break down an example calculation for a check of type Authored, using the isCreditCard rule (Medium Importance), with no anomalies, and Tag B applied:

Step-by-Step Calculation

Step 1: Rule Type Weight – The isCreditCard rule has a weight of 2 (Medium Importance).
Step 2: Anomaly Weight – An Authored Check without anomalies adds 9 points.
Step 3: Tag Weight – Tag B adds 2 points.

Total Weight = 2 (rule type) + 9 (no anomalies) + 2 (Tag B) = 13 points

Additional Notes

If the table itself has a Tag A with a weight of 10, the check will inherit that tag. In this case, the total weight will include both tag weights.

Total Weight = 2 (rule type) + 9 (no anomalies) + 2 (Tag B) + 10 (Tag A) = 23 points

Quick Calculation Formula

To make the calculation easier, here are the quick formulas for different types of checks:

For Authored Checks with Anomalies:
[Rule Type Weight] + 12 (Anomaly) + Check’s Tag Weight + Table’s Tag Weight
For Authored Checks without Anomalies:
[Rule Type Weight] + 9 (No Anomaly) + Check’s Tag Weight + Table’s Tag Weight
For Inferred Checks with Anomalies:
[Rule Type Weight] + 6 (Anomaly) + Check’s Tag Weight + Table’s Tag Weight
For Inferred Checks without Anomalies:
[Rule Type Weight] + 0 (No Anomaly) + Check’s Tag Weight + Table’s Tag Weight

Example Calculation (Extended)

Let's extend the example with the inclusion of both Tag A and Tag B:

For Authored Checks with Anomalies:
[Rule Type Weight] + 12 + 10 (Tag A) + 2 (Tag B)
For Authored Checks without Anomalies:
[Rule Type Weight] + 9 + 10 (Tag A) + 2 (Tag B)
For Inferred Checks with Anomalies:
[Rule Type Weight] + 6 + 10 (Tag A) + 2 (Tag B)
For Inferred Checks without Anomalies:
[Rule Type Weight] + 0 + 10 (Tag A) + 2 (Tag B)

Data Quality Checks

Checks Overview

Checks in Qualytics are rules applied to data that ensure quality by validating accuracy, consistency, and integrity. Each check includes a data quality rule, along with filters, tags, tolerances, and notifications, allowing efficient management of data across tables and fields.

Let’s get started 🚀

Checks Types

In Qualytics, you will come across two types of checks:

Inferred Checks

Qualytics automatically generates inferred checks during a Profile operation. These checks typically cover 80-90% of the rules needed by users. They are created and maintained through profiling, which involves statistical analysis and machine learning methods.

For more details on Inferred Checks, please refer to the Inferred Check documentation.

Authored Checks

Authored checks are manually created by users within the Qualytics platform or API. You can author many types of checks, ranging from simple templates for common checks to complex rules using Spark SQL and User-Defined Functions (UDF) in Scala.

For more details on Authored Checks, please refer to the Authored Checks documentation.

View & Manage Checks

Checks tab in Qualytics provides users with an interface to view and manage various checks associated with their data. These checks are accessible through two different methods, as discussed below.

Method 1: Datastore-Specific Checks

Step 1: Log in to your Qualytics account and select the datastore from the left menu.

datastore

Step 2: Click the "Checks" from the navigation tab.

tab

You will see a list of all the checks that have been applied to the selected datastore.

selected-datastore

You can switch between different types of checks to view them categorically (such as All, Active, Draft, and Archived).

categorically

Method 2: Explore Section

Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.

explore-tab

Step 2: Click the "Checks" from the navigation tab.

checks

You'll see a list of all the checks that have been applied to various tables and fields across different source datastores.

list

Check Templates

Check Templates empower users to efficiently create, manage, and apply standardized checks across various datastores, acting as blueprints that ensure consistency and data integrity across different datasets and processes.

Check templates streamline the validation process by enabling check management independently of specific data assets such as datastores, containers, or fields. These templates reduce manual intervention, minimize errors, and provide a reusable framework that can be applied across multiple datasets, ensuring all relevant data adheres to defined criteria. This not only saves time but also enhances the reliability of data quality checks within an organization.

For more details about check templates, please refer to the Check Templates documentation.

Apply Check Template for Quality Checks

You can export check templates to make quality checks easier and more consistent. Using a set template lets you quickly verify that your data meets specific standards, reducing mistakes and improving data quality. Exporting these templates simplifies the process, making finding and fixing errors more efficient, and ensuring your quality checks are applied across different projects or systems without starting from scratch.

For more details on how to apply check templates for quality checks, please refer to the Apply Check Template for Quality Checks documentation.

Export Check Templates

You can export check templates to easily share or reuse your quality check settings across different systems or projects. This saves time by eliminating the need to recreate the same checks repeatedly and ensures that your quality standards are consistently applied. Exporting templates helps maintain accuracy and efficiency in managing data quality across various environments.

For more details about export check templates, please refer to the Export Check Templates documentation.

Manage Checks in Datastore

Managing your checks within a datastore is important to maintain data integrity and ensure quality. You can categorize, create, update, archive, restore, delete, and clone checks, making it easier to apply validation rules across the datastores. The system allows for checks to be set as active, draft, or archived based on their current state of use. You can also define reusable templates for quality checks to streamline the creation of multiple checks with similar criteria. With options for important and favorite, users have full flexibility to manage data quality efficiently.

For more details on how to manage checks in datastores, please refer to the Manage Checks in Datastore documentation.

Check Rule Types

In Qualytics, a variety of check rule types are provided to maintain data quality and integrity. These rules define specific criteria that data must meet, and checks apply these rules during the validation process.

For more details about check rule types, please refer to the Rule Types Overview documentation.

Rule Type	Description
After Date Time	Asserts that the field is a timestamp later than a specific date and time.
Aggregation Comparison	Verifies that the specified comparison operator evaluates true when applied to two aggregation expressions.
Any Not Null	Asserts that one of the fields must not be null.
Before Date Time	Asserts that the field is a timestamp earlier than a specific date and time.
Between	Asserts that values are equal to or between two numbers.
Between Times	Asserts that values are equal to or between two dates or times.
Contains Credit Card	Asserts that the values contain a credit card number.
Contains Email	Asserts that the values contain email addresses.
Contains Social Security Number	Asserts that the values contain social security numbers.
Contains Url	Asserts that the values contain valid URLs.
Data Diff	Asserts that the dataset created by the targeted field(s) has differences compared to the referred field(s).
Distinct Count	Asserts on the approximate count distinct of the given column.
Entity Resolution	Asserts that every distinct entity is appropriately represented once and only once.
Equal To	Asserts that all of the selected fields equal a value.
Equal To Field	Asserts that this field is equal to another field.
Exists in	Asserts if the rows of a compared table/field of a specific Datastore exists in the selected table/field.
Expected Schema	Asserts that all selected fields are present and that all declared data types match expectations.
Expected Values	Asserts that values are contained within a list of expected values.
Field Count	Asserts that there must be exactly a specified number of fields.
Freshness Check	Asserts that data was added or updated in the data asset after a declared time.
Greater Than	Asserts that the field is a number greater than (or equal to) a value.
Greater Than Field	Asserts that this field is greater than another field.
Is Address	Asserts that the values contain the specified required elements of an address.
Is Credit Card	Asserts that the values are credit card numbers.
Is Replica Of (is sunsetting)	Asserts that the dataset created by the targeted field(s) is replicated by the referred field(s).
Is Type	Asserts that the data is of a specific type.
Less Than	Asserts that the field is a number less than (or equal to) a value.
Less Than Field	Asserts that this field is less than another field.
Matches Pattern	Asserts that a field must match a pattern.
Max Length	Asserts that a string has a maximum length.
Max Partition Size	Asserts the maximum number of records that should be loaded from each file or table partition.
Max Value	Asserts that a field has a maximum value.
Metric	Records the value of the selected field during each scan operation and asserts that the value is within a specified range (inclusive).
Min Length	Asserts that a string has a minimum length.
Min Partition Size	Asserts the minimum number of records that should be loaded from each file or table partition.
Min Value	Asserts that a field has a minimum value.
Not Exists In	Asserts that values assigned to this field do not exist as values in another field.
Not Future	Asserts that the field's value is not in the future.
Not Negative	Asserts that this is a non-negative number.
Not Null	Asserts that the field's value is not explicitly set to nothing.
Positive	Asserts that this is a positive number.
Predicted By	Asserts that the actual value of a field falls within an expected predicted range.
Required Values	Asserts that all of the defined values must be present at least once within a field.
Satisfies Expression	Evaluates the given expression (any valid `Spark SQL`) for each record.
Sum	Asserts that the sum of a field is a specific amount.
Time Distribution Size	Asserts that the count of records for each interval of a timestamp is between two numbers.
Unique	Asserts that the field's value is unique.
Volumetric	Asserts that the data volume (rows or bytes) remains within dynamically inferred thresholds based on historical trends (daily, weekly, monthly).

Authored Check

Authored checks are manually created by users within the Qualytics platform or API. You can author many types of checks, ranging from simple templates for common checks to complex rules using Spark SQL and User-Defined Functions (UDF) in Scala.

Let's get started 🚀

Step 1: Log in to your Qualytics account and select the datastore from the left menu.

datastore

Step 2: Click on the "Checks" from the Navigation Tab.

checks

Step 3: In the top-right corner, click on the "Add" button, then select "Check" from the dropdown menu.

A modal window titled “Authored Check Details” will appear, providing you the options to add the authored check details.

window

Step 4: Enter the following details to add the authored check:

1. Associate with a Check Template:

If you toggle ON the "Associate with a Check Template" option, the check will be linked to a specific template.
If you toggle OFF the "Associate with a Check Template" option, the check will not be linked to any template, which allows you full control to modify the properties independently.

check-temp

2. Rule Type (Required): Select a Rule from the dropdown menu, such as checking for non-null values, matching patterns, comparing numerical values, or verifying date-time constraints. Each rule type defines the specific validation logic to be applied.

For demonstration purposes we have selected the After Date Time rule type.

rule

For more details about the available rule types, refer to the "Rule Types Overview" documentation.

Note

Different rule types have different sets of fields and options appearing when selected.

3. File (Required): Select a file from the dropdown menu on which the check will be performed.

file

4. Field (Required): Select a field from the dropdown menu on which the check will be performed.

field

5. Filter Clause: Specify a valid Spark SQL WHERE expression to filter the data on which the check will be applied.

The filter clause defines the conditions under which the check will be applied. It typically includes a WHERE statement that specifies which rows or data points should be included in the check.

filter

6. Date (Required): Enter the reference date for the rule. For the After Date Time rule, records in the selected field must have a timestamp later than this specified date.

date

Note

Spark SQL expressions used in calculated fields are editable, enabling greater flexibility in configuration.

7. Coverage: Adjust the Coverage setting to specify the percentage of records that must comply with the check.

Note

The Coverage setting applies to most rule types and allows you to specify the percentage of records that must meet the validation criteria.

coverage

8. Description (Required): Enter a detailed description of the check template, including its purpose, applicable data, and relevant information to ensure clarity for users. If you're unsure of what to include, click on the "💡" lightbulb icon to apply a suggested description based on the rule type.

Example: The Date of Birth must be a timestamp later than < date_time >.

This description specifies that the Date of Birth field must have a timestamp later than the specified < date_time >.

description

9. Tag: Assign relevant tags to your check to facilitate easier searching and filtering based on categories like "data quality," "financial reports," or "critical checks."

tag

10. Additional Metadata: Add key-value pairs as additional metadata to enrich your check. Click the plus icon (+) next to this section to open the metadata input form, where you can add key-value pairs.

metadata

Enter the desired key-value pairs (e.g., DataSourceType: SQL Database and PriorityLevel: High). After entering the necessary metadata, click "Confirm" to save the custom metadata.

confirm

Step 4: After completing all the check details, click on the "Validate" button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct. It ensures that the check will work as expected by running it against the data without committing any changes.

validate

If the validation is successful, a green message will appear saying "Validation Successful".

msg

If the validation fails, a red message will appear saying "Failed Validation". This typically occurs when the check logic or parameters do not match the data properly.

failed

Step 5: Once you have a successful validation, click the "Save" button.

After clicking on the “Save” button your check is successfully created and a success flash message will appear saying “Check successfully created”.

Author a Check via API

Users are able to author and interact with Checks through the API by passing JSON Payloads. Please refer to the API documentation on details: qualytics.io/api/docs

Screenshot

Inferred Check

Qualytics automatically generates inferred checks during a Profile operation. These checks typically cover 80-90% of the rules needed by users. They are created and maintained through profiling, which involves statistical analysis and machine learning methods.

Let's get started 🚀

Step 1: Log in to your Qualytics account and select the datastore from the left menu.

datastore

Step 2: Click on the "Checks" from the Navigation Tab.

checks

Step 3: In the top-right corner, click on the "Run" button, then select "Profile" from the dropdown menu. This action will initiate the profiling process that generates inferred checks.

run

Note

Inferred checks will be automatically updated with the next Profiling run. Manually updating an inferred check will take it out of the automatic update workflow.

To understand how Inferred checks work, you can follow the steps from the documentation Profile Operation.

After the profiling run is complete, inferred checks will be automatically created based on the analysis of your data.

profiling

1. Check Summary: Provides a summary of the schema validation check, including its unique identifier, type, status, and associated warnings or information. It serves as a quick reference for users to assess the check's current state and access relevant actions.

For demonstration purposes, the applied rule type is Expected Schema.

check-summary

Check ID: A unique identifier assigned to this particular check.

check-id

Check Type: Indicates the nature of the validation being performed on the check.

check-type

Warnings: Indicates the presence of active anomalies detected in the dataset. Clicking on this icon opens a dropdown menu with the following options:

View: Displays detailed information about the detected anomalies.
Acknowledge: Marks the anomaly as reviewed or acknowledged.
Archive: Moves the anomaly record to the archive for future reference.

warnings

Open Details: Provides additional details or guidance about the check. Clicking this icon typically displays more context or documentation related to schema validation.

open-details

Check Actions: Opens a dropdown menu with more actions related to managing or modifying the check.

check-actions

2. Target: Specifies the dataset and file that the inferred check will be applied to. This section ensures that the validation rules are correctly assigned to the intended source datastore. Users can select a different file if needed by clicking the dropdown.

target

3. Fields List: Displays the fields that are expected to be present in the dataset.

fields

No.	Fields	Description
1	Date of Birth	Stores date and time values, ensuring precise representation of birth dates.
2	Created Date	Holds the record’s creation date as a text value rather than a structured date format.

4. Allow Other Fields (Checkbox):

If checked, the validation process allows additional fields beyond those explicitly listed.
If unchecked, any unexpected field in the dataset will trigger an error.

allow-fields

5. Description: Enter a detailed description of the check template, including its purpose, applicable data, and relevant information to ensure clarity for users. If you're unsure of what to include, click on the "💡" lightbulb icon to apply a suggested description based on the rule type.

description

6. Tags: Tags help categorize and manage checks efficiently. The tag "test partition scan" in this case suggests that this check is related to a specific test or partition scan process.

Inference Engine

After metadata is generated by a Profile Operation, Inference Engine is initiated to kick off Inductive and Unsupervised learning methods.
Available data is partitioned into a training set and a testing set.
The engine applies numerous machine learning models and techniques to the training data in an effort to discover well-fitting data quality constraints.
Those inferred constraints are then filtered by testing them against the held out testing set and only those that assert true above a certain threshold are converted and exposed to users as Inferred Checks.

Rule Types

Rule Types Overview

In Qualytics, a variety of rule types are provided to maintain data quality and integrity.These rules define specific criteria that data must meet, and checks apply these rules during the validation process.

Here’s an overview of the rule types and their purposes:

Check Rule Types

Rule Type	Description
After Date Time	Asserts that the field is a timestamp later than a specific date and time.
Any Not Null	Asserts that one of the fields must not be null.
Before DateTime	Asserts that the field is a timestamp earlier than a specific date and time.
Between	Asserts that values are equal to or between two numbers.
Between Times	Asserts that values are equal to or between two dates or times.
Contains Credit Card	Asserts that the values contain a credit card number.
Contains Email	Asserts that the values contain email addresses.
Contains Social Security Number	Asserts that the values contain social security numbers.
Contains Url	Asserts that the values contain valid URLs.
Data Diff	Asserts that the dataset created by the targeted field(s) has differences compared to the referred field(s).
Distinct Count	Asserts on the approximate count distinct of the given column.
Entity Resolution	Asserts that every distinct entity is appropriately represented once and only once
Equal To Field	Asserts that this field is equal to another field.
Exists in	Asserts if the rows of a compared table/field of a specific Datastore exists in the selected table/field.
Expected Schema	Asserts that all selected fields are present and that all declared data types match expectations.
Expected Values	Asserts that values are contained within a list of expected values.
Field Count	Asserts that there must be exactly a specified number of fields.
Greater Than	Asserts that the field is a number greater than (or equal to) a value.
Greater Than Field	Asserts that this field is greater than another field.
Is Address	Asserts that the values contain the specified required elements of an address.
Is Credit Card	Asserts that the values are credit card numbers.
Is Replica Of (is sunsetting)	Asserts that the dataset created by the targeted field(s) is replicated by the referred field(s).
Is Type	Asserts that the data is of a specific type.
Less Than	Asserts that the field is a number less than (or equal to) a value.
Less Than Field	Asserts that this field is less than another field.
Matches Pattern	Asserts that a field must match a pattern.
Max Length	Asserts that a string has a maximum length.
Max Value	Asserts that a field has a maximum value.
Metric	Records the value of the selected field during each scan operation and asserts that the value is within a specified range (inclusive).
Min Length	Asserts that a string has a minimum length.
Min Partition Size	Asserts the minimum number of records that should be loaded from each file or table partition.
Min Value	Asserts that a field has a minimum value.
Not Exists In	Asserts that values assigned to this field do not exist as values in another field.
Not Future	Asserts that the field's value is not in the future.
Not Negative	Asserts that this is a non-negative number.
Not Null	Asserts that the field's value is not explicitly set to nothing.
Positive	Asserts that this is a positive number.
Predicted By	Asserts that the actual value of a field falls within an expected predicted range.
Required Values	Asserts that all of the defined values must be present at least once within a field.
Satisfies Expression	Evaluates the given expression (any valid `Spark SQL`) for each record.
Sum	Asserts that the sum of a field is a specific amount.
Time Distribution Size	Asserts that the count of records for each interval of a timestamp is between two numbers.
Unique	Asserts that the field's value is unique.
Volumetric Check	Asserts that the volume of the data asset has not changed by more than an inclusive percentage amount for the prescribed moving daily average.

After Date Time

Definition

Asserts that the field is a timestamp later than a specific date and time.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify a particular date and time to act as the threshold for the rule.

Name	Description
Date	The timestamp used as the lower boundary. Values in the selected field should be after this timestamp.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all O_ORDERDATE entries in the ORDERS table are later than 10:30 AM on December 31st, 1991.

Sample Data

O_ORDERKEY	O_ORDERDATE
1	1991-12-31 10:30:00
2	1992-01-02 09:15:00
3	1991-12-14 10:25:00

Payload example

{
    "description": "Ensure that all O_ORDERDATE entries in the ORDERS table are later than 10:30 AM on December 31st, 1991.",
    "coverage": 1,
    "properties":  {
        "datetime": "1991-12-31 10:30:00"
    },
    "tags": [],
    "fields": ["O_ORDERDATE"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "afterDateTime",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with O_ORDERKEY 1 and 3 do not satisfy the rule because their O_ORDERDATE values are not after 1991-12-31 10:30:00.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_ORDERDATE]
B --> C{Is O_ORDERDATE > '1991-12-31 10:30:00'?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s)
select
    o_orderkey
    , o_orderdate
from orders 
where
    o_orderdate <= '1991-12-31 10:30:00'

Potential Violation Messages

Record Anomaly

The O_ORDERDATE value of 1991-12-14 10:30:00 is not later than 1991-12-31 10:30:00

Shape Anomaly

In O_ORDERDATE, 66.667% of 3 filtered records (2) are not later than 1991-12-31 10:30:00

Aggregation Comparison

Definition

Verifies that the specified comparison operator evaluates true when applied to two aggregation expressions.

In-Depth Overview

The Aggregation Comparison is a rule that allows for the dynamic analysis of aggregations across different datasets. It empowers users to establish data integrity by ensuring that aggregate values meet expected comparisons, whether they are totals, averages, counts, or any other aggregated metric.

By setting a comparison between aggregates from potentially different tables or even source datastores, this rule confirms that relationships between data points adhere to business logic or historical data patterns. This is particularly useful when trying to validate interrelated financial reports, summary metrics, or when monitoring the consistency of data ingestion over time.

Field Scope

Calculated: The rule automatically identifies the fields involved, without requiring explicit field selection.

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Facilitates the comparison between a target aggregate metric and a reference aggregate metric across different datasets.

Name	Description
Target Aggregation	Specifies the aggregation expression to evaluate
Comparison	Select the comparison operator (e.g., greater than, less than, etc.)
Datastore	Identifies the source datastore for the reference aggregation
Table/File	Specifies the table or file for the reference aggregation
Reference Aggregation	Defines the reference aggregation expression to compare against
Reference Filter	Applies a filter to the reference aggregation if necessary

Details

It's important to understand that each aggregation must result in a single row. Also, similar to Spark expressions, the aggregation expressions must be written in a valid format for DataFrames.

Examples

Simple Aggregations

SUM(O_TOTALPRICE)

Combining with SparkSQL Functions

ROUND(SUM(O_TOTALPRICE))

Complex Aggregations

ROUND(SUM(L_EXTENDEDPRICE * (1 - L_DISCOUNT) * (1 + L_TAX)))

Aggregation Expressions

COUNT(CATEGORY) * MAX(VALUE) - FIRST(VALUE)

Here are some common aggregate functions used in SparkSQL:

SUM: Calculates the sum of all values in a column.
AVG: Calculates the average of all values in a column.
MAX: Returns the maximum value in a column.
MIN: Returns the minimum value in a column.
COUNT: Counts the number of rows in a column.

For a detailed list of valid SparkSQL aggregation functions, refer to the Apache Spark SQL documentation.

Payload example

{
    "description": "Assert that O_ORDERDATE is after the defined date time",
    "coverage": 1,
    "properties":  {
        "datetime": "1991-12-31 10:30:00"
    },
    "tags": [],
    "fields": fields,
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "afterDateTime",
    "container_id": {container_id},
    "template_id": {template_id},
}

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the aggregated sum of total_price from the ORDERS table matches the aggregated and rounded sum of calculated_price from the LINEITEM table.

Info

The calculated_price in this example is represented by the sum of each product's extended price, adjusted for discount and tax.

Sample Data

Aggregated data from ORDERS (Target)

TOTAL_PRICE
5000000

Aggregated data from LINEITEM (Reference)

CALCULATED_PRICE
4999800

Inputs

Target Aggregation: ROUND(SUM(O_TOTALPRICE))
Comparison: eq (Equal To), lt (Less Than), lte (Less Than or Equal to), gte (Greater Than or Equal To), gt (Greater Than)
Reference Aggregation: ROUND(SUM(L_EXTENDEDPRICE * (1 - L_DISCOUNT) * (1 + L_TAX)))

Payload example

{
    "description": "Ensure that the aggregated sum of total_price from the ORDERS table matches the aggregated and sum of l_totalprice from the LINEITEM table",
    "coverage": 1,
    "properties": {
        "comparison": "eq",
        "expression": f"SUM(O_TOTALPRICE)",
        "ref_container_id": ref_container_id,
        "ref_datastore_id": ref_datastore_id,
        "ref_expression": f"SUM(L_TOTALPRICE)",
        "ref_filter": "1=1",
    },
    "tags": [],
    "fields": ["O_TOTALPRICE"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "aggregationComparison",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the aggregated TOTAL_PRICE from the ORDERS table is 5000000, while the aggregated and rounded CALCULATED_PRICE from the LINEITEM table is 4999800. The difference between these totals indicates a potential anomaly, suggesting issues in data calculation or recording methods.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve Aggregated Values]
B --> C{Do Aggregated Totals Match?}
C -->|Yes| D[End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query related to the rule using TPC-H tables.
with orders_agg as (
    select 
        round(sum(o_totalprice)) as total_order_price
    from 
        orders
),
lineitem_agg as (
    select 
        round(sum(l_extendedprice * (1 - l_discount) * (1 + l_tax))) as calculated_price
    from 
        lineitem
),
comparison as (
    select
        o.total_order_price,
        l.calculated_price
    from
        orders_agg o
        cross join lineitem_agg l
)
select * from comparison
where comparison.total_order_price != comparison.calculated_price;

Potential Violation Messages

Shape Anomaly

ROUND(SUM(O_TOTALPRICE)) is not equal to ROUND(SUM(L_EXTENDEDPRICE * (1 - L_DISCOUNT) * (1 + L_TAX))).

Any Not Null

Definition

Asserts that at least one of the selected fields must hold a value.

Field Scope

Multiple: The rule evaluates multiple specified fields.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`
`String`
`Boolean`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that for every record in the ORDERS table, at least one of the fields (O_COMMENT, O_ORDERSTATUS) isn't null.

Sample Data

O_ORDERKEY	O_COMMENT	O_ORDERSTATUS
1	NULL	NULL
2	Good product	NULL
3	NULL	Shipped

Payload example

{
    "description": "Ensure that for every record in the ORDERS table, at least one of the fields (O_COMMENT, O_ORDERSTATUS) isn't null",
    "coverage": 1,
    "properties": {},
    "tags": [],
    "fields": ["O_ORDERSTATUS","O_COMMENT"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "anyNotNull",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "_PARITY = 'odd'"
}

Anomaly Explanation

In the sample data above, the entry with O_ORDERKEY 1 does not satisfy the rule because both O_COMMENT and O_ORDERSTATUS does not hold a value.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_COMMENT and O_ORDERSTATUS]
B --> C{Is Either Field Not Null?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    o_orderkey
    , o_comment
    , o_orderstatus
from orders 
where
    o_comment is null
    and o_orderstatus is null

Potential Violation Messages

Record Anomaly

There is no value set for any of O_COMMENT, O_ORDERSTATUS

Shape Anomaly

In O_COMMENT, O_ORDERSTATUS, 33.333% of 3 filtered records (1) have no value set for any of O_COMMENT, O_ORDERSTATUS

Before Date Time

Definition

Asserts that the field is a timestamp earlier than a specific date and time.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify a particular date and time to act as the threshold for the rule.

Name	Description
Date	The timestamp used as the upper boundary. Values in the selected field should be before this timestamp.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all L_SHIPDATE entries in the LINEITEM table are earlier than 3:00 PM on December 1st, 1998.

Sample Data

L_ORDERKEY	L_SHIPDATE
1	1998-12-01 15:30:00
2	1998-11-02 12:45:00
3	1998-08-01 10:20:00

Payload example

{
    "description": "Make sure datetime values are earlier than 3:00 PM, Dec 01, 1998",
    "coverage": 1,
    "properties": {
        "datetime":"1998-12-01T15:00:00Z"
    },
    "tags": [],
    "fields": ["L_SHIPDATE"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "beforeDateTime",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with L_ORDERKEY 1 does not satisfy the rule because its L_SHIPDATE value is not before 1998-12-01 15:00:00.

FlowchartsSQL

graph TD
A[Start] --> B[Retrieve L_SHIPDATE]
B --> C{Is L_SHIPDATE < '1998-12-01 15:00:00'?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey
    , l_shipdate
from lineitem 
where
    l_shipdate >= '1998-12-01 15:00:00'

Potential Violation Messages

Record Anomaly

The L_SHIPDATE value of 1998-12-01 15:30:00 is not earlier than 1998-12-01 15:00:00.

Shape Anomaly

In L_SHIPDATE, 33.33% of 3 filtered records (1) are not earlier than 1998-12-01 15:00:00.

Between

Definition

Asserts that values are equal to or between two numbers.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify both minimum and maximum boundaries, and determine if these boundaries should be inclusive.

Name	Explanation
Max	The upper boundary of the range.
Inclusive (Max)	If true, the upper boundary is considered a valid value within the range. Otherwise, it's exclusive.
Min	The lower boundary of the range.
Inclusive (Min)	If true, the lower boundary is considered a valid value within the range. Otherwise, it's exclusive.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all L_QUANTITY entries in the LINEITEM table are between 5 and 20 (inclusive).

Sample Data

L_ORDERKEY	L_QUANTITY
1	4
2	15
3	21

Payload example

{
    "description": "Ensure that all L_QUANTITY entries in the LINEITEM table are between 5 and 20 (inclusive)",
    "coverage": 1,
    "properties": {
        "min":5
        "inclusive_min":true,
        "max":20,
        "inclusive_max":true,
    },
    "tags": [],
    "fields": ["L_QUANTITY"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "between",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with L_ORDERKEY 1 and 3 do not satisfy the rule because their L_QUANTITY values are not between 5 and 20 inclusive.

FlowchartsSQL

graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is 5 <= L_QUANTITY <= 20?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey
    , l_quantity
from lineitem 
where
    l_quantity < 5
    or l_quantity > 20

Potential Violation Messages

Record Anomaly

The value for L_QUANTITY of 4 is not between 5.000 and 20.000.

Shape Anomaly

In L_QUANTITY, 66.67% of 3 filtered records (2) are not between 5.000 and 20.000.

Between Times

Definition

Asserts that values are equal to or between two dates or times.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify the range of dates or times that values in the selected field should fall between.

Name	Description
Min	The timestamp used as the lower boundary. Values in the selected field should be after this timestamp.
Max	The timestamp used as the upper boundary. Values in the selected field should be before this timestamp.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all O_ORDERDATE entries in the ORDERS table are between 10:30 AM on January 1st, 1991 and 3:00 PM on December 31st, 1991.

Sample Data

O_ORDERKEY	O_ORDERDATE
1	1990-12-31 10:30:00
2	1991-06-02 09:15:00
3	1992-01-01 01:25:00

Payload example

{
    "description": "Ensure that all O_ORDERDATE entries in the ORDERS table are between 10:30 AM on January 1st, 1991 and 3:00 PM on December 31st, 1991",
    "coverage": 1,
    "properties": {
        "min_time":"1991-01-01T10:30:00Z",
        "max_time":"1991-12-31T15:00:00Z"
    },
    "tags": [],
    "fields": ["O_ORDERDATE"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "betweenTimes",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "_PARITY = 'odd'"
}

Anomaly Explanation

In the sample data above, the entries with O_ORDERKEY 1 and 3 do not satisfy the rule because their O_ORDERDATE values are not between 1991-01-01 10:30:00 and 1991-12-31 15:00:00.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_ORDERDATE]
B --> C{Is '1991-01-01 10:30:00' <= O_ORDERDATE <= '1991-12-31 15:00:00'?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    o_orderkey
    , o_orderdate
from orders 
where
    o_orderdate < '1991-01-01 10:30:00'
    or o_orderdate > '1991-12-31 15:00:00'

Potential Violation Messages

Record Anomaly

The value for O_ORDERDATE of 1990-12-31 10:30:00 is not between 1991-01-01 10:30:00 and 1991-12-31 15:00:00.

Shape Anomaly

In O_ORDERDATE, 66.667% of 3 filtered records (2) are not between 1991-01-01 10:30:00 and 1991-12-31 15:00:00.

Contains Credit Card

Definition

Asserts that the values contain a credit card number.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that every O_PAYMENT_DETAILS in the ORDERS table contains a credit card number to confirm the payment method used for each order.

Sample Data

O_ORDERKEY	O_PAYMENT_DETAILS
1	{"date": "2023-09-25", "amount": 250.50, "credit_card": "5105105105105100"}
2	{"date": "2023-09-25", "amount": 150.75, "credit_card": "ABC12345XYZ"}
3	{"date": "2023-09-25", "amount": 200.00, "credit_card": "4111-1111-1111-1111"}

Payload example

{
    "description": "Ensure that every O_PAYMENT_DETAILS in the ORDERS table contains a credit card number to confirm the payment method used for each order",
    "coverage": 1,
    "properties": {},
    "tags": [],
    "fields": ["C_CCN_JSON"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "containsCreditCard",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with O_ORDERKEY 2 violates the rule as the O_PAYMENT_DETAILS does not contain a credit card number, indicating an incomplete order record.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_PAYMENT_DETAILS]
B --> C{Contains Credit Card Number?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query to identify order records that don't contain a credit card number in the payment details.
select
    o_orderkey,
    o_payment_details
from orders
where
    not (regexp_like(o_payment_details, '[0-9]{16}'))
    or not (regexp_like(o_payment_details, '\d{4}-\d{4}-\d{4}-\d{4}'))

Potential Violation Messages

Record Anomaly

The O_PAYMENT_DETAILS value of {"date": "2023-09-25", "amount": 150.75, "credit_card": "ABC12345XYZ"} does not contains a credit card number.

Shape Anomaly

In O_PAYMENT_DETAILS, 33.33% of 3 order records (1) do not contain a credit card number.

Contains Email

Definition

Asserts that the values contain email addresses.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: .

Sample Data

C_CUSTKEY	C_CONTACT_DETAILS
1	{"name": "John Doe", "email": "john.doe@example.com"}
2	{"name": "Amy Lu", "email": "amy.lu@"}
3	{"name": "Jane Smith", "email": "jane.smith@domain.org"}

Payload example

{
    "description": "Ensure that all C_CONTACT_DETAILS entries in the CUSTOMER table contain valid email addresses",
    "coverage": 1,
    "properties": {},
    "tags": [],
    "fields": ["C_EMAIL_JSON"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "containsEmail",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with C_CUSTKEY 2 does not satisfy the rule because its C_CONTACT_DETAILS value does not follow a typical email format.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve C_CONTACT_DETAILS]
B --> C{Does C_CONTACT_DETAILS contain an email address?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    c_custkey,
    c_contact_details
from customer 
where
    not regexp_like(c_contact_details, '^[a-za-z0-9._%-]+@[a-za-z0-9.-]+\.[a-za-z]{2,4}$')

Potential Violation Messages

Record Anomaly

The C_CONTACT_DETAILS value of {"name": "Amy Lu", "email": "amy.lu@"} does not contain an email address.

Shape Anomaly

In C_CONTACT_DETAILS, 33.333% of 3 filtered records (1) do not contain email addresses.

Asserts that the values contain a social security number.

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Objective: Ensure that all C_CONTACT_DETAILS entries in the CUSTOMER table contain valid social security numbers.

Sample Data

C_CUSTKEY	C_CONTACT_DETAILS
1	{"name": "John Doe", "ssn": "234567890"}
2	{"name": "Amy Lu", "ssn": "666-12-3456"}
3	{"name": "Jane Smith", "ssn": "429-14-2216"}

Payload example

{
    "description": "Ensure that all C_CONTACT_DETAILS entries in the CUSTOMER table contain valid social security numbers",
    "coverage": 1,
    "properties": {},
    "tags": [],
    "fields": ["C_CONTACT_DETAILS"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "containsSocialSecurityNumber",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with C_CUSTKEY 2 does not satisfy the rule because its C_CONTACT_DETAILS value does not contain the typical social security number format.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve C_CONTACT_DETAILS]
B --> C{Does C_CONTACT_DETAILS contain a valid SSN format?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    c_custkey,
    c_contact_details
from customer 
where
    not regexp_like(ssn, '^[0-9]{3}-[0-9]{2}-[0-9]{4}$')

Potential Violation Messages

Record Anomaly

The C_CONTACT_DETAILS value of {"name": "Amy Lu", "ssn": "666-12-3456"} does not contain a social security number.

Shape Anomaly

In C_CONTACT_DETAILS, 33.333% of 3 filtered records (1) do not contain social security numbers.

Contains URL

Definition

Asserts that the values contain valid URLs.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all S_DETAILS entries in the SUPPLIER table contain valid URLs.

Sample Data

S_SUPPKEY	S_DETAILS
1	{"name": "Tech Parts", "website": "www.techparts.com"}
2	{"name": "Hardwarepro", "website": "https://www.hardwarepro.com"}
3	{"name": "Smith's Tools", "website": "ftp:server:8080"}

Payload example

{
    "description": "Ensure that all S_DETAILS entries in the SUPPLIER table contain valid URLs",
    "coverage": 1,
    "properties": {},
    "tags": [],
    "fields": ["S_DETAILS"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "containsUrl",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with S_SUPPKEY 1 and 3 do not satisfy the rule because their S_DETAILS values do not contain a valid URL pattern.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve S_DETAILS]
B --> C{Does S_DETAILS contain a valid URL?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    s_suppkey,
    s_details
from supplier 
where
    not regexp_like(url, '^https?://.+')

Potential Violation Messages

Record Anomaly

The S_DETAILS value of {"name": "Tech Parts", "website": "www.techparts.com"} does not contain a valid URL.

Shape Anomaly

In S_DETAILS, 66.667% of 3 filtered records (2) do not contain a valid URL.

Data Diff

Recommended Check

Qualytics recommends using the dataDiff rule type instead of the isReplicaOf.

The isReplicaOf check is sunsetting and will no longer be maintained, while dataDiff provides the same functionality with enhanced performance and additional capabilities.

Definition

Asserts that the dataset created by the targeted field(s) matches the referred field(s) for data comparison.

In-Depth Overview

The DataDiff rule ensures that data integrity is maintained when comparing data between different sources. This involves checking not only the data values themselves but also ensuring that the structure and relationships are preserved.

In a distributed data ecosystem, data comparison often occurs to validate consistency across systems, verify data transfers, or ensure data quality between sources. However, discrepancies might arise due to various reasons such as network glitches, software bugs, or human errors. The DataDiff rule serves as a safeguard against these issues by:

Preserving Data Structure: Ensuring that the structure of the compared data matches between sources.
Checking Data Values: Ensuring that every piece of data in the source matches the reference data.

Field Scope

Multi: The rule evaluates multiple specified fields.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`
`String`
`Boolean`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify the datastore and table/file where the reference data for the targeted fields is located for comparison.

Name	Description
Row Identifiers	The list of fields defining the compound key to identify rows in the comparison analysis.
Datastore	The source datastore where the reference data for the targeted field(s) is located.
Table/file	The table, view or file in the source datastore that should serve as the reference for comparison.
Comparators	Specifies how variations are handled, allowing for slight deviations within a defined margin of error.

Info

The DataDiff rule supports editing of Row Identifiers and Passthrough Fields, allowing for more tailored configuration.

Details

Row Identifiers

This optional input allows row comparison analysis by defining a list of fields as row identifiers, it enables a more detailed comparison between tables/files, where each row compound key is used to identify its presence or absence in the reference table/file compared to the target table/file. Qualytics can inform if the row exists or not and distinguish which field values differ in each row present in the reference table/file, helping to determine if it is a data diff.

Info

Anomalies produced by a DataDiff quality check making use of Row Identifiers have their source records presented in a different visualization.

See more at: Comparison Source Records

Comparators

The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:

Numeric

Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.

Comparison Type

Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.

Threshold

The threshold is the value you set to define the margin of error:

When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.

Illustration using Absolute Value

In this example, it compares Value A and Value B according to the defined Threshold of 50.

Value A	Value B	Difference	Are equal?
100	150	50	True
100	90	10	True
100	155	55	False
100	49	51	False

Illustration using Percentage Value

In this example, it compares Value A and Value B according to the defined Threshold of 10%.

Percentage Change Formula: [ (Value B - Value A) / Value A ] * 100

Value A	Value B	Percentage Change	Are equal?
120	132	10%	True
150	135	10%	True
200	180	10%	True
160	150	6.25%	True
180	200	11.11%	False

Duration

Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.

Unit

The unit of time you select determines how granular the comparison is:

Millis: Measures time in milliseconds, ideal for high-precision needs.
Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
Days: Best for longer durations.

Value

Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.

Illustration using Duration Comparator

Unit	Value A	Value B	Difference	Threshold	Are equal?
Millis	500 ms	520 ms	20 ms	25 ms	True
Seconds	30 sec	31 sec	1 sec	2 sec	True
Days	5 days	7 days	2 days	1 day	False
Millis	1000 ms	1040 ms	40 ms	25 ms	False
Seconds	45 sec	48 sec	3 sec	2 sec	False

String

String comparators facilitate comparisons of textual data by allowing variations in spacing. This capability is essential for ensuring data consistency, particularly where minor text inconsistencies may occur.

Ignore Whitespace

When enabled, this setting allows the comparator to ignore differences in whitespace. This means sequences of whitespace are collapsed into a single space, and any leading or trailing spaces are removed. This can be particularly useful in environments where data entry may vary in formatting but where those differences are not relevant to the data's integrity.

Illustration

In this example, it compares Value A and Value B according to the defined string comparison to ignore whitespace as True.

Value A	Value B	Are equal?	Has whitespace?
`Leonidas`	`Leonidas`	True	No
`Beth`	`Beth`	True	Yes
`Ana`	`Anna`	False	Yes
`Joe`	`Joel`	False	No

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Scenario: Consider that the fields N_NATIONKEY and N_NATIONNAME in the NATION table need to be compared with a backup database for data validation purposes. The data engineering team wants to ensure that both fields in the backup accurately match the original.

Objective: Ensure that N_NATIONKEY and N_NATIONNAME from the NATION table match the data in the NATION_BACKUP table.

Sample Data from NATION

N_NATIONKEY	N_NATIONNAME
1	Australia
2	United States
3	Uruguay

Reference Sample Data from NATION_BACKUP

N_NATIONKEY	N_NATIONNAME
1	Australia
2	USA
3	Uruguay

Payload example

{
    "description": "Ensure that N_NATIONKEY and N_NATIONNAME from the NATION table match the data in the NATION_BACKUP table",
    "coverage": 1,
    "properties": {
        "ref_container_id": {ref_container_id},
        "ref_datastore_id": {ref_datastore_id}
    },
    "tags": [],
    "fields": ["N_NATIONKEY", "N_NATIONNAME"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "dataDiff",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

The datasets representing the fields N_NATIONKEY and N_NATIONNAME in the original and the reference data are not completely identical, indicating a possible discrepancy in the data or an unintended change.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve Original Data]
B --> C[Retrieve Reference Data]
C --> D{Do datasets match for both fields?}
D -->|Yes| E[End]
D -->|No| F[Mark as Anomalous]
F --> E

-- An illustrative SQL query comparing original to reference data for both fields.
select
    orig.n_nationkey as original_key,
    orig.n_nationname as original_name,
    ref.n_nationkey as reference_key,
    ref.n_nationname as reference_name
from nation as orig
left join nation_backup as ref on orig.n_nationkey = ref.n_nationkey
where
    orig.n_nationname <> ref.n_nationname
or
    orig.n_nationkey <> ref.n_nationkey

Potential Violation Messages

Shape Anomaly

There is 1 record that differs between NATION_BACKUP (3 records) and NATION (3 records) in <datastore_name>

Distinct Count

Definition

Asserts on the approximate count distinct of the given column.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`
`String`
`Boolean`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify the distinct count expectation for the values in the field.

Name	Description
Value	The exact count of distinct values expected in the selected field.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that there are exactly 3 distinct O_ORDERSTATUS in the ORDERS table: 'O' (Open), 'F' (Finished), and 'P' (In Progress).

Sample Data

O_ORDERKEY	O_ORDERSTATUS
1	O
2	F
...	...
20	X
21	O

Payload example

{
    "description": "Ensure that there are exactly 3 distinct O_ORDERSTATUS in the ORDERS table: 'O' (Open), 'F' (Finished), and 'P' (In Progress)",
    "coverage": 1,
    "properties": {
        "value":3
    },
    "tags": [],
    "fields": ["O_ORDERSTATUS"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "distinctCount",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the rule is violated because the O_ORDERSTATUS contains 4 distinct values and not 3: 'O' (Open), 'F' (Finished), and 'P' (In Progress).

FlowchartSQL

graph TD
A[Start] --> B[Retrieve all O_ORDERSTATUS entries and count distinct values]
B --> C{Is distinct count of O_ORDERSTATUS != 3?}
C -->|Yes| D[Mark as Anomalous]
C -->|No| E[End]

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    count(distinct o_orderstatus)
from orders
having count(distinct o_orderstatus) <> 3

Potential Violation Messages

Shape Anomaly

In O_ORDERSTATUS, the distinct count of the records is not 3.

Entity Resolution

Definition

Asserts that every distinct entity is appropriately represented once and only once

In-Depth Overview

This check performs automated entity name clustering to identify entities with similar names that likely represent the same entity. It then assigns each cluster a unique entity identifier and asserts that every row with the same entity identifier shares the same value for the designated distinction field

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Name	Description
Distinction Field	The field that must hold a distinct value for every distinct entity
Pair Substrings	Considers entities a match if one entity is part of the other
Pair Homophones	Considers entities a match if they sound alike, even if spelled differently
Spelling Similarity	The minimum similarity required for clustering two entity names

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: If you have a businesses table with an id field and a name field, this check can be configured to resolve name and use id as the distinction field. During each scan, similar names will be grouped and assigned the same entity identifier and any rows that share the same entity identifier but have different values for id will be identified as anomalies.

Sample Data

BUSINESS_ID	BUSINESS_NAME
1	ACME Boxing
2	Frank's Flowers
3	ACME Boxes

Payload example

{
    "description": "Ensure a `businesses` table with an `BUSINESS_ID` field and a `BUSINESS_NAME` field shares the same `entity identifier`",
    "coverage": 1,
    "properties": {
        "distinct_field_name":"BUSINESS_ID",
        "pair_substrings":true,
        "pair_homophones":true,
        "spelling_similarity_threshold":0.6
    },
    "tags": [],
    "fields": ["BUSINESS_NAME"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "entityResolution",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with BUSINESS_ID 1 and 3 do not satisfy the rule because their BUSINESS_NAME values will be marked as similar yet they do not share the same BUSINESS_ID

Flowchart

graph TD
A[Start] --> B[Retrieve Original Data]
B --> C{Which entities are similar?}
C --> D[Assign each record an entity identifier]
D --> E[Cluster records by entity identifier]
E --> F{Do records with same<br/>entity identifier share the<br/>same distinction field value?}
F -->|Yes| I[End]
F -->|No| H[Mark as Anomalous]
H --> I

Equal To

Definition

Asserts that all of the selected fields' equal a value.

Field Scope

Multi: The rule evaluates multiple specified fields.

Accepted Types

Type
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify the field to compare for equality with the selected field.

Name	Description
Value	Specifies the value a field should be equal to.
Comparators	Specifies how variations are handled, allowing for slight deviations within a defined margin of error.

Details

Comparators

The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:

Numeric

Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.

Comparison Type

Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.

Threshold

The threshold is the value you set to define the margin of error:

When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.

Illustration using Absolute Value

In this example, it compares Value A and Value B according to the defined Threshold of 50.

Value A	Value B	Difference	Are equal?
100	150	50	True
100	90	10	True
100	155	55	False
100	49	51	False

Illustration using Percentage Value

In this example, it compares Value A and Value B according to the defined Threshold of 10%.

Percentage Change Formula: [ (Value B - Value A) / Value A ] * 100

Value A	Value B	Percentage Change	Are equal?
120	132	10%	True
150	135	10%	True
200	180	10%	True
160	150	6.25%	True
180	200	11.11%	False

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is equal to a value of 10.

Sample Data

L_ORDERKEY	L_LINENUMBER	L_QUANTITY
1	1	10
2	2	5
3	3	10
4	4	8

Payload example

{
    "description": "Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is equal to a value of 10",
    "coverage": 1,
    "properties": {
        "value":"10",
        "inclusive":true
    },
    "tags": [],
    "fields": ["L_QUANTITY"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "equalTo",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with L_ORDERKEY 2 and 4 do not satisfy the rule because their L_QUANTITY values are below the specified minimum value of 10.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY = 10?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey,
    l_linenumber,
    l_quantity
from lineitem 
where
    l_quantity < 10;

Potential Violation Messages

Record Anomaly

Not all of the fields equal are equal to the value of 10

Shape Anomaly

In L_QUANTITY, 2 of 4 filtered records (4) are not equal to the value of 10

Equal To Field

Definition

Asserts that a field is equal to another field.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify the field to compare for equality with the selected field.

Name	Description
Field to compare	The field name whose values should match those of the selected field.
Comparators	Specifies how variations are handled, allowing for slight deviations within a defined margin of error.

Details

Comparators

The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:

String

String comparators facilitate comparisons of textual data by allowing variations in spacing. This capability is essential for ensuring data consistency, particularly where minor text inconsistencies may occur.

Ignore Whitespace

When enabled, this setting allows the comparator to ignore differences in whitespace. This means sequences of whitespace are collapsed into a single space, and any leading or trailing spaces are removed. This can be particularly useful in environments where data entry may vary in formatting but where those differences are not relevant to the data's integrity.

Illustration

In this example, it compares Value A and Value B according to the defined string comparison to ignore whitespace as True.

Value A	Value B	Are equal?	Has whitespace?
`Leonidas`	`Leonidas`	True	No
`Beth`	`Beth`	True	Yes
`Ana`	`Anna`	False	Yes
`Joe`	`Joel`	False	No

Duration

Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.

Unit

The unit of time you select determines how granular the comparison is:

Millis: Measures time in milliseconds, ideal for high-precision needs.
Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
Days: Best for longer durations.

Value

Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.

Illustration using Duration Comparator

Unit	Value A	Value B	Difference	Threshold	Are equal?
Millis	500 ms	520 ms	20 ms	25 ms	True
Seconds	30 sec	31 sec	1 sec	2 sec	True
Days	5 days	7 days	2 days	1 day	False
Millis	1000 ms	1040 ms	40 ms	25 ms	False
Seconds	45 sec	48 sec	3 sec	2 sec	False

Numeric

Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.

Comparison Type

Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.

Threshold

The threshold is the value you set to define the margin of error:

When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.

Illustration using Absolute Value

In this example, it compares Value A and Value B according to the defined Threshold of 50.

Value A	Value B	Difference	Are equal?
100	150	50	True
100	90	10	True
100	155	55	False
100	49	51	False

Illustration using Percentage Value

In this example, it compares Value A and Value B according to the defined Threshold of 10%.

Percentage Change Formula: [ (Value B - Value A) / Value A ] * 100

Value A	Value B	Percentage Change	Are equal?
120	132	10%	True
150	135	10%	True
200	180	10%	True
160	150	6.25%	True
180	200	11.11%	False

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Scenario: An e-commerce platform sells digital products. The shipping date (when the digital product link is sent) should always be the same as the delivery date (when the customer acknowledges receiving the product).

Objective: Ensure that the O_SHIPDATE in the ORDERS table matches its delivery date O_DELIVERYDATE.

Sample Data

O_ORDERKEY	O_SHIPDATE	O_DELIVERYDATE
1	1998-01-04	1998-01-04
2	1998-01-14	1998-01-15
3	1998-01-12	1998-01-12

Payload example

{
    "description": "Ensure that the O_SHIPDATE in the ORDERS table matches its delivery date O_DELIVERYDATE",
    "coverage": 1,
    "properties": {"field_name":"O_DELIVERYDATE", "inclusive":false},
    "tags": [],
    "fields": ["O_SHIPDATE"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "equalToField",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with O_ORDERKEY 2 does not satisfy the rule because its O_SHIPDATE of 1998-01-14 does not match the O_DELIVERYDATE of 1998-01-15.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_SHIPDATE and O_DELIVERYDATE]
B --> C{Is O_SHIPDATE = O_DELIVERYDATE?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    o_orderkey,
    o_shipdate,
    o_deliverydate
from orders 
where
    o_shipdate != o_deliverydate

Potential Violation Messages

Record Anomaly

The O_SHIPDATE value of 1998-01-14 is not equal to the value of O_DELIVERYDATE which is 1998-01-15.

Shape Anomaly

In O_SHIPDATE, 33.333% of the filtered fields are not equal.

Exists In

Definition

Asserts that values assigned to a field exist as values in another field.

In-Depth Overview

The ExistsIn rule allows you to cross-validate data between different sources, whether it’s object storage systems or databases.

Traditionally, databases might utilize foreign key constraints (if available) to enforce data integrity between related tables. The ExistsIn rule extends this concept in two powerful ways:

Cross-System Integrity: it allows for integrity checks to span across different databases or even entirely separate systems. This is particularly advantageous in scenarios where data sources are fragmented across diverse platforms.
Flexible Data Formats: Beyond just databases, this rule can validate values against various data formats, such as ensuring values in a file align with those in a table.

These enhancements enable businesses to maintain data integrity even in complex, multi-system environments.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`
`String`
`Boolean`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Define the datastore, table/file, and field where the rule should look for matching values.

Name	Description
Datastore	The source datastore where the profile of the reference field is located.
Table/file	The profile (e.g. table, view or file) containing the reference field.
Field	The field name whose values should match those of the selected field.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all NATION_NAME entries in the NATION table match entries under the COUNTRY_NAME column in an external lookup file listing official country names.

Sample Data

N_NATIONKEY	N_NATIONNAME
1	Algeria
2	Argentina
3	Atlantida

Payload example

{
    "description": "Ensure that all NATION_NAME entries in the NATION table match entries under the COUNTRY_NAME column in an external lookup file listing official country names",
    "coverage": 1,
    "properties": {
        "field_name":"COUNTRY_NAME",
        "ref_container_id": {ref_container_id},
        "ref_datastore_id": {ref_datastore_id}
    },
    "tags": [],
    "fields": ["NATION_NAME"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "existsIn",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Lookup File Sample

COUNTRY_NAME
Algeria
Argentina
Brazil
Canada
...
Zimbabwe

Anomaly Explanation

In the sample data above, the entry with N_NATIONKEY 3 does not satisfy the rule because the N_NATIONNAME "Atlantida" does not match any COUNTRY_NAME in the official country names lookup file.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve COUNTRY_NAME]
B --> C[Retrieve N_NATIONNAME]
C --> D{Does N_NATIONNAME exists in COUNTRY_NAME?}
D -->|Yes| E[Move to Next Record/End]
D -->|No| F[Mark as Anomalous]
F --> E

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    n_nationkey
    , n_nationname
from nation 
where
    n_nationname not in ('Algeria', 'Argentina', ... /* other valid countries */)

Potential Violation Messages

Record Anomaly

The N_NATIONNAME value of 'Atlantida' does not exist in COUNTRY_NAME.

Shape Anomaly

In N_NATIONNAME, 33.333% of 3 filtered records (1) do not match any COUNTRY_NAME.

Expected Schema

Definition

Asserts that all of the selected fields must be present in the datastore.

Behavior

The expected schema is the first check to be tested during a scan operation. If it fails, the scan operation will result as Failure with the following message:

<container-name>: Aborted because schema check anomalies were identified.

General Properties

Details

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

Specific Properties

Specify the fields that must be present in the schema, and determine if a schema change caused by additional fields should fail or pass the assertion.

Name	Description
Fields	List of fields that must be presented in the schema.
Allow other fields	If true, then new fields are allowed to be presented in the schema. Otherwise, the assertion will be stricter.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that expected fields such as L_ORDERKEY, L_PARTKEY, and L_SUPPKEY are always present in the LINEITEM table.

Sample Data

Valid

FIELD_NAME	FIELD_TYPE
L_ORDERKEY	NUMBER
L_PARTKEY	NUMBER
L_SUPPKEY	NUMBER
L_LINENUMBER	NUMBER
L_QUANTITY	NUMBER
L_EXTENDEDPRICE	NUMBER
...	...

Invalid

L_SUPPKEY is missing from the schema

FIELD_NAME	FIELD_TYPE
L_ORDERKEY	NUMBER
L_PARTKEY	NUMBER
L_LINENUMBER	NUMBER
L_QUANTITY	NUMBER
L_EXTENDEDPRICE	NUMBER
...	...

Payload example

{
    "description": "Ensure that expected fields such as L_ORDERKEY, L_PARTKEY, and L_SUPPKEY are always present in the LINEITEM table",
    "coverage": 1,
    "properties": {
        "allow_other_fields":false,
        "list":["L_ORDERKEY","L_PARTKEY","L_SUPPKEY"]
    },
    "tags": [],
    "fields": null,
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "expectedSchema",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

Among the presented sample schemas, the second one is missing one of the expected schema. Only the first schema has the correct expected schema.

FlowchartSQL

graph TD
A[Start] --> B{Check for Field Presence}
B -.->|Field is missing| C[Mark as Shape Anomaly]
B -.->|All fields present| D[End]

-- An illustrative SQL query to check the existence of columns.
select 
    column_name 
from 
    information_schema.columns 
where 
    table_name = 'LINEITEM' and 
    column_name in ('L_ORDERKEY', 'L_PARTKEY', 'L_SUPPKEY');

Potential Violation Messages

Shape Anomaly

The required fields (L_SUPPKEY) are not present.

Expected Values

Definition

Asserts that values are contained within a list of expected values.

Info

Visual warnings for spacing issues are available across all forms where values are entered.
When a value includes leading or trailing spaces, the system automatically highlights it inside a warning-colored chip and displays a tooltip message indicating the issue in real time — for example:

“The following value has leading or trailing spaces: ‘ship ’”

warning

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`
`String`
`Boolean`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify the list of expected values for the data in the field.

Name	Description
List	A predefined set of values against which the data is validated.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all O_ORDERSTATUS entries in the ORDERS table only contain expected order statuses: "O", "F", and "P".

Sample Data

O_ORDERKEY	O_ORDERSTATUS
1	F
2	O
3	P
4	X

Payload example

{
    "description": "Ensure that all O_ORDERSTATUS entries in the ORDERS table only contain expected order statuses: "O", "F", and "P"",
    "coverage": 1,
    "properties": {
        "list":["O","F","P"]
    },
    "tags": [],
    "fields": ["O_ORDERSTATUS"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "expectedValues",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with O_ORDERKEY 4 does not satisfy the rule because the O_ORDERSTATUS "X" is not on the list of expected order statuses ("O", "F", "P").

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_ORDERSTATUS]
B --> C{Is O_ORDERSTATUS in 'O', 'F', 'P'?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    o_orderkey
    , o_orderstatus
from orders 
where
    o_orderstatus not in ('O', 'F', 'P')

Potential Violation Messages

Record Anomaly

The O_ORDERSTATUS value of 'X' does not appear in the list of expected values

Shape Anomaly

In O_ORDERSTATUS, 25.000% of 4 filtered records (1) do not appear in the list of expected values

Field Count

Definition

Asserts that there must be exactly a specified number of fields.

General Properties

Details

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

Specific Properties

Specify the exact number of fields expected in the profile.

Name	Description
Number of Fields	The exact number of fields that should be present in the profile.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the ORDERS profile contains exactly 9 fields.

Sample Profile

Valid

FIELD_NAME	FIELD_TYPE
O_ORDERKEY	STRING
O_CUSTKEY	STRING
O_ORDERSTATUS	STRING
O_TOTALPRICE	FLOAT
O_ORDERDATE	DATE
O_ORDERPRIORITY	STRING
O_CLERK	STRING
O_SHIPPRIORITY	STRING
O_COMMENT	STRING

Invalid

count (8) less than expected (9)

FIELD_NAME	FIELD_TYPE
O_ORDERKEY	STRING
O_CUSTKEY	STRING
O_ORDERSTATUS	STRING
O_TOTALPRICE	FLOAT
O_ORDERDATE	DATE
O_ORDERPRIORITY	STRING
O_CLERK	STRING
O_COMMENT	STRING

count (10) greater than expected (9)

FIELD_NAME	FIELD_TYPE
O_ORDERKEY	STRING
O_CUSTKEY	STRING
O_ORDERSTATUS	STRING
O_TOTALPRICE	FLOAT
O_ORDERDATE	DATE
O_ORDERPRIORITY	STRING
O_CLERK	STRING
O_COMMENT	STRING
EXTRA_FIELD	UNKNOWN

Payload example

{
    "description": "Ensure that the ORDERS profile contains exactly 9 fields",
    "coverage": 1,
    "properties": {
        "value": 9
    },
    "tags": [],
    "fields": null,
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "fieldCount",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

Among the presented sample profiles, the second one is missing a field, while the third one contains an extra field. Only the first profile has the correct number of fields, which is 9.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve Profile Fields]
B --> C{Does the profile have exactly 9 fields?}
C -->|Yes| D[End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query related to the rule.
select
    table_name, count(column_name) as number_of_fields
from information_schema.columns 
where table_name = 'orders'
group by table_name
having count(column_name) <> 9

Potential Violation Messages

Shape Anomaly

In ORDERS, the field count is not 9.

Freshness Check

A Freshness Check ensures data stays up-to-date by monitoring its last update time. It prevents stale data from impacting reports and dashboards while detecting outdated information early. By setting a maximum age threshold, it helps identify pipeline failures and ensures accurate, real-time analytics for reliable business insights.

Let's get started 🚀

Importance of Freshness Check

A Freshness Check ensures data is always up-to-date for real-time analytics. It prevents stale data from affecting reports and dashboards while detecting pipeline failures early.

Keeps data up-to-date for real-time analytics.
Prevents stale data from affecting reports and dashboards.
Detects data pipeline failures early.

How It Works

A Freshness Check monitors when a dataset (table, file, or view) was last updated. If the data is older than the allowed limit, the system triggers an alert.

Process Flow

Data Update: System records the last update timestamp.
Threshold Definition: A Maximum Age (e.g., 1 hour, 24 hours) is set.
Scan: The system checks if the data is within the allowed time.
Result Evaluation:
Pass → Data is updated within the allowed time.
Fail → Data is older than the limit, triggering an alert.

How Freshness is Measured Across Platforms

Qualytics performs volumetric and freshness measurements differently depending on the data platform and container type (table, view, or computed).

Snowflake

Tables

Freshness and volumetrics are retrieved directly from the INFORMATION_SCHEMA. (System-level metadata is used — not the Incremental Field.)

Views and Computed Objects

Volumetrics: SELECT COUNT(*) FROM <object>
Freshness: SELECT MAX(<incremental_identifier>) FROM <object> (Here, the Incremental Field defined in Table Settings determines freshness.)

Oracle

Tables

Volumetrics: Retrieved from the information_schema
Freshness: Determined using SELECT MAX(<incremental_identifier>)

Views and Computed Objects

Volumetrics: SELECT COUNT(*) FROM <object>
Freshness: SELECT MAX(<incremental_identifier>) FROM <object>

S3

Freshness: Based on the file modification time.
Volumetrics: Depends on file format.
CSV, Excel → file load + row count.
Delta or structured formats → row counts extracted from metadata.

BigQuery

Same behavior as Snowflake:

Tables: Retrieved from INFORMATION_SCHEMA.
Views/Computed: SELECT COUNT(*) and SELECT MAX(<incremental_identifier>).

Note

For regular tables, freshness is evaluated from INFORMATION_SCHEMA — the configured Incremental Field is not used.
For views or computed tables, freshness is derived from the Incremental Field defined in Table Settings.

Configuring Freshness Check

Step 1: Log in into your Qualytics account and select the datastore from the left menu on which you want to add a volumetric check.

freshness-check

Step 2: Click the Add button and select Checks.

freshness-check

Step 3: A modal window appears. Enter the required details to configure the Freshness Check.

freshness-check

Step 4: Enter the details to configure the volumetric check:

No.	Field	Description
1.	Rule Type	Select the Freshness Rule type from the dropdown.
2.	Table	Select the table for the rule to apply.
3.	Unit	Select time unit (Hours, Minutes, Days) for freshness measurement.
4.	Maximum Age	Set the time limit for data freshness. If exceeded, the check fails.
5.	Description	Enter a description for the check.
6.	Tag	Add tags for categorizing the check.
7.	Additional Metadata	Add custom metadata for additional details.

freshness-check

Step 5: After completing all the check details, click on the "Validate" button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct. It ensures that the check will work as expected by running it against the data without committing any changes.

freshness-check

If the validation is successful, a green message will appear saying "Validation Successful".

freshness-check

Step 6: Once you have a successful validation, click the "Save" button.

freshness-check

After clicking on the “Save” button your check is successfully created and a success flash message will appear saying “Check successfully created”.

freshness-check

Example

A company needs hourly updates on sales data to ensure real-time reports. A Freshness Check is set up with a 1-hour threshold.

Before Running the Check (Data is Fresh)

No.	Order ID	Customer	Amount ($)	Last Updated
01	12345	John Doe	49.99	10:30 AM
02	12346	Jane Smith	89.50	10:35 AM

Current Time: 11:00 AM
Threshold: 1 Hour
Pass (Data is fresh)

When Freshness Check Fails (Data is Stale)

No.	Order ID	Customer	Amount($)	Last Updated
01	12345	John Doe	49.99	09:30 AM
02	12346	Jane Smith	89.50	09:35 AM

Current Time: 11:00 AM.
Threshold: 1 Hour.
Fail (Data is older than 1 hour).

Greater Than

Definition

Asserts that the field is a number greater than (or equal to) a value.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Allows specifying a numeric value that acts as the threshold.

Name	Description
Value	The number to use as the base comparison.
Inclusive	If true, the comparison will also allow values equal to the threshold. Otherwise, it's exclusive.
Comparators	Specifies how variations are handled, allowing for slight deviations within a defined margin of error.

Details

Comparators

The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:

Numeric

Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.

Comparison Type

Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.

Threshold

The threshold is the value you set to define the margin of error:

When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.

Illustration using Absolute Value

In this example, it compares Value A and Value B according to the defined Threshold of 50.

Value A	Value B	Difference	Are equal?
100	150	50	True
100	90	10	True
100	155	55	False
100	49	51	False

Illustration using Percentage Value

In this example, it compares Value A and Value B according to the defined Threshold of 10%.

Percentage Change Formula: [ (Value B - Value A) / Value A ] * 100

Value A	Value B	Percentage Change	Are equal?
120	132	10%	True
150	135	10%	True
200	180	10%	True
160	150	6.25%	True
180	200	11.11%	False

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all L_QUANTITY entries in the LINEITEM table are greater than 10.

Sample Data

L_ORDERKEY	L_QUANTITY
1	9
2	15
3	5

Payload example

{
    "description": "Ensure that all L_QUANTITY entries in the LINEITEM table are greater than 10",
    "coverage": 1,
    "properties": {
        "inclusive": true,
        "value": 10
    },
    "tags": [],
    "fields": ["L_QUANTITY"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "greaterThan",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with L_ORDERKEY 1 and 3 do not satisfy the rule because their L_QUANTITY values are not greater than 10.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY > 10?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey,
    l_quantity
from lineitem 
where
    l_quantity <= 10;

Potential Violation Messages

Record Anomaly

The L_QUANTITY value of 5 is not greater than the value of 10.

Shape Anomaly

In L_QUANTITY, 66.667% of 3 filtered records (2) are not greater than 10.

Greater Than Field

Definition

Asserts that the field is greater than another field.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Allows specifying another field against which the value comparison will be performed.

Name	Description
Field to compare	Specifies the name of the field against which the value will be compared.
Inclusive	If true, the comparison will also allow values equal to the value of the other field. Otherwise, it's exclusive.
Comparators	Specifies how variations are handled, allowing for slight deviations within a defined margin of error.

Details

Comparators

The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:

Numeric

Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.

Comparison Type

Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.

Threshold

The threshold is the value you set to define the margin of error:

When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.

Illustration using Absolute Value

In this example, it compares Value A and Value B according to the defined Threshold of 50.

Value A	Value B	Difference	Are equal?
100	150	50	True
100	90	10	True
100	155	55	False
100	49	51	False

Illustration using Percentage Value

In this example, it compares Value A and Value B according to the defined Threshold of 10%.

Percentage Change Formula: [ (Value B - Value A) / Value A ] * 100

Value A	Value B	Percentage Change	Are equal?
120	132	10%	True
150	135	10%	True
200	180	10%	True
160	150	6.25%	True
180	200	11.11%	False

Duration

Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.

Unit

The unit of time you select determines how granular the comparison is:

Millis: Measures time in milliseconds, ideal for high-precision needs.
Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
Days: Best for longer durations.

Value

Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.

Illustration using Duration Comparator

Unit	Value A	Value B	Difference	Threshold	Are equal?
Millis	500 ms	520 ms	20 ms	25 ms	True
Seconds	30 sec	31 sec	1 sec	2 sec	True
Days	5 days	7 days	2 days	1 day	False
Millis	1000 ms	1040 ms	40 ms	25 ms	False
Seconds	45 sec	48 sec	3 sec	2 sec	False

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all O_TOTALPRICE entries in the ORDERS table are greater than their respective O_DISCOUNT.

Sample Data

O_ORDERKEY	O_TOTALPRICE	O_DISCOUNT
1	100	105
2	500	10
3	120	121

Payload example

{
    "description": "Ensure that all O_TOTALPRICE entries in the ORDERS table are greater than their respective O_DISCOUNT",
    "coverage": 1,
    "properties": {
        "field_name": "O_DISCOUNT",
        "inclusive": true
    },
    "tags": [],
    "fields": ["O_TOTALPRICE"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "greaterThanField",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with O_ORDERKEY 1 and 3 do not satisfy the rule because their O_TOTALPRICE values are not greater than their respective O_DISCOUNT values.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_TOTALPRICE and O_DISCOUNT]
B --> C{Is O_TOTALPRICE > O_DISCOUNT?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    o_orderkey,
    o_totalprice,
    o_discount
from orders 
where
    o_totalprice <= o_discount;

Potential Violation Messages

Record Anomaly

The O_TOTALPRICE value of 100 is not greater than the value of O_DISCOUNT.

Shape Anomaly

In O_TOTALPRICE, 66.667% of 3 filtered records (2) are not greater than O_DISCOUNT.

Is Address

Definition

Asserts that the values contain the specified required elements of an address.

In-Depth Overview

This check leverages machine learning powered by the libpostal library to support multilingual street address parsing/normalization that can handle addresses all over the world. The underlying statistical NLP model was trained using data from OpenAddress and OpenStreetMap, a total of about 1.2 billion records of data from over 230 countries, in 100+ languages. The international address parser uses Conditional Random Fields, which can infer a globally optimal tag sequence instead of making local decisions at each word, and it achieves 99.45% full-parse accuracy on held-out addresses (i.e. addresses from the training set that were purposefully removed so we could evaluate the parser on addresses it hasn’t seen before).

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Name	Description
Required Labels	The labels that must be identifiable in the value of each record

Info

The address parser can technically use any string labels that are defined in the training data, but these are the ones currently supported:

road: Street name(s)
city: Any human settlement including cities, towns, villages, hamlets, localities, etc
state: First-level administrative division. Scotland, Northern Ireland, Wales, and England in the UK are mapped to "state" as well (convention used in OSM, GeoPlanet, etc.)
country: Sovereign nations and their dependent territories, anything with an ISO-3166 code
postcode: Postal codes used for mail sorting

This check allows the user to define any combination of these labels as required elements of the value held in each record. Any value these does not contain every required element will be identified as anomalous.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all values in O_MAILING_ADDRESS include the labels "road", "city", "state", and "postcode"

Sample Data

O_ORDERKEY	O_MAILING_ADDRESS
1	One-hundred twenty E 96th St, new york NY 14925
2	Quatre vingt douze R. de l'Église, 75196 cedex 04
3	781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA

Payload example

{
    "description": "Ensure that all values in O_MAILING_ADDRESS include the labels "road", "city", "state", and "postcode"",
    "coverage": 1,
    "properties": {
        "required_labels": ["road","city","state","country","postcode"]
        },
    "tags": [],
    "fields": ["O_MAILING_ADDRESS"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "isAddress",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with O_ORDERKEY 2 does not satisfy the rule because the O_MAILING_ADDRESS value includes only a road and postcode which violates the business logic that city and state also be present.

Flowchart

graph TD
A[Start] --> B[Retrieve O_MAILING_ADDRESS]
B --> C[Infer address labels using ML]
C --> D{Are all required labels present?}
D -->|Yes| E[Move to Next Record/End]
D -->|No| F[Mark as Anomalous]
F --> E

Potential Violation Messages

Record Anomaly

The O_MAILING_ADDRESS value of Quatre vingt douze R. de l'Église, 75196 cedex 04 does not adhere to the required format.

Shape Anomaly

In O_MAILING_ADDRESS, 33.33% of 3 filtered records (1) do not adhere to the required format.

Is Credit Card

Definition

Asserts that the values are credit card numbers.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all C_CREDIT_CARD entries in the CUSTOMER table are valid credit card numbers.

Sample Data

C_CUSTKEY	C_CREDIT_CARD
1	5105105105105100
2	ABC12345XYZ
3	4111111111111111

Payload example

{
    "description": "Ensure that all C_CREDIT_CARD entries in the CUSTOMER table are valid credit card numbers",
    "coverage": 1,
    "properties": {},
    "tags": [],
    "fields": ["C_CREDIT_CARD"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "isCreditCard",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with C_CUSTKEY 2 does not satisfy the rule because its C_CREDIT_CARD value is not a valid credit card number.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve C_CREDIT_CARD]
B --> C{Is C_CREDIT_CARD valid?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    c_custkey
    , c_credit_card
from customer 
where
    not regexp_like(replace(c_credit_card, '-', ''), '^[0-9]{16}$')

Potential Violation Messages

Record Anomaly

The C_CREDIT_CARD value of ABC12345XYZ is not a valid credit card number.

Shape Anomaly

In C_CREDIT_CARD, 33.33% of 3 filtered records (1) are not valid credit card numbers.

Is Replica Of (Is sunsetting)

Deprecation Notice

The isReplicaOf check is being deprecated and will no longer be maintained. We strongly recommend using the Data Diff check, which offers the same functionality with improved performance and additional features.

Our recommendation:

Consider using Data Diff for new implementations
dataDiff provides enhanced performance and additional capabilities
Both checks will continue to coexist in the system

If you have questions about this change, please contact our support team

Definition

Asserts that the dataset created by the targeted field(s) is replicated by the referred field(s).

In-Depth Overview

The IsReplicaOf rule ensures that data integrity is maintained when data is replicated from one source to another. This involves checking not only the data values themselves but also ensuring that the structure and relationships are preserved.

In a distributed data ecosystem, replication often occurs to maintain high availability, create backups, or feed data into analytical systems. However, discrepancies might arise due to various reasons such as network glitches, software bugs, or human errors. The IsReplicaOf rule serves as a safeguard against these issues by:

Preserving Data Structure: Ensuring that the structure of the replicated data matches the original.
Checking Data Values: Ensuring that every piece of data in the source exists in the replica.

Field Scope

Multi: The rule evaluates multiple specified fields.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`
`String`
`Boolean`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify the datastore and table/file where the replica of the targeted fields is located for comparison.

Name	Description
Row Identifiers	The list of fields defining the compound key to identify rows in the comparison analysis.
Datastore	The source datastore where the replica of the targeted field(s) is located.
Table/file	The table, view or file in the source datastore that should serve as the replica.
Comparators	Specifies how variations are handled, allowing for slight deviations within a defined margin of error.

Info

The IsReplicaOf rule supports editing of Row Identifiers and Passthrough Fields, allowing for more tailored configuration.

Details

Row Identifiers

This optional input allows row comparison analysis by defining a list of fields as row identifiers, it enables a more detailed comparison between tables/files, where each row compound key is used to identify its presence or absence in the reference table/file compared to the target table/file. Qualytics can inform if the row exists or not and distinguish which field values differ in each row present in the reference table/file, helping to determine if it is a replica.

Info

Anomalies produced by a IsReplicaOf quality check making use of Row Identifiers have their source records presented in a different visualization.

See more at: Comparison Source Records

Comparators

The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:

Numeric

Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.

Comparison Type

Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.

Threshold

The threshold is the value you set to define the margin of error:

When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.

Illustration using Absolute Value

In this example, it compares Value A and Value B according to the defined Threshold of 50.

Value A	Value B	Difference	Are equal?
100	150	50	True
100	90	10	True
100	155	55	False
100	49	51	False

Illustration using Percentage Value

In this example, it compares Value A and Value B according to the defined Threshold of 10%.

Percentage Change Formula: [ (Value B - Value A) / Value A ] * 100

Value A	Value B	Percentage Change	Are equal?
120	132	10%	True
150	135	10%	True
200	180	10%	True
160	150	6.25%	True
180	200	11.11%	False

Duration

Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.

Unit

The unit of time you select determines how granular the comparison is:

Millis: Measures time in milliseconds, ideal for high-precision needs.
Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
Days: Best for longer durations.

Value

Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.

Illustration using Duration Comparator

Unit	Value A	Value B	Difference	Threshold	Are equal?
Millis	500 ms	520 ms	20 ms	25 ms	True
Seconds	30 sec	31 sec	1 sec	2 sec	True
Days	5 days	7 days	2 days	1 day	False
Millis	1000 ms	1040 ms	40 ms	25 ms	False
Seconds	45 sec	48 sec	3 sec	2 sec	False

String

String comparators facilitate comparisons of textual data by allowing variations in spacing. This capability is essential for ensuring data consistency, particularly where minor text inconsistencies may occur.

Ignore Whitespace

When enabled, this setting allows the comparator to ignore differences in whitespace. This means sequences of whitespace are collapsed into a single space, and any leading or trailing spaces are removed. This can be particularly useful in environments where data entry may vary in formatting but where those differences are not relevant to the data's integrity.

Illustration

In this example, it compares Value A and Value B according to the defined string comparison to ignore whitespace as True.

Value A	Value B	Are equal?	Has whitespace?
`Leonidas`	`Leonidas`	True	No
`Beth`	`Beth`	True	Yes
`Ana`	`Anna`	False	Yes
`Joe`	`Joel`	False	No

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Scenario: Consider that the fields N_NATIONKEY and N_NATIONNAME in the NATION table are being replicated to a backup database for disaster recovery purposes. The data engineering team wants to ensure that both fields in the replica in the backup accurately reflect the original.

Objective: Ensure that N_NATIONKEY and N_NATIONNAME from the NATION table are replicas in the NATION_BACKUP table.

Sample Data from NATION

N_NATIONKEY	N_NATIONNAME
1	Australia
2	United States
3	Uruguay

Replica Sample Data from NATION_BACKUP

N_NATIONKEY	N_NATIONNAME
1	Australia
2	USA
3	Uruguay

Payload example

{
    "description": "Ensure that N_NATIONKEY and N_NATIONNAME from the NATION table are replicas in the NATION_BACKUP table",
    "coverage": 1,
    "properties": {
        "ref_container_id": {ref_container_id},
        "ref_datastore_id": {ref_datastore_id}
    },
    "tags": [],
    "fields": ["N_NATIONKEY", "N_NATIONNAME"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "isReplicaOf",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

The datasets representing the fields N_NATIONKEY and N_NATIONNAME in the original and the replica are not completely identical, indicating a possible discrepancy in the replication process or an unintended change.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve Original Data]
B --> C[Retrieve Replica Data]
C --> D{Do datasets match for both fields?}
D -->|Yes| E[End]
D -->|No| F[Mark as Anomalous]
F --> E

-- An illustrative SQL query comparing original to replica for both fields.
select
    orig.n_nationkey as original_key,
    orig.n_nationname as original_name,
    replica.n_nationkey as replica_key,
    replica.n_nationname as replica_name
from nation as orig
left join nation_backup as replica on orig.n_nationkey = replica.n_nationkey
where
    orig.n_nationname <> replica.n_nationname
or
    orig.n_nationkey <> replica.n_nationkey

Potential Violation Messages

Shape Anomaly

There is 1 record that differs between NATION_BACKUP (3 records) and NATION (3 records) in <datastore_name>

Is Type

Definition

Asserts that the data is of a specific type.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Specify the expected type for the data in the field.

Name	Description
Field Type	The type that values in the selected field should conform to.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all L_QUANTITY entries in the LINEITEM table are of Integral type.

Sample Data

L_ORDERKEY	L_QUANTITY
1	"10"
2	"15.5"
3	"Ten"

Payload example

{
    "description": "Ensure that all L_QUANTITY entries in the LINEITEM table are of Integral type",
    "coverage": 1,
    "properties": {
        "field_type":"Integral"
    },
    "tags": [],
    "fields": ["L_QUANTITY"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "isType",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with L_ORDERKEY 2 and 3 do not satisfy the rule because their L_QUANTITY values are not of Integral type.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY of Integral type?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey,
    l_quantity
from lineitem 
where
    typeof(l_quantity) != 'INTEGER'

Potential Violation Messages

Record Anomaly

The L_QUANTITY value of Ten is not a valid Integral.

Shape Anomaly

In L_QUANTITY, 66.667% of 3 filtered records (2) are not a valid Integral.

Less Than

Definition

Asserts that the field is a number less than (or equal to) a value.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Allows specifying a numeric value that acts as the threshold.

Name	Description
Value	The number to use as the base comparison.
Inclusive	If true, the comparison will also allow values equal to the threshold. Otherwise, it's exclusive.
Comparators	Specifies how variations are handled, allowing for slight deviations within a defined margin of error.

Details

Comparators

The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:

Numeric

Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.

Comparison Type

Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.

Threshold

The threshold is the value you set to define the margin of error:

When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.

Illustration using Absolute Value

In this example, it compares Value A and Value B according to the defined Threshold of 50.

Value A	Value B	Difference	Are equal?
100	150	50	True
100	90	10	True
100	155	55	False
100	49	51	False

Illustration using Percentage Value

In this example, it compares Value A and Value B according to the defined Threshold of 10%.

Percentage Change Formula: [ (Value B - Value A) / Value A ] * 100

Value A	Value B	Percentage Change	Are equal?
120	132	10%	True
150	135	10%	True
200	180	10%	True
160	150	6.25%	True
180	200	11.11%	False

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all L_PRICE entries in the LINEITEM table are less than 20.

Sample Data

L_ORDERKEY	L_PRICE
1	18
2	25
3	23

Payload example

{
    "description": "Ensure that all L_PRICE entries in the LINEITEM table are less than 20",
    "coverage": 1,
    "properties": {
        "inclusive": true,
        "value": 20
    },
    "tags": [],
    "fields": ["L_QUANTITY"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "lessThan",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with L_ORDERKEY 2 and 3 do not satisfy the rule because their L_PRICE values are not less than 20.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_PRICE]
B --> C{Is L_PRICE < 20?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey,
    l_price
from lineitem 
where
    l_price >= 20;

Potential Violation Messages

Record Anomaly

The L_PRICE value of 23 is not less than the value of 20.

Shape Anomaly

In L_PRICE, 66.667% of 3 filtered records (2) are not less than 20.

Less Than Field

Definition

Asserts that the field is less than another field.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Allows specifying another field against which the value comparison will be performed.

Name	Description
Field to compare	Specifies the name of the field against which the value will be compared.
Inclusive	If true, the comparison will also allow values equal to the value of the other field. Otherwise, it's exclusive.
Comparators	Specifies how variations are handled, allowing for slight deviations within a defined margin of error.

Details

Comparators

The Comparators allow you to set margins of error, accommodating slight variations in data validation. This flexibility is crucial for maintaining data integrity, especially when working with different data types such as numeric values, durations, and strings. Here's an overview of how each type of comparator can be beneficial for you:

Numeric

Numeric comparators enable you to compare numbers with a specified margin, which can be a fixed absolute value or a percentage. This allows for minor numerical differences that are often acceptable in real-world data.

Comparison Type

Absolute Value: Uses a fixed threshold for determining equality. It's ideal when you need consistent precision across measurements.
Percentage Value: Uses a percentage of the original value as the threshold for equality comparisons. It's suitable for floating point numbers where precision varies.

Threshold

The threshold is the value you set to define the margin of error:

When using Absolute Value, the threshold represents the maximum allowable difference between two values for them to be considered equal.
For Percentage Value, the threshold is the percentage that describes how much a value can deviate from a reference value and still be considered equal.

Illustration using Absolute Value

In this example, it compares Value A and Value B according to the defined Threshold of 50.

Value A	Value B	Difference	Are equal?
100	150	50	True
100	90	10	True
100	155	55	False
100	49	51	False

Illustration using Percentage Value

In this example, it compares Value A and Value B according to the defined Threshold of 10%.

Percentage Change Formula: [ (Value B - Value A) / Value A ] * 100

Value A	Value B	Percentage Change	Are equal?
120	132	10%	True
150	135	10%	True
200	180	10%	True
160	150	6.25%	True
180	200	11.11%	False

Duration

Duration comparators support time-based comparisons, allowing for flexibility in how duration differences are managed. This flexibility is crucial for datasets where time measurements are essential but can vary slightly.

Unit

The unit of time you select determines how granular the comparison is:

Millis: Measures time in milliseconds, ideal for high-precision needs.
Seconds: Suitable for most general purposes where precision is important but doesn't need to be to the millisecond.
Days: Best for longer durations.

Value

Value sets the maximum acceptable difference in time to consider two values as equal. It serves to define the margin of error, accommodating small discrepancies that naturally occur over time.

Illustration using Duration Comparator

Unit	Value A	Value B	Difference	Threshold	Are equal?
Millis	500 ms	520 ms	20 ms	25 ms	True
Seconds	30 sec	31 sec	1 sec	2 sec	True
Days	5 days	7 days	2 days	1 day	False
Millis	1000 ms	1040 ms	40 ms	25 ms	False
Seconds	45 sec	48 sec	3 sec	2 sec	False

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all O_DISCOUNT entries in the ORDERS table are less than their respective O_TOTALPRICE.

Sample Data

O_ORDERKEY	O_TOTALPRICE	O_DISCOUNT
1	105	100
2	500	10
3	121	125

Payload example

{
    "description": "Ensure that all O_DISCOUNT entries in the ORDERS table are less than their respective O_TOTALPRICE",
    "coverage": 1,
    "properties": {
        "field_name": "O_TOTALPRICE",
        "inclusive":true
    },
    "tags": [],
    "fields": ["O_DISCOUNT"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "lessThanField",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with O_ORDERKEY 3 does not satisfy the rule because its O_DISCOUNT value is not less than its respective O_TOTALPRICE value.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_TOTALPRICE and O_DISCOUNT]
B --> C{Is O_DISCOUNT < O_TOTALPRICE?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    o_orderkey,
    o_totalprice,
    o_discount
from orders 
where
    o_discount >= o_totalprice;

Potential Violation Messages

Record Anomaly

The O_DISCOUNT value of 125 is not less than the value of O_TOTALPRICE.

Shape Anomaly

In O_DISCOUNT, 33.333% of 3 filtered records (1) is not less than O_TOTALPRICE.

Matches Pattern

Definition

Asserts that a field must match a pattern.

In-Depth Overview

Patterns, typically expressed as regular expressions, allow for the enforcement of custom structural norms for data fields. For complex patterns, regular expressions offer a powerful tool to ensure conformity to the expected format.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Allows specifying a pattern against which the field will be checked.

Name	Description
Pattern	Specifies the regular expression pattern the field must match.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all P_SERIAL entries in the PART table match the pattern for product serial numbers: TPCH-XXXX-####, where XXXX are uppercase alphabetic characters and #### are numbers.

Sample Data

P_PARTKEY	P_SERIAL
1	TPCH-ABCD-1234
2	TPCH-1234-ABCD
3	TPCH-WXYZ-9876

Payload example

{
    "description": "Ensure that all P_SERIAL entries in the PART table match the pattern for product serial numbers: `TPCH-XXXX-####`, where `XXXX` are uppercase alphabetic characters and `####` are numbers",
    "coverage": 1,
    "properties": {
        "pattern":"^tpch-[a-z]{4}-[0-9]{4}$"
    },
    "tags": [],
    "fields": ["P_SERIAL"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "matchesPattern",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with P_PARTKEY 2 does not satisfy the rule because its P_SERIAL does not match the required pattern.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve P_SERIAL]
B --> C{Does P_SERIAL match TPCH-XXXX-#### format?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    p_partkey,
    p_serial
from part 
where
    not regexp_like(p_serial, '^tpch-[a-z]{4}-[0-9]{4}$')

Potential Violation Messages

Record Anomaly

The P_SERIAL value of TPCH-1234-ABCD does not match the pattern TPCH-XXXX-####.

Shape Anomaly

In P_SERIAL, 33.333% of 3 filtered records (1) do not match the pattern TPCH-XXXX-####.

Max Length

Definition

Asserts that a string has a maximum length.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Determines the maximum acceptable length of the string.

Name	Description
Length	Specifies the maximum number of characters a string in the field should have.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that P_DESCRIPTION in the PART table do not exceed 50 characters in length.

Sample Data

P_PARTKEY	P_DESCRIPTION
1	Standard industrial widget
2	A product description that clearly goes way beyond the specified fifty characters limit.
3	Basic office equipment

Payload example

{
    "description": "Ensure that P_DESCRIPTION in the PART table do not exceed 50 characters in length",
    "coverage": 1,
    "properties": {
        "value": 3
    },
    "tags": [],
    "fields": ["C_BLOOD_GROUP"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "maxLength",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with P_PARTKEY 2 does not satisfy the rule because its P_DESCRIPTION exceeds 50 characters in length.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve P_DESCRIPTION]
B --> C{Is P_DESCRIPTION length <= 50 characters?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    p_partkey,
    p_description
from part 
where
    length(p_description) > 50;

Potential Violation Messages

Record Anomaly

The P_DESCRIPTION length of A product description that clearly goes way beyond the specified fifty characters limit. is greater than the max length of 50.

Shape Anomaly

In P_DESCRIPTION, 33.333% of 3 filtered records (1) have a length greater than 50.

Max Partition Size

Definition

Asserts the maximum number of records that should be loaded from each file or table partition.

In-Depth Overview

Managing the volume of data in each partition is critical when dealing with partitioned datasets. This is especially pertinent when system limitations or data processing capabilities are considered, ensuring that no partition exceeds the system's ability to handle data efficiently.

The Max Partition Size rule is designed to set an upper limit on the number of records each partition can contain.

General Properties

Details

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

Specific Properties

Specifies the maximum allowable record count for each data partition

Name	Description
Maximum partition size	The maximum number of records that can be loaded from each partition.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that no partition of the LINEITEM table contains more than 10,000 records to prevent data processing bottlenecks.

Sample Data for Partition P3

Row Number	L_ITEM
1	Data
2	Data
...	...
10,050	Data

Payload example

{
    "description": "Ensure that no partition of the LINEITEM table contains more than 10,000 records to prevent data processing bottlenecks",
    "coverage": 1,
    "properties": {
        "value":10000
    },
    "tags": [],
    "fields": null,
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "maxPartitionSize",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

In the sample data above, the rule is violated because partition P3 contains 10,050 records, which exceeds the set maximum of 10,000 records.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve Number of Records for Each Partition]
B --> C{Does Partition have <= 10,000 records?}
C -->|Yes| D[Move to Next Partition/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s). 
select
    subset_name, -- or any column indicating the partition or subset
    count(*)
from lineitem 
group by subset_name
having count(*) > 10000;

Potential Violation Messages

Shape Anomaly

In LINEITEM, more than 10,000 records were loaded.

Max Value

Definition

Asserts that a field has a maximum value.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Determines the maximum allowable value for the field.

Name	Description
Value	Specifies the maximum value a field should have.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table does not exceed a value of 50.

Sample Data

L_ORDERKEY	L_LINENUMBER	L_QUANTITY
1	1	40
1	2	55
2	1	20
3	1	60

Payload example

{
    "description": "Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table does not exceed a value of 50",
    "coverage": 1,
    "properties": {
        "value": 50
    },
    "tags": [],
    "fields": ["L_QUANTITY"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "maxValue",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with L_ORDERKEY 1 and 3 do not satisfy the rule because their L_QUANTITY values exceed the specified maximum value of 50.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY <= 50?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey,
    l_linenumber,
    l_quantity
from lineitem 
where
    l_quantity > 50;

Potential Violation Messages

Record Anomaly

The L_QUANTITY value of 55 is greater than the max value of 50.

Shape Anomaly

In L_QUANTITY, 50.000% of 4 filtered records (2) are greater than the max value of 50.

Metric

Definition

Records the value of the selected field during each scan operation and asserts limits based upon an expected change or absolute range (inclusive).

In-Depth Overview

The Metric rule is designed to monitor the values of a selected field over time. It is particularly useful in a time-series context where values are expected to evolve within certain bounds or limits. This rule allows for tracking absolute values or changes, ensuring they remain within predefined thresholds.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Determines the evaluation method and allowable limits for field value comparisons over time.

Name	Description
Comparison	Specifies the type of comparison: Absolute Change, Absolute Value, or Percentage Change.
Min Value	Indicates the minimum allowable increase in value. Use a negative value to represent an allowable decrease.
Max Value	Indicates the maximum allowable increase in value.

Details

Comparison Options

Absolute Change

The Absolute Change comparison works by comparing the change in a numeric field's value to a pre-set limit (Min / Max). If the field's value changes by more than this specified limit since the last relevant scan, an anomaly is identified.

Illustration

Any record with a value change smaller than 30 or greater than 70 compared to the last scan should be flagged as anomalous

Thresholds: Min Change = 30, Max Change = 70

Scan	Previous Value	Current Value	Absolute Change	Anomaly Detected
#1	-	100	-	No
#2	100	150	50	No
#3	150	220	70	No
#4	220	300	80	Yes

Absolute Value

The Absolute Value comparison works by comparing the change in a numeric field's value to a pre-set limit between Min and Max values. If the field's value changes by more than this specified range since the last relevant scan, an anomaly is identified.

Illustration

The value of the record in each scan should be within 100 and 300 to be considered normal

Thresholds: Min Value = 100, Max Value = 300

Scan	Current Value	Anomaly Detected
#1	150	No
#2	90	Yes
#3	250	No
#4	310	Yes

Percentage Change

The Percentage Change comparison operates by tracking changes in a numeric field's value relative to its previous value. If the change exceeds the predefined percentage (%) limit since the last relevant scan, an anomaly is generated.

Illustration

An anomaly is identified if the record's value decreases by more than 20% or increases by more than 50% compared to the last scan.

Thresholds: Min Percentage Change = -20%, Max Percentage Change = 50%

Percentage Change Formula: ( (current_value - previous_value) / previous_value ) * 100

Scan	Previous Value	Current Value	Percentage Change	Anomaly Detected
1	-	100	-	No
2	100	150	50%	No
3	150	120	-20%	No
4	120	65	-45.83%	Yes
5	65	110	69.23%	Yes

Thresholds

At least the Min or Max value must be specified, and including both is optional. These values determine the acceptable range or limit of change in the field's value.

Min Value

Represents the minimum allowable increase in the field's value.
A negative Min Value signifies an allowable decrease, determining the minimum value the field can drop to be considered valid.

Max Value

Indicates the maximum allowable increase in the field’s value, setting an upper limit for the value's acceptable growth or change.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the total price in the ORDERS table does not fluctuate beyond a predefined percentage limit between scans.

Thresholds: Min Percentage Change = -30%, Max Percentage Change = 30%

Sample Scan History

Scan	O_ORDERKEY	Previous O_TOTALPRICE	Current O_TOTALPRICE	Percentage Change	Anomaly Detected
#1	1	-	100	-	No
#2	1	100	110	10%	No
#3	1	110	200	81.8%	Yes
#4	1	200	105	-47.5%	Yes

Payload example

{
    "description": "Ensure that the total price in the ORDERS table does not fluctuate beyond a predefined percentage limit between scans",
    "coverage": 1,
    "properties": {
        "comparison":"Percentage Change",
        "min":-0.3,
        "max":0.3
    },
    "tags": [],
    "fields": ["O_TOTALPRICE "],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "metric",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample scan history above, anomalies are identified in scans #3 and #4. The O_TOTALPRICE values in these scans fall outside the declared percentage change limits of -30% and 30%, indicating that something unusual might be happening and further investigation is needed.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_TOTALPRICE]
B --> C{Is Percentage Change in O_TOTALPRICE within -30% and 30%?}
C -->|Yes| D[End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s)
select 
    o_orderkey,
    o_totalprice,
    lag(o_totalprice) over (order by o_orderkey) as previous_o_totalprice
from
    orders
having
    abs((o_totalprice - previous_o_totalprice) / previous_o_totalprice) * 100 > 30
    or
    abs((o_totalprice - previous_o_totalprice) / previous_o_totalprice) * 100 < -30;

Potential Violation Messages

Record Anomaly (Percentage Change)

The percentage change of O_TOTALPRICE from '110' to '200' falls outside the declared limits

Record Anomaly (Absolute Change)

using hypothetical numbers

The absolute change of O_TOTALPRICE from '150' to '300' falls outside the declared limits

Record Anomaly (Absolute Value)

using hypothetical numbers

The value for O_TOTALPRICE of '50' is not between the declared limits

Min Length

Definition

Asserts that a string has a minimum length.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`String`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Determines the minimum allowable length for the field.

Name	Description
Value	Specifies the minimum length that the string field should have.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that all C_COMMENT entries in the CUSTOMER table have a minimum length of 5 characters.

Sample Data

C_CUSTKEY	C_COMMENT
1	Ok
2	Excellent customer service, very satisfied!
3	Nice staff

Payload example

{
    "description": "Ensure that all C_COMMENT entries in the CUSTOMER table have a minimum length of 5 characters",
    "coverage": 1,
    "properties": {
        "value": 5
    },
    "tags": [],
    "fields": ["C_COMMENT"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "minLength",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with C_CUSTKEY 1 does not satisfy the rule because the length of its C_COMMENT values is below the required minimum length of 5 characters.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve C_COMMENT]
B --> C{Is C_COMMENT length >= 5?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    c_custkey,
    c_comment
from customer 
where
    length(c_comment) < 5;

Potential Violation Messages

Record Anomaly

The C_COMMENT length of Ok is less than the min length of 5.

Shape Anomaly

In C_COMMENT, 33.333% of 3 filtered records (1) have a length less than 5.

Min Partition Size

Definition

Asserts the minimum number of records that should be loaded from each file or table partition.

In-Depth Overview

When working with large datasets that are often partitioned for better performance and scalability, ensuring a certain minimum number of records from each partition becomes crucial. This could be to ensure that each partition is well-represented in the analysis, to maintain data consistency or even to verify that data ingestion or migration processes are functioning properly.

The Min Partition Size rule allows users to set a threshold ensuring that each partition has loaded at least the specified minimum number of records.

General Properties

Details

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

Specific Properties

Sets the required minimum record count for each data partition

Name	Description
Minimum partition size	Specifies the minimum number of records that should be loaded from each partition

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that each partition of the LINEITEM table has at least 1000 records.

Sample Data for Partition P3

Row Number	L_ITEM
1	Data
2	Data
...	...
900	Data

Payload example

{
    "description": "Ensure that each partition of the LINEITEM table has at least 1000 records",
    "coverage": 1,
    "properties": {
        "value": 1000
    },
    "tags": [],
    "fields": null,
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "minPartitionSize",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

The sample data above does not satisfy the rule because it contains only 900 records, which is less than the required minimum of 1000 records.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve Number of Records for Each Partition]
B --> C{Does Partition have >= 1000 records?}
C -->|Yes| D[Move to Next Partition/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s). 
select
    subset_name, -- or any column indicating the partition or subset
    count(*)
from lineitem 
group by subset_name
having count(*) < 1000;

Potential Violation Messages

Shape Anomaly

In LINEITEM, fewer than 900 records were loaded.

Min Value

Definition

Asserts that a field has a minimum value.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Determines the minimum allowable value for the field.

Name	Description
Value	Specifies the minimum value a field should have.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is not below a value of 10.

Sample Data

L_ORDERKEY	L_LINENUMBER	L_QUANTITY
1	1	40
1	2	5
2	1	20
3	1	8

Payload example

{
    "description": "Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is not below a value of 10",
    "coverage": 1,
    "properties": {
        "value": 10
    },
    "tags": [],
    "fields": ["L_QUANTITY"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "minValue",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with L_ORDERKEY 1 and 3 do not satisfy the rule because their L_QUANTITY values are below the specified minimum value of 10.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY >= 10?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey,
    l_linenumber,
    l_quantity
from lineitem 
where
    l_quantity < 10;

Potential Violation Messages

Record Anomaly

The L_QUANTITY value of 5 is less than the min value of 10.

Shape Anomaly

In L_QUANTITY, 50.000% of 4 filtered records (2) are less than the min value of 10.

Not Exists In

Definition

Asserts that values assigned to this field do not exist as values in another field.

In-Depth Overview

The Not ExistsIn rule allows you to ensure data exclusivity between different sources, whether it’s object storage systems or databases.

While databases might utilize unique constraints to maintain data distinctiveness between related tables, the Not ExistsIn rule extends this capability in two significant ways:

Cross-System Exclusivity: it enables checks to ensure data does not overlap across different databases or even entirely separate systems. This can be essential in scenarios where data should be partitioned or isolated across platforms.
Flexible Data Formats: Not just limited to databases, this rule can validate values against various data formats, such as ensuring values in a file do not coincide with those in a table.

These functionalities enable businesses to maintain data exclusivity even in intricate, multi-system settings.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`
`String`
`Boolean`

Specific Properties

Define the datastore, table/file, and field where the rule should look for non-matching values.

Name	Description
Datastore	The source datastore where the profile of the reference field is located.
Table/file	The profile (e.g. table, view or file) containing the reference field.
Field	The field name whose values should not match those of the selected field.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Scenario: A shipping company needs to ensure that all NATION_NAME entries in the NATION table aren't listed in an external unsupported regions file, which lists countries they don't ship to.

Sample Data

N_NATIONKEY	N_NATIONNAME
1	Antarctica
2	Argentina
3	Atlantida

Unsupported Regions File Sample

UNSUPPORTED_REGION
Antarctica
Mars
...

Payload example

{
    "description": "A shipping company needs to ensure that all NATION_NAME entries in the NATION table aren't listed in an external unsupported regions file, which lists countries they don't ship to",
    "coverage": 1,
    "properties": {
        "field_name":"UNSUPPORTED_REGION",
        "ref_container_id": {ref_container_id},
        "ref_datastore_id": {ref_datastore_id}
    },
    "tags": [],
    "fields": ["NATION_NAME"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "notExistsIn",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with N_NATIONKEY 1 does not satisfy the rule because the N_NATIONNAME "Antarctica" is listed as an UNSUPPORTED_REGION in the unsupported regions file, indicating the company doesn't ship there.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve UNSUPPORTED_REGION]
B --> C[Retrieve N_NATIONNAME]
C --> D{Is N_NATIONNAME listed in UNSUPPORTED_REGION?}
D -->|No| E[Move to Next Record/End]
D -->|Yes| F[Mark as Anomalous]
F --> E

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    n_nationkey
    , n_nationname
from nation 
where
    n_nationname in ('Antarctica', 'Mars', ... /* other unsupported regions */)

Potential Violation Messages

Record Anomaly

The N_NATIONNAME value of 'Antarctica' is an UNSUPPORTED_REGION.

Shape Anomaly

In N_NATIONNAME, 33.333% of 3 filtered records (1) do exist in UNSUPPORTED_REGION.

Not Future

Definition

Asserts that the field's value is not in the future.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Fields

Field
`Date`
`Timestamp`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the delivery dates (O_DELIVERYDATE) in the ORDERS table are not set in the future.

Sample Data

O_ORDERKEY	O_DELIVERYDATE
1	2023-09-20
2	2023-10-25 (Future Date)
3	2023-10-10

Payload example

{
    "description": "Ensure that the delivery dates (O_DELIVERYDATE) in the ORDERS table are not set in the future",
    "coverage": 1,
    "properties": null,
    "tags": [],
    "fields": ["O_DELIVERYDATE"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "notFuture",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with O_ORDERKEY 2 does not satisfy the rule because its O_DELIVERYDATE is set in the future.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_DELIVERYDATE]
B --> C{Is O_DELIVERYDATE <= Current Date?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    o_orderkey,
    o_deliverydate
from orders 
where
    o_deliverydate > current_date;

Potential Violation Messages

Record Anomaly

The value for O_DELIVERYDATE of 2023-10-25 is in the future.

Shape Anomaly

In O_DELIVERYDATE, 33.333% of 3 filtered records (1) are future times.

Not Negative

Definition

Asserts that this is a non-negative number.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Fields

Type
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is a non-negative number.

Sample Data

L_ORDERKEY	L_LINENUMBER	L_QUANTITY
1	1	40
2	2	-5
3	1	20

Payload example

{
    "description": "Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is a non-negative number",
    "coverage": 1,
    "properties": null,
    "tags": [],
    "fields": ["L_QUANTITY"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "notNegative",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entry with L_ORDERKEY 2 does not satisfy the rule because its L_QUANTITY value is a negative number.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY >= 0?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey,
    l_linenumber,
    l_quantity
from lineitem 
where
    l_quantity < 0;

Potential Violation Messages

Record Anomaly

The value for L_QUANTITY of -5 is a negative number.

Shape Anomaly

In L_QUANTITY, 33.333% of 3 filtered records (1) are negative numbers.

Not Null

Definition

Asserts that none of the selected fields' values are explicitly set to nothing.

Field Scope

Multi: The rule evaluates multiple specified fields.

Accepted Fields

Type
`Date`
`Timestamp`
`Integral`
`Fractional`
`String`
`Boolean`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that every record in the CUSTOMER table has an assigned value for the C_NAME and C_ADDRESS fields.

Sample Data

C_CUSTKEY	C_NAME	C_ADDRESS
1	Alice	123 Oak St
2	Bob	NULL
3	Charlie	789 Maple Ave
4	NULL	456 Pine Rd

Payload example

{
    "description": "Ensure that every record in the CUSTOMER table has an assigned value for the C_NAME and C_ADDRESS fields",
    "coverage": 1,
    "properties": null,
    "tags": [],
    "fields": ["C_ADDRESS","C_NAME"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "notNull",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with C_CUSTKEY 2 and 4 do not satisfy the rule because they have NULL values in the C_NAME or C_ADDRESS fields.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve C_NAME and C_ADDRESS]
B --> C{Are C_NAME and C_ADDRESS non-null?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    c_custkey,
    c_name,
    c_address
from customer 
where
    c_name is null or c_address is null;

Potential Violation Messages

Record Anomaly

There is no assigned value for C_NAME.

Shape Anomaly

In C_NAME and C_ADDRESS, 50.000% of 4 filtered records (2) are not assigned values.

Positive

Definition

Asserts that this is a positive number.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Fields

Type
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is a positive number.

Sample Data

L_ORDERKEY	L_LINENUMBER	L_QUANTITY
1	1	40
2	1	0
3	1	-5
4	1	20

Payload example

{
    "description": "Ensure that the quantity of items (L_QUANTITY) in the LINEITEM table is a positive number",
    "coverage": 1,
    "properties": null,
    "tags": [],
    "fields": ["L_QUANTITY"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "positive",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with L_ORDERKEY 2 and 3 do not satisfy the rule because their L_QUANTITY values are not positive numbers.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_QUANTITY]
B --> C{Is L_QUANTITY Positive?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey,
    l_linenumber,
    l_quantity
from lineitem 
where
    l_quantity <= 0;

Potential Violation Messages

Record Anomaly

The value for L_QUANTITY of -5 is not a positive number.

Shape Anomaly

In L_QUANTITY, 50.000% of 4 filtered records (2) are not positive numbers.

Predicted By

Definition

Asserts that the actual value of a field falls within an expected predicted range.

In-Depth Overview

The Predicted By rule is used to verify whether the actual values of a specific field align with a set of expected values that are derived from a prediction expression. This expression could be a mathematical formula, statistical calculation, or any other valid predictive logic.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Fields

Type
`Integral`
`Fractional`
`Date`
`Timestamp`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Determines if the actual value of a field falls within an expected predicted range.

Name	Description
Expression	The prediction expression or formula for the field.
Tolerance	The allowed deviation from the predicted value.

Note

The tolerance level must be defined to allow a permissible range of deviation from the predicted values.

Here’s a simple breakdown:

An expression predicts what the value of a field should be.
A tolerance value specifies how much deviation from the predicted value is acceptable.
The actual value is then compared against the range defined by the predicted value ± tolerance.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the discount (L_DISCOUNT) in the LINEITEM table is calculated correctly based on the actual price (L_EXTENDEDPRICE). A correct discount should be approximately 8% less than the actual price, within a tolerance of ±2.

Sample Data

L_ORDERKEY	L_LINENUMBER	L_EXTENDEDPRICE	L_DISCOUNT
1	1	100	8
2	1	100	12
3	1	100	9

Inputs

Expression: L_EXTENDEDPRICE × 0.08
Tolerance: 2

Payload example

{
    "description": "Ensure that the discount (L_DISCOUNT) in the LINEITEM table is calculated correctly based on the actual price (L_EXTENDEDPRICE). A correct discount should be approximately 8% less than the actual price, within a tolerance of ±2",
    "coverage": 1,
    "properties": {
        "expression": "L_EXTENDEDPRICE × 0.08",
        "tolerance": 2
    },
    "tags": [],
    "fields": ["L_DISCOUNT"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "predictedBy",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

For the entry with L_ORDERKEY 2, the discount is 12, which is outside of the computed range. Based on an 8% expected discount with a tolerance of ±2, the discount should be between 6 and 10 (calculated from the actual price of 100). Therefore, this record is marked as anomalous.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_EXTENDEDPRICE and L_DISCOUNT]
B --> C{Is Discount within Predicted Range?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey,
    l_linenumber,
    l_extendedprice,
    l_discount
from lineitem 
where
    l_discount not between l_extendedprice * 0.06 and l_extendedprice * 0.10;

Potential Violation Messages

Record Anomaly

The L_DISCOUNT value of '12' is not within the predicted range defined by L_EXTENDEDPRICE * 0.08 +/- 2.0

Shape Anomaly

In L_DISCOUNT, 33.333% of 3 filtered records (1) are not within the predicted range defined by L_EXTENDEDPRICE * 0.08 +/- 2.0

Required Values

Definition

Asserts that all of the defined values must be present at least once within a field.

Info

Visual warnings for spacing issues are available across all forms where values are entered.
When a value includes leading or trailing spaces, the system automatically highlights it inside a warning-colored chip and displays a tooltip message indicating the issue in real time — for example:

“The following value has leading or trailing spaces: ‘ship ’”

warning

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`
`String`
`Boolean`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Ensures that a specific set of values is present within a field.

Name	Description
Values	Specifies the list of values that must exist in the field.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that orders have priorities labeled as '1-URGENT', '2-HIGH', '3-MEDIUM', '4-LOW', and '5-NOT URGENT'.

Sample Data

O_ORDERKEY	O_ORDERPRIORITY
1	1-URGENT
2	2-HIGH
3	3-MEDIUM
4	3-MEDIUM

Payload example

{
    "description": "Ensure that orders have priorities labeled as '1-URGENT', '2-HIGH', '3-MEDIUM', '4-LOW', and '5-NOT URGENT'",
    "coverage": 1,
    "properties": {
        "list":["1-URGENT","2-HIGH","3-MEDIUM","4-LOW","5-NOT URGENT"]
    },
    "tags": [],
    "fields": ["O_ORDERPRIORITY"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "requiredValues",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the rule is violated because the values '4-LOW' and '5-NOT URGENT' are not present in the O_ORDERPRIORITY field of the ORDERS table.

FlowchartSQL

graph TD
A[Start] --> B{Check if all specified values exist in the field}
B -->|Yes| C[End: No Anomalies]
B -->|No| D[Mark as Anomalous: Missing Values]
D --> C

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select distinct
    o_orderpriority
from orders
where o_orderpriority in ('1-URGENT', '2-HIGH', '3-MEDIUM', '4-LOW', '5-NOT URGENT');

Potential Violation Messages

Shape Anomaly

In O_ORDERPRIORITY, required values are missing in 40.000% of filtered records.

Satisfies Expression

Definition

Evaluates the given expression (any valid Spark SQL) for each record.

In-Depth Overview

The Satisfies Expression rule allows for a wide range of custom validations on the dataset. By defining a Spark SQL expression, you can create customized conditions that the data should meet.

This rule will evaluate an expression against each record, marking those that do not satisfy the condition as anomalies. It provides the flexibility to create complex validation logic without being restricted to predefined rule structures.

Field Scope

Calculated: The rule automatically identifies the fields involved, without requiring explicit field selection.

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Evaluates each record against a specified Spark SQL expression to ensure it meets custom validation conditions.

Name	Description
Expression	Defines the Spark SQL expression that each record should meet.

Info

Refers to the Filter Guide in the General Properties topic for examples of valid Spark SQL expressions.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example 1: Satisfies Expression Using a `CASE` Statement

Let's assume you want to ensure that for orders with a priority of '1-URGENT' or '2-HIGH', the orderstatus must be 'O' (for open), and for orders with a priority of '3-MEDIUM', the orderstatus must be either 'O' or 'P' (for pending).

 CASE
    WHEN o_orderpriority IN ('1-URGENT', '2-HIGH') AND o_orderstatus != 'O' THEN FALSE
    WHEN o_orderpriority = '3-MEDIUM' AND o_orderstatus NOT IN ('O', 'P') THEN FALSE
    ELSE TRUE
 END

Example 2: Satisfies Expression Using a `Relatively Complex CTE` Statement

Objective:: To ensure that the overall effect of discounts on item prices remains within acceptable limits, we validate whether the average discounted price of all items is greater than the maximum discount applied to any single item.

Background:

In pricing analysis, it’s important to monitor how discounts affect the final prices of products. By comparing the average price after discounts with the maximum discount applied, we can assess whether the discounts are having an overly significant impact or if they are within a reasonable range.

CASE 
        WHEN (SELECT AVG(l_extendedprice * (1 - l_discount)) FROM lineitem) > 
             (SELECT MAX(l_discount) FROM {{ _qualytics_self }}) 
        THEN TRUE 
        ELSE FALSE 
END AS is_discount_within_limits

Use Case

Objective: Ensure that the total tax applied to each item in the LINEITEM table is not more than 10% of the extended price.

Sample Data

L_ORDERKEY	L_LINENUMBER	L_EXTENDEDPRICE	L_TAX
1	1	10000	900
2	1	15000	2000
3	1	20000	1800
4	1	10000	1500

Inputs

Expression: L_TAX <= L_EXTENDEDPRICE * 0.10

Payload example

{
    "description": "Ensure that the total tax applied to each item in the LINEITEM table is not more than 10% of the extended price",
    "coverage": 1,
    "properties": {
        "expression":"L_TAX <= L_EXTENDEDPRICE * 0.10"
        },
    "tags": [],
    "fields": ["L_TAX", "L_EXTENDEDPRICE"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "satisfiesExpression",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with L_ORDERKEY 2 and 4 do not satisfy the rule because the L_TAX values are more than 10% of their respective L_EXTENDEDPRICE values.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_EXTENDEDPRICE and L_TAX]
B --> C{Is L_TAX <= L_EXTENDEDPRICE * 0.10?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select
    l_orderkey,
    l_linenumber,
    l_extendedprice,
    l_tax
from
    lineitem 
where
    l_tax > l_extendedprice * 0.10;

Potential Violation Messages

Record Anomaly

The record does not satisfy the expression: L_TAX <= L_EXTENDEDPRICE * 0.10

Shape Anomaly

50.000% of 4 filtered records (2) do not satisfy the expression: L_TAX <= L_EXTENDEDPRICE * 0.10

Sum

Definition

Asserts that the sum of a field is a specific amount.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Integral`
`Fractional`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Specific Properties

Ensures that the total sum of values in a specified field matches a defined amount.

Name	Description
Sum	Specifies the expected sum of the values in the field.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the total discount value in the LINEITEM table does not exceed $2000.

Sample Data

L_ORDERKEY	L_LINENUMBER	L_EXTENDEDPRICE	L_DISCOUNT	L_DISCOUNT_VALUE
1	1	10000	0.05	500
2	1	8000	0.10	800
3	1	7000	0.05	350
4	1	5000	0.10	500

Payload example

{
    "description": "Ensure that the total discount value in the LINEITEM table does not exceed $2000",
    "coverage": 1,
    "properties": {
        "value": "2000"
    },
    "tags": [],
    "fields": ["L_DISCOUNT_VALUE"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "sum",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the total of the L_DISCOUNT_VALUE column is (500 + 800 + 350 + 500 = 2150), which exceeds the specified maximum total discount value of $2000.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve L_DISCOUNT_VALUE]
B --> C{Sum of L_DISCOUNT_VALUE <= 2000?}
C -->|Yes| D[End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select 
    sum(l_discount_value) as total_discount_value
from 
    lineitem 
having 
    sum(l_discount_value) > 2000;

Potential Violation Messages

Shape Anomaly

In L_DISCOUNT_VALUE, the sum of the 4 records is not 2000.000

Time Distribution Size

Definition

Asserts that the count of records for each interval of a timestamp is between two numbers.

In-Depth Overview

The Time Distribution Size rule helps in identifying irregularities in the distribution of records over time intervals such as hours, days, or months.

For instance, in a retail context, it could ensure that there’s a consistent number of orders each month to meet business targets. A sudden drop in orders might highlight operational issues or shifts in market demand that require immediate attention.

Field Scope

Single: The rule evaluates a single specified field.

Accepted Types

Type
`Timestamp`
`Date`

Specific Properties

Name	Description
Interval	Defines the time interval for segmentation.
Min Count	Specifies the minimum count of records in each segment.
Max Count	Specifies the maximum count of records in each segment.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the number of orders for each month is consistently between 5 and 10.

Sample Data

O_ORDERKEY	O_ORDERDATE
1	2023-01-01
2	2023-01-15
3	2023-01-20
4	2023-01-25
5	2023-02-01
6	2023-02-05
7	2023-02-10
8	2023-02-15
9	2023-02-20
10	2023-02-25
11	2023-02-28

Payload example

{
    "description": "Ensure that the number of orders for each month is consistently between 5 and 10",
    "coverage": 1,
    "properties": {
        "interval_name": "Monthly",
        "min_size": 5,
        "max_size": 10
    },
    "tags": [],
    "fields": ["O_ORDERDATE"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "timeDistributionSize",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the January segment fails the rule because there are only 4 orders, which is below the specified minimum count of 5.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve O_ORDERDATE]
B --> C{Segment data by month}
C --> D{Is count of records in each segment between 5 and 10?}
D -->|Yes| E[End]
D -->|No| F[Mark as Anomalous]
F --> E

-- An illustrative SQL query demonstrating the rule applied to example dataset(s).
select 
    extract(month from o_orderdate) as month,
    count(*) as order_count
from 
    orders 
group by 
    extract(month from o_orderdate)
having 
    count(*) < 5
    or count(*) > 10;

Potential Violation Messages

Shape Anomaly

50.000% of the monthly segments of O_ORDERDATE have record counts not between 5 and 10.

Unique

Definition

Asserts that every value held by a field appears only once. If multiple fields are specified, then every combination of values of the fields should appear only once.

Field Scope

Multi: The rule evaluates multiple specified fields.

Accepted Types

Type
`Date`
`Timestamp`
`Integral`
`Fractional`
`String`
`Boolean`

General Properties

DetailsFilter Guide

Name	Supported
Filter Allows the targeting of specific data based on conditions
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

O_TOTALPRICE > 1000

C_MKTSEGMENT = 'BUILDING'

WHERE O_TOTALPRICE > 1000

WHERE C_MKTSEGMENT = 'BUILDING'

O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'

WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'

O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'

LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7

RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'

EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM {{ _qualytics_self }}
    WHERE O_TOTALPRICE > 1000
)

O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage

Incorrect usage

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage

Incorrect usage

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage

Incorrect usage

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved dynamic variable {{ _qualytics_self }}.

Correct usage

Incorrect usage

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on {{ _qualytics_self }}

The {{ _qualytics_self }} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using {{ _qualytics_self }} may not encompass all entries from the target container.

Anomaly Types

Type	Supported
Record Flag inconsistencies at the row level
Shape Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that each combination of C_NAME and C_ADDRESS in the CUSTOMER table is unique.

Sample Data

C_CUSTKEY	C_NAME	C_ADDRESS
1	Customer_A	123 Main St
2	Customer_B	456 Oak Ave
3	Customer_A	123 Main St
4	Customer_C	789 Elm St

Payload example

{
    "description": "Ensure that each combination of C_NAME and C_ADDRESS in the CUSTOMER table is unique",
    "coverage": 1,
    "properties": null,
    "tags": [],
    "fields": ["C_NAME", "C_ADDRESS"],
    "additional_metadata": {"key 1": "value 1", "key 2": "value 2"},
    "rule": "unique",
    "container_id": {container_id},
    "template_id": {template_id},
    "filter": "1=1"
}

Anomaly Explanation

In the sample data above, the entries with C_CUSTKEY 1 and 3 have the same C_NAME and C_ADDRESS, which violates the rule because this combination of keys should be unique.

FlowchartSQL

graph TD
A[Start] --> B[Retrieve C_NAME and C_ADDRESS]
B --> C{Is the combination unique?}
C -->|Yes| D[Move to Next Record/End]
C -->|No| E[Mark as Anomalous]
E --> D

-- An illustrative SQL query to find non-unique C_NAME and C_ADDRESS combinations.
select
    c_custkey,
    c_name,
    c_address
from customer 
group by c_name, c_address
having count(*) > 1;

Potential Violation Messages

Shape Anomaly

In C_NAME and C_ADDRESS, 25.000% of 4 filtered records (1) are not unique.

Volumetric Check

Volumetric Check ensures data stability by monitoring dataset size fluctuations in rows or bytes. It detects anomalies by comparing current volumes against historical trends (daily, weekly, monthly). Users can configure rules for precise control, while automated threshold adjustments enhance accuracy over time.

Let's get started 🚀

Configure Volumetric Check

Step 1: Login into your Qualytics account and select the datastore from the left menu on which you want to add a volumetric check.

volumetric-check

Step 2: Click the Add button and select Checks.

volumetric-check

Step 3: A modal window appears. Enter the required details to configure the Volumetric Check.

volumetric-check

Step 4: Enter the details to configure the volumetric check:

No.	Field	Description
1.	Rule Type	Select the Volumetric Rule type from the dropdown.
2.	Table	Select the table for the rule to apply.

volumetric-check

3. Comparison: Specifies the type of comparison: Absolute Change, Absolute Value, or Percentage Change:

volumetric-check

Details

Comparison Options

Absolute Value

The Absolute Value comparison works by comparing the change in a numeric field's value to a pre-set limit between Min and Max values. If the field's value changes by more than this specified range since the last relevant scan, an anomaly is identified.

Illustration

The value of the record in each scan should be within 100 and 300 to be considered normal

Thresholds: Min Value = 100, Max Value = 300

Scan	Current Value	Anomaly Detected
#1	150	No
#2	90	Yes
#3	250	No
#4	310	Yes

volumetric-check

Details

Comparison Options

Absolute Change

The Absolute Change comparison works by comparing the change in a numeric field's value to a pre-set limit (Min / Max). If the field's value changes by more than this specified limit since the last relevant scan, an anomaly is identified.

Illustration

Any record with a value change smaller than 30 or greater than 70 compared to the last scan should be flagged as anomalous

Thresholds: Min Change = 30, Max Change = 70

Scan	Previous Value	Current Value	Absolute Change	Anomaly Detected
#1	-	100	-	No
#2	100	150	50	No
#3	150	220	70	No
#4	220	300	80	Yes

volumetric-check

Details

Comparison Options

Percentage Change

The Percentage Change comparison operates by tracking changes in a numeric field's value relative to its previous value. If the change exceeds the predefined percentage (%) limit since the last relevant scan, an anomaly is generated.

Illustration

An anomaly is identified if the record's value decreases by more than 20% or increases by more than 50% compared to the last scan.

Thresholds: Min Percentage Change = -20%, Max Percentage Change = 50%

Percentage Change Formula: ( (current_value - previous_value) / previous_value ) * 100

Scan	Previous Value	Current Value	Percentage Change	Anomaly Detected
1	-	100	-	No
2	100	150	50%	No
3	150	120	-20%	No
4	120	65	-45.83%	Yes
5	65	110	69.23%	Yes

volumetric-check

4. Measurement Period Days: Enter the number of days for measurement.

volumetric-check

5. Threshold: At least the Min or Max value must be specified, and including both is optional. These values determine the acceptable range or limit of change in the field's value.

volumetric-check

Min Value

Represents the minimum allowable increase in the field's value.
A negative Min Value signifies an allowable decrease, determining the minimum value the field can drop to be considered valid.

volumetric-check

Max Value

Indicates the maximum allowable increase in the field’s value, setting an upper limit for the value's acceptable growth or change.

volumetric-check

No.	Field	Description
6.	Description	Enter a description for the check.
7.	Tag	Add tags for categorizing the check.
8.	Additional Metadata	Add custom metadata for additional details.

volumetric-check

Step 4: After completing all the check details, click on the "Validate" button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct. It ensures that the check will work as expected by running it against the data without committing any changes.

volumetric-check

If the validation is successful, a green message will appear saying "Validation Successful".

volumetric-check

Step 5: Once you have a successful validation, click the "Save" button.

volumetric-check

After clicking on the “Save” button your check is successfully created and a success flash message will appear saying “Check successfully created”.

volumetric-check

How It Works

The system automatically infers and maintains volumetric checks based upon observed daily, weekly, and monthly averages. These checks enable proactive management of data volume trends, ensuring that any unexpected deviations are identified as anomalies for review.

Automating Adaptive Volumetric Checks

The following Volumetric Checks are automatically inferred for data assets with automated volume measurements enabled:

Daily: the expected daily volume expressed as an absolute minimum and maximum threshold. The thresholds are calculated as standard deviations from the previous 7-day moving average.
Weekly: the expected weekly volume expressed as an absolute minimum and maximum threshold. The thresholds are calculated as standard deviations from the previous four weeks’ weekly volume moving average.
Monthly: the expected 4-week volume expressed as an absolute minimum and maximum threshold. The thresholds are calculated as standard deviations from the previous sixteen weeks’ 4-week volume moving average.

Scan Assertion and Anomaly Creation

Volumetric Checks are asserted during a Scan Operation just like all other check types and enrichment of volumetric check anomalies is fully supported. This enables full support for custom scheduling of volumetric checks and remediation workflows of volumetric anomalies.

Adaptive Thresholds and Manual Adjustments

Each time data volume is measured for an asset, the system automatically updates the inferred Volumetric Checks.

1.Automatic Threshold Adjustment:

The system sets initial thresholds at 2 standard deviations from the moving average.
Over time, these thresholds adjust automatically using historical data trends to improve accuracy.

2.Continuous Learning:

The system monitors past data and adapts thresholds to detect unusual data volume changes.

3.Why It Matters:

Helps maintain data integrity by identifying unexpected volume changes.
Ensures quick detection and response to potential data issues.

volumetric-check

Check Templates

Check Templates empower users to efficiently create, manage, and apply standardized checks across various datastores, acting as blueprints that ensure consistency and data integrity across different datasets and processes.

Check Templates streamline the validation process by enabling check management independently of specific data assets such as datastores, containers, or fields. These templates reduce manual intervention, minimize errors, and provide a reusable framework that can be applied across multiple datasets, ensuring all relevant data adheres to defined criteria. This not only saves time but also enhances the reliability of data quality checks within an organization.

Let's get started 🚀

Step 1: Log in to your Qualytics account and click the “Library” button on the left side panel of the interface.

library

Step 2: Click on the “Add Check Template” button located in the top right corner.

add-check

A modal window titled “Check Template Details” will appear, providing you with the options to add the check template details.

template-detail

Step 3: Enter the following details to add the check template:

Rule Type (Required)
Filter Clause
Template Locked
Description (Required)
Tags
Additional Metadata

1. Rule Type (Required): Select a Rule Type from the dropdown menu for data validation, such as checking for non-null values, matching patterns, comparing numerical values, or verifying datetime constraints. Each rule type defines the specific validation logic to be applied.

For more details about the available rule types, refer to the "Check Rule Types" section.

Note

Different rule types have different sets of fields and options appearing when selected.

rule-type

2. Filter Clause: Specify a valid Spark SQL WHERE expression to filter the data on which the check will be applied.

The filter clause defines the conditions under which the check will be applied. It typically includes a WHERE statement that specifies which rows or data points should be included in the check.

Example: A filter clause might be used to apply the check only to rows where a certain column meets a specific condition, such as WHERE status \= 'active'.

filter-clause

Adjust the Coverage setting to specify the percentage of records that must comply with the check.

Note

The Coverage setting applies to most rule types and allows you to specify the percentage of records that must meet the validation criteria.

coverage-setting

3. Template Locked: Check or uncheck the "Template Locked" option to determine whether all checks created from this template will have their properties automatically synced to any changes made to the template.

For more information about the template state, jump to the "Template State" section below.

template

4. Description (Required): Enter a detailed description of the check template, including its purpose, applicable data, and relevant information to ensure clarity for users. If you're unsure of what to include, click on the "💡" lightbulb icon to apply a suggested description based on the rule type.

Example: "The < field > must exist in bank_transactions_*.csv.Total_Transaction_Amount (Bank Dataset - Staging)".

This description clarifies that the specified field must be present in a particular file (bank_transactions_*.csv) and column (Total_Transaction_Amount) within the Bank Dataset.

description

5. Tags: Assign relevant tags to your check template to facilitate easier searching and filtering based on categories like "data quality," "financial reports," or "critical checks."

tag

6. Additional Metadata: Add key-value pairs as additional metadata to enrich your check. Click the plus icon (+) next to this section to open the metadata input form, where you can add key-value pairs.

metadata

Enter the desired key-value pairs (e.g., DataSourceType: SQL Database and PriorityLevel: High). After entering the necessary metadata, click "Confirm" to save the custom metadata.

key-value

Step 4: Once you have entered all the required fields, click the “Save” button to finalize the template.

Warning

Once a template is saved, the selected rule type becomes locked and cannot be changed.

After clicking the Save button, a success notification appears on the screen confirming that the check template was created successfully.

After saving the check template, you can now Apply a Check Template to create Quality Checks, which will enforce the validation rules defined in the template across your datastores. This ensures consistent data quality and compliance with the criteria you’ve established.

Once a check template is created, you can view its details by clicking on it, where three tabs are displayed at the top: Overview, Checks, and Anomalies.

Overview

The Overview tab gives a complete view of a check template, showing its key details, configuration, and recent activities. In the Summary section, users can also use redirect buttons to quickly navigate to related tabs like Checks and Active Anomalies. Information is divided into three sections: Summary, Activity, and Definition.

details-view

Summary

The Summary section provides a quick overview of the check template’s key details, including its name, type, priority, coverage, associated checks, active anomalies, description, and tags — helping users quickly understand its purpose and current status.

summary-section

REF.	Field	Description
1	Template	The rule type of the check template, indicating its purpose (e.g., “After Date Time”).
2	Type	Indicates whether the template is Locked (cannot be edited) or Unlocked (editable).
3	Weight	A numeric value representing the importance or priority level of the template in scoring or decision-making.
4	Coverage	The percentage of relevant dataset elements to which this template is applied.
5	Checks	The total number of checks currently using this template. You can click on the (↗) button to navigate directly to the Checks tab.
6	Active Anomalies	The count of unresolved anomalies detected by checks associated with this template. You can click on the (↗) button to navigate directly to the Anomalies tab.
7	Description	A short statement explaining the logic or purpose of the check template.
8	Tags	Used to categorize and filter templates (e.g., “Sandbox”). Users can change the tags by clicking the tag badge.

summary-fields

Definition

The Definition section displays the configuration details of a check template. It outlines the target conditions, specific properties, and any additional metadata associated with the template, providing clarity on how and where it is applied.

definition-section

REF.	Field	Description
1	Target	Defines the filter condition applied to the dataset. If no filter is specified, the check template applies to all data in the target scope.
2	Properties	Displays configuration details specific to the check type. Content varies based on the selected check: Field Count checks: Shows "Number of Fields". Metric checks: Shows "Comparison", "Min Value", and "Max Value".
3	Metadata	Displays any custom metadata properties linked to the template. If none are defined, this section remains empty.

definition-fields

Activity

The Activity section provides a chronological log of all actions and updates related to this template. It tracks key events such as creation, modifications, and other relevant activities, along with timestamps to show when they occurred.

activity

You can hover over a timestamp to view the full date and last modified time.

modified

Checks

The Checks tab provides a comprehensive view of all checks linked to the chosen datastore, container, or field, along with their source details such as computed table and field information. By clicking options such as Active, Important, Favorite, Draft, Archived (Invalid and Discarded), or All, users can instantly view checks based on their status. This categorization helps in organizing, reviewing, and managing checks more effectively for consistent data quality oversight.

checks

Alternatively, users can navigate to the Checks tab directly from the Overview tab by clicking the redirect button in the Checks section of the Summary panel.

checks-button

Anomalies

The Anomalies tab displays all anomalies detected for the selected check template, along with details such as source datastore, computed table, field, rule, and the number of anomalous records. Users can view anomalies based on their status: Open, Active, Acknowledged, Archived, or All and sort them based on specific parameters.

anomalies

Alternatively, users can navigate to the Anomalies tab directly from the Overview tab by clicking the redirect button in the Active Anomalies section of the Summary panel.

anomalies-button

Multiple Checks Creation

Users can create multiple checks at once by selecting a template and adding multiple targets. Each target will generate its own check.

Step 1: Click on the Add button located in the top right corner and select Multiple Checks from the dropdown.

add-button

A Bulk Add Quality Checks modal window will appear. Fill in the details:

No.	Field	Description
1.	Datastore	Select the datastore where the check should be applied.
2.	File/Table	Choose the file/table within the selected datastore.
3.	Field	Select the field to apply the check on.
4.	Filter Clause	Specify a valid Spark SQL `WHERE` expression to filter the data on which the check will be applied.

modal-window

Step 2: Click on the Add Target button to create another check. You can keep adding targets to create as many checks as you need within the same template.

add-target

Template State

Any changes to a template may or may not impact its related checks, depending on whether the template state is locked or unlocked. Managing the template state allows you to control whether updates automatically apply to all related checks or let them function independently.

Unlocked

Quality Checks can evolve independently of the template. Subsequent updates to an unlocked Check Template do not affect its related quality checks.

Locked

Quality Checks from a locked Check Template will inherit changes made to the template. Subsequent updates to a locked Check Template do affect its related quality checks.

Info

Tags will be synced independently of unlocked and locked Check Templates, while Description and Additional Metadata will not be synced. This behavior is general for Check Templates.

Template State

graph TD
A[Start] -->|Is `Template Locked` enabled?| B{Yes/No}
B -->|No| E[The quality check can evolve independently]
B -->|Yes| C[They remain synchronized with the template]
C --> D[End]
E --> D[End]

Apply Check Template for Quality Checks

You can export check templates to make quality checks easier and more consistent. Using a set template lets you quickly verify that your data meets specific standards, reducing mistakes and improving data quality. Exporting these templates simplifies the process, making finding and fixing errors more efficient, and ensuring your quality checks are applied across different projects or systems without starting from scratch.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the “Library” button on the left side panel of the interface.

library

Here you can view the list of all the customer data validation templates.

checks-templates

Step 2: Locate the template, click on the vertical ellipsis (three dots) next to it, and select “Add Check” from the dropdown menu to create a Quality Check based on this template

For demonstration purposes, we have selected the “After Date Time” template.

add-check

A modal window titled “Authored Check Template” will appear, displaying all the details of the Quality Check Template.

check-details

Step 3: Enter the following details:

1. Associate with a Check Template:

If you toggle ON the "Associate with a Check Template" option, the check will be linked to a specific template.
If you toggle OFF the "Associate with a Check Template" option, the check will not be linked to any template, which allows you full control to modify the properties independently.

Since we are applying a check template to create quality checks, it's important to keep the toggle on to ensure the template is applied as a quality check.

toggle

2. Template: Choose a Template from the dropdown menu that you want to associate with the quality check. The check will inherit properties from the selected template.

Locked: If the template is locked, it will automatically sync with any future updates made to the template. However, you won't be able to modify the check's properties directly, except for specific fields like Datastore, Table, and Fields, which can still be updated while maintaining synchronization with the template.
Unlocked: If the template is unlocked, you are free to modify the check's properties as needed. However, any future updates to the template will no longer affect this check, as it will no longer be synced with the template.

template

3. Datastore: Select the Datastore, Table and Field where you want to apply the check template. This ensures that the template is linked to the correct data source, allowing the quality checks to be performed on the specified datastore.

For demonstration purposes, we have selected the “MIMIC II” datastore, with the “ADMISSIONS” table and the “ADMITTIME” field.

datastore

Step 4: After completing all the check details, click on the "Validate" button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct. It ensures that the check will work as expected by running it against the data without committing any changes.

validate

If the validation is successful, a green message will appear saying "Validation Successful".

validate-success

If the validation fails, a red message will appear saying "Failed Validation". This typically occurs when the check logic or parameters do not match the data properly.

validate-fail

Step 5: Once you have a successful validation, click the "Save" button.

Info

You can create as many Quality checks as you want for a specific template.

save

After clicking on the “Save” button your check is successfully created and a success flash message will appear saying “Check successfully created”.

success-msg

Export Check Templates

You can export check templates to easily share or reuse your quality check settings across different systems or projects. This saves time by eliminating the need to recreate the same checks repeatedly and ensures that your quality standards are consistently applied. Exporting templates helps maintain accuracy and efficiency in managing data quality across various environments.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the “Library” button on the left side panel of the interface.

library

Step 2: Click on the “Export Check Template” button located in the top right corner.

export-check-template

Step 3: A modal window titled “Export Check Templates” will appear, where you have to select the enrichment store to which the check templates will be exported.

enrichment

Step 4: Once you have selected the enrichment store, click on the “Export” button

export-check

After clicking “Export,” the process starts, and a message will confirm that the metadata will be available in your Enrichment Datastore shortly.

export-success

Review Exported Check Templates

Step 1: Once the checks have been exported, navigate to the “Enrichment Datastores” located on the left menu.

enrichment-datastore

Step 2: In the “Enrichment Datastores” section, select the datastore where you exported the checks templates. The exported check templates will now be visible in the selected datastore.

When you export check templates, you can reuse them for other datastores, share them with teams, or save them as a backup. Once exported, the templates can be imported and customized to fit different datasets, making them versatile and easy to adapt.

exported-templates

You also have the option to download them as a CSV file, allowing you to share or store them for future use.

download-templates

Datastore Checks

Checks in Datastore

Checks are validation rules applied to datasets to ensure data quality and integrity. They can be categorized as Active, Draft, or Archived based on their status and usage. Each check includes detailed metadata such as importance, scan history, anomalies, and assertion results. This section guides you through viewing, managing, and analyzing these checks within your datastore.

Let's get started 🚀

Step 1: Log in to your Qualytics account and select the datastore from the left menu on which you want to manage your checks.

datastore

Step 2: Click on the "Checks" from the Navigation Tab.

checks

Categories Checks

You can categorize your checks based on their status, such as Active, Draft, Archived (Invalid and Discarded), or All, according to your preference. This categorization offers a clear view of the data quality validation process, helping you manage checks efficiently and maintain data integrity.

All

By selecting All Checks, you can view a comprehensive list of all the checks in the datastore, including both active and draft checks, allowing you to focus on the checks that are currently being managed or are in progress. However, archived checks are not displayed in this.

all

Active

By selecting Active, you can view checks that are currently applied and being enforced on the data. These operational checks are used to validate data quality in real time, allowing you to monitor all active checks and their performance.

active

You can also categorize the active checks based on their importance and favorites to streamline your data quality monitoring.

1. Important: Shows only checks that are marked as important. These checks are prioritized based on their significance, typically assigned a weight of 7 or higher.

Note

Important checks are prioritized based on a weight of 7 or higher.

important

2. Favorite: Displays checks that have been marked as favorites. This allows you to quickly access checks that you use or monitor frequently.

favorite

3. All: Displays a comprehensive view of all active checks, including important, favorite, and any checks that do not fall under these specific categories.

all

Draft Checks

By selecting Draft, you can view checks that have been created but have not yet been applied to the data. These checks are in the drafting stage, allowing for adjustments and reviews before activation. Draft checks provide flexibility to experiment with different validation rules without affecting the actual data.

draft

You can also categorize the draft checks based on their importance and favorites to prioritize and organize them effectively during the review and adjustment process.

1. Important: Shows only checks that are marked as important. These checks are prioritized based on their significance, typically assigned a weight of 7 or higher.

important

2. Favorite: Displays checks that have been marked as favorites. This allows you to quickly access checks that you use or monitor frequently.

favorite

3. All: Displays a comprehensive view of all draft checks, including important, favorite, and any checks that do not fall under these specific categories.

all

Archived Checks

By selecting Archived, you can view checks that have been marked as discarded or invalid from use but are still stored for future reference or restoration. Although these checks are no longer active, they can be restored if needed.

archived

You can also categorize the archived checks based on their status as Discarded, Invalid, or view All archived checks to manage and review them effectively.

1. Discarded: Shows checks that have been marked as no longer useful or relevant and have been discarded from use.

discarded

2. Invalid: Displays checks that are deemed invalid due to errors or misconfigurations, requiring review or deletion.

invalid

3. All: Provides a view of all archived checks within this category, including discarded and invalid checks.

all

Check Info

Check Details provides important information about each check in the system. It shows when a check was last run, how often it has been used, when it was last updated, who made changes to it, and when it was created. This section helps users understand the status and history of the check, making it easier to manage and track its use over time.

Step 1: Locate the check you want to review, then hover over the info icon to view the Check Details.

hover

A popup will appear with additional details about the check.

popup

Last Asserted

Last Asserted At shows the most recent time the check was run, indicating when the last validation occurred. For example, the check was last asserted on Oct 17, 2023, at 2:37 AM (GMT+5:30).

popup

Scans

Scans show how many times the check has been used in different operations. It helps you track how often the check has been applied. For example, the check was used in 30 operations.

scan

Updated At

Updated At shows the most recent time the check was modified or updated. It helps you see when any changes were made to the check’s configuration or settings. For example, the check was last updated on Sep 9, 2024, at 3:18 PM (GMT+5:30).

update

Last Editor

Last Editor indicates who most recently made changes to the check. It helps track who is responsible for the latest updates or modifications. This is useful for accountability and collaboration within teams.

editor

Created At

Created At shows when the check was first made. It helps you know how long the check has been in use. This is useful for tracking its history. For example, the check was created on Oct 17, 2023, at 2:19 PM (GMT+5:30).

created

Check Details

Check Detail View displays all key information related to a specific data quality check. It shows what the check is monitoring, how it's configured, where it's applied in the dataset, and whether any issues have been found. It also includes sections for viewing the check’s recent performance, related activities, and any additional metadata. This view helps users easily understand the purpose and current state of the check.

Step 1: Click on the check that you want to see the details of.

success-msg

You will be navigated to the detail section, where you can view the Summary, Observability, Properties, Activity, and Metadata information.

detail

Info

In addition to viewing the check details, you can also monitor and manage any anomalies associated with this check — all from the same page, without needing to navigate elsewhere.

Summary Section

The Summary section shows that a data quality check is applied to a field and is currently active. It indicates whether the check was created automatically by the system or manually by a user and is being applied to the entire dataset and has a defined importance level. It also shows when the check last ran and whether there are any current issues found in the data.

1. Check & Status : The type of check applied to the data. In this case, it's a Volumetric check and the check is Active, meaning it's currently being applied.

check

2. Type : This check is Authored, meaning it was manually created by the users.

type

When you hover over the time period written below the type of the check, a pop-up appears displaying the complete date and time.

type-complete-date

3. Last Asserted : Shows when the check was last run – 3 months ago in this case.

last-asserted

When you hover over the time the check last ran, a pop-up appears displaying the complete date and time.

asserted-date

Last Asserted Details

Click on the info icon to view the last asserted details.

details

A popup will appear with Scans details. Scans show how many times the check has been used in different operations. It helps you track how often the check has been applied. For example, the check was used in 19 operations.

scan-details

4. Weight : Indicates the importance or priority of this check – the weight is 13.

weight

5. Coverage : How much data this check applies to – here it's 100%, meaning it applies to the whole dataset.

coverage

6. Active Anomalies : Number of current issues found – 0 anomalies are active right now.

active-anomaly

7. Description : Explains the rule or condition that the check is validating.

description

8. Tags : Displays any tags linked to the check. Users can also add new tags by clicking on the tag area.

Copy the Check Link

Click on the Copy Check Link icon(represented by share icon) located at the right corner of the summary section to copy a direct link to the selected check. This link can be shared with other users for quick access to the specific check within the platform.

copy-link

Favorite the check

Click on the bookmark icon located at the right corner of the summary section to mark the check as favorite.

fav

To unmark a check, simply click on the bookmark icon of the marked check. This will remove it from your favorites.

unfav

Observability Section

Observability provides a visual overview of how a check performs over time by tracking assertion results. It helps identify trends, failures, or anomalies using daily status indicators across a selected timeframe.

observability

Users can hover over any date in the timeline. It provides a comprehensive view of assertion statuses, including passed, failed, and anomalous results. By hovering over a specific date, users can access detailed information such as the result status, the number of asserted records, and any anomalies identified. Highlighting all available status types ensures a clearer understanding of the data quality over time.

hover

Additionally, clicking the Latest Assertion Scan button (e.g., #48151) will navigate users directly to the Scan Results page for that specific assertion.

scan-results

Selecting Report Date and Timeframe

The Observability section helps you monitor how your check assertion metrics change over time. You can customize the view by selecting a specific report date and timeframe to analyze trends over different periods.

Select the Report Date

Step 1: Locate the Report Date field at the top-right of the Observability section.

report-date

Step 2: Click on the calendar icon. A date picker will appear. Select the desired report date to update the Assertion Over Time graph accordingly.

Choose the Timeframe

Step 1: Locate the Timeframe field at the top-right of the Observability section.

timeframe

Step 2: Choose a timeframe for your assertion data view:

Week – Shows assertion metrics distributed over a 7-day period.
Month – Displays daily or weekly assertions throughout the selected month.
Quarter – Covers a three-month range (e.g., Q1: Jan–Mar, Q2: Apr–Jun), useful for quarterly reporting and insights.
Year – Presents assertion data trends for an entire calendar year, allowing for broad, high-level performance monitoring.

select-time

Once a timeframe is selected, the Assertion Over Time chart below will automatically adjust to reflect assertion activity within the chosen window.

Properties Section

The Properties section explains where this check is applied. In this case, the check is applied to a table called supplier, specifically to the s_comment field of type String. There is no filter added, so the check is applied to all rows in the table. This helps maintain clean and trustworthy data, especially when phone numbers must be unique per customer.

properties

Activity Section

The Activity section displays a chronological history of all actions performed on the quality check, including creation, updates, and automated adjustments. It provides visibility into how the check has evolved over time, capturing the exact configuration, properties, and tags associated with each event.

activity

You can view the exact version of the check as it existed at that point in time by clicking the check icon on the right side of the activity entry.

check-version

A right side panel will open with the historical configuration of the check.

right-panel

The Version At field displays the exact date and time when that version of the check was created. For example, July 8, 2025, at 5:42 AM (GMT+5:30) indicates when the configuration shown was active in the system.

version

Metadata Section

Currently, there is no extra metadata added to this check. Metadata can include additional notes or properties, but in this case, it's left blank.

metadata

Manage Checks

Overview

Managing your checks within a datastore is important to maintain data integrity and ensure quality. You can categorize, create, update, archive, restore, delete, and clone checks, making it easier to apply validation rules across the datastores. The system allows for checks to be set as active, draft, or archived based on their current state of use. You can also define reusable templates for quality checks to streamline the creation of multiple checks with similar criteria. With options for important and favorite, users have full flexibility to manage data quality efficiently.

Let's get started 🚀

Step 1: Log in to your Qualytics account and select the datastore from the left menu on which you want to manage your checks.

datastore

Step 2: Click "Checks" from the navigation tab.

checks

You will be navigated to the Checks section within the selected datastore. Here, you can view checks categorized as Active, Draft, Archived (including Invalid and Discarded), or All.

navigated

Status Management of Checks

Set Check as Draft

You can move an active check into a draft state, allowing you to work on the check, make adjustments, and refine the validation rules without affecting live data. This is useful when you need to temporarily deactivate a check for review and updates.

For more information on how to set a check as draft, please refer to the Draft Checks documentation.

Activate Draft Check

You can activate the draft checks after you have worked on the check, made adjustments, and refined the validation rules. Activating the draft check and making it live ensures that the defined criteria are enforced on the data.

For more information on how to activate a draft check, please refer to the Activate Draft Check documentation.

Set Check as Archived

You can move an active or draft check into the archive when it is no longer relevant but may still be needed for historical purposes or future use. Archiving helps keep your checks organized without permanently deleting them.

For more information on how to set a check as archived, please refer to the Archive Checks documentation.

Activate Archived Checks

You can activate archived checks when you need to restore previously defined validation rules. This is useful if a check was archived temporarily and is now relevant again for data quality enforcement.

For more information on how to activate archived checks, please refer to the Activate Archived Checks documentation.

Draft Archived Checks

You can move archived checks to the draft state when you want to update or refine them before activation. This is useful if a check is no longer archived but needs adjustments before going live.

For more information on how to draft archived checks, please refer to the Draft Archived Checks documentation.

Restore Archived Checks

If a check has been archived, then you can restore it back to an active state or in a draft state. This allows you to reuse your checks that were previously archived without having to recreate them from scratch.

For more information on how to restore archived checks, please refer to the Restore Archived Checks documentation.

Edit Check

You can edit an existing check to modify its properties, such as the rule type, coverage, filter clause, or description. Updating a check ensures that it stays aligned with evolving data requirements and maintains data quality as conditions change.

For more information on how to edit a check, please refer to the Edit Check documentation.

Delete Checks

You can delete a check permanently, removing it from the system, and this is an irreversible action. Once you delete it, the check cannot be restored. By deleting the check, you ensure it will no longer appear in active or archived lists, making the system more streamlined and organized.

For more information on how to delete checks, please refer to the Delete Checks documentation.

Dry Run

The Dry Run feature allows users to simulate the behavior of a Data Quality Check before enforcing it during a scan. This helps validate the check logic and preview potential anomalies without persisting the results or affecting any data.

For more information on dry run, please refer to the Dry Run documentation.

Clone Check

You can clone both active and draft checks to create a duplicate copy of an existing check. This is useful when you want to create a new check based on the structure of an existing one, allowing you to make adjustments without affecting the original check.

For more information on how to clone a check, please refer to the Clone Check documentation.

Create a Quality Check Template

You can add checks as templates, which allows you to create a reusable framework for quality checks. By using templates, you standardize the validation process, enabling the creation of multiple checks with similar rules and criteria across different datastores. This ensures consistency and efficiency in managing data quality checks.

For more information on how to create a quality check template, please refer to the Quality Check Template documentation.

Mark Check as Favorite

Marking a check as a favorite helps you quickly access and prioritize important checks during your data validation process. Favorited checks appear in the "Favorite" category, making them easier to manage and monitor.

For more information on how to mark a check as favorite, please refer to the Mark Check as Favorite documentation.

Filter and Sort

Filter and Sort options allow you to organize your checks by various criteria, such as Weight, Active Anomalies, Coverage, Created Date, and Rules. You can also apply filters to refine your list of checks based on Check Type, Asserted State (Passed, Failed, Not Asserted), Tags, Tables, and Fields.

For more information on how to filter and sort, please refer to the Filter and Sort documentation.

Quality Check Migration

Quality Check Migration allows you to transfer authored quality checks from one container to another, even across different datastores. This feature helps you reuse existing quality rules without manually recreating them in the target container.

For more information about Quality Check Migration, please refer to the Quality Check Migration documentation.

Draft Checks

There are two methods to move your active checks to draft: you can either draft specific checks or draft multiple checks in bulk.

Method I: Draft Specific Check

Step 1: Click the vertical ellipsis (⋮) next to the active check you want to move to the draft state and select "Edit" from the dropdown menu.

For demonstration purposes, we have selected the "Between" check.

checks-list

Step 2: A modal window will appear displaying the check details. Click the vertical ellipsis (⋮) located in the upper-right corner of the modal window and select "Draft" from the dropdown menu.

draft

After clicking "Draft", the selected item will move to the draft state, and a success message will appear on the screen.

Alternatively, you can move an active check to the draft state by clicking the vertical ellipsis (⋮) next to the check and selecting “Draft” from the dropdown menu.

draft-specific

Method II. Draft Checks in Bulk

You can move multiple checks into the draft state in one action, allowing you to pause or make adjustments to several checks without affecting your active validation process.

Step 1: Hover over the active checks and click the checkbox to select multiple checks.

check-box

Step 2: Click the vertical ellipsis (⋮) and select "Draft" from the dropdown menu to move active checks to the draft state.

draft

A confirmation modal window titled "Bulk Update Checks to Draft" will appear, indicating the number of checks being moved to draft.

modal

Step 3: Click the "Update" button to move the selected active checks to draft.

update

After clicking the "Update" button, your selected checks will be moved to draft, and a success message will appear on the screen.

Activate Draft Checks

There are two ways to activate draft checks: you can activate specific checks or activate multiple checks in bulk.

Method I. Activate Specific Check

Step 1: Navigate to the Draft check section and click the vertical ellipsis (⋮) next to the draft check you want to activate, and select Edit from the dropdown menu.

For demonstration purposes, we have selected the "Metric" check.

checks-list

A modal window will appear with the check details. If you want to make any changes to the check details, you can edit them.

check-details

Step 2: Click the down arrow icon with the Update button. A dropdown menu will appear. Click the Activate button.

activate

After clicking the Activate button, your check has been successfully moved to active checks, and a success message will appear on the screen.

Alternatively, you can activate a draft check by clicking the vertical ellipsis (⋮) next to the draft check and selecting "Activate" from the dropdown menu.

activate-specific

A confirmation modal window “Activate Check” will appear. Click the “Activate” button to activate the draft check.

activate-modal

Method II. Activate Draft Checks in Bulk

Step 1: Hover over the draft checks and click the checkbox to select multiple checks in bulk.

bulk-draft

When multiple checks are selected, an action toolbar appears, displaying the total number of checks chosen along with a vertical ellipsis for additional bulk action options.

action-toolbar

Step 2: Click the vertical ellipsis (⋮) and choose "Activate" from the dropdown menu to activate the selected checks.

Step 3: A confirmation modal window “Bulk Activate Checks” will appear. Click the “Activate” button to activate the draft checks.

modal-window

After clicking the Activate button, your draft checks will be activated, and a success message will appear on the screen.

Archive Checks

There are two ways to archive checks: you can archive individual checks or archive multiple checks in bulk.

Method I: Archive Specific Check

You can archive a specific check using two ways: either by directly clicking the archive button on the check or by opening the check and selecting the archive option from the action menu.

1. Archive Directly

Step 1: Locate the check (whether Active or Draft) that you want to archive and click the vertical ellipsis (⋮) next to it and select "Archive" from the dropdown menu.

For demonstration purposes, we have selected the "Metric" check.

Step 2: A modal window titled "Archive Check" will appear, providing you with the following archive options:

Discarded: Select this option if the check is no longer relevant or suitable for the current business rules or data requirements. This helps in archiving checks that are obsolete but still exist for historical reference.
Invalid: Choose this option if the check is not valid and should be retired from future inference. This helps the system learn from invalid checks and improves its ability to infer valid checks in the future.

archive-option

Step 3: Once you've made your selection, click the Archive button to proceed.

archive

After clicking the Archive button, your check is moved to the archive, and a success message will appear on the screen.

Step 1: Locate the check (whether Active or Draft) that you want to archive and click the vertical ellipsis (⋮) next to it and select "Edit" from the dropdown menu.

For demonstration purposes, we have selected the "Between" check.

checks-list

Step 2: A modal window will appear displaying the check details. Click the vertical ellipsis (⋮) located in the upper-right corner of the modal window and click "Archive" from the dropdown menu.

archive

Step 3: A modal window titled “Archive Check” will appear, providing you with the following archive options:

Discarded: Select this option if the check is no longer relevant or suitable for the current business rules or data requirements. This helps in archiving checks that are obsolete but still exist for historical reference.
Invalid: Choose this option if the check is not valid and should be retired from future inference. This helps the system learn from invalid checks and improves its ability to infer valid checks in the future.

archive-check

Step 4: Once you've made your selection, click the Archive button to proceed.

archive

After clicking the Archive button, your check is moved to the archive, and a success message will appear on the screen.

Method II: Archive Checks in Bulk

You can archive multiple checks in a single step, deactivating and storing them for future reference or restoration while keeping your active checks uncluttered.

Step 1: Hover over the checks (whether Active or Draft) and click the checkbox to select multiple checks.

check-box

When multiple checks are selected, an action toolbar appears, displaying the total number of selected checks along with a vertical ellipsis for additional bulk action options.

Step 2: Click the vertical ellipsis (⋮) and choose "Archive" from the dropdown menu to archive the selected checks.

archive

A modal window will appear, providing you with the following archive options:

1. Delete all anomalies associated with the checks: Toggle this option "On" if you want to delete any anomalies related to the selected checks when archiving them.

2. Archive Options: You are presented with two options to categorize why the checks are being archived:

Discarded: Select this option if the check is no longer relevant or suitable for the current business rules or data requirements. This helps in archiving checks that are obsolete but still exist for historical reference.
Invalid: Choose this option if the check is not valid and should be retired from future inference. This helps the system learn from invalid checks and improves its ability to infer valid checks in the future.

archive-check

Step 3: Once you've made your selections, click the "Archive" button to confirm and archive the checks.

archive

After clicking the "Archive" button, your selected checks (whether Active or Draft) will be successfully archived, and a success message will appear on the screen.

Activate Archived Checks

You can activate archived checks individually or in bulk.

Method I: Activate Specific Check

Step 1: Navigate to the Archived checks section and click the vertical ellipsis (⋮) next to the archived check you want to activate, and select "Activate" from the dropdown menu.

For demonstration purposes, we have selected the "Metric" check.

activate-specific

Step 2: A confirmation modal window “Activate Check” will appear. Click the “Activate” button to activate the archived check.

Step 3: After clicking the Activate button, your check has been successfully moved to active checks, and a success message will appear on the screen.

Method II: Activate Archived Checks in Bulk

Step 1: Hover over the archived checks and click the checkbox to select multiple checks in bulk.

bulk-activate

When multiple checks are selected, an action toolbar appears, displaying the total number of checks chosen along with a vertical ellipsis for additional bulk action options.

bulk-activate

Step 2: Click the vertical ellipsis (⋮) and choose "Activate" from the dropdown menu to activate the selected checks.

activate-ellip

Step 3: A confirmation modal window “Bulk Activate Checks” will appear. Click the “Activate” button to activate the archived checks.

After clicking the Activate button, your archived checks will be activated, and a success message will appear on the screen.

Draft Archived Checks

You can draft archived checks individually or in bulk.

Method I: Draft Specific Check

Step 1: Navigate to the Archived checks section. Click the vertical ellipsis (⋮) next to the archived check you want to move to the draft state and select "Draft" from the dropdown menu.

For demonstration purposes, we have selected the "Not Null" check.

draft-archive

After clicking "Draft", the check will be successfully moved to the draft state, and a success message will appear on the screen.

Method II: Draft Archived Checks in Bulk

Step 1: Hover over the archived checks and click the checkbox to select multiple checks in bulk.

bulk-draft

When multiple checks are selected, an action toolbar appears, displaying the total number of checks chosen along with a vertical ellipsis for additional bulk action options.

bulk-draft

Step 2: Click the vertical ellipsis (⋮) and select "Draft" from the dropdown menu to move archived checks to the draft state.

draft-ellip

Step 3: A confirmation modal window "Bulk Update Checks to Draft" will appear. Click the "Update" button to move the selected archived checks to draft.

After clicking the "Update" button, your archived checks will be moved to draft, and a success message will appear on the screen.

Restore Archived Checks

Step 1: Click Archived from the navigation bar in the Checks section to view all archived checks.

archive

Step 2: Click the archived check that you want to restore.

select-check

Step 3: You will be directed to the check details page. Click the Settings icon located at the top-right corner of the interface and select “Edit” from the dropdown menu.

restore

A modal window will appear with the check details.

check-details

Step 4: If you want to make any changes to the check, you can edit it. Otherwise, click the Restore button to restore it as an active check.

restore-check

To restore the check as a draft, click the arrow icon next to the Restore button. A dropdown menu will appear—select Restore as Draft from the options.

restore-as-draft

After clicking the Restore button, the check will be successfully restored as either an active or draft check, depending on your selection, and a success message will appear on the screen.

Edit Check

There are two methods for editing checks: you can either edit specific checks or edit multiple checks in bulk.

Note

When editing multiple checks in bulk, only the filter clause, tags, and metadata can be modified.

Method I. Edit Specific Check

Step 1: Click the vertical ellipsis (⋮) next to the check you want to edit whether it is an active or draft check, and select "Edit" from the dropdown menu.

For demonstration purposes, we have selected the "Between" check.

edit-check

A modal window will appear with the check details.

modal-win

Step 2: Modify the check details as needed based on your requirements.

check-detail

Step 3: Once you have edited the check details, click the "Validate" button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct.

If the validation is successful, a green message saying "Validation Successful" will appear.

validate-msg

If the validation fails, a red message saying "Failed Validation" will appear. This typically occurs when the check logic or parameters do not match the data properly.

failed-msg

Step 4: Once you have a successful validation, click the "Update" button. The system will update the changes you've made to the check, including changes to the fields, filter clause, coverage, description, tags, or metadata.

After clicking the "Update" button, your check is successfully updated, and a success message will appear on the screen.

Method II. Edit Checks in Bulk

You can easily apply changes to multiple checks at once, saving time by editing several checks simultaneously without having to modify each one individually.

Step 1: Hover over the checks (whether Active or Draft) and click the checkbox to select multiple checks.

edit-check

When multiple checks are selected, an action toolbar appears, displaying the total number of selected checks along with a vertical ellipsis for additional bulk action options.

select

Step 2: Click the vertical ellipsis (⋮) and select "Edit" from the dropdown menu to make changes to the selected checks.

Step 3: A modal window titled "Bulk Edit Checks" will appear. Here you can only modify the filter clause, tags, and metadata of the selected checks.

modal-window

Step 4: Toggle on the options (Filter Clause, Tags, or Additional Metadata) that you want to modify for the selected checks, and make the necessary changes.

Note

This action will overwrite the existing data for the selected checks.

modal-window

Step 5: Once you have made the changes, click the "Save" button.

After clicking the "Save" button, your selected checks will be updated with the new changes, and a success message will appear on the screen.

Delete Checks

There are two methods for deleting checks: you can either delete individual checks or delete multiple checks in bulk.

Note

You can only delete archived checks. If you want to delete an active or draft check, you must first move it to the archive, and then you can delete it.

Warning

Deleting a check is a one-time action. It cannot be restored after deletion.

Method I. Delete Specific Check

Step 1: Click Archived from the navigation bar in the Checks section to view all archived checks.

archived-btn

Step 2: Locate the check that you want to delete and click the vertical ellipsis (⋮) and select Delete from the dropdown menu.

For demonstration purposes, we have selected the "Not Null" check.

Step 3: A confirmation modal window will appear. Click the Delete button to permanently remove the check from the system.

After clicking the Delete button, your check is successfully deleted, and a success message will appear on the screen.

Method II. Delete Checks in Bulk

You can permanently delete multiple checks from the system in one action. This process is irreversible, so it should be used when you are certain that the checks are no longer needed.

Step 1: Hover over the archived checks and click the checkbox to select checks in bulk.

dlt-bulk

When multiple checks are selected, an action toolbar appears, displaying the total number of selected checks along with a vertical ellipsis for additional bulk action options.

select

Step 2: Click the vertical ellipsis (⋮) and choose "Delete" from the dropdown menu to delete the selected checks.

delete-ellp

Step 3: A confirmation modal window will appear. Click the "Delete" button to permanently delete the selected checks.

After clicking the Delete button, your selected checks will be permanently deleted, and a success message will appear on the screen.

Dry Run

Step 1: Click the check you want to test using the Dry Run feature.

click-check

Step 2: Click the Settings icon located at the top-right corner of the interface and select “Dry Run” from the dropdown menu.

dry-run

A modal window titled Dry Run Results will appear.

results

This window enables you to confidently evaluate and refine data quality checks before running full scans, helping maintain high-quality standards without unnecessary noise or misconfiguration.

fields

No.	Field	Description
1	Status	Indicates whether the dry run completed successfully.
2	Timing	Displays the total time taken to execute the dry run.
3	Sampling Limit	Shows the number of records sampled during the dry run (default is 10K records).
4	Check ID and Name	The unique identifier and name of the data quality check. This provides both a reference ID and a descriptive label indicating the rule type.
5	Description	A concise explanation of the check rule being tested. For example, “PS_SUPPLYCOST is greater than PS_AVAILQTY.”
6	Table	The name of the table on which the check is being applied.
7	Field	The specific column or field within the table that the rule targets.

Anomalies

Highlights any violations detected during the dry run, such as constraint breaches or unexpected value patterns.

anomalies

No.	Field	Description
1	Violation	Clearly states the reason for failure. This message helps users quickly understand what went wrong and why the data didn't pass the quality check.
2	Asserted Records	Displays the total number of records evaluated in the dry run.
3	Anomalous Records	Shows how many of those records violated the constraint logic.

fields

Source Records

The Source Records section presents a detailed, tabular view of all records that were evaluated by the selected quality check. This section is designed to help users investigate the underlying data issues that may have led to anomalies, offering clear visibility into the records that failed to meet the defined constraint.

source-records

Sort Options

Users can sort the records based on different fields using the Sort By dropdown.

sort-options

No.	Sort By	Description
1	Name	Sorts the records alphabetically based on the field name.
2	Weight	Sorts records based on the weight or severity of the failure. Higher-weighted issues appear first.
3	Quality Score	Sorts records by their quality score, helping you prioritize records with the lowest data quality.

Download Source Records

The Download Source Records option allows users to export the records evaluated during the Dry Run process for further offline analysis or documentation purposes. A file containing the asserted records and their anomaly status will be downloaded in CSV format.

download-records

Note

When no issues are detected, users receive a clear confirmation message indicating no anomalies were identified.

Info

You can perform a Dry Run on draft checks to validate the logic before they are finalized and published.

Clone Check

Step 1: Click the vertical ellipsis (⋮) next to the check (whether Active or Draft) that you want to clone and select "Edit" from the dropdown menu.

For demonstration purposes, we have selected the "Between" check.

clone-check

Step 2: A modal window will appear, displaying the check details. Click the vertical ellipsis (⋮) located in the upper-right corner of the modal window and select "Clone" from the dropdown menu.

Step 3: After clicking the Clone button, a modal window will appear. This window allows you to adjust the cloned check's details.

modal-window

1. If you toggle on the "Associate with a Check Template" option, the cloned check will be linked to a specific template.

toggle-on

Choose a Template from the dropdown menu that you want to associate with the cloned check. The check will inherit properties from the selected template.

Locked: The check will automatically sync with any future updates made to the template, but you won't be able to modify the check's properties directly.
Unlocked: You can modify the check, but future updates to the template will no longer affect this check.

associate-check

2. If you toggle off the "Associate with a Check Template" option, the cloned check will not be linked to any template, which allows you full control to modify the properties independently.

toggle-off

Select the appropriate Rule Type for the check from the dropdown menu.

rule-type

Step 4: Once you have selected the template or rule type, fill out the remaining check details as required.

check-detail

Step 5: After completing all the check details, click the "Validate" button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct. It ensures that the check will work as expected by running it against the data without committing any changes.

If the validation is successful, a green message saying "Validation Successful" will appear.

validation-success

If the validation fails, a red message saying "Failed Validation" will appear. This typically occurs when the check logic or parameters do not match the data properly.

failed-validation

Step 6: Once you have a successful validation, click the "Save" button. The system will save any modifications you've made to the check and create a clone of that check based on your changes.

After clicking the "Save" button, your check is successfully created, and a success message will appear on the screen.

Create a Quality Check template

Step 1: Locate the check (whether Active or Draft) that you want to convert into a template and click that check.

For demonstration purposes, we have selected the "Between" check.

select-check

Step 2: A modal window will appear displaying the check details. Click the vertical ellipsis (⋮) located in the upper-right corner of the modal window and select "Template" from the dropdown menu.

After clicking the "Template" button, the check will be saved and created as a template in the library, and a success message will appear on the screen. This allows you to reuse the template for future checks, streamlining the validation process.

Mark Check as Favorite

Locate the check you want to mark as a favorite and click the bookmark icon located on the right side of the check.

mark-fav

After clicking the bookmark icon, your check is successfully marked as a favorite, and a success message will appear on the screen.

To unmark a check, simply click the bookmark icon of the marked check. This will remove it from your favorites.

remove-fav

Filter and Sort

You can easily organize your checks using the available sort and filter options.

Sort

You can sort your checks by Active Anomalies, Coverage, Created Date, Last Asserted, Rules, and Weight to easily organize and prioritize them according to your needs.

sort

No	Sort By Option	Description
1	Active Anomalies	Sort checks based on the number of active anomalies.
2	Coverage	Sort checks by data coverage percentage.
3	Created Date	Sort checks according to the date they were created.
4	Last Asserted	Sorts by the last time the check was executed.
5	Rules	Sort checks based on specific rules applied to the checks.
6	Weight	Sort checks by their assigned weight or importance level.

Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.

arrange

Filter

You can filter your checks based on values like Check Type, Asserted State, Rule, Tags, Table, Field, and Template.

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

filter

No	Filter	Filter Value	Description
1	Check Type	All	Displays all types of checks, both inferred and authored.
		Inferred	Shows system-generated checks that automatically validate data based on detected patterns or logic.
		Authored	Displays user-created checks, allowing the user to focus on custom validations tailored to specific requirements.
2	Asserted State	All	Displays all checks, regardless of their asserted status. This provides a full overview of both passed, failed, and not asserted checks.
		Passed	Shows checks that have been asserted successfully, meaning no active anomalies were found during the validation process.
		Failed	Displays checks that have failed assertion, indicating active anomalies or issues that need attention.
		Not Asserted	Filters out checks that have not yet been asserted, either because they haven’t been processed or validated yet.
3	Rule	N/A	Select this to filter the checks based on a specific rule type for data validation, such as checking non-null values, matching patterns, comparing numerical ranges, or verifying date-time constraints. By clicking on the caret down button next to the Rule field, the available rule types will be dynamically populated based on the rule types present in the results. The rules displayed are based on the current dataset and provide more granular control over filtering. Each rule type will show a counter next to it, displaying the total number of occurrences for that rule in the dataset. For example, the rule type After Date Time is displayed with a total of 2 occurrences.

filter

No	Filter	Filter Value	Description
4	Tag	N/A	Tag Filter displays only the tags associated with the currently visible items, along with their color icon, name, type, and the number of matching records. Selecting one or more tags refines the list based on your selection. If no matching items are found, a 'No options found' message is displayed.
5	Table	N/A	Filters checks by the table to which they are applied.
6	Field	N/A	Filters checks by the specific field/column name within a table.
7	Template	N/A	This filter allows users to view and apply predefined check templates.

Quality Check Migration

Quality Check Migration allows you to transfer authored quality checks from one container to another, even across different datastores. This feature helps you reuse existing quality rules without manually recreating them in the target container. This feature is useful when you want to:

Reuse existing authored quality checks in another container or datastore.
Quickly set up quality checks for similar datasets without starting from scratch.
Standardize quality rules across multiple data stores.

Note

Archived and inferred checks are excluded from migration to ensure only active, relevant, authored checks are moved. All migrated checks are set to Draft status, allowing you to review and activate them in the new container.

Let’s get started 🚀

How It Works

The Migrate Quality Checks process consists of two main steps:

1. Select Checks

Choose which authored quality checks to migrate:

REF.	FIELDS	DESCRIPTION
1.	All	Migrates all authored quality checks available in the source container, excluding archived ones.
2.	Specific	Lets you manually select individual authored quality checks from a list. Useful when you only need certain checks in the target container.
3.	Tag	Migrates all authored quality checks that match the selected tags, allowing for automated grouping.

migration-1

2. Destination Settings

Define where the selected checks will be migrated:

REF.	FIELDS	DESCRIPTION
1.	Source Datastore	The datastore where the selected quality checks will be migrated.
2.	Table	The specific target container (e.g., table) within the datastore where the checks will be added.
3.	Assign Additional Tags	Lets you add tags to migrated checks to help with categorization and filtering in the target datastore.

migration-2

Note

Migrated checks are set to Draft. Field(s) will be automatically matched by name when possible; unmatched fields remain unassigned.

Example Use Case

Scenario

You have two tables in the COVID-19 Data datastore:

CDC_INPATIENT_BEDS_ALL – contains all hospital inpatient bed records.
CDC_INPATIENT_BEDS_COVID – contains only records related to COVID-19 cases.

The first table already has 12 authored quality checks to verify important fields like hospital_id, report_date, and available_beds.

The second table doesn’t have these checks yet, but it uses the same structure and fields.

Instead of creating all 12 checks again, you migrate them from CDC_INPATIENT_BEDS_ALL to CDC_INPATIENT_BEDS_COVID.

This way, both tables follow the same validation rules, saving time and keeping data quality consistent.

Before and After Migration

Item	Before Migration	After Migration
Source Table	CDC_INPATIENT_BEDS_ALL (authored checks already exist).	CDC_INPATIENT_BEDS_ALL (unchanged).
Destination Table	CDC_INPATIENT_BEDS_COVID (no authored checks).	CDC_INPATIENT_BEDS_COVID (authored checks in Draft status).
Check Count	12 authored checks.	12 authored checks (copied).
Status of Checks	Active in source.	Draft in destination.
Benefit	N/A	Saves time, ensures consistency, and avoids manual recreation.

Visual Diagram

Flowchart

graph TD
A[Start] --> B[CDC_INPATIENT_BEDS_ALL Source: 12 Authored Checks]
B -->|Migrate| C[CDC_INPATIENT_BEDS_COVID Destination: Draft Checks]
C --> D[End]

Tips

Review before activation: Migrated checks are saved as Draft, so you can make adjustments before using them.
Use tags for tracking: Assign a tag like Migrated_Aug2025 to easily find migrated checks later.
Keep field names the same: The system only assigns fields when names match. If a field name differs during migration, it will not be mapped—another reason checks start in Draft status and require user review.

Observability

Observability helps users track changes in data volume and quality over time, ensuring data accuracy and integrity. Within the Source Datastore section, the Observability tab provides visibility into observability metrics across tables or files within a specific datastore. It introduces two main categories:

Measures
Metric Checks

Measures include Volumetric Checks, which monitor fluctuations in row counts, and Freshness Tracking, which ensures data is updated on time.

Metric Checks focus on specific fields and offer deeper insights derived from scan operations. These tools work together to help detect anomalies early and maintain the reliability of your data assets.

Let’s get started 🚀

Why We Need Observability

Observability is critical in tracking and understanding data behavior, providing insights into how data is moving, evolving, and being processed. By implementing observability, you can monitor key metrics like data volume, freshness, and quality across your systems. It helps you quickly detect anomalies, spot issues early, and ensure data integrity over time.

How It Works

Observability in Qualytics is designed to give you a continuous, automated view of your data health. Here’s how it works:

Automated Checks: The system automatically runs once every hour at the 30-minute mark to check the container volumes in your data stores.
Volume Tracking: Once volume tracking is enabled, no manual intervention is required. The system calculates data volume automatically.
Measurement Frequency: The observability job runs 48 times a day to keep data up-to-date, and each measurement is time-stamped according to UTC.
Efficient Monitoring: The system skips redundant checks for containers that have already been measured, thus optimizing the process.

Use Case: Understanding Automatic Volume Tracking

Scenario

A data operations team needs to monitor table volumes daily to detect unexpected data spikes or drops. They want to understand how Qualytics automatically tracks volume without requiring constant manual intervention.

Common Questions

Q: How is data volume calculated? Do I need to run profiles or scans daily?
Ans: Data volume is calculated automatically by the observability job once volume tracking is enabled. No manual profiling or scanning is required.

How It Works:

The observability job runs 24 times per day (every 30 minutes)
On each run, it checks whether a container has already been measured for "today" (UTC time)
If not measured yet, it records the container's volume
If already measured for that day, it skips the measurement
The first daily measurement typically occurs around 00:30 UTC (the first run after midnight UTC)

Setup Requirements

Initial Cataloging: Ensure tables are cataloged in Qualytics
Enable Volume Tracking: Turn on volume tracking for the specific container
Automatic Monitoring: The observability job handles all subsequent measurements

No additional profiling or scanning operations are needed after the initial setup.

Time Zone Considerations

The observability system currently operates on UTC time for daily volume calculations. This means:

Daily volume resets occur at midnight UTC
Volume measurements begin at 00:30 UTC
The UI displays daily totals in UTC time

Note

"Local Time Display"
The UI currently displays observability data in UTC. If you notice date discrepancies (for example, a scan performed on September 17 showing observability data starting September 18), this is due to UTC time zone differences.

Benefits

Zero Maintenance: Automatic checks every 30 minutes without manual intervention
Consistent Monitoring: Regular measurements throughout the day ensure comprehensive coverage
Early Detection: Quickly identify volume anomalies or data pipeline issues
Efficient Processing: Smart skip logic prevents redundant measurements

Configuration

To enable automatic volume tracking:

Navigate to your datastore settings
Select the container you want to monitor
Enable Volume Tracking
The Observability job runs automatically every hour at minute 30.

Step 1: Log in to your Qualytics account and select the datastore from the left menu that you want to monitor.

datastore

Step 2: Click on the “Observability” from the Navigation tab.

observability

Observability metrics for tables of the selected source datastore are shown, enabling you to view their detailed insights.

observability-metrics

Observability Categories

Observability in data checks is divided into two key categories: Measures and Metric Checks. Measures focus on overall data trends and include Volumetric Checks, which monitor data volume to identify trends and anomalies, and Freshness Tracking, which tracks when data was last added or updated to ensure timeliness. Metric Checks, on the other hand, analyze specific data attributes, providing detailed insights into data quality.

Measure

Measures focus on monitoring overall data trends to ensure consistency and reliability.

Note

For more information regarding measures please refer to the measure documentation.

Metric

Metrics track changes in data over time to ensure accuracy and reliability.

Note

For more information regarding metric please refer to the metric documentation.

Observability Categories

Measures

Measures focus on monitoring overall data trends to ensure consistency and reliability. This includes Volumetric Checks, which track data volume to identify trends and detect anomalies, and Freshness Tracking, which measures the last update or addition of data to ensure timeliness. These checks help maintain data integrity by highlighting unexpected changes in volume or delays in data updates. This category includes two key checks:

Volumetric

Volumetric help monitor data volumes over time to keep data accurate and reliable. They automatically count rows in a table and spot any unusual changes, like problems with data loading.

For more information please, refer to the volumetric documentation.

Freshness

This measures the timeliness of data by monitoring when new data was last added or updated. It helps ensure that data remains up-to-date and relevant for decision-making.

For more information please, refer to the freshness documentation.

Metric

Metrics track changes in data over time to ensure accuracy and reliability. They check specific fields against set limits to identify when values, like averages, go beyond expected ranges. With scheduled scans, Metrics automatically log and analyze these data points, making it easy for users to spot any issues. This functionality enhances users' understanding of data patterns, ensuring high quality and dependability. With Metrics, managing and monitoring data becomes straightforward and efficient.

observability

No	Field	Description
1	Search	The search bar helps users find specific metrics or data by entering an identifier or description.
2	Sort By	Sort By allows users to organize data by Weight, Anomalies, or Created Date for easier analysis and prioritization.
3	Filter	Filter lets users refine data by Tags or Tables. Use Apply to filter or Clear to reset.
4	Metric(ID)	Represents the tracked data metric with a unique ID.
5	Description	A brief label or note about the metric, in this case, it's labeled as test.
6	Weight	Weight shows how important a check is for finding anomalies and sending alerts.
7	Anomalies	Anomalies show unexpected changes or issues in the data that need attention.
8	Favorite	Mark this as a favorite for quick access and easy monitoring in the future.
9	Edit Checks	Edit the check to modify settings, or add tags for better customization and monitoring.
10	Field	This refers to the specific field being measured, here the max_value, which tracks the highest value observed for the metric.
11	Min	This indicates the minimum value for the metric, which is set to 1. If not defined, no lower limit is applied.
12	Max	This field shows the maximum threshold for the metric, set at 8. Exceeding this may indicate an issue or anomaly.
13	Created Date	This field shows when the metric was first set up, in this case, June 18, 2024.
14	Last Asserted	Last Asserted field shows the last time the metric was checked, in this case July 25, 2024.
15	Edit Threshold	Edit Threshold lets users set custom limits for alerts, helping them control when they’re notified about changes in data.
16	Group By	This option lets users group data by periods like Day, Week, or Month. In this example, it's set to Day.

Comparisons

When you add a metric check, you can choose from three comparison options:

Absolute Change
Absolute Value
Percentage Change

These options help define how the system will evaluate your data during scan operations on the datastore.

Once a scan is run, the system analyzes the data based on the selected comparison type. For example, Absolute Change will look for significant differences between scans, Absolute Value checks if the data falls within a predefined range, and Percentage Change identifies shifts in data as a percentage.

Based on the chosen comparison type, the system flags any deviations from the defined thresholds. These deviations are then visually represented on a chart, displaying how the metric has fluctuated over time between scans. If the data crosses the upper or lower limits during any scan, the system will highlight this in the chart for further analysis.

1. Absolute Change: The Absolute Change comparison checks how much a numeric field's value has changed between scans. If the change exceeds a set limit (Min/Max), it flags this as an anomaly.

observability

2. Absolute Value: The Absolute Value comparison checks whether a numeric field's value falls within a defined range (between Min and Max) during each scan. If the value goes beyond this range, it identifies it as an anomaly.

observability

3. Percentage Change: The Percentage Change comparison monitors how much a numeric field's value has shifted in percentage terms. If the change surpasses the set percentage threshold between scans, it triggers an anomaly.

observability

Minimum Measurements for Chart Rendering

To display metric charts in the UI, a minimum number of measurements must be recorded. If the required number of measurements is not met, the chart remains empty even though some measurements exist.

Absolute Value: Requires at least 2 measurements to render.
Absolute Change: Requires at least 3 measurements to render.
Percentage Change: Requires at least 3 measurements to render.

These thresholds ensure meaningful visual representation by preventing incomplete or misleading chart data.

Volumetric

Volumetric checks help monitor data volumes over time to keep data accurate and reliable. They automatically count rows in a table and spot any unusual changes, like problems with data loading. This makes it easier to catch issues early and keep everything running smoothly. Volumetric checks also let you track data over different time periods, like daily or weekly. The system sets limits based on past data, and if the row count goes above or below those limits, an anomaly alert is triggered.

details

No	Field	Description
1	Search	This feature helps users quickly find specific identifiers or names in the data.
2	Report Date	Report Date lets users pick a specific date to view data trends for that day.
3	Time Frame	The time frame option lets users choose a period (week, month, quarter, or year) to view data trends.
4	Sort By	Sort By option helps users organize data by criteria like Volumetrics Count, Name, or Last Scanned for quick access.
5	Filter	The filter lets users easily refine results by choosing specific tags or tables to view.
6	Favorite	Mark this as a favorite for quick access and easy monitoring in the future.
7	Table	Displays the table for which the volumetric check is being performed (e.g., customer_view, nation). Each table has its own Volumetric Check.
8	Check (# ID)	Each check is assigned a unique identifier, followed by the time period it applies to (e.g., 1 Day for the customer table). This ID helps in tracking the specific check in the system.
9	Weight	Weight shows how important a check is for finding anomalies and sending alerts.
10	Anomaly Detection	The Volumetric Check detects anomalies when row counts exceed set min or max thresholds, triggering an alert for sudden changes.
11	Edit Checks	Edit the check to modify settings, or add tags for better customization and monitoring.
12	Group By	Users can also Group By specific intervals, such as day, week, or month, to observe trends over different periods.
13	Measurement Period	Defines the time period over which the volumetric check is evaluated. It can be customized to 1 day, week, or other timeframes.
14	Comparison	These indicate the type of comparison used, indicating the "Absolute Value" method.
15	Min Values	These indicate the minimum thresholds for the row count of the table being checked (e.g., 150,139 Rows).
16	Max Values	These indicate the maximum thresholds for the row count of the table being checked.
17	Last Asserted	This shows the date the last check was asserted, which is the last time the system evaluated the Volumetric Check (e.g., Oct 02, 2024).
18	Edit Threshold	Edit Threshold lets users set custom limits for alerts, helping them control when they’re notified about changes in data.
19	Graph Visualization	The graph provides a visual representation of the row count trends. It shows fluctuations in data volume over the selected period. This visual allows users to quickly identify any irregularities or anomalies.

Observability Heatmap

The heatmap provides a visual overview of data anomalies by day, using color codes for quick understanding:

heatmap-square

Blue square: Blue squares represent days with no anomalies, meaning data stayed within the expected range.
Orange square: Orange squares indicate days where data exceeded the minimum or maximum threshold range but didn’t qualify as a critical anomaly.
Red square: Red squares highlight days with anomalies, signaling significant deviations from expected values that need further investigation.

heatmap-square

By hovering over each square, you can view additional details for that specific day, including the date, anomaly count, last row count, and last modification time allowing you to easily pinpoint and analyze data issues over time.

Freshness

This measures the timeliness of data by monitoring when new data was last added or updated. It helps ensure that data remains up-to-date and relevant for decision-making. Users can view timestamp values in a clear date and time format, making it easier to analyze data freshness while maintaining millisecond-level precision in the background. If data updates are delayed or missing, it may indicate pipeline failures, system lag, or unexpected data source changes. Regular freshness checks prevent outdated information from impacting analytics, reporting, or automated workflows.

freshness

No.	Field	Description
1.	Search Bar	This feature helps users quickly find specific identifiers or names in the data.
2.	Report Date	Report Date lets users pick a specific date to view data trends for that day.
3.	Timeframe	The time frame option lets users choose a period (week, month, quarter, or year.) to view data trends.
4.	Sort By	Sort By option helps users organize data by criteria like Anomalies, Checks, Created Date, Name, or Last Scanned for quick access.
5.	Filters	The filter lets users easily refine results by choosing specific tags or tables to view.
6.	Favorite	Mark this as a favorite for quick access and easy monitoring in the future.
7.	Table	Displays the name of the selected table being analyzed.
8.	Weight	Weight shows how important a check is for finding anomalies and sending alerts.
9.	Anomaly Detection	Represents active anomalies detected in the data.
10.	Edit Check	Edit the check to modify settings, or add tags for better customization.
11.	Freshness (# ID)	Each freshness check is assigned a unique identifier, corresponding to the specified time period it monitors (e.g., 1 Day for the customer table). This identifier facilitates precise tracking and management within the system.
12.	Group By	Users can also Group By specific intervals, such as day, week to observe trends over different periods.
13.	Unit	The unit used to measure data freshness, shown in milliseconds.
14.	Maximum Age	Displays the maximum recorded age of data in milliseconds.
15.	Last Asserted	Shows the latest date when the data was validated or checked.
16.	Edit Maximum Age	Edit Maximum Age lets users set custom limits for data freshness, allowing control over when alerts are triggered based on the age of the data.
17.	Graph Visualization	Graph illustrates consistent data patterns over time, with sudden anomalies marked by spikes in red. It reflects changes in data freshness and highlights when the data was last updated.

Note

For more information please refer to the Freshness Check documentation..

Manage Observability

Overview

In this section, you can manage the observability settings, including editing checks, thresholds, maximum ages, and marking checks as favorites. These features help you fine-tune and optimize your monitoring setup.

Edit Check

Editing a Check enables users to modify settings such as the unit of measurement, maximum age, description, and metadata.

Note

For more steps refer to the edit checks documentation

Edit Maximum Age

Maximum Age sets the limit for how long data can remain unchanged before it’s flagged as outdated.

Note

For more steps refer to the edit maximum age documentation

Edit Threshold

Edit thresholds to set specific row count limits for your data checks.

Note

For more steps refer to the edit threshold documentation

Mark Check as Favorite

Marking a Metric Check as a favorite allows you to easily access important checks quickly.

Note

For more steps refer to the mark check as favorite documentation

Edit Check

Editing a Check enables users to modify settings such as the unit of measurement, maximum age, description, and metadata. Additionally, they can add tags to streamline organization and retrieval.

Step 1: Click the edit icon to modify the check.

freshness

A modal window will appear with the check details.

freshness

Step 2: Modify the check details as needed based on your preferences:

No.	Field	Description
1.	Unit	Edit the unit of measurement for the freshness check, such as milliseconds (Millis), Minutes, Hours etc.
2.	Maximum Age	Edit the maximum allowed age (in the specified unit) for data to be considered fresh.
3.	Description	Edit the Description to better explain what the check does.
4.	Tags	Edit the Tags to organize and easily find the check later.
5.	Additional Metadata(Optional)	Edit the Additional Metadata section to add any new custom details for more context.

freshness

Step 3: Once you have edited the check details, then click on the Validate button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct.

freshness

If the validation is successful, a green message saying "Validation Successful" will appear.

freshness

Step 4: Once you have a successful validation, click the "Update" button. The system will update the changes you've made to the check, including changes to the properties, description, tags, or additional metadata.

freshness

After clicking on the Update button, your check is successfully updated.

Edit Threshold

Edit thresholds to set specific row count limits for your data checks. By defining minimum and maximum values, you ensure alerts are triggered when data goes beyond the expected range. This helps you monitor unusual changes in data volume. It gives you better control over tracking your data's behavior.

Note

When editing the threshold, only the min and max values can be modified.

Step 1: Click the Edit Thresholds button on the right side of the graph.

observability

Step 2: After clicking Edit Thresholds, you enter the editing mode where the Min and Max values become editable, allowing you to input new row count limits.

observability

Step 3: Once you've updated the Min and Max values, click Save to apply the changes and update the thresholds.

After clicking on the Save button, your threshold is successfully updated.

Edit Maximum Age

Maximum Age sets the limit for how long data can remain unchanged before it’s flagged as outdated. This ensures your data stays fresh and reliable for decision-making.

Step 1: Click the Edit Maximum Age button on the right side of the graph.

age

Step 2: After clicking Edit Maximum Age, the field becomes editable, allowing you to modify the maximum age value.

age

Step 3: Once you've updated the maximum age values, click Save to apply the changes.

age

After clicking on the Save button, a success flash message will appear.

Mark Check as Favorite

Marking a Metric Check as a favorite allows you to easily access important checks quickly. This feature helps you prioritize and manage the checks you frequently use, making data monitoring more efficient.

Click on the bookmark icon to mark the Metric Check as a favorite.

observability

After clicking on the bookmark icon your check is successfully marked as a favorite.

To unmark a check, simply click on the bookmark icon of the marked check. This will remove it from your favorites.

observability

Anomalies

Anomalies in Qualytics represent data points that deviate from expected patterns or violate defined quality rules, often highlighting issues such as missing values, structural inconsistencies, or incorrect data. These anomalies are detected during scan operations through system-inferred or user-authored checks.

Anomaly Types

Qualytics classifies anomalies into two types: Record Anomalies and Shape Anomalies. Record anomalies flag rows with data issues like missing or invalid values, while shape anomalies detect structural problems such as missing columns or schema changes. Together, they ensure thorough data quality coverage at both the value and structure levels.

Note

For more information, please refer to the Anomaly Types Documentation.

Anomaly Detection Process

The anomaly detection process in Qualytics ensures data quality by identifying deviations from expected patterns through a structured workflow. It starts with configuring datastores, cataloging metadata, and profiling data to understand its structure. Users then apply quality checks—either authored or inferred—during Scan operations. Any failures are flagged as anomalies, enabling timely detection and resolution of data issues to maintain overall data integrity.

Note

For more information, please refer to the Anomaly Detection Process Documentation.

Anomaly Types

Anomalies in Qualytics are classified into two primary types, Record Anomalies and Shape Anomalies, each targeting different aspects of data integrity. Record anomalies flag individual rows that fail specific quality checks, such as missing or invalid values. Shape anomalies, on the other hand, detect structural issues in the dataset, like missing columns or schema mismatches. Together, these types provide a comprehensive approach to identifying both value-level and schema-level data quality issues.

Let’s get started 🚀

Record Anomaly

A record anomaly identifies a single record (row) as anomalous and provides specific details regarding why it is considered anomalous. The simplest form of a record anomaly is a row that lacks an expected value for a field.

Example Use Case

Scenario

We have an Employee dataset used for payroll.

Rules:

Every employee must have a Salary greater than 40,000.
The dataset must contain these four columns: id, name, age, salary.
The name must follow the "First Last" format.

Rule Checked: Salary > 40,000

Input Table

id	name	age	salary
1	John Doe	28	50,000
2	Jane Smith	35	75,000
3	Bob Johnson	22	30,000

Detection Result (Record Anomaly)

id	name	age	salary	anomaly_reason
3	Bob Johnson	22	30,000	Salary is less than the required 40,000

Why this is a Record Anomaly:

The table structure is correct. Only one row’s value violates the rule.

Shape Anomaly

A shape anomaly identifies an anomalous structure within the analyzed data. The simplest shape anomaly is a dataset that doesn't match the expected schema because it lacks one or more fields. Some shape anomalies only apply to a subset of the data being analyzed and can therefore produce a count of the number of rows that reflect the anomalous concern. Where that is possible, the shape anomaly's anomalous_record_count is populated.

Note

Sometimes, shape anomalies only affect a subset of the dataset. This means that only certain rows exhibit the structural issue, rather than the entire dataset.

Example Use Case

Scenario

We have a Sales Orders dataset.

Rules:

Required columns: order_id, customer_id, order_date, total_amount.
order_date must be in YYYY-MM-DD format.

Input Table

order_id	customer_id	order_date
101	C001	2025-08-10
102	C002	08/11/2025
103	C003	2025-08-12

Detection Result (Shape Anomalies)

order_id	customer_id	order_date	total_amount	anomaly_reason
101	C001	2025-08-10	–	Missing total_amount column
102	C002	08/11/2025	–	Missing total_amount column; Date format incorrect
103	C003	2025-08-12	–	Missing total_amount column

Why this is a Shape Anomaly:

A required column (total_amount) is completely missing from the structure.
A field format (order_date in row 102) does not match the required YYYY-MM-DD pattern.
The problem is with the shape/structure of the dataset, not just a wrong value.

Note

When a shape anomaly affects only a portion of the dataset, Qualytics can count the number of rows that have the structural problem. This count is stored in the anomalous_record_count field, providing a clear measure of how widespread the issue is within the dataset. Example: Imagine a dataset that is supposed to have columns for id, name, age, and salary. If some rows are missing the salary column, this would be flagged as a shape anomaly. If this issue only affects 50 out of 1,000 rows, the anomalous_record_count would be 50, indicating that 50 rows have a structural issue.

Anomaly Detection Process

The anomaly detection process in Qualytics ensures data quality by identifying deviations from expected patterns through a structured workflow. It starts with configuring datastores, cataloging metadata, and profiling data to understand its structure. Users then apply quality checks—either authored or inferred—during Scan operations. Any failures are flagged as anomalies, enabling timely detection and resolution of data issues to maintain overall data integrity.

Let’s get started 🚀

1. Create a Datastore and Connection

By setting up a datastore and establishing a connection to your data source (database or file system), you create a robust foundation for effective data management and analysis in Qualytics. This setup enables you to access, manipulate, and utilize your data efficiently, paving the way for advanced data quality checks, profiling, scanning, anomaly surveillance, and other analytics tasks.

Note

For more information, please refer to the Configuring Source Datastores documentation.

2. Catalog Operation

The Catalog operation involves systematically collecting data structures along with their corresponding metadata. This process also includes a thorough analysis of the existing metadata within the datastore. This ensures a solid foundation for the subsequent Profile and Scan operations.

Note

For more information, please refer to the Catalog Operation Documentation.

3. Profile Operation

The Profile operation enables training of the collected data structures and their associated metadata values. This is crucial for gathering comprehensive aggregated statistics on the selected data, providing deeper insights, and preparing the data for quality assessment.

Note

For more information, please refer to the documentation Profile Operation.

4. Create Authored Checks

Authored Checks are manually created data quality checks in Qualytics, defined by users either through the user interface (UI) or via API. These checks encapsulate a specific data quality check, along with additional context such as associated notifications, tags, filters, and tolerances.
Authored checks can range from simple, template-based checks to more complex rules implemented through SQL or user-defined functions (UDFs) in Scala. By allowing users to define precise criteria for data quality, authored checks enable detailed monitoring and validation of data within the datastore, ensuring that it meets the specified standards and requirements.

Note

For more information, please refer to the documentation Authored Checks.

5. Scan Operation

The Scan operation asserts rigorous quality checks to identify any anomalies within the data. This step ensures data integrity and reliability by recording the analyzed data in your configured enrichment datastore, facilitating continuous data quality improvement.

Note

For more information, please refer to the documentation Scan Operation.

6. Anomaly Analysis

An Anomaly is a data record or column that fails a data quality check during a Scan Operation. These anomalies are identified through both Inferred and Authored Checks and are grouped together to highlight data quality issues. This process ensures that any deviations from expected data quality standards are promptly identified and addressed.

Note

For more information, please refer to the documentation Anomalies Overview.

Details

Anomaly Insights

Anomaly Insights provides key insights into a specific data anomaly, including its status, anomalous record count, failed checks, and weight. It also shows when the anomaly was detected, the triggering scan, and the related datastore, table, and location. This view helps users quickly understand the scope and source of the anomaly for easier investigation and resolution.

Let’s get started 🚀

Step 1: Click on the anomaly that you want to see the details of.

anomaly-details

You will be navigated to the details section, where you can view the Summary, Failed Checks, Source Records and Activity information.

anomaly-details-view

Summary Section

The Summary section provides a quick overview of the anomaly's key attributes. It includes the anomaly’s status, total anomalous records, failed checks, weight, detection time, scan information, and the corresponding datastore and table. This section helps users quickly understand where the anomaly occurred and its potential impact.

summary

No.	Field	Description
1	Status and Type	Shows the current state and category of the anomaly. In this case, the anomaly is Active and of type Shape, indicating it relates to the structure or distribution of the data.
2	Anomalous Records	Indicates the total number of records affected by the anomaly. Here, 102 records were identified as anomalous.
3	Failed Check	Displays the number of data quality checks that were violated and triggered this anomaly. In this instance, 1 check failed.
4	Weight	Represents the significance or impact of the anomaly. A higher weight value implies a more critical issue. This anomaly has a weight of 8.
5	Detected	Shows how long ago the anomaly was first detected. When you hover over the time the anomaly was detected, a pop-up appears displaying the complete date and time.
6	Scan	Indicates the scan operation that detected the anomaly. Scan ID #21379 is shown here, and it was an incremental scan. When you click on the expand icon, you will be directed to the Scan Results page where you can view the specific scan that detected the anomaly.
7	Source Datastore	Identifies the dataset that contains the anomaly. This anomaly occurred in the "Qualytics Databricks POC" datastore. Clicking the expand icon opens a detailed view and navigates to the dataset’s page for more information about the source datastore.
8	Table	Points to the specific table involved in the anomaly. The affected table is raw_order. Clicking on the expand icon navigates to the table’s page, providing more in-depth information about the table structure and contents.
9	Location	Displays the full path of the table in the datastore. This helps users trace the exact location of the anomaly within the data pipeline. You can click on the copy icon to copy the full location path of the table where the anomaly was detected.
10	Tags	Highlights the severity or categorization of the anomaly. The tag High indicates a high-priority issue. You can add or remove tags from the anomaly by clicking on the tag badge.

summary-fields

Failed Checks

The Failed Checks section lists the data quality checks that were violated and subsequently triggered the anomaly. Each listed item displays the check ID, type of violation, and a summarized description of the failure condition.

failed-checks-section

Click on a failed check to view the corresponding quality check information.

check

A right-side panel will open, allowing you to view the details without navigating to a different page.

right-panel

Source Records

The Source Records section displays all the data and fields related to the detected anomaly from the dataset. It is an Enrichment Datastore that is used to store the analyzed results, including any anomalies and additional metadata in files; therefore, it is recommended to add/link an enrichment datastore with your connected source datastore.

source-records

For more information on Source Records, please refer to the Source Records section in the documentation.

Activity Section

The Activity section provides a complete timeline of actions and events related to the anomaly. It helps users track how the anomaly has been handled and by whom, ensuring better collaboration and accountability.

activity-section

Users can leave comments to discuss the issue, add context, or communicate decisions. All comments are timestamped and attributed to the respective user.

Note

Users can’t add, edit, or delete comments in the Activity section when an anomaly is archived (Duplicated, Invalid, or Resolved). Restore the anomaly to make updates, then revert its status if needed.

comment

Source Records

The Source Records page provides a detailed view of all records from your dataset that have failed data quality checks and been identified as anomalies. It serves as the primary interface for reviewing anomalous data at the row level, with visual highlights indicating the specific fields that triggered the anomalies. All displayed records are sourced from the linked Enrichment Datastore, which stores the results of data quality scans along with relevant metadata.

If the Anomaly Type is Shape, you will find the highlighted column(s) having anomalies in the source record.

source-record

If the Anomaly Type is Record, you will find the highlighted row(s) in the source record indicating failed checks.

record

Note

In anomaly detection, source records are displayed as part of the Anomaly Details. For a Record anomaly, the specific record is highlighted. For a Shape anomaly, 10 samples from the underlying anomalous records are highlighted.

Source Record Visualization

Users can view source records with selectable display limits of 10, 100, 1,000, or 10,000 records for comprehensive dataset analysis. The interface includes sticky headers that remain visible when scrolling through large datasets, making navigation easier during data review.

record

Comparison Source Records

Anomalies identified by the Data Diff rule type, configured with Row Identifiers, are displayed with a detailed source record comparison. This visualization highlights differences between rows, making it easier to identify specific discrepancies.

is-replica

Anomaly Status

Anomaly status provides a structured way to track the lifecycle of data quality issues—from detection to resolution. Each anomaly is assigned a status that indicates its current state, helping teams prioritize actions and maintain oversight. These statuses are divided into two main categories: Open, for anomalies that still need attention, and Archived, for those that have been resolved, dismissed, or categorized for reference.

Let’s get started 🚀

Open Anomalies

Open anomalies are data quality issues that have been detected but not yet resolved or archived. This category is divided into three sub-statuses that help track the progress and handling of each anomaly.

open-anomalies

1 Active: By clicking on the Active button, the user can see anomalies that are currently unresolved and have not been acknowledged, archived, or resolved. It may require immediate attention.

2 Acknowledged: By clicking on the Acknowledged button, the user can see anomalies that has been reviewed and marked as acknowledged, though it may still need further action.

3 All: By clicking on the All button, the user can view all open anomalies, including those marked as Active and Acknowledged, providing a complete view of ongoing issues.

Archived Anomalies

Archived anomalies are issues that have already been reviewed and moved out of the active monitoring flow. These anomalies are categorized based on how they were resolved or classified, helping maintain a clear historical record without cluttering ongoing monitoring efforts.

archive-anomalies

1 Resolved: This indicates that the anomaly was a legitimate data quality concern and has been addressed.

2 Duplicate: This indicates that the anomaly is a duplicate of an existing record and has already been addressed.

Info

The main purpose of marking an anomaly as Duplicate is to support fingerprinting.
If an anomaly is set as Duplicate without referencing the original anomaly, Qualytics cannot determine which one is the true original, which breaks fingerprinting.

Recommended approach:
Set the anomaly as Discarded instead and include the original anomaly ID or a meaningful comment. This keeps the fingerprinting logic accurate.

For more information refer to the Anomaly Fingerprint Documentation

3 Invalid: This indicates that the anomaly is not a legitimate data quality concern and does not require further action.

4 Discarded: This indicates that the anomaly is no longer being reviewed or considered relevant. It helps remove outdated or unnecessary anomalies from the active list without marking them as invalid or resolved.

5 All: Displays all archived anomalies, including those marked as Resolved, Duplicate, and Invalid, giving a comprehensive view of all past issues.

Note

For more information, refer to the Archived Anomalies Documentation.

Anomaly Fingerprints

Anomaly fingerprints are unique identifiers generated for each detected anomaly to help the system recognize and manage duplicates effectively. By comparing these fingerprints, Qualytics can determine whether a newly detected anomaly matches a previously identified one. This mechanism reduces redundancy, ensures consistency in anomaly tracking, and simplifies decision-making during data quality operations.

Info

Fingerprinting works only when Incremental Row tracking is enabled.

Duplicate Handling Configuration

Once anomalies are fingerprinted, Qualytics can use these unique identifiers to determine whether a newly detected anomaly matches any existing one. This fingerprint-based recognition powers the duplicate handling configuration during scan setup.

When configuring a scan operation, you can define how the system should respond to anomalies that share fingerprints with previously detected ones:

Duplicate Status: You can instruct the system to automatically mark newly detected anomalies as “Duplicate” if their fingerprints match those of any open anomalies. These duplicates are then archived, ensuring focus remains on the original issue without creating redundant records.
Re-opening Option: If a new anomaly matches an archived one, you can configure the system to automatically re-open the earlier anomaly. This ensures that reoccurring issues are not overlooked simply because they were resolved or dismissed in a prior scan.

Fingerprinting Criteria

To determine whether anomalies are identical, Qualytics generates unique fingerprints based on specific criteria. These criteria differ depending on the type of anomaly being evaluated. This approach ensures that anomalies are only considered duplicates when they are truly the same in both nature and context.

Record Anomalies

For record-level anomalies, fingerprinting is based on the specific check that identified the issue and the complete source data of the anomalous record. This ensures that every unique row is evaluated precisely:

Identifying check: The check responsible for detecting the anomaly (e.g., a null value or out-of-range check).
Source record data: All field values within the affected row.

Note

Identical records across multiple scans will generate the same fingerprint and therefore be flagged as duplicates.

Shape Anomalies

For shape anomalies—which refer to patterns or distributions of data rather than individual records, the fingerprint is derived from a broader set of attributes:

Identifying check(s): The rule(s) that triggered the anomaly at the dataset level.
Failure percentage: The proportion of records that failed the check(s) within the scanned batch.
Maximum incremental identifier: The highest value of a designated incremental field (e.g., timestamp, ID) in the scanned dataset.

Tip

Shape anomalies can only be fingerprinted if the data asset includes an incremental identifier. This field anchors the fingerprint to a specific range of data, ensuring accurate comparisons across different scans.

This fingerprinting mechanism ensures consistent anomaly tracking by minimizing false duplicates and keeping historical issues relevant when they reoccur.

Use Case: Handling Daily Truncate-and-Reload Tables

Scenario

Many data pipelines use staging tables that follow a truncate-and-reload pattern daily. These tables present a unique challenge:

No last update timestamp for incremental strategy
Table is completely truncated and reloaded each day
Same data anomalies appear repeatedly across scans
Standard incremental detection cannot identify "already seen" records

Problem

Without fingerprinting, each daily scan treats truncated-and-reloaded data as entirely new:

Day 1: Scan identifies 127 anomalies → Team acknowledges all 127
Day 2: Table truncated, data reloaded → Same 127 anomalies flagged as "new"
Day 3: Process repeats → Team faces anomaly fatigue from duplicate alerts

The lack of a persistent identifier means Qualytics cannot distinguish between truly new anomalies and recurring issues from reloaded data.

Solution

To handle recurring anomalies in truncate-and-reload tables, configure your scan to use fingerprint-based duplicate handling.

Follow the steps in the scan operation configuration to reach the correct settings. Then, under Step 8 → Scan Settings, open the anomaly options section and enable both duplicate-handling options:

Archive Duplicate Anomalies: When the same 127 anomalies appear again after the table reload, Qualytics recognizes their fingerprints and automatically marks them as duplicates rather than new anomalies.
Reactivate Recurring Anomalies: If an anomaly was previously archived or resolved but reappears in subsequent scans, Qualytics reactivates the original anomaly record, maintaining full historical context.

Benefits

Eliminates daily re-acknowledgment of the same known issues
Maintains clean anomaly counts reflecting only truly new problems
Preserves audit trail through anomaly reactivation
Reduces alert fatigue while ensuring genuine recurrences are tracked

Configuration

Enable these settings in Scan Settings of your Scan Operation:

Archive Duplicate Anomalies
Reactivate Recurring Anomalies

Set an appropriate Anomaly Rollup Threshold based on your data volume and tolerance for grouped anomalies.

Manage Anomalies

Acknowledge Anomalies

By acknowledging anomalies, you indicate that they have been reviewed or recognized. This can be done either individually or in bulk, depending on your workflow. Acknowledging anomalies helps you keep track of issues that have been addressed, even if further action is still required.

Warning

Once an anomaly is acknowledged, it remains acknowledged and never reverts to the active state.

Acknowledge Specific Anomaly

You can acknowledge individual anomalies either directly or through the action menu, giving you precise control over each anomaly's status.

Step 1: Log in to your Qualytics account and select the datastore from the left menu on which you want to manage your anomalies.

datastore-light-1

Step 2: Click on the “Anomalies” from the Navigation Tab.

anomalies

1. Acknowledge Directly

Step 1: Locate the active anomaly you want to acknowledge.

acknowledge-directly

Step 2: Click on the vertical ellipsis (⋮) located on the right side of the anomaly and select “Acknowledge” from the dropdown menu.

acknowledge-option

After clicking on the Acknowledge button your anomaly is successfully moved to the acknowledged state and a confirmation message appears on the screen.

2. Acknowledge via Action Menu

Step 1: Click on the active anomaly from the list of available anomalies that you want to acknowledge.

active-anomaly

Step 2: You will be directed to the anomaly details page. Click on the Acknowledge button located at the top-right corner of the interface.

vertical-acknowledge

After clicking on the Acknowledge button your anomaly is successfully moved to the acknowledged state and a confirmation message appears on the screen.

Acknowledge Anomalies in Bulk

By acknowledging anomalies in bulk, you can quickly mark multiple anomalies as reviewed at once, saving time and ensuring that all relevant issues are addressed simultaneously.

Step 1: Hover over the active anomalies and click on the checkbox to select multiple anomalies.

acknowledge-bulk

When multiple anomalies are selected, an action toolbar appears, displaying the total number of selected anomalies along with a vertical ellipsis for additional bulk action options.

action-toolbar

Step 2: Click on the vertical ellipsis (⋮) and choose "Acknowledge" from the dropdown menu to acknowledge the selected anomalies.

vertical-acknowledge-1

A modal window titled “Acknowledge Anomalies” will appear, confirming that this action acknowledges the anomalies as a legitimate data quality concern.

You also have the option to leave a comment in the provided field to provide additional context or details.

acknowledge-anomalies

Step 3: Click on the Acknowledge button to acknowledge the anomalies.

After clicking on the Acknowledge button your anomalies are successfully moved to the acknowledged state and a confirmation message appears on the screen.

Archive Anomalies

By archiving anomalies, you move them to an inactive state, while still keeping them available for future reference or analysis. Archiving helps keep your active anomaly list clean without permanently deleting the records.

Archive Specific Anomalies

You can archive individual anomalies either directly or through the action menu.

1. Archive Directly

Step 1: Locate the anomaly (whether Active or Acknowledged) you want to archive.

archive-directly

Step 2: Click on the vertical ellipsis (⋮) located on the right side of the anomaly and select “Archive” from the dropdown menu.

archive-option

Step 3: A modal window titled “Archive Anomaly” will appear, providing you with the following archive options:

Resolved: Choose this option if the anomaly was a legitimate data quality concern and has been addressed. This helps maintain a record of resolved issues while ensuring that they are no longer active.
Invalid: Select this option if the anomaly is not a legitimate data quality concern and does not require further action. Archiving anomalies as invalid helps differentiate between real issues and those that can be dismissed, improving overall data quality management.
Discarded: Choose this option if the anomaly is no longer being reviewed or considered relevant. This helps remove outdated or unnecessary anomalies from the active list without marking them as invalid or resolved.

archive-anomaly-options

You also have the option to leave a comment in the provided field to provide additional context or details.

comment-1

Step 4: Once you've made your selection, click the Archive button to proceed.

After clicking on the Archive button your anomaly is moved to the archive and a confirmation message appears on the screen.

Step 1: Click on the anomaly from the list of available (whether Active or Acknowledged) anomalies that you want to archive.

archive-action

Step 2: You will be directed to the anomaly details page. Click on the Settings icon located at the top right corner of the interface and select “Archive” from the dropdown menu.

vertical-archive

Step 3: A modal window titled “Archive Anomaly” will appear, providing you with the following archive options:

Resolved: Choose this option if the anomaly was a legitimate data quality concern and has been addressed. This helps maintain a record of resolved issues while ensuring that they are no longer active.
Invalid: Select this option if the anomaly is not a legitimate data quality concern and does not require further action. Archiving anomalies as invalid helps differentiate between real issues and those that can be dismissed, improving overall data quality management.
Discarded: Choose this option if the anomaly is no longer being reviewed or considered relevant. This helps remove outdated or unnecessary anomalies from the active list without marking them as invalid or resolved.

archive-options

You also have the option to leave a comment in the provided field to provide additional context or details.

comment-2

Step 4: Once you've made your selection, click the Archive button to proceed.

After clicking on the Archive button your anomaly is moved to the archive and a confirmation message appears on the screen.

Archive Anomalies in Bulk

To handle multiple anomalies efficiently, you can archive them in bulk, allowing you to quickly move large volumes of anomalies into the archived state.

Step 1: Hover over the anomaly (whether Active or Acknowledged) and click on the checkbox to select multiple anomalies.

archive-bulk

When multiple anomalies are selected, an action toolbar appears, displaying the total number of selected anomalies along with a vertical ellipsis for additional bulk action options.

archive-bulk-vertical

Step 2: Click on the vertical ellipsis (⋮) and choose "Archive" from the dropdown menu to archive the selected anomalies.

archive-option-bulk

Step 3: A modal window titled “Archive Anomaly” will appear, providing you with the following archive options:

Resolved: Choose this option if the anomaly was a legitimate data quality concern and has been addressed. This helps maintain a record of resolved issues while ensuring that they are no longer active.
Invalid: Select this option if the anomaly is not a legitimate data quality concern and does not require further action. Archiving anomalies as invalid helps differentiate between real issues and those that can be dismissed, improving overall data quality management.
Discarded: Choose this option if the anomaly is no longer being reviewed or considered relevant. This helps remove outdated or unnecessary anomalies from the active list without marking them as invalid or resolved.

archive-bulk-options

You also have the option to leave a comment in the provided field to provide additional context or details.

comment-3

Step 4: Once you've made your selection, click on the Archive button to proceed.

After clicking on the Archive button your anomaly is moved to the archive and a confirmation message appears on the screen.

Restore Anomalies

By restoring archived anomalies, you can bring them back into the acknowledged state for further investigation or review. These anomalies will not return to the active state once they have been acknowledged.

Step 1: Click on the anomaly that you want to restore from the list of archived anomalies.

restore-anomaly

Step 2: You will be directed to the anomaly details page. Click on the Settings icon located at the top right corner of the page and select “Restore” from the drop down menu.

restore-option

After clicking on the Restore button, the selected anomaly is now restored in an acknowledged state and a confirmation message appears on the screen.

Edit Anomalies

By editing anomalies, you can only update their tags, allowing you to categorize and organize anomalies more effectively without altering their core details.

Note

When editing multiple anomalies in bulk, only the tags can be modified.

Step 1: Hover over the anomaly (whether Active or Acknowledged) and click on the checkbox.

edit-anomaly

You can edit multiple anomalies by selecting the checkboxes next to each anomaly to choose multiple anomalies at once.

hover-edit

When multiple anomalies are selected, an action toolbar appears, displaying the total number of selected anomalies along with a vertical ellipsis for additional bulk action options.

vertical-edit

Step 2: Click on the vertical ellipsis (⋮) and choose "Edit" from the dropdown menu to edit the selected anomalies.

vertical-edit-options

A modal window titled “Bulk Edit Anomalies” will appear. Here you can only modify the “tags” of the selected anomalies.

edit-modal

Step 3: Turn on the toggle and assign tags to the selected anomalies.

edit-bulk

Step 4: Once you have assigned the tags, click on the “Save” button.

After clicking the Save button, the selected anomalies will be updated with the assigned tags.

Delete Anomalies

Deleting anomalies allows you to permanently remove records that are no longer relevant or were logged in error. This can be done individually or for multiple anomalies at once, ensuring that your anomaly records remain clean and up to date.

Note

You can only delete archived anomalies, not active or acknowledged anomalies. If you want to delete an active or acknowledged anomaly, you must first move it to the archive, and then you can delete it.

Warning

Deleting an anomaly is a one-time action. It cannot be restored after deletion.

Delete Specific Anomaly

You can delete individual anomalies using one of two methods:

1. Delete Directly

Step 1: Click on Archived from the navigation bar in the Anomalies section to view all archived anomalies.

archive-anomaly

Step 2: Locate the anomaly that you want to delete and click on the Delete icon located on the right side of the anomaly.

delete-anomaly

Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the anomaly from the system.

After clicking on the Delete button, your anomaly is successfully deleted and a confirmation message appears on the screen.

Step 1: Click on the archived anomaly from the list of archived anomalies that you want to delete.

delete-anomaly-1

Step 2: You will be directed to the anomaly details page. Click on the Settings icon located at the top right corner of the page and select “Delete” from the dropdown menu.

vertical-delete

Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the anomaly from the system.

After clicking on the Delete button, your anomaly is successfully deleted and a confirmation message appears on the screen.

Delete Anomalies in Bulk

For more efficient management, you can delete multiple anomalies at once using the bulk delete option, allowing for faster cleanup of unwanted records.

Step 1: Hover over the archived anomalies and click on the checkbox to select anomalies in bulk.

bulk-delete

When multiple checks are selected, an action toolbar appears, displaying the total number of checks chosen along with a vertical ellipsis for additional bulk action options.

bulk-delete-options

Step 2: Click on the vertical ellipsis (⋮) and choose "Delete" from the dropdown menu to delete the selected anomalies.

delete-bulk

Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the selected anomalies from the system.

After clicking on the Delete button, your anomalies are successfully deleted and a confirmation message appears on the screen.

Filter and Sort

Filter and Sort options allow you to organize your anomalies by various criteria, such as Weight, Anomalous Records, and Created Date. You can also apply filters to refine your list of anomalies based on Timeframe, Type, and Rule etc.

Sort

You can sorts your anomalies by Anomalous Records, Created Date, and Weight to easily organize and prioritize them according to your needs.

sort-options

No.	Sort By Option	Description
1	Anomalous Records	Sorts anomalies based on the number of anomalous records identified.
2	Created Date	Sorts anomalies according to the date they were detected.
3	Weight	Sorts anomalies by their assigned weight or importance level.

Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.

sort-order

Filter

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

filter

You can filter your anomalies based on values like Timeframe, Type, Rule, and Tags, etc.

filter

No.	Field	Description
1	Timeframe	Filtering anomalies detected within specific time ranges (e.g., anomalies detected in the last week or year).
2	Type	Filter anomalies based on anomaly type (Record or Shape).
3	Rule	Filter anomalies based on specific rules applied to the anomaly. By clicking on the caret down button next to the Rule field, the available rule types will be dynamically populated based on the rule types present in the results. The rules displayed are based on the current dataset and provide more granular control over filtering. Each rule type will show a counter next to it, displaying the total number of occurrences for that rule in the dataset. For example, the rule type Between is displayed with a total of 3 occurrences.
4	Table	Filters anomalies based on the table where they occurred.
5	Field	Filters anomalies based on the column in the table where the issue was found.
6	Check	Filters anomalies based on the check that generated them.

filter

No.	Filter	Description
7	Tags	Tag Filter displays only the tags associated with the currently visible items, along with their color icon, name, type, and the number of matching records. Selecting one or more tags refines the list based on your selection. If no matching items are found, a No options found message is displayed.

Explore

Explore page in Qualytics is where you can easily view and manage all your data. It provides easy access to important features through tabs like Insights, Activity, Profiles, Observability, Checks, and Anomalies. Each tab shows a different part of your data, such as its quality, activities, structure, checks, and issues. You can sort and filter the data by datastore and time frame, making it easier to track performance, spot problems, and take action. The Explore section helps you manage and understand your data all in one place.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.

explore

Step 2: After click on the Explore button, you will see the following tabs: Insights, Activity, Profiles, Observability, Checks, and Anomalies.

explore-tab

Insights

Insights tab provides a quick and clear overview of your data's health and performance. It shows key details like Quality Scores, active checks, profiles, scans, and anomalies in a simple and effective way. This makes it easy to monitor and track data quality, respond to issues, and take action quickly. Additionally, users can monitor specific source datastores and check for a particular report date and time frame.

For more details on Insights, please refer to the Insights documentation.

Activity

Activity tab provides a comprehensive view of all operations, helping users monitor and analyze the performance and workflows across various source datastores. Activity are categorized into Runs and Schedule operations, offering distinct insights into executed and scheduled activities.

For more details on Activity, please refer to the Activity documentation.

Profiles

Profiles tab helps you explore and manage your containers and fields. With features like filtering, sorting, tagging, and detailed profiling, it provides a clear understanding of data quality and structure. This simplifies navigation and enhances data management for quick, informed decisions.

For more details on Profiles, please refer to the Profiles documentation.

Observability

Observability tab gives users an easy way to track changes in data volume over time. It introduces two types of checks: Volumetric and Metric. The Volumetric check automatically monitors the number of rows in a table and flags unusual changes, while the Metric check focuses on specific fields, providing more detailed insights from scan operations. Together, these tools help users spot data anomalies quickly and keep their data accurate.

For more details on Observability, please refer to the Observability documentation.

Checks

Checks tab provides a quick overview of the various checks applied across different tables and fields in multiple source datastores. In Qualytics, checks act as rules applied to data tables and fields to ensure accuracy and maintain data integrity. You can filter and sort the checks based on your preferences, making it easy to see which checks are active, in draft, or archived. This section is designed to simplify the review of applied checks across datasets without focusing on data quality or performance.

For more details on Checks, please refer to the Checks documentation.

Anomalies

Anomalies tab provides a quick overview of all detected anomalies across your source datastores. In Qualytics, An Anomaly refers to a dataset (record or column) that fails to meet specified data quality checks, indicating a deviation from expected standards or norms. These anomalies are identified when the data does not satisfy the applied validation criteria. You can filter and sort anomalies based on your preferences, making it easy to see which anomalies are active, acknowledged, or archived. This section is designed to help you quickly identify and address any issues.

For more details on Anomalies, please refer to the Anomalies documentation.

Insights

Insights in Qualytics provides a quick and clear overview of your data's health and performance. It shows key details like Quality Scores, active checks, profiles, scans, and anomalies in a simple and effective way. This makes it easy to monitor and track data quality, respond to issues, and take action quickly. Additionally, users can monitor specific source datastores and check for a particular report date and time frame.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.

explore-button

You will be navigated to the Insights tab to view a presentation of your data, pulled from the connected source datastore.

insight-page

Filtering Controls

Filtering Controls allow you to refine the data displayed on the Insights page. You can customize the data view based on Source Datastores, Tags, Report Date, and Timeframe, ensuring you focus on the specific information that matters to you.

filters

No	Filter	Description
1.	Select Source Datastores	Select specific source datastores to focus on their data.
2.	Tags	Filter data by specific tags to categorize and refine results.
3.	Report Date	Set the report date to view data from a particular day.
4.	Timeframe	Choose a timeframe to view data for a specific (week, month, quarter, and year)

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

filters

Understanding Timeframes and Timeslices

When analyzing data on the Insights, two key concepts help you uncover trends: timeframes and timeslices. These work together to give you both a broad view and a detailed breakdown of your data.

Timeframes

Timeframe is the total range of time you select to view your data. For example, you can choose to see data:

Weekly: Summarize data for an entire week.
Monthly: Group data by months.
Quarterly: Cover three months at a time.
Yearly: Show data for the entire year.

How Metrics Behave Over a Timeframe

Quality Score and other similar metrics display an average for the selected timeframe.

Example:If you select weekly, the Quality Score shown will be the average score for the entire week.

average

Historical Graphs (like Profiles or Scans) show cumulative totals over time.

Example:If you view a graph for a monthly timeframe, the graph shows how data grows or changes month by month.

monthly

Timeslices

Timeslice breaks your selected timeframe into smaller parts. It helps you see more detailed trends within the overall timeframe.

For example:

A weekly timeframe shows each day of the week.
A monthly timeframe breaks into weekly segments.
A quarterly timeframe highlights months within that quarter.
A yearly timeframe divides into quarters and months.

How Timeslices Work

When you choose a timeframe, the graph automatically breaks it into timeslices.
Each bar or point on the graph represents one timeslice.

Example:

If you choose a Weekly timeframe, each bar in the graph will represent one day of the week.

weekly

If you choose a Monthly timeframe, each bar will represent one week in that month.

week

Metrics Within a Timeslice

Metrics like Quality Score, Profiles, or Scans are displayed for each timeslice, allowing you to identify trends and patterns over smaller intervals.

Quality Score

Quality Score gives a clear view of your data's overall quality. It shows important measures like Completeness, Conformity, Consistency, Precision, Timeliness, Volumetrics, and Accuracy, each represented by a percentage. This helps you quickly understand the health of your data, making it easier to identify areas that need improvement.

quality-score

Overview

Overview provides a quick view of your data. It shows the total amount of data being managed, along with the number of Source Datastores and Containers. This helps you easily track the size and growth of your data.

overview

Records and Fields Data

This section shows important information about the records and fields in the connect source datastores:

Records Profiled: This represents the total number of records that were included in the profiling process.
Records Scanned: This refers to the number of records that were checked during a scan operation. The scan performs data quality checks on collections like tables, views, and files.
Fields Profiled: This shows how many field profiles were updated as a result of the profiling operation.

Screenshot

Checks

Checks offer a quick view of active checks, categorizing them based on their results.

Screenshot

1. Passing Check: Displays the real-time number of passed checks that were successfully completed during the scan or profile operation, indicating that the data met the set quality criteria.

passed-check

2. Failing Checks: This shows the real-time number of checks that did not pass during the scan or profile operation, indicating data that did not meet the quality criteria.

failed-check

3. Not Asserted Checks: This shows the real-time number of checks that haven't been processed or validated yet, meaning their status is still pending and they have not been confirmed as either passed or failed.

not-asserted

4. Inferred Checks: This shows the real-time number of system-generated Inferred Checks.These checks are automatically created during a Profile operation using statistical analysis and machine learning methods.

inferred-check

5. Authored Check: This shows the Authored Checks that are manually created by users within the Qualytics platform or API. These checks can range from simple templates for common validations to complex rules using Spark SQL and User-Defined Functions (UDF) in Scala.

authored-check

The count for each category can be viewed by hovering over the relevant check, providing real-time ratios of checks. Users can also click on these checks to navigate directly to the corresponding checks’ dedicated page in the Explore section.

Anomalies

Anomalies section provides a clear overview of identified anomalies in the system. The anomalies are categorized for better clarity and management.

anomalies

Anomalies Identified shows the total issues found, divided into active, acknowledged, resolved, duplicate and invalid, helping users quickly manage and fix problems.

1. Active Anomalies: Shows the number of unresolved anomalies that require immediate attention. These anomalies are still present and have not been acknowledged, archived, or resolved in the system.

active-anomalies

2. Acknowledged Anomalies: These are anomalies that have been reviewed and recognized by users but are not yet resolved. Acknowledging anomalies helps keep track of issues that have been addressed, even if further actions are still needed.

acknowledged-anomalies

3. Resolved Anomalies: Represent anomalies that were valid data quality issues and have been successfully addressed. These anomalies have been resolved, indicating the data now meets the required quality standards.

resolved-anomalies

4. Duplicate Anomalies: These anomalies have been identified as duplicates of existing anomalies. This status helps prevent redundant issue tracking and ensures that duplicate records are consolidated into a single entry.

duplicate-anomalies

5. Invalid Anomalies: These anomalies have been reviewed and determined to be false positives or not relevant. Marking an anomaly as invalid removes it from active consideration, preventing unnecessary investigations.

invalid-anomalies

Info

Users can see the checks using the redirect link (the redirect only show the current check statuses).

The count for each category can be viewed by hovering over the relevant anomalies, providing real-time ratios of anomalies. Users can also click on these anomalies to navigate directly to the corresponding anomalies’ dedicated page in the Explore section.

Rule Distribution Type

Rule Type Distribution highlights the top rule types applied to the source datastore, each represented by a different color. The visualization allows users to quickly see which rules are most commonly applied.

rule-type

By clicking the caret down 🔽 button, users can choose either the top 5 or top 10 rule types to view in the insights, based on their analysis needs.

top

Profiles

Profiles section provides a clear view of data profiling activities over time, showing how often profiling is performed and the amount of data (records) analyzed.

profiles

Profile Runs shows how many times data profiling has been done over a certain period. Each run processes a specific source datastore or table, helping users see how often profiling happens. The graph gives a clear view of the changes in profile runs over time, making it easier to track profiling activity.

Click on the caret down 🔽 button to choose between viewing Records Profiled or Fields Profiled, depending on your preference.

Record Profile

Record Profiled shows the total number of records processed during the profile runs. It provides insight into the amount of data that has been analyzed during those runs. The bars in the graph show the comparison of the number of records profiled over the selected days.

Field Profiled

Field Profiled shows the number of fields processed during the profile runs. It shows how many individual fields within datasets have been analyzed during those runs. The bars in the graph provide a comparison of the fields profiled over the selected days.

Scans

Scans section provides a clear overview of all scanning activities within a selected period. It helps users keep track of how many scans were performed and how many anomalies were detected during those scans. This section makes it easier to understand the scanning process and manage data by offering insight into how often scans occur.

scans

Scan Runs show how often data scans are performed over a certain period. These scans check the quality of data across tables, views, and files, helping users monitor their data regularly and identify any issues. The process can be customized to scan tables or limit the number of records checked, ensuring that data stays accurate and up to standard.

scans-runs

Click on the caret down 🔽 button to choose between viewing Anomalies Identified or Records Scanned, depending on your preference.

Anomalies Identified

Anomalies Identified shows the total number of anomalies detected during the scan runs. The bars in the graph allow users to compare the number of anomalies found across different days, helping them spot trends or irregularities in the data.

anomalies

Records Scanned

Records Scanned shows the total number of records that were scanned during the scan runs. It gives users insight into how much data has been processed and allows them to compare the scanned records over the selected period.

record-scanned

Data Volume

Data Volume allows users to track the size of data stored within all source datastores present in the Qualytics platform over time. This helps in monitoring how the source datastore grows or changes, making it easier to detect irregularities or unexpected increases that could affect system performance. Users can visualize data size trends and manage the source datastore's efficiency, optimizing storage, adjusting resources, and enhancing data processing based on its size and growth.

data-volume

Export

Export button allows you to quickly download the data from the Insights page. You can export data according to the selected Source Datastores, Tags, Report Date, and Timeframe. This makes it easy to save the data for offline use or share it with others.

export

After exporting, the data appears in a structured format, making it easy to save for offline use or to share with others.

download

Refresh

Refresh button allows users to quickly update the Insights data. When clicked, it fetches the latest information, ensuring that users always have the most up-to-date insights.

refresh

A label indicates when the page was last refreshed, helping users track data updates. This feature ensures accuracy and keeps the insights current without requiring a full page reload.

refresh label

Activity

Activity in Qualytics provides a comprehensive view of all operations, helping users monitor and analyze the performance and workflows across various source datastores. Activities are categorized into Runs and Schedule operations, offering distinct insights into executed and scheduled activities.

The Rerun and Resume options depend on the type of operation. Profile and Scan support both because the system can remember where it stopped and continue from there. Catalog, Export, and Materialize only support Rerun, since the system can't pick up from where it left off and must start over. External Scan doesn't support either option, as they don't apply to it.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the Explore button in the left side panel of the interface.

explore

Step 2: Click on "Activity" from the Navigation Tab.

activity

You will be navigated to the Activity tab and here you'll see a list of operations catalog, profile, scan, and external scan across different source datastores.

list

Activity Categories

Activities are divided into two categories: Runs and Schedule Operations. Runs provide insights into operations that have been performed, while Schedule Operations provide insights into scheduled operations.

Runs

Runs provide a complete record of all executed operations across various source datastores. This section enables users to monitor and review activities such as catalog, profile, scan, and external scan. Each run displays key details like the operation type, status, execution time, duration, and triggering method, offering a clear overview of system performance and data processing workflows.

run

No.	Field	Description
1.	Select Source Datastore	Select specific source datastores to focus on their operations.
2.	Search	This feature helps users quickly find specific identifiers.
3.	Sort By	Sort By option helps users organize the list of performed operations by criteria like Duration and Created Date for quick access.
4.	Filter	The filter lets users easily refine the list of performed operations by choosing a specific Type Scan, Catalog, Profile, External Scan, etc. along with Status (Success, Failure, Running, and Aborted) or Has Logs to view operations that completed with logs.
5.	Activity Heatmap	The Activity Heatmap shows daily activity levels, with color intensity indicating operation counts. Hovering over a square reveals details for that day.
6.	Operation List	Shows a list of operations catalog, profile, scan, and external scan, etc performed across various source datastores.

Activity Heatmap

The Activity Heatmap represents activity levels over a period, with each square indicating a day and the color intensity representing the number of operations or activities on that day. It is useful for tracking the number of operations performed on each day within a specific timeframe.

Note

You can click on any of the squares from the Activity Heatmap to filter operations.

heatmap

By hovering over each square, you can view additional details for that specific day, such as the exact date and the total number of operations executed.

hover

Operation Details

Step 1: Click on any successfully performed operation from the list to view its details.

For demonstration purposes, we have selected the profile operation.

Step 2: After clicking, a drop-down will appear, displaying the details of the selected operation.

drop-down

Step 3: Users can hover over abbreviated metrics to see the full value for better clarity. For demonstration purposes, we are hovering over the Records Profiled field to display the full value.

record

Step 4: Users can view the exact completion timestamp of any operation by hovering over the duration label (e.g., Took less than a minute). The tooltip displays the date and time when the operation was completed.

completed

Users can also view both profiled and non-profiled File Patterns:

Step 5: Click on the Result Button.

profiled

The Profile Results modal displays a list of both profiled and non-profiled containers. You can filter the view to show only non-profiled containers by toggling on the button, which will display the complete list of unprofiled containers.

profiled

Schedule

The Schedule section provides a complete record of all scheduled operations across various source datastores. This section enables users to monitor and review scheduled operations such as catalog, profile, and scan. Each scheduled operation includes key details like operation type, scheduled time, and triggering method, giving users a clear overview of system performance and data workflows.

schedule

No.	Field	Description
1.	Selected Source Datastores	Select specific source datastores to focus on their operations.
2.	Search	This feature helps users quickly find specific identifiers.
3.	Sort By	Sort By option helps users organize the list of scheduled operations by criteria like Created Date and Operations for quick access.
4.	Filter	The filter lets users easily refine the list of scheduled operations by choosing a specific operation type: Scan, Catalog, Profile, etc. to view.
5.	Operation List	Shows the list of scheduled operations such as catalog, profile, scan, etc across various source datastores.

Deactivate Schedule Operation

Users can deactivate a scheduled operation from the Activity tab. This stops the operation from running further until it is activated again.

Step 1: Click on the Redirect (↗) button to open the datastore.

redirect

Step 2: Click on the Activity tab and select Schedule to view the scheduled operations.

scheduled

User can hover over any operation timestamp (e.g., "1 week ago") to view the exact Completed at time. Clicking the Redirect link opens the operation details.

time

Step 3: Click on the vertical ellipsis (⋮) and select Deactivate to stop the scheduled operation.

deactivate

Profiles

Profiles in Qualytics helps you explore and manage your containers and fields. With features like filtering, sorting, tagging, and detailed profiling, it provides a clear understanding of data quality and structure. This simplifies navigation and enhances data management for quick, informed decisions.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.

explore

Step 2: Click on the "Profiles" from the Navigation Tab.

profiles

You will be navigated to the Profiles section. Here, you will see the data organized into two sections: Containers and Fields, allowing you to explore and analyze the datasets efficiently.

organized

Containers

By selecting the Containers section, you can explore structured datasets that are organized as either JDBC or DFS containers. JDBC containers represent tables or views within relational databases, while DFS containers include files such as CSV, JSON, or Parquet, typically stored in distributed systems like Hadoop or cloud storage.

container

Container Details

Containers section provides key details about each container, including the last profiled and last scanned dates. Hovering over the info icon for a specific container reveals these details instantly.

Step 1: Locate the container you want to review, then hover over the info icon to view the container Details.

info

Step 2: A pop-up will appear with additional details about the container, such as the last profiled and last scanned dates.

dates

Explore Tables and Fields

By clicking on a specific container, users can view its associated fields, including detailed profiling information. Additionally, clicking the arrow icon on the right side of a specific container allows users to navigate directly to its corresponding table for a more in-depth exploration.

Explore Fields

To explore the data within a container, you can view all its fields. This allows you to gain insights into the structure and quality of the data stored in the container.

Step 1: Click on the specific container whose fields you want to preview.

For demonstration purposes, we have selected the Netsuite Financials container.

netsuite

Step 2: You will be directed to the fields of the selected container, where all the fields of the container will be displayed.

directed

Explore Tables

To explore the data in more detail, you can view the corresponding table of a selected container. This provides a comprehensive look at the data stored within, allowing for deeper analysis and exploration.

Step 1: Click on the arrow icon on the right side of the container you want to preview.

Step 2: You will be directed to the corresponding table, providing a comprehensive view of the data stored in the container.

table

Filter and Sort

Filter and Sort options allow you to organize your containers by various criteria, such as Name, Last Profiled, Last Scanned, Quality Score, Records, and Type. You can also apply filters to refine your list of containers based on Type.

Sort

You can sort your containers by various criteria, such as Name, Last Profiled, Last Scanned, Quality Score, Records, and Type to easily organize and prioritize them according to your needs.

sort

No	Sort By	Description
1.	Active Anomalies	Sorts containers based on the number of currently active anomalies detected.
2.	Active Checks	Sorts containers by the number of active validation checks applied.
3.	Completeness	Sorts containers based on their data completeness percentage.
4.	Created Date	Sorts containers by the date they were created, showing the newest or oldest fields first.
5.	Fields	Sorts containers by the number of fields profiled.
6.	Last Profiled	Sorts by the most recent profiling container.
7.	Last Scanned	Sorts by the most recent scanned container.
8.	Name	Sorts containers alphabetically by their names.
9.	Quality Score	Sorts containers based on their quality score, indicating the reliability of the data in the field.
10.	Records	Sorts containers by the number of records profiled.
11.	Type	Sorts containers based on their data type (e.g., string, boolean, etc.).

sortt

Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.

caret

Filter

You can filter your containers based on values like Type (Table, View, File, Computed Table and Computed File) to easily organize and prioritize them according to your needs.

filter

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

Mark as Favorite

Marking a container as a favorite allows you to quickly access and prioritize the containers that are most important to your work, ensuring faster navigation and improved efficiency.

Step 1: Locate the container which you want to mark as a favorite and click on the bookmark icon located on the left side of the container.

mark

After clicking on the bookmark icon your container is successfully marked as a favorite and a success flash message will appear stating "The Table has been favorited"

msg

To unmark, simply click on the bookmark icon of the marked container. This will remove it from your favorites.

unmark

Fields

By selecting the Fields section in the Qualytics platform, you can view all the fields across your data sources, including their quality scores, completeness, and metadata, for streamlined data management.

Field Details

Field Details view in the Qualytics platform provides in-depth insights into a selected field. It displays key information, including the field’s declared type, number of distinct values, minimum and maximum length of observed values, entropy, and unique/distinct ratio. This detailed profiling allows you to understand the field's data structure, quality, and variability, enabling better data governance and decision-making.

Step 1: Click on the specific field whose field details you want to preview.

field

A modal window will appear, providing detailed information about the selected field, such as its declared type, distinct values, length range, the Last Profile timestamp to indicate when the field was last profiled, and more.

window

Manage Tags in Field Details

Tags can now be directly managed in the field profile within the Explore section. Simply access the Field Details panel to create, add, or remove tags, enabling more efficient and organized data management.

Step 1: Click on the specific field that you want to manage tags.

tag

A Field Details modal window will appear. Click on the + button to assign tags to the selected field.

assign

Step 2: You can also create the new tag by clicking on the ➕ button.

new

A modal window will appear, providing the options to create the tag. Enter the required values to get started.

create

For more information on creating tags, refer to the Add Tag section.

Filter and Sort

Filter and Sort options allow you to organize your fields by various criteria, such as Active Anomalies, Active Checks, Completeness, Created Date, Name, Quality Score, and Type. You can also apply filters to refine your list of fields based on Profile and Type.

Sort

You can sort your containers by various criteria, such as Active Anomalies, Active Checks, Completeness, Created Date, Name, Quality Score, and Type to easily organize and prioritize them according to your needs.

sort2

No	Sort By	Description
1.	Active Anomalies	Sorts fields based on the number of currently active anomalies detected.
2.	Active Checks	Sorts fields by the number of active validation checks applied.
3.	Completeness	Sorts fields based on their data completeness percentage.
4.	Created Date	Sorts fields by the date they were created, showing the newest or oldest fields first.
5.	Name	Sorts fields alphabetically by their names.
6.	Quality Score	Sorts fields based on their quality score, indicating the reliability of the data in the field.
7.	Type	Sorts fields based on their data type (e.g., string, boolean, etc.).

sortt2

Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.

caret2

Filter

You can filter your fields based on Profiles and Type to easily organize and prioritize them according to your needs.

No.	Filter	Description
1.	Profile	Filters fields based on the Profiles (e.g., accounts, accounts.csv, etc.).
2.	Type	Filters fields based on the data type (e.g., string, boolean, date, etc.).

filter2

Observability

Observability provides a structured way to monitor data behavior, detect anomalies, and identify trends across datastores and tables. It supports consistent tracking through Measures and Metrics, including daily data volumes, data freshness, and specific field-level values measured against predefined thresholds. Automated scans, heatmaps, and visual insights make it easier to spot issues, set thresholds, and adjust monitoring settings to maintain data integrity.

Note

For more information please refer to the observability overview documentation

Let’s get started 🚀

Manage Observability

In this section, you can manage the observability settings, including editing checks, thresholds, maximum ages, and marking checks as favorites.

Note

For more information please refer to the manage observability overview documentation

Checks

Checks tab provides a quick overview of the various checks applied across different tables and fields in multiple source datastores. In Qualytics, checks act as rules applied to data tables and fields to ensure accuracy and maintain data integrity. You can filter and sort the checks based on your preferences, making it easy to see which checks are active, in draft, or archived. This section is designed to simplify the review of applied checks across datasets without focusing on data quality or performance.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.

explore

Step 2: Click on the "Checks" from the Navigation Tab.

check-tab

You will be navigated to the Checks tabs here and you'll see a list of all the checks that have been applied to various tables and fields across different source datastores.

Categories Check

You can categorize your checks based on their status, such as Active, Draft, Archived (Invalid and Discarded), or All, according to your preference. This categorization offers a clear view of the data quality validation process, helping you manage checks efficiently and maintain data integrity.

All

By selecting All Checks, you can view a comprehensive list of all the checks in the datastores, including both active and draft checks, allowing you to focus on the checks that are currently being managed or are in progress. However, archived checks are not displayed in this.

all-check

Active

By selecting Active, you can view checks that are currently applied and being enforced on the data. These operational checks are used to validate data quality in real-time, allowing you to monitor all active checks and their performance.

active-check

You can also categorize the active checks based on their importance, favorites, or specific metrics to streamline your data quality monitoring.

1. Important: Shows only checks that are marked as important. These checks are prioritized based on their significance, typically assigned a weight of 7 or higher.

Note

Important checks are prioritized based on a weight of 7 or higher.

important

2. Favorite: Displays checks that have been marked as favorites. This allows you to quickly access checks that you use or monitor frequently.

favorite

3. All: Displays a comprehensive view of all active checks, including important, favorite and any checks that do not fall under these specific categories.

all

Draft Checks

By selecting Draft, you can view checks that have been created but have not yet been applied to the data. These checks are in the drafting stage, allowing for adjustments and reviews before activation. Draft checks provide flexibility to experiment with different validation rules without affecting the actual data.

draft-tab

You can also categorize the draft checks based on their importance, favorites, or specific metrics to prioritize and organize them effectively during the review and adjustment process.

1. Important: Shows only checks that are marked as important. These checks are prioritized based on their significance, typically assigned a weight of 7 or higher.

important

2 Favorite: Displays checks that have been marked as favorites. This allows you to quickly access checks that you use or monitor frequently.

favorite

3. All: Displays a comprehensive view of all draft checks, including important, favorite and any checks that do not fall under these specific categories.

all

Archived Checks

By selecting Archived, you can view checks that have been marked as discarded or invalid from use but are still stored for future reference or restoration. Although these checks are no longer active, they can be restored if needed.

archived-check

You can also categorize the archived checks based on their status as Discarded, Invalid, or view All archived checks to manage and review them effectively.

1. Discarded: Shows checks that have been marked as no longer useful or relevant and have been discarded from use.

discarded

2. Invalid: Displays checks that are deemed invalid due to errors or misconfigurations, requiring review or deletion.

invalid

3. All: Provides a view of all archive checks within this category including discarded and invalid checks.

all

Details

Check Details provides important information about each check in the system. It shows when a check was last run, how often it has been used, when it was last updated, who made changes to it, and when it was created. This section helps users understand the status and history of the check, making it easier to manage and track its use over time.

Step 1: Locate the check you want to review, then hover over the info icon to view the Check Details.

check-detail

A popup will appear with additional details about the check.

popup

Last Asserted

Last Asserted At shows the most recent time the check was run, indicating when the last validation occurred. For example, the check was last asserted on Mar 27, 2025, at 2:16 AM (GMT+5:30).

popup

Scans

Scans show how many times the check has been used in different operations. It helps you track how often the check has been applied. For example, the check was used in 17 operations.

scan

Updated At

Updated At shows the most recent time the check was modified or updated. It helps you see when any changes were made to the check’s configuration or settings. For example, the check was last updated on Nov 8, 2024, at 6:37 PM (GMT+5:30).

update

Last Editor

Last Editor indicates who most recently made changes to the check. It helps track who is responsible for the latest updates or modifications. This is useful for accountability and collaboration within teams.

editor

Created At

Created At shows when the check was first made. It helps you know how long the check has been in use. This is useful for tracking its history. For example, the check was created on Oct 17, 2024, at 11:13 AM (GMT+5:30).

created

Status Management of Checks

Set Check as Draft

You can move an active check into a draft state, allowing you to work on the check, make adjustments, and refine the validation rules without affecting live data. This is useful when you need to temporarily deactivate a check for review and updates.

Step 1: Click on the active check that you want to move to the draft state.

draft

Step 2: A modal window will appear displaying the check details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and select "Draft" from the drop-down menu.

draft

Step 3: After clicking on "Draft", the check will be successfully moved to the draft state, and a success flash message will appear stating, "The checks have been successfully updated."

success-updated

Activate Draft Check

You can activate the draft checks after when you have worked on the check, make adjustments, and refine the validation rules. By activating the draft check and making it live, ensures that the defined criteria are enforced on the data.

Step 1: Navigate to the Draft check section, and click on the drafted check that you want to activate, whether you have made changes or wish to activate it as is.

activate

A modal window will appear with the check details. If you want to make any changes to the check details, you can edit them.

check-details

Step 2: Click on the down arrow icon with the Update button. A dropdown menu will appear, click on the Activate button.

activate

Step 3: After clicking on the activate button, your check is now successfully moved to the active checks and a success flash message will appear stating "Check successfully updated".

success-updated

Set Check as Archived

You can move an active or draft check into the archive when it is no longer relevant but may still be needed for historical purposes or future use. Archiving helps keep your checks organized without permanently deleting them.

Step 1: Click on the check from the list of available (whether Active or Draft) checks that you want to archive.

archived

Step 2: A modal window will appear displaying the check details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and click on the "Archive" from the drop-down menu.

archive

Step 3: A modal window titled “Archive Check” will appear, providing you with the following archive options:

Discarded: Select this option if the check is no longer relevant or suitable for the current business rules or data requirements. This helps in archiving checks that are obsolete but still exist for historical reference.
Invalid: Choose this option if the check is not valid and should be retired from future inference. This helps the system learn from invalid checks and improves its ability to infer valid checks in the future.

archive-check

Step 4: Once you've made your selection, click the Archive button to proceed.

archive

Step 5: After clicking on the Archive button your check is moved to the archive and a flash message will appear saying "The Check has been successfully archived".

success-archive

Restore Archived Checks

If a check has been archived, then you can restore it back to an active state or in a draft state. This allows you to reuse your checks that were previously archived without having to recreate them from scratch.

Step 1: Click on Archived from the navigation bar in the Checks section to view all archived checks.

restore

Step 2: Click on the archived check which you want to restore as an active or draft check.

For Demonstration purpose, we have selected the "Metric" check.

archive-checks

A modal window will appear with the check details.

check-details

Step 3: If you want to make any changes to the check, you can edit it. Otherwise, click on the Restore button to restore it as an active check.

restore-check

To restore the check as a draft, click on the arrow icon next to the Restore button. A dropdown menu will appear—select Restore as Draft from the options.

restore-as-draft

After clicking the Restore button, the check will be successfully restored as either an active or draft check, depending on your selection. A success message will appear confirming, "Check successfully updated."

success-updated

Edit Check

You can edit an existing check to modify its properties, such as the rule type, coverage, filter clause, or description. Updating a check ensures that it stays aligned with evolving data requirements and maintains data quality as conditions change.

Step 1: Click on the check you want to edit, whether it is an active or draft check.

edit-check

A modal window will appear with the check details.

modal-win

Step 2: Modify the check details as needed based on your preferences.

check-detail

Step 3: Once you have edited the check details, then click on the Validate button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct.

If the validation is successful, a green message saying "Validation Successful" will appear.

validate-msg

If the validation fails, a red message saying "Failed Validation" will appear. This typically occurs when the check logic or parameters do not match the data properly.

failed-msg

Step 3: Once you have a successful validation, click the "Update" button. The system will update the changes you've made to the check, including changes to the fields, filter clause, coverage, description, tags, or metadata.

After clicking on the Update button, your check is successfully updated and a success flash message will appear stating "Check successfully updated".

update-msg

Mark Check as Favorite

Marking a check as a favorite allows you to quickly access and prioritize the checks that are most important to your data validation process. This helps streamline workflows by keeping frequently used or critical checks easily accessible, ensuring you can monitor and manage them efficiently. By marking a check as a favorite, it will appear in the "Favorite" category for faster retrieval and management.

Step 1: Locate the check which you want to mark as a favorite and click on the bookmark icon located on the right side of the check.

favorite

After Clicking on the bookmark icon your check is successfully marked as a favorite and a success flash message will appear stating "Check has been favorited".

fav-msg

To unmark a check, simply click on the bookmark icon of the marked check.

remove-fav

This will remove it from your favorites.A success flash message will appear stating "The Check has been unfavorited".

successfully-unfav

Clone Check

You can clone both active and draft checks to create a duplicate copy of an existing check. This is useful when you want to create a new check based on the structure of an existing one, allowing you to make adjustments without affecting the original check.

Step 1: Click on the check (whether Active or Draft) that you want to clone.

clone

Step 2: A modal window will appear displaying the check details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and select "Clone" from the drop-down menu.

Step 3: After clicking the Clone button, a modal window will appear. This window allows you to adjust the cloned check's details.

modal-window

1. If you toggle on the "Associate with a Check Template" option, the cloned check will be linked to a specific template.

toggle-on

Choose a Template from the dropdown menu that you want to associate with the cloned check. The check will inherit properties from the selected template.

Locked: The check will automatically sync with any future updates made to the template, but you won't be able to modify the check's properties directly.
Unlocked: You can modify the check, but future updates to the template will no longer affect this check.

associate-check

2. If you toggle off the "Associate with a Check Template" option, the cloned check will not be linked to any template, which allows you full control to modify the properties independently.

toggle-off

Select the appropriate Rule type for the check from the dropdown menu.

rule-type

Step 4: Once you have selected the template or rule type, fill up the remaining check details as required.

check-detail

Step 5: After completing all the check details, click on the "Validate" button. This will perform a validation operation on the check without saving it. The validation allows you to verify that the logic and parameters defined for the check are correct. It ensures that the check will work as expected by running it against the data without committing any changes.

If the validation is successful, a green message saying "Validation Successful" will appear.

validation-success

If the validation fails, a red message saying "Failed Validation" will appear. This typically occurs when the check logic or parameters do not match the data properly.

failed-validation

Step 6: Once you have a successful validation, click the "Save" button. The system will save any modifications you've made to the check, and create a clone of that check on basis of your changes.

After clicking on the "Save" button your check is successfully created and a success flash message will appear stating "Check successfully created".

success-msgs

Create a Quality Check template

You can add checks as a Template, which allows you to create a reusable framework for quality checks. By using templates, you standardize the validation process, enabling the creation of multiple checks with similar rules and criteria across different datastores. This ensures consistency and efficiency in managing data quality checks.

Step 1: Locate the check (whether Active or Draft) you want to archive and click on that check.

quality-check

Step 2: A modal window will appear displaying the check details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and select "Template" from the drop-down menu.

After clicking the "Template" button, the check will be saved and created as a template in the library, and a success flash message will appear stating, "The quality check template has been successfully created." This allows you to reuse the template for future checks, streamlining the validation process.

quality-check

Filter and Sort

Filter and Sort options allow you to organize your checks by various criteria, such as Weight, Anomalies, Coverage, Created Date, and Rules. You can also apply filters to refine your list of checks based on Selected Source Datastores, Check Type, Asserted State (Passed, Failed, Not Asserted), Tags, Files, and Fields.

Sort

You can sort your checks by Active Anomalies, Coverage, Created Date, Last Asserted, Rules, and Weight to easily organize and prioritize them according to your needs.

sort

No	Sort By Option	Description
1	Active Anomalies	Sort checks based on the number of active anomalies.
2	Coverage	Sort checks by data coverage percentage.
3	Created Date	Sort checks according to the date they were created.
4	Last Asserted	Sorts by the last time the check was executed.
5	Rules	Sort checks based on specific rules applied to the checks.
6	Weight	Sort checks by their assigned weight or importance level.

Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.

caret

Filter

You can filter your checks based on values like Source Datastores Check Type, Asserted State, Rule, Tags, File, Field, and Template.

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

filter

No	Filter	Filter Value	Description
1	Selected Source Datastores	N/A	Select specific source datastores to focus on their checks.
2	Select Tags	N/A	Filter checks by specific tags to categorize and refine results.

filter

No	Filter	Filter Value	Description
3	Check Type	All	Displays all types of checks, both inferred and authored.
		Inferred	Shows system-generated checks that automatically validate data based on detected patterns or logic.
		Authored	Displays user-created checks, allowing the user to focus on custom validations tailored to specific requirements.
4	Asserted State	All	Displays all checks, regardless of their asserted status. This provides a full overview of both passed, failed, and not asserted checks.
		Passed	Shows checks that have been asserted successfully, meaning no active anomalies were found during the validation process.
		Failed	Displays checks that have failed assertion, indicating active anomalies or issues that need attention.
		Not Asserted	Filters out checks that have not yet been asserted, either because they haven’t been processed or validated yet.
5	Rule	N/A	Select this to filter the checks based on specific rule type for data validation, such as checking non-null values, matching patterns, comparing numerical ranges, or verifying date-time constraints. By clicking on the caret down button next to the Rule field, the available rule types will be dynamically populated based on the rule types present in the results. The rules displayed are based on the current dataset and provide more granular control over filtering. Each rule type will show a counter next to it, displaying the total number of occurrences for that rule in the dataset. For example, the rule type After Date Time is displayed with a total of 46 occurrences.
6	Template	N/A	This filter allows users to view and apply predefined check templates.

Anomalies

Anomalies tab provides a quick overview of all detected anomalies across your source datastores. In Qualytics, An Anomaly refers to a dataset (record or column) that fails to meet specified data quality checks, indicating a deviation from expected standards or norms. These anomalies are identified when the data does not satisfy the applied validation criteria. You can filter and sort anomalies based on your preferences, making it easy to see which anomalies are active, acknowledged, or archived. This section is designed to help you quickly identify and address any issues.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the Explore button on the left side panel of the interface.

explore

Step 2: Click on the "Anomalies" from the Navigation Tab.

anomalies

You will be navigated to the Anomalies tab, where you'll see a list of all the detected anomalies across various tables and fields from different source datastores, based on the applied data quality checks.

all

Categories Anomalies

Anomalies can be classified into different categories based on their status and actions taken. These categories include Open and Archived anomalies. Managing anomalies effectively helps in maintaining data integrity and ensuring quick response to issues.

Open

By selecting Open Anomalies, you can view anomalies that have been detected but remain unacknowledged or unresolved. These anomalies require attention and may need further investigation or corrective action.

open

This option helps focus on unaddressed issues while allowing seamless navigation to All, Active, or Acknowledged anomalies as needed.

1. Active: By selecting Active Anomalies, you can focus on anomalies that are currently unresolved or require immediate attention. These are the anomalies that are still in play and have not yet been acknowledged, archived, or resolved.

active

2. Acknowledge: By selecting Acknowledged Anomalies, you can see all anomalies that have been reviewed and marked as acknowledged. This status indicates that the anomalies have been noted, though they may still require further action.

acknowledge

3. All: By selecting All Anomalies, you can view the complete list of anomalies, regardless of their status. This option helps you get a comprehensive overview of all issues that have been detected, whether they are currently active, acknowledged, or archived.

all

Archived

By selecting Archived Anomalies, you can view anomalies that have been resolved or moved out of active consideration. Archiving anomalies allows you to keep a record of past issues without cluttering the active list.

archived

You can also categorize the archived anomalies based on their status as Resolved, Duplicate and Invalid, to review them effectively.

1. Resolved: This indicates that the anomaly was a legitimate data quality concern and has been addressed.

resolved

2. Duplicate: This indicates that the anomaly is a duplicate of an existing record and has already been addressed.

duplicate

3. Invalid: This indicates that the anomaly is not a legitimate data quality concern and does not require further action.

invalid

4. All: Displays all archived anomalies, including those marked as Resolved, Duplicate, and Invalid, giving a comprehensive view of all past issues.

all

Anomaly Details

Anomaly Details window provides information about anomalies identified during scan operations. It displays details such as the anomaly ID, status, type, detection time, and where it is in the data, such as the datastore and table. Additionally, it offers options to explore datasets, share details, and collaborate, making it easier to resolve data issues.

Step 1: Click on the anomaly from the list of available (whether Active, Acknowledged or Archived) anomalies to view its details.

details

A modal window titled “Anomaly Details” will appear, displaying all the details of the selected anomaly.

modal

For more details on Anomaly Details, please refer to the Anomaly Insights section in the documentation.

Acknowledged Anomalies

By acknowledging anomalies, you indicate that they have been reviewed or recognized. Acknowledging anomalies helps you keep track of issues that have been addressed, even if further action is still required.

Warning

Once an anomaly is acknowledged, it remains acknowledged and never reverts to the active state.

Step 1: Click on the active anomaly from the list of available anomalies that you want to acknowledge.

ackno

Step 2: A modal window will appear displaying the anomaly details. Click on the acknowledge (👁) icon located in the upper-right corner of the modal window.

ackno-window

Step 3: After clicking on the Acknowledge icon your anomaly is successfully moved to the acknowledge and a flash message will appear saying “The Anomaly has been successfully acknowledged”.

flash

Archive Anomalies

By archiving anomalies, you move them to an inactive state, while still keeping them available for future reference or analysis. Archiving helps keep your active anomaly list clean without permanently deleting the records.

Step 1: Click on the anomaly from the list of available (whether Active or Acknowledged) anomalies that you want to archive.

archived

Step 2: A modal window will appear displaying the anomaly details. Click on the archive (🗑) icon located in the upper-right corner of the modal window.

window

Step 3: A modal window titled “Archive Anomaly” will appear, providing you with the following archive options:

Resolved: Choose this option if the anomaly was a legitimate data quality concern and has been addressed. This helps maintain a record of resolved issues while ensuring that they are no longer active.
Duplicate: Choose this option if the anomaly is a duplicate of an existing record and has already been addressed. No further action is required as the issue has been previously resolved.
Invalid: Select this option if the anomaly is not a legitimate data quality concern and does not require further action. Archiving anomalies as invalid helps differentiate between real issues and those that can be dismissed, improving overall data quality management.

option

Step 4: Once you've made your selection, click the Archive button to proceed.

Step 5: After clicking on the Archive button your anomaly is moved to the archive and a flash message will appear saying “Anomaly has been successfully archived”.

msg

Restore Archived Anomalies

By restoring archived anomalies, you can bring them back into the acknowledged state for further investigation or review. These anomalies will not return to the active state once they have been acknowledged.

Step 1: Click on the anomaly that you want to restore from the list of archived anomalies.

restore

Step 2: A modal window will appear displaying the anomaly details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and click on “Restore” from the drop-down menu.

Step 3: After clicking on the “Restore” button, the selected anomaly is now restored as in acknowledged state.

button

Assign Tags

Assigning tags to an anomaly serves the purpose of labeling and grouping anomalies and driving downstream workflows.

Step 1: Click on the Assign tags to this Anomaly or + button.

tag

Step 2: A dropdown menu will appear with existing tags. Scroll through the list and click on the tag you wish to assign.

scroll

Delete Anomalies

Deleting an anomaly allows you to permanently remove a record that is no longer relevant or was logged in error. This action is done individually, ensuring that your anomaly records remain clean and up to date.

Note

You can only delete archived anomalies, not active or acknowledged checks. If you want to delete an active or acknowledged anomaly, you must first move it to the archive, and then you can delete it.

You can delete individual anomalies using one of two methods:

1. Delete Directly

Step 1: Click on Archived from the navigation bar in the Anomalies section to view all archived anomalies.

dlt

Step 2: Locate the anomaly, that you want to delete and click on the Delete icon located on the right side of the anomaly.

button

Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the anomaly from the system.

Step 4: After clicking on the delete button, your anomaly is successfully deleted and a success flash message will appear saying “Anomaly has been successfully deleted”.

flash

Step 1: Click on the archive anomaly from the list of archived anomalies that you want to delete.

action

Step 2: A modal window will appear displaying the anomaly details. Click on the vertical ellipsis (⋮) located in the upper-right corner of the modal window, and click on “Delete” from the drop-down menu.

vertical

Step 3: A confirmation modal window will appear, click on the Delete button to permanently remove the anomaly from the system.

Step 4: After clicking on the delete button, your anomaly is successfully deleted and a success flash message will appear saying “Anomaly has been successfully deleted”.

msg

Filter and Sort

Filter and Sort options allow you to organize your anomaly by various criteria, such as Weight, Anomalous Record, Created Date. You can also apply filters to refine your list of anomaly based on Selected Source Datastores, Selected Tag, Timeframe, Type and Rule .

Sort

You can sort your anomalies by Anomalous Record, Created Date, and Weight to easily organize and prioritize them according to your needs.

sort

No	Sort By Option	Description
1	Anomalous Record	Sorts anomalies based on the number of anomalous records identified.
2	Created Date	Sorts anomalies according to the date they were detected.
3	Weight	Sort anomalies by their assigned weight or importance level.

Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.

caret

Filter

You can filter your anomalies based on values like Source Datastores, Timeframe, Type, Rule, and Tags.

filter

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

filter

No.	Filter	Description
1	Selected Source Datastore	Select specific source datastores to focus on their anomalies.
2	Select Tags	Filter anomalies by specific tags to categorize and prioritize issues effectively.
3	Timeframe	Filtering anomalies detected within specific time ranges (e.g., anomalies detected in the last week or year).
4	Type	Filter anomalies based on anomaly type (Record or Shape).
5	Rule	Filter anomalies based on specific rules applied to the anomaly. By clicking on the caret down button next to the Rule field, the available rule types will be dynamically populated based on the rule types present in the results. The rules displayed are based on the current dataset and provide more granular control over filtering. Each rule type will show a counter next to it, displaying the total number of occurrences for that rule in the dataset. For example, the rule type After Date Time is displayed with a total of 14 occurrences.

Flows

Flows enable users to create pipelines by chaining actions and configuring how they are triggered. Triggers can be set based on predefined events and filters, offering a flexible and efficient way to automate processes. These actions can be notifications or operations, allowing users to inform various notification channels or execute tasks based on specific operations.

Step 1: Log in to your Qualytics account and click on Flows on the left side panel of the interface.

flows

You will navigate to the Flows interface, where you can add and manage flows. At the top, you will see two tabs:

Definitions: Displays a list of all flows along with details like triggers, actions, tags, and the last triggered time.

definition

Executions: Provides the execution history of flows, including their status and timestamps.

execution

Add Flow

Step 1: Click on the Add Flow button from the top right corner.

addflow

A modal window, Add Flow, will appear, providing options to create a flow. Each flow starts by default with two nodes: Flow and Trigger.

flowchart

Flow

Step 1: Click on the Flow node.

flow

A panel will appear on the right-hand side, allowing you to:

No.	Field Name	Description
1.	Name	Enter the name for the flow.
2.	Description	Provide a brief description of the flow (optional) to clarify its purpose or functionality.
3.	Deactivated	Check the box to deactivate the flow. If selected, the flow won't start even if the trigger conditions are met.

flow

Step 2: Once the details are filled in, click the Save button to save the flow settings.

save

Trigger

Step 1: After completing the "Flow" node setup, users can click on the "Trigger" node.

trigger

A panel will appear on the right-hand side, enabling users to define when the flow should start. The panel provides four options for initiating the flow. Users can choose one of the following options:

Operation Completes.
Anomalous Table and File Detection.
Anomaly Detected.
Manual

triggersetting

Operation Completes

This type of flow is triggered whenever an operation, such as a catalog, profile, or scan, is completed on a source datastore. Upon completion, teams are promptly notified through in-app messages and, if configured, via external notification channels such as email, Slack, Microsoft Teams, and others. For example, the team is notified whenever the catalog operation is completed, helping them proceed with the profile operation on the datastore.

operation

Filter Conditions

Filters can be set to narrow down which operations should trigger the flow execution:

Source Datastore Tags: The flow is triggered only for source datastores that have all the selected tags assigned.
Source Datastores: The flow is triggered only for the selected source datastores.
Operation Types: The flow is triggered only for operations that match one or more of the selected types.
Operation Status: The flow is triggered for operations with a status of either Success or Failure.

operation

After defining the conditions, users must click the Save button to finalize the trigger configuration.

save

Anomalous Table and File Detected

This flow is triggered when anomalies are detected within a specific table, file and check rule types. It includes information about the number of anomalies found and the specific scan target within the datastore. This is useful for assessing the overall health of a particular datastore.

table

Filter Conditions

Users can optionally set filters to specify which tables or files should trigger the flow execution.

Tables / Files Tags: Only tables or files with all the selected tags assigned will trigger the flow.
Source Datastores: The flow is triggered only for the selected source datastores.
Check Rule Types: Only anomalies identified by one or more of the selected check rule types will initiate the flow.

table

After defining the conditions, users must click the Save button to finalize the trigger configuration.

save

Anomaly Detected

This type of flow is triggered when any single anomaly is identified in the data. The flow message typically includes the type of anomaly detected and the datastore where it was found. It provides specific information about the anomaly type, which helps quickly understand the issue's nature.

anomaly

Filter Condition

Users can define specific conditions to determine when the flow should be initiated.

Anomaly’s Tags: Only anomalies with all selected tags assigned will trigger the flow.
Source Datastores: Only triggered when anomalies are detected in the selected datastores.
Check Rule Types: Only anomalies identified by one or more of the selected check rule types will initiate the flow.
Anomaly Weight (Min): Only anomalies with a weight equal to or greater than the specified value will trigger the flow.

anomaly

Step 2: Once the filter conditions are set, users must click the Save button to finalize the configuration.

save

Manual

The flow starts only when the user manually triggers it. It doesn’t depend on any automatic conditions or detections, giving the user full control.

manual

Once selected, users must click the Save button to confirm the manual trigger configuration.

save

Hover over the filter tooltip in trigger nodes to view the applied conditions such as tags, datastores, and operation types. This provides quick visibility into how each trigger is configured.

filter-tooltip

Actions

Actions define the specific steps the system will execute after a flow is triggered. They allow users to automate tasks, send notifications, or interact with external systems.

Step 1: After completing the "Trigger" node setup, users can click on the "Actions" node.

action

A panel will appear on the right-hand side displaying the list of available actions. These actions define what the system will execute after the flow is triggered. The actions are categorized into three groups:

Operations.
Notifications.
HTTP.

actionlist

Info

Inline summaries are shown within action nodes, displaying key details based on the action type—for example, datastore names for operations, Slack or Teams channels for notifications, and webhook URLs for HTTP actions. This enhancement provides quick clarity during flow configuration.

Operation

Users can execute specific operations when the trigger activates. They can choose from the following options:

Catalog.
Profile.
Scan.
Export.
Materialize.

operations

Catalog

Step 1: Click on Catalog.

catalog

A panel Catalog Settings will appear on the right-hand side. This window allows you to configure the catalog operation.

No.	Field	Description
1.	Source Datastore	Select the source datastore to catalog.
2.	Prune	Checkbox to enable or disable the removal of named collections (tables, views, files, etc.) that no longer exist in the datastore.
3.	Recreate	Checkbox to enable or disable the recreation of previously deleted named collections in Qualytics for the catalog.
4.	Include	Checkboxes to select Tables, Views, or both, specifying the resources to include in the catalog.

catalog

Step 2: After configuring the settings, click Save to apply and proceed with the catalog operation.

save

Profile

Step 1: Click on Profile.

A panel Profile Settings will appear on the right-hand side. This window allows you to configure the Profile operation.

No.	Field	Description
1.	Source Datastore	Select the source datastore to profile.
2.	Select Tables	Allows users to select all tables, specific tables, or tables associated with selected tags to profile.
3.	Read Settings	Configure the starting point for profiling and set a maximum record limit per table for profiling.
4.	Inference Settings	Set the level of automated checks and decide whether inferred checks should be saved in draft mode.

Step 2: Click Save to finalize the profile configuration.

save

Scan

Step 1: Click on Scan.

scan

A panel Scan Settings will appear on the right-hand side. This window allows you to configure the Scan operation.

scan

Source Datastore: Select the datastore to be scanned.

scan

Select Tables: Choose all tables, specific tables, or tables associated with selected tags to include in the scan.

scan

Select Check Categories: Select categories of checks to include, such as table properties (Metadata) or value checks (Data Integrity).

scan

Read Settings: Define the scan strategy: incremental scans updated records; full scans process all records.

scan

Starting Threshold: Set a starting point for scanning based on an incremental identifier.

scan

Record Limit: Specify the maximum number of records to scan per table.

scan

Scan Settings: Choose how to manage duplicate or recurring anomalies by archiving overlaps or reactivating previously archived anomalies with fingerprint tracking.

scan

Anomaly Rollup Threshold: Set the Rollup Threshold to limit how many anomalies are created per check. When the limit is reached, anomalies will be merged into one for easier management.

rollup

Enrichment Source Record Limit: Define the number of source records to include in the enrichment operation.

scan

Step 2: Click Save to finalize the scan configuration.

save

Export

Step 1: Click on Export.

export

A panel Export Settings will appear on the right-hand side. This window allows you to configure the Export settings.

panel

Source Datastore: Select the datastore to export data from.

source

Select file patterns to export: All (all file patterns, including future ones), Specific (manually chosen file patterns), or Tag (file patterns based on selected tags).

profile

Select Metadata: Choose metadata to export anomalies, quality checks, or field profiles. Anomalies detect data issues, quality checks validate data, and field profiles store field metadata.

exportt

Step 2: Click Save to finalize the export configuration.

save

Export nodes display the asset type in their titles (e.g., “Export Anomalies”) to help you identify the exported content easily.

export-status

Materialize

Step 1: Click on Materialize.

materialize

A panel Materialize Settings will appear on the right-hand side. This window allows you to configure the Materialize settings.

setting

Source Datastore: Select the datastore to materialize data from.

source

Select Tables: Choose which tables (all, specific, or tagged) to extract from your source datastore and export to the enrichment datastore.

select

Read Settings: Select the record limit to control how much data is materialized per table.

read

Step 2: Click Save to finalize the materialize configuration.

save

Notification

Users can configure the application to send notifications through various channels. The available notification options include:

In App.
Email.
Slack.
Microsoft Teams.
PagerDuty.

notification

In App

This will send an app notification to all users that use Qualytics. Users can set a custom message using variables and modify the standard text.

Step 1: Click on In App.

notification

A panel In App Settings will appear on the right-hand side, allowing you to configure the notification message.

notification

Message: Enter your custom message using variables in the Message field, where you can specify the content of the notification that will be sent out.

notification

Tip

You can write your custom notification message by utilizing the autocomplete feature. This feature allows you to easily insert internal variables such as {{ flow_name }}, {{ container_name }}, and {{ datastore_name }}. As you start typing, the autocomplete will suggest and recommend relevant variables in the dropdown.

Step 2: After configuring the message, click Save to finalize the settings.

save

Email

Adding email notifications allows users to receive timely updates or alerts directly in their inbox. By setting up notifications with specific triggers and channels, you can ensure that you are promptly informed about critical events, such as operation completions or detected anomalies. This proactive approach allows you to take immediate action when necessary, helping to address issues quickly and maintain the smooth and efficient operation of your processes.

Step 1: Click on Email.

notification

A panel Email Settings will appear on the right-hand side, allowing you to add email addresses, specify an email subject, and configure the notification message.

notification

No.	Field	Description
1.	Email Address	Enter the email address where the notification should be sent.
2.	Email Subject	Enter the subject line of the notification email to help recipients identify its purpose.
3.	Message	Text area to customize the notification message content with dynamic placeholders like `{{ flow_name }}`, `{{ operation_type }}`, and `{{ operation_result }}`.

notification

Step 2: Click the Test Notification button to send a test email to the provided address. If the email is successfully sent, you will receive a confirmation message indicating Notification successfully sent.

test

Step 3: Once all fields are configured, click the Save button to finalize the email notification setup.

save

Slack

Qualytics integrates with Slack to deliver real-time notifications on scan completions, anomalies, and operational statuses, ensuring teams stay informed and can act quickly. With this integration, users receive instant alerts for system events, monitor scan results, and manage data anomalies directly within Slack. They can view notifications, acknowledge issues, and take necessary actions without switching platforms.

Step 1: Click on Slack.

click-slack

A Slack Settings panel appears on the right side of the screen.

slack-settings

No.	Field	Description
1.	Channel	Choose the channel where notifications should be sent using the Channel dropdown. For demonstration purposes, the channel #demo is selected.
2.	Preview	Shows a preview of the Slack notification that will be sent when the flow runs.

slack-options

Step 2: Click the Test Notification button to send a sample notification to the selected Slack channel.

test-notification

A prompt appears stating Notification successfully sent once the notification is successfully delivered.

successfully-notified

Step 3: Once the notification is successfully sent, check your connected Slack workspace to ensure it is linked to Qualytics. You will see the test notification in the selected Slack channel.

Note

Each trigger generates a different type of Slack notification message. The content and format of the message vary based on the specific trigger event.

anomaly-detected

Step 4: After confirming that the notification was received successfully, return and click the Save button.

save

Examples of Trigger Messages

Trigger messages in Slack provide real-time notifications for various system events, ensuring timely awareness and action. Each trigger message follows a unique format and conveys different types of information based on the operation performed. Below are examples highlighting distinct scenarios:

Scenario 1: Scan Completion Notification

When a data cataloging or scan operation completes successfully, a notification is sent to Slack. The message includes details such as the dataset name, operation type (e.g., Catalog Operation), and the result of the operation.

scan-completed

Scenario 2: Anomalous Table or File Detected

When a scan detects a critical data anomaly, Slack sends a detailed notification highlighting the issue. The notification includes the dataset name, flow (such as Quality Monitor), and source datastore. It also provides a summary of the anomaly, specifying the number of records that differ between datasets and the container where the discrepancy was found. Additionally, the message offers an option to view detailed results.

anomalous-scan

Scenario 3: Anomaly Detected

When a scan detects record anomalies, Slack sends a notification highlighting the affected container, flow, and source datastore. It specifies the number of records that differ between datasets and provides options to view or acknowledge the anomaly.

anomaly-detected

Managing Qualytics Alerts in Slack

Qualytics Slack integration enables real-time monitoring and quick action on data quality issues directly from Slack. This guide outlines the different types of alerts and the actions you can take without leaving Slack.

When an Operation Success or failure

Step 1: A Slack notification confirms the scan completion with a Success/failure status.

For demonstration purposes we are using Success operation.

scan-completed

Step 2: Click View Operation to be redirected automatically to the result section in Qualytics.

view-operation

When an Anomalous File or Table is Detected

Step 1: A Slack alert notifies about anomalies in a dataset.

anomalous-scan

Step 2: Click View Results to examine the identified discrepancies directly in Qualytics.

view-results

When a Record Anomaly is Detected

If a shape or record anomaly is found, you'll receive a Slack notification. You can take the following actions:

anomaly-detected

View Anomaly – Click on view anomaly to open the details in Qualytics to investigate further.

view-anomaly

Acknowledge – Click on Acknowledge to mark it as reviewed to avoid duplicate alerts.

acknowledge-anomaly

Horizontal ellipsis(⋯) – Click on horizontal ellipsis.

horizontal-ellipsis

A dropdown will open with option comment and archive :

comment-archive

No.	Action	Description
1.	Comment	Add Comment to collaborate with your team.
2.	Archive	Archive if no further action is needed.

Microsoft Teams

Step 1: Click on Microsoft Teams.

notification

A panel Microsoft Teams Settings will appear on the right-hand side, allowing you to add a webhook url and configure the notification message.

notification

No.	Field	Description
1.	Teams Webhook URL	Enter the Teams webhook URL where the notification should be sent.
2.	Message	Text area to customize the notification message content with dynamic placeholders like `{{ flow_name }}`, `{{ operation_type }}`, and `{{ operation_result }}`.

notification

Step 2: Click the "Test Notification" button to send a test message to the provided “Webhook URL”. If the message is successfully sent, you will receive a confirmation notification indicating "Notification successfully sent".

test

Step 3: Once all fields are configured, click the Save button to finalize the Microsoft Teams notification setup.

save

PagerDuty

Integrating PagerDuty with Qualytics ensures that your team gets instant alerts for critical data events and system issues. With this connection, you can automatically receive real-time notifications about anomalies, operation completions and other important events directly in your PagerDuty account. By categorizing alerts based on severity, it ensures the right people are notified at the right time, speeding up decision-making and resolving incidents efficiently. This helps your team respond quickly to issues, reducing downtime and keeping data operations on track.

Step 1: Click on PagerDuty.

notification

A PagerDuty Settings panel will appear on the right-hand side, enabling users to configure and send PagerDuty notifications.

notification

Integration Key: Enter the Integration Key where you want the notification to be sent.

notification

Severity: Select the appropriate PagerDuty severity level to categorize incidents based on their urgency and impact. The available severity levels are:

Info: For informational messages that don't require immediate action but provide helpful context.
Warning: For potential issues that may need attention but aren't immediately critical.
Error: For significant problems that require prompt resolution to prevent disruption.
Critical: For urgent issues that demand immediate attention due to their severe impact on system operations.

notification

Message: Enter your custom message using variables in the Message field, where you can specify the content of the notification that will be sent out.

notification

Tip

You can write your custom notification message by utilizing the autocomplete feature. This feature allows you to easily insert internal variables such as {{ flow_name }}, {{ operation_type }}, and {{ datastore_name }}. As you start typing, the autocomplete will suggest and recommend relevant variables in the dropdown.

Step 2: Click on the Test notification button to check if the integration key is functioning correctly. Once the test notification is sent, you will see a success message, "Notification successfully sent."

test

Step 3: Once you have entered all the values, then click on the Save button.

save

HTTP

Users can connect to external apps for notifications using one of these services:

Webhook.
HTTP Action.

notification

Webhook

Qualytics allows you to connect external apps for notifications using webhooks, making it easy to stay updated in real time. When you set up a webhook, it sends an instant alert to the connected app whenever a specific event or condition occurs. This means you can quickly receive notifications about important events as they happen and respond right away. By using webhook notifications, you can keep your system running smoothly, keep everyone informed, and manage your operations more efficiently.

Step 1: Click on Webhook.

notification

A Webhook Settings panel will appear on the right-hand side, enabling users to configure and send webhook notifications.

notification

No.	Field	Description
1.	Webhook URL	Enter the desired "Webhook URL" of the target system where you want to receive notifications.
2.	Message	Text area to customize the notification message content with dynamic placeholders like `{{ flow_name }}`, `{{ operation_type }}`, and `{{ operation_result }}`.

notification

Step 2: Click on the "Test HTTP" button to send a test notification to the webhook URL you provided. If the webhook URL is correct, you will receive a confirmation message saying "Notification successfully sent." This indicates that the webhook is functioning correctly.

test

Step 3: Once you have entered all the values, then click on the Save button.

save

HTTP Action

Integrating HTTP Action notifications allows users to receive timely updates or alerts directly to a specified server endpoint. By setting up HTTP Action notifications with specific trigger conditions, you can ensure that you are instantly informed about critical events, such as operation completions or anomalies detected. This approach enables you to take immediate action when necessary, helping to address issues quickly and maintain the smooth and efficient operation of your processes.

Step 1: Click on HTTP Action.

notification

An HTTP Action Settings panel will appear on the right-hand side, enabling users to configure and send HTTP Action notifications.

notification

Step 2: Enter the following detail where you want the notification to be sent.

1. Action URL: Enter the “Action URL” in this field. It specifies the server endpoint for the HTTP request and defines where data will be sent or retrieved. It must be correctly formatted and accessible, including the protocol (http or https), domain, and path.

2. HTTP Verbs: HTTP verbs specify the actions performed on server resources. Common verbs include:

POST: Use POST to send data to the server to create something new. For example, it's used for submitting forms or uploading files. The server processes this data and creates a new resource.
PUT: Updates or creates a resource, replacing it entirely if it already exists. For example, updating a user’s profile information or creating a new record with specific details.
GET: Retrieves data from the server without making any modifications. For example, requesting a webpage or fetching user details from a database.

3. Username: Enter the username needed for authentication.

4. Auth Type: This field specifies how to authenticate requests. Choose the method that fits your needs:

Basic: Uses a username and password sent with each request. Example: “Authorization: Basic ”.
Bearer: Uses a token included in the request header to access resources. Example: “Authorization: Bearer < token >”.
Digest: Provides a more secure authentication method by using a hashed combination of the username, password, and request details. Example: Authorization: Digest username=" ", realm=" ", nonce=" ", uri=" ", response=" ".

5. Secret: Enter the password or token used for authentication. This is paired with the Username and Auth Type to securely access the server. Keep the secret confidential to ensure security.

6. Message: Enter your custom message using variables in the Message field, where you can specify the content of the notification that will be sent out.

notification

Tip

You can write your custom notification message by utilizing the autocomplete feature. This feature allows you to easily insert internal variables such as {{ flow_name }}, {{ operation_type }}, and {{ datastore_name }}. As you start typing, the autocomplete will suggest and recommend relevant variables in the dropdown.

Step 3: Click the "Test HTTP" button to verify the correctness of the Action URL. If the URL is correct, a confirmation message saying "Notification successfully sent" will appear, confirming that the HTTP action is set up and functioning properly.

test

Step 4: Once you have entered all the values, then click on the Save button.

save

Step 3: After completing all the required details in the "Add Flow" section, click on the Publish button to finalize the process.

publish

After clicking the Publish button, a success notification appears confirming that the flow has been successfully added.

View Created Flows

Once a flow is added, it will be visible in the Definitions tab, where you can view all the created flows.

panel

Clone a Flow

Users can duplicate existing flows to simplify the reuse and modification of flow configurations for similar scenarios.

Step 1: Click on the existing flow you want to clone.

panel

Step 2: A new window will open displaying the flow's detailed configuration. Click the settings icon and select Clone.

panel

Step 3: After selecting the clone button, click the Publish button to publish it.

panel

After clicking the Publish button, a success notification appears confirming that the flow has been successfully added.

Sort Flows

Qualytics allows you to sort your flows by Created Date and Name to easily organize and prioritize them according to your needs.

sort

Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.

sort

Execute Manual Flows

Users can start a manual flow from the vertical ellipsis menu for greater flexibility in executing flows.

Step 1: Locate the manual flow in your list of flows.

manual-flow

Step 2: Click the vertical ellipsis (⋮) next to the manual flow you wish to execute, then select "Execute" from the dropdown menu to trigger the flow.

manual-flow

After clicking the Execute button, a success notification appears confirming that the flow has been successfully executed.

Manage Flows

Manage Flow allows users to edit, delete, deactivate or activate flows. Users can update configurations, remove outdated flows, or pause triggers to maintain an organized and efficient workflow system.

Edit Flow

Edit Flow feature lets users update existing flows by modifying configurations or adding actions.

Step 1: Click the flow you want to edit.

panel

Step 2: After clicking the flow, a new window will open displaying the flow's detailed configuration. Click on the boxes you want to edit.

For demonstration purposes we have selected the Flow node.

result

Step 3: Click the Save button to apply the updates.

save

Step 4: After clicking the Save button, click the Publish button located in the top right corner to finalize and publish the changes.

publish

Delete Flow

Delete Flow feature allows you to permanently remove unwanted or outdated flows from the system. This helps in maintaining a clean and organized list of active flows.

Step 1: Click the vertical ellipsis (⋮) next to the flow that you want to delete, then click on Delete from the dropdown menu.

delete

After clicking the delete button, a confirmation modal window Delete Flow will appear.

delete

Step 2: Click on the Delete button to delete the flow.

delete

After clicking the Delete button, a success notification appears confirming the deletion.

Deactivate Flow

Users can deactivate a flow to pause its triggers by disabling it. This prevents the flow from being executed until it is reactivated.

Step 1: Click the vertical ellipsis (⋮) next to the flow that you want to deactivate, then click on Deactivate from the dropdown menu.

deactivate

After clicking the Deactivate button, a success notification appears confirming the deactivation.

Activate Flow

Users can reactivate a flow that was previously deactivated. Once reactivated, the flow’s triggers become active again, allowing it to run automatically based on the defined conditions.

Step 1: Click the vertical ellipsis (⋮) next to the flow that you want to activate, then click on Activate from the dropdown menu.

activate

After clicking the Activate button, a success notification appears confirming the activation.

Clone an Action

Users can duplicate an existing action in just a few clicks. Cloning an action allows you to quickly replicate its configuration without manually setting it up again.

Step 1: Click the vertical ellipsis (⋮) on the action you want to clone, then select the Clone option from the menu.

vertical

Step 2: After clicking the Clone option, a cloned action will be created.

clone

Flows Execution

Execution tab allows users to view the execution history and current status of a flow. It provides detailed timestamps, status updates, and a comprehensive record of flow executions for efficient tracking and analysis.

Click on the Execution tab.

executions

You will be navigated to the Execution tab, where you can view the complete execution history of all created flows.

executions

See a Flow Execution

Users can view flow execution in real-time by clicking on the desired flow operation. The page shows detailed operations but does not allow editing.

Step 1: Click on the flow operation you want to view.

manual-flow

After clicking, the user will navigate to the selected flow operation details. The page displays all operational details in real-time. Note that this page is for viewing only, and no edits can be made here.

flow

Understanding Flow States

On the bottom-right corner, there is a Legend indicating the possible states of an action, such as:

Success (Green)
Failure (Red)
Aborted (Orange)
Skipped (Yellow)
Running (Blue with dotted lines animation)
Pending (Gray)

chart

If a step is running, you will see a dot-line animation, signaling that the step is in progress.
Once completed, the Action box will change its color to reflect the final state.

chart

Accessing Operation Results

To view detailed results of specific operations:

Step 1: Click the Top Right Arrow button within the action operation box.

chart

Step 2: You will navigate to the Activity page, where a Result Modal will open, displaying in-depth details of the operation.

result

Delete Flow Execution

Step 1: Click the Delete icon next to the flow execution you want to remove.

delete

A confirmation modal window Delete Flow Execution will appear.

delete

Step 2: Click on the Delete button to delete the flow execution.

delete

After clicking the Delete button, a success notification appears confirming the deletion.

Filter and Sort

Filter and Sort in the Executions tab help organize flow execution data. Users can sort by creation date or duration and filter by flow name, status, or trigger type for quick access to specific details.

Sort

Sort By feature allows users to organize executions by Created Date or Duration, simplifying the process of reviewing flow executions based on their creation or runtime.

sort

Filter

Filter feature allows users to refine flow execution results based on specific criteria. By clicking the filter icon, users can choose from the following options:

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

filter

No.	Filter	Description
1.	Flows	Select a specific flow to view its executions.
2.	Status	Filter executions by their completion status (e.g., success, failure and running).
3.	Trigger When	Filter executions based on their trigger condition.

filter

Operations

In the Activity tab, users can easily identify flow executions. The Flow column shows the flow name and includes a button to redirect users to the flow's operation. This feature is available in Explore Activities, Datastore Activity, and Container Activity.

explore

Tags

Tags allow users to categorize and organize data assets effectively and provide the ability to assign weights for prioritization. They drive notifications and downstream workflows, enabling users to stay informed and take appropriate actions. Tags can be configured and associated with specific properties, allowing for targeted actions and efficient management of entities across multiple datastores.

Tags can be applied to Datastores, Profiles, Fields, Checks, and Anomalies, streamlining data management and improving workflow efficiency. Overall, tags enhance organization, prioritization, and decision-making.

Let’s get started 🚀

What Are Tags and Why They Matter

A Tag is a reusable label that you can assign to Datastores, Profiles, Fields, Checks, and Anomalies.
Tags bring consistency, context, and automation to your data workflows.

Why We Use Tags

Without tags, managing hundreds of data assets quickly becomes difficult. Tags help you:

Categorize assets logically (e.g., Finance, PII, Deprecated).
Identify priorities by applying weight values.
Filter views and dashboards for faster navigation.
Automate responses in Flows (e.g., alert when a “Critical” check fails).
Enforce governance by grouping data by sensitivity or ownership.

In short, Tags help you find what matters faster and act automatically based on context.

How Tags Work

Tags can be applied across the full data hierarchy in Qualytics:

Datastore level: Applies to the entire data source and cascades to all related assets.
Container (Table/View): Inherits from the parent datastore and passes tags to its fields and checks.
Field: Reflects any inherited or directly applied tags.
Check: Inherits from the container or datastore; defines context for anomalies.
Anomaly: Inherits tags from the failed check when it’s created.

Tag Inheritance

Inheritance ensures consistency:

If a tag named Critical is applied to a Datastore, it automatically applies to all its Containers, Fields, and Checks.
When a Check fails, the resulting Anomaly inherits the same Critical tag.
If you remove the Critical tag from the parent datastore, all child assets lose that tag.
However, existing Anomalies keep the tag they inherited when they were created (no retroactive removal).

Note

Tag inheritance occurs only downward (from parent to child).
Anomalies inherit tags at creation time only — subsequent tag updates do not propagate automatically.

Real-Life Example

Imagine your organization manages multiple Datastores — Customer Data, Transactions, and Logs.

Here’s how Tags make this easier:

You apply a PII tag to all fields containing personal data (e.g., email, phone).
You apply a Finance tag to your Transactions datastore, which cascades to all related fields and checks.
You assign a Critical (Weight: 10) tag to checks that monitor payment processing errors.

Now your team can:

Filter anomalies by tag (e.g., view only “Critical” issues).
Trigger Flows for specific tags (e.g., auto-alert the Finance team).
Generate reports grouped by classification (e.g., all PII fields).

Tags turn scattered data into a structured, actionable map of your ecosystem.

Understanding Weight Modifier

Each tag includes a Weight Modifier — a numeric value between –10 and +10 that represents its relative importance.

Range	Purpose	Example
–10 to –1	Low priority	Deprecated or test data
0	Neutral	Informational or general tags
+1 to +10	High priority	Critical, PII, or Production data

How Weight Affects the System

In Dashboards: Higher-weight tags appear first in sorted lists and visualizations.
In Checks: High-weight tags help prioritize anomaly reviews and notifications.
In Flows: Tags can be used to trigger automated actions for higher-priority data.

Note

Weight values affect prioritization and filtering, not computation or scoring.

Scope: User-Level or System-Level?

Tags in Qualytics are system-wide, not user-specific.
Once created, a tag becomes available for all users who have permission to view or apply it.

Types of Tags

Global Tags: Created manually inside Qualytics. Editable by permitted roles and visible to all teams.
External Tags: Imported automatically from integrated catalog systems like Atlan or Alation.
These cannot be edited or deleted in Qualytics and remain read-only.

Use Cases

Scenario	Example	Benefit
Data Classification	Tag all personal data fields with `PII`.	Simplifies privacy compliance checks.
Operational Priority	Tag high-risk checks as `Critical (Weight: 10)`.	Drives targeted alerts and prioritization.
Lifecycle Management	Tag outdated datasets as `Deprecated`.	Makes cleanup easier and safer.
Automation	Configure Flows to run only for `Finance` tags.	Enables targeted workflows.

Permissions and Security

Tag permissions are determined by Team Roles in the Qualytics security model.

Permission Matrix for Tags

Legend:
✅ → The role can perform the action
❌ → The role cannot perform the action

Action	Reporter	Viewer	Drafter	Author	Editor
Create Tag	❌	❌	❌	❌	✅
Edit / Delete Tag	❌	❌	❌	❌	✅
Apply Existing Tag	✅	✅	✅	✅	✅
View Tag	✅	✅	✅	✅	✅

Tags in Flows

Tags can be used in Flow configurations to trigger or filter actions.

Example Use Cases

Run a Flow only for Checks tagged “Critical.”
Send Slack alerts for Anomalies tagged “PII.”
Trigger Data Export for Datastores tagged “Finance.”

Tags act as metadata filters that determine which entities are included or excluded in automated workflows.

Step 1: Log in to your Qualytics account and click on the Tags on the left side panel of the interface.

Add Tag

Note

For more steps please refer to the add tag documentation.

Applying a Tag

Note

For more steps please refer to the applying a tag documentation.

External Tag

Note

For more information refer to the external tag

Filter and Sort

Note

For more steps please refer to the filter and sort documentation.

Edit Tags

Note

For more steps please refer to the edit tag documentation.

Delete Tags

Note

For more steps please refer to the delete tag documentation.

External Tags

External Tags in Qualytics make it easy to keep your data catalog and Qualytics in sync — so you never have to create the same tags twice.

They automatically bring in tags (like Customer Data, Finance, PII, etc.) from platforms such as Atlan or Alation, and apply them to the same data assets inside Qualytics.

What Are External Tags?

External Tags are tags that come from an external data catalog, not from Qualytics itself.
They’re read-only, meaning you can view and use them in Qualytics, but can’t edit or delete them there.

Instead, they stay connected to your catalog — whenever your catalog updates, Qualytics updates too.

Example

Let’s say your team uses Atlan to tag assets with labels like:

Finance Data – for tables related to financial reports
Customer Data – for customer information
PII – for sensitive or personally identifiable data

Once Atlan is integrated, Qualytics will automatically import these same tags and show them as External Tags.
You’ll see them in the tag list, filters, and on relevant assets — always in sync with your catalog.

Why External Tags Matter

Without External Tags, teams often have to create the same tags separately in Qualytics — leading to confusion and duplicate work.
With External Tags, you get:

Consistency — the same tags appear across Atlan, Alation, and Qualytics
Automatic sync — no need to manually recreate or update tags
Unified visibility — see data quality insights for each tag
Governance alignment — your existing tagging standards stay intact

In short, External Tags help Qualytics “speak the same language” as your data catalog.

How It Works (Simple View)

Step	What Happens
1	You connect Atlan, Alation, or another supported catalog to Qualytics under Settings → Integrations.
2	Qualytics uses a secure API connection to read your catalog’s tags.
3	Those tags appear inside Qualytics automatically as External Tags.
4	Whenever a tag or asset updates in your catalog, Qualytics syncs those changes.
5	You can filter, sort, or view these tags — but editing happens in the catalog itself.

Real-Life Example

Imagine your company runs both a data catalog (Atlan) and a data quality platform (Qualytics).

In Atlan, your governance team adds a tag called “Sensitive Data” to all customer-related tables.
When synced, Qualytics automatically imports this tag and marks it as External.
Now, your data quality team can filter all anomalies or checks by “Sensitive Data” — without ever tagging them manually.
If the governance team later renames the tag to “Confidential Data”, Qualytics updates automatically.

This creates a single, reliable view of your data health — using the same tags everywhere.

Use Cases

Here are a few practical examples of how teams use External Tags:

Governance Teams

Use External Tags to track which datasets are PII, Customer Data, or Financial Records — without managing tags in multiple places.

Data Engineers

Quickly filter quality checks in Qualytics by tags synced from your catalog — for example, view all Finance datasets with active anomalies.

Compliance & Risk Teams

Easily identify sensitive assets tagged as GDPR or Confidential and monitor their data quality.

Business Analysts

Filter dashboards and reports by External Tags like Sales, Marketing, or Customer Behavior to analyze data quality in context.

Example in Action

Let’s take a real example:

In Atlan, a governance user tags three tables as Finance.
Qualytics syncs with Atlan → those three tables now show Finance (External) tags in Qualytics.
When viewing anomalies in Qualytics, you can filter by Finance to check if those datasets have open data quality issues.
The moment the tag changes in Atlan, Qualytics updates it automatically — no manual work needed.

Key Takeaways

External Tags come from platforms like Atlan or Alation.
They are read-only in Qualytics but automatically stay up-to-date.
You can filter, view, and analyze your data quality using these tags.
All updates happen through integration — keeping catalog and Qualytics perfectly aligned.

Manage Tags

Add Tag

Step 1: Click on the Add Tag button from the top right corner.

add-tag

Step 2: A modal window will appear, providing the options to create the tag. Enter the required values to get started.

REF.	FIELD	ACTION	EXAMPLE
1.	Preview	This shows how the tag will appear to users.	Preview
2.	Name	Assign a name to your tag.	Sensitive
3.	Color	A color picker feature is provided, allowing you to select a color using its hex code.	#E74C3C
4.	Description	Explain the nature of your tag.	Maintain data that is highly confidential and requires strict access controls.
5.	Category	Choose an existing category or create a new one to group related tags for easier organization.	Demo2
6.	Weight Modifier	Adjust the tag's weight for prioritization, where a higher value represents greater significance. The range is between -10 and 10. 💡 Tip: Increasing the value boosts the weight of a recipient and its children by the same amount, while negative values decrease it.	10

tag-details

Step 3: Click on the Save button to save your tag.

save-tag

After clicking the Save button, the tag will be added to the system and a success message will appear.

View Created Tags

Once you have created a tag, you can view it in the tags list.

view-tag

Applying a Tag

Once a Tag is created, it's ready to be associated with a Datastore, Profile, Check, Notification and ultimately an Anomaly.

Tag Inheritance

When a Tag is applied to a data asset, all the descendants of that data asset also receive the Tag.
- For example, if a Tag named Critical is applied to a Datastore then all the Tables, Fields, and Checks under that Datastore also receive the Tag.
Note

Anomalies will inherit the tags if a scan has been run.
Likewise, if the Critical Tag is subsequently removed from one of the Tables in that Datastore, then all the Fields and Checks belonging to that Table will have the Critical Tag removed as well.
When a new data asset is created, it inherits the Tags from the owning data asset. For example, if a user creates a new Computed Table, it inherits all the Tags that are applied to the Datastore in which it is created.

Tagging Anomalies

Anomalies also inherit Tags at the time they are created. They inherit all the Tags of all the associated failed checks.
Thus Anomalies do not inherit subsequent tag changes from those checks. They only inherit tags one time - at creation time.
Tags can be directly applied to or removed from Anomalies at any time after creation.

Edit Tags

This allows you to keep your tags updated with current information and relevance.

Step 1: Click the vertical ellipsis (⋮) next to the tag that you want to edit, then click on Edit from the dropdown menu.

edit-tag

Step 2: Edit the tag's name, color, description, category and weight as needed.

edit-details

Step 3: Click the Save button to apply your changes.

save-edit

Step 4: After clicking the Save button, a success message will appear.

Filter and Sort

Qualytics allows you to sort and filter your tags so that you can easily organize and find the most relevant tags according to your criteria, improving data management and workflow efficiency.

Sort

You can sort your tags by Category, Color, Created Date, Name, and Weight to easily organize and prioritize them according to your needs.

sort-tag

No.	Sort Option	Description
1	Category	Sort tags based on their assigned category.
2	Color	Sort tags according to their color label.
3	Created Date	Sort tags by the date they were created.
4	Name	Sort tags alphabetically by name.
5	Weight	Sort tags based on their assigned weight value.

Whatever sorting option is selected, you can arrange the data either in ascending or descending order by clicking the caret button next to the selected sorting criteria.

caret button

Filter

You can filter your tags by type and category, which allows you to categorize and manage them more effectively.

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

typos-tag

filter-tag

No.	Filter Option	Description
1	Type	Filter tags based on their origin. External: Imported automatically from integrated catalog systems like Atlan or Alation via API. These tags cannot be manually created or edited and ensure consistent tagging across connected platforms. Global: Created and managed directly within Qualytics. Used internally to organize datasets and unaffected by external integrations unless the Overwrite Tags option is enabled.
2	Category	Filter tags based on predefined groups or categories, making it easier to locate and manage related tags efficiently.

Delete Tags

This allows you to remove outdated or unnecessary tags to maintain a clean and efficient tag system.

Step 1: Click the vertical ellipsis (⋮) next to the tag that you want to delete, then click on Delete from the dropdown menu.

Warning

Deleting a tag is permanent and cannot be undone.

delete

Step 2: After clicking the Delete button, your tag will be removed from the system, and a success message will appear.

Settings

Connections

The Connections Management section allows you to manage global configurations for various connections to different data sources. This provides you with a centralized interface for managing all the data connections, ensuring efficient data integration and enrichment processes. You can easily navigate and manage your connections by utilizing the search, sort, edit, and delete features.

Let's get started 🚀

Step 1: Log in to your Qualytics account and click the Settings button on the left sidebar of the interface.

global-setting

Step 2: By default, you will be navigated to the Connections section.

connections

Manage Connection

You can effectively manage your connections by editing, deleting, and adding datastores to maintain accuracy and efficiency.

Warning

Before deleting a connection, ensure that all associated datastores and enrichment datastores have been removed.

Edit Connection

You can edit connections to update details such as name, account, role, warehouse, and authentication to improve performance. This keeps connection settings up-to-date and suited to your data needs.

Note

You can only edit the connection name and connection details, but you are not able to edit the connector itself.

Step 1: Click the vertical ellipsis (⋮) next to the connection that you want to edit, then click on Edit from the dropdown menu.

edit

Step 2: Edit the connection details as needed.

Note

Connection details vary from connection to connection, which means that each connection may have its unique configuration settings.

connection-details

Step 3: Once you have updated the values, click on the Test Connection button to check and verify its connection.

test-connection

Step 4: After the connection is verified, click on the Save button to save the changes.

save-connection

Delete Connection

This allows you to remove outdated or unnecessary connections to maintain a clean and efficient network configuration.

Step 1: Click the vertical ellipsis (⋮) next to the connection that you want to delete, then click on Delete from the dropdown menu.

delete

Step 2: A modal window Delete Connection will appear.

Warning

Source Datastores and Enrichment Datastores that are associated must be removed before deleting the connection.

delete-window

Step 3: Enter the Name of the Connection in the given field (confirmation check) and then click on the I’M SURE, DELETE THIS CONNECTION button to delete the connection.

confirm-delete

Add Datastore

You can add new or existing datastores and enrichment datastores directly from the connection, making it easy to manage and access your data while ensuring all sources are connected and available.

Step 1: Click the vertical ellipsis (⋮) next to the connection where you want to add a datastore, then click on Add Datastore from the dropdown menu.

add-datastore

A modal window labeled Add Datastore will appear, giving you options to connect a datastore. For more information on adding a datastore, please refer to the Configuring Source Datastores section.

Once you have successfully added a datastore to the connection, a success message will appear.

View Connection

Once you have added a new datastore and enrichment datastore, you can view them in the connections list.

view-connections

Sort Connection

You can sort your connections by Name and Created Date to easily find and manage them.

sort-connection

Filter Connection

You can filter connections by selecting specific data source types from the dropdown menu, making it easier to locate and manage the desired connections.

filter

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

filter

Security

You can easily manage user and team access by assigning roles and permissions within the system. This includes setting up specific access levels and roles for different users and teams. By doing so, you ensure that data and resources are accessed securely and appropriately, with only authorized individuals and groups having the necessary permissions to view or modify them. This helps maintain the integrity and security of your system.

Note

Only users with the Admin role have the authority to manage global platform settings, such as user permissions and team access controls.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.

global-setting

Step 2: By default, you will be navigated to the Tags section. Click on the Security tab.

security

User Roles

In Qualytics, every user is assigned a role: Admin,Manager or Member. Admins have the ability to edit any User selected from the user listing and change that User's role.

Admin: Admin users have full access to the system and can manage datastores, teams, and users. This means they can access everything in the application, as well as manage user accounts and team permissions.

Category	Functionality	Description
Source Datastore	Delete	Permanently remove a source datastore from the system.
Enrichment Datastore	Delete	Permanently remove an enrichment datastore from the system.
Global Settings	Settings	Manage global system configurations and preferences.
	Security	Manage user access and team permissions within the system.
	User (Manage)	Add, modify, or delete user accounts, assign roles, and control access levels.
	Team (Manage)	Manage teams by adding or removing members, and setting team-specific permissions.
	Health	Monitor the system’s health status and performance metrics.
	Restart Analytical Engine	Restart the analytics engine to refresh data processing or resolve issues.

Manager: Manager role has limited administrative access over global assets but remains subject to team permissions when interacting with datastores. Managers cannot manage user roles or teams. They can list all datastores (but cannot view their content without explicit team permission) and create datastores for teams where they have Editor permission. Additionally, Managers on a team with Editor permission can manage datastore teams. They can manage global assets such as Tags, Templates, and Notifications but do not have the ability to manage user accounts or team permissions like Admins.

Category	Functionality	Description
Source Datastore	Create	Managers can create new source datastores for data integration.
	List	Managers can view all source datastores that are listed in the system.
	Add Enrichment	Add enrichment processes to source datastores to enhance data quality.
Teams	Manage	Managers on a team with Editor permission can manage datastore teams.
Enrichment Datastore	Create	Managers can create enrichment datastores and assign them to teams with an "Editor" role.
	List	Managers can view all enrichment datastores available in the system.
Global Settings	DataStore (Source &Enrichment)
	Create	Managers can create new source and enrichment datastores and assign them to teams with an "Editor" role.
	List	Managers can view all datastores (source and enrichment) listed in the system.
Library	View	View the checks, or assets available in the library.
	Manage	Manage library content, such as adding, modifying, or removing checks.
Tags	View	View tags assigned to datastores, records, or other elements in the system.
	Manage	Manage the tags themselves — create, update (name, color, description), or delete tags. Assigning tags to assets such as datastores, fields, checks, and anomalies depends on the user’s role and team permissions. Note: For Flows, only Admin and Manager roles can assign tags.
Notifications Rules	View	View existing notification rules and actions configured for alerts.
	Manage	Configure and manage notification rules for different actions or triggers.
Settings	Connections
	Create	Create new connections for integrating external systems or databases.
	Update	Update existing connections to modify their settings or credentials.
	Delete	Remove existing connections that are no longer needed.
	Security
	View Users	View the list of users in the system and their access details.
	View Teams	View the teams and their roles/permissions within the system.
	Integration
	Add	Add new integrations to the system for external systems or data sources.
	Sync	Sync external data with the system to ensure the most up-to-date information.
Health	View	View the health status of the system to monitor performance and stability.
API only (ATM)	Transaction History	View the history of transactions made via the API for auditing and tracking.

Member: Members are normal users with access explicitly granted to them, usually inherited from the teams they are assigned to.

Category	Functionality	Description
Library	View	Access and browse available checks.
Tags	View	View tags associated with datastores, records, or other system elements.
Actions	View	View existing notification rules and action configurations.
Settings	Connection
	Read	Access connection details without modification permissions.
	Tokens
	Generate Token	Create new tokens for secure access or integrations.
	Revoke	Disable existing tokens to restrict access.
	Restore	Reactivate previously revoked tokens.
	Delete	Permanently remove tokens.
	View	Access and review all token details.

Manage Users

You can easily manage users by assigning roles, teams, and deactivating users who are not active. This ensures that access control is streamlined, security is maintained, and only active users have access to resources.

The Security section, visible only to Admins, allows for granting and revoking permissions for Member users.

Access controls in Qualytics are assigned at the datastore level. A non-administrator user (Member) can have one of three levels of access to any datastore connected to Qualytics:

Editor: Editor role has the most advanced permissions, enabling users to manage datastore functions comprehensively. Editors can control enrichment, scoring, computed fields, operations, and more. However, they cannot add teams outside their access; only administrators can perform this task.
Author: Author role focuses on managing checks and their associated metadata. This role is essential for tasks like activating, validating, and editing checks but has limited access to datastore functionalities.
Drafter: Drafter role is designed for users who need to create and prepare checks without performing or finalizing them. This role focuses on adding and organizing content for future use.
Viewer: Viewer role provides read-only access to anomalies and allows users to add comments or create notes. This role is ideal for those who need to monitor activities without making changes.
Reporter: Reporter role has extensive access to all app report information, including dashboards, overviews, and anomalies. Reporters can view various data contexts and generate analytical insights.

Note

Permissions are assigned to Teams rather than directly to users. Users inherit the permissions of the teams to which they are assigned.

All users are part of the default Public team, which provides access to all Public Datastores. Admins can create and manage additional teams, assigning both users and datastores to them. When a datastore is assigned to a team, the team is granted either Read or Write access, and all team members inherit this permission.

View Users

Whenever new users are added to the system, they will appear in the Users list. Click the Users tab to view the list of users.

User

Edit Users

You can edit user details to update their role, and team assignments, ensuring their access and team information are current and accurate.

Step 1: Click the vertical ellipsis (⋮) next to the user name that you want to edit, then click on Edit from the dropdown menu.

edit-user

Step 2: Edit the user details as needed, including:

Updating their role
Assigning them additional teams

Note

All users are inside the Public team by default and that can't be changed. If users have no default access to any datastore, then no datastores should be assigned to the Public team.

edit-user-details

Step 3: Once you have made the necessary changes, then click on the Save button.

save-user

After clicking the Save button, your changes will be updated, and a success message will appear.

Deactivate Users

You can deactivate users to revoke their access to the system while retaining their account information for future reactivation if needed.

Step 1: Click the vertical ellipsis (⋮) next to the user name that you want to deactivate, then click on Deactivate from the dropdown menu.

deactivate-user

Step 2: A modal window Deactivate User will appear.

deactivate-window

Step 3: Enter deactivate in the given field (confirmation check) and then click on the I’M SURE, DEACTIVATE THIS USER button to deactivate the user.

confirm-deactivate

Sort Users

You can sort users by various criteria, such as Created date, Name, Role, and Teams, to easily manage and organize user information.

sort-user

Filter Users

You can filter the users by their roles, deactivated and team, to quickly find and manage particular groups of users.

filter

Info

Users can search across all filter inputs using typos, partial terms, or abbreviations.
The system intelligently matches relevant results, making it easier to find what they need without exact inputs.

filter

Manage Teams

You can manage teams by editing their permissions, adding or removing users, and adjusting access to source and enrichment datastores. If a team is no longer needed, you can delete it from the system. This ensures that team configurations are always up-to-date and relevant, enhancing overall data management and security.

View Team

Whenever new teams are added to the system, they will appear in the Teams list. Click the Teams tab to view the list of teams.

teams

Edit Team

You can edit a team to update its permissions, name, manage users within the team, and adjust access to source and enrichment datastores, ensuring the team's configuration is current and effective.

Note

The name and users of a public team cannot be edited.

Step 1: Click on the vertical ellipsis (⋮) next to the team name that you want to edit, then click on Edit from the dropdown menu.

edit-team

Step 2: Edit the team details as needed, including updating their permissions, users, source, and enrichment datastores.

team-details

Step 3: Once you have made the necessary changes, then click on the Save button.

save-team

After clicking on the Save button, your team is successfully updated, and a success message will appear.

Delete Team

You can delete a team from the system when it is no longer needed, removing its access and permissions to streamline management and maintain security.

Step 1: Click the vertical ellipsis (⋮) next to the team name that you want to delete, then click on Edit from the dropdown menu.

delete-team

A modal window Delete Team will appear.

confirm-delete

Step 2: Click on the Delete button to delete the team from the system.

click-delete

Sort Team

You can sort teams by various criteria, such as name or creation date, to easily organize and manage team information.

sort-team

Team Permissions
Directory Sync

Team Permissions

Admins are not subject to Team permissions and can therefore access all data assets. By contrast, users assigned the Member and Manager roles are subject to Team permissions which control the data assets they can interact with.

Team permissions are granted at the Datastore level and extend to all data assets under that Datastore (Tables/View/Files, Fields, Quality Checks, Anomalies, etc...)

Permission Matrix

Legend:

✅ The given Team permission grants the ability to perform the Action on associated Datastores
❌ The given Team permission does not grant the ability to perform the Action on associated Datastores

Action	Reporter	Viewer	Drafter	Author	Editor
Delete Source Datastore	❌	❌	❌	❌	❌
View Source Datastore	✅	✅	✅	✅	✅
Edit Datastore Settings	❌	❌	❌	❌	✅
Preview Source Datastore	❌	✅	✅	✅	✅
Create/Delete Computed Asset	❌	❌	❌	❌	✅
View Activity	✅	✅	✅	✅	✅
Run & Manage Operations	❌	❌	❌	❌	✅
Schedule Operations	❌	❌	❌	❌	✅
View Profiles	✅	✅	✅	✅	✅
Delete Profiles	❌	❌	❌	❌	✅
View Checks	✅	✅	✅	✅	✅
Create Checks	❌	❌	✅	✅	✅
Save Check to draft	❌	❌	✅	✅	✅
Restore Check to draft	❌	❌	✅	✅	✅
Activate / Validate Check	❌	❌	❌	✅	✅
Edit Check Metadata	❌	❌	❌	✅	✅
View Anomalies	✅	✅	✅	✅	✅
View Anomaly Source Records	❌	✅	✅	✅	✅
Change Anomaly Status	❌	❌	❌	✅	✅
Add Comment to Anomaly	❌	✅	✅	✅	✅
Delete Enrichment Datastore	❌	❌	❌	❌	❌
View Enrichment Datastore	❌	✅	✅	✅	✅
Preview Enrichment Datastore	❌	✅	✅	✅	✅
View Tags	✅	✅	✅	✅	✅
Manage Tags (create / update / delete)	❌	❌	✅	✅	✅
Assign Tags to Assets (Datastore, Container, Field, Check, Anomaly)	❌	✅	✅	✅	✅
Assign Tags in Flows	❌	❌	✅	✅	✅

Add Team

You can create a new team for efficient and secure data management. Teams make it easier to control who has access to what, help people work together better, keep things secure with consistent rules, and simplify managing and expanding user groups. You can assign permissions to the team, such as Editor, Author, Drafter, Viewer and Reporter access, by selecting the datastore and enrichment datastore to which you want them to have access. This makes data management easier.

Step 1: Click on the Add Team button located in the top right corner.

add-team

Step 2: A modal window will appear, providing the options for creating the team. Enter the required values to get started.

REF.	FIELD	ACTION	EXAMPLE
1.	Name	Enter the name of the team	Data Insights Team
2.	Description	Provide a brief description of the team.	Analyzes data to provide actionable insights, supporting data-driven decisions

window

Permissions

Permissions decide what users can see, create, or manage based on their role. Each role is designed for specific tasks, giving users access to the tools and information they need without going beyond their limits. From Editors who manage advanced settings to Viewers with read-only access, these roles make it easy to use the system while keeping everything secure.

permission

Editor

Editor role allows users to manage datastore functions comprehensively. They can handle tasks such as controlling enrichment, scoring, computed fields, and operations.

editor

Feature	Operation	Can View/Can Run	Can Manage
Datastores	Add Datastore	❌	✅
	Edit Settings	❌	✅
Enrichment	Add Enrichment	❌	✅
	Edit Enrichment	❌	✅
Scoring	Edit Scoring	❌	✅
Computed Field	Add Computed	❌	✅
Operation	Run Operation	✅	✅
	Manage Operation	❌	✅
	Manage Scheduled Operation	❌	✅
Profiles	Add Computed	❌	✅
	Delete Computed	❌	✅
Field Context	Edit Field Context	❌	✅
	Delete Field Context	❌	✅

Author

Author role focuses on managing checks within the system. Users can activate, validate, change the status of checks, and edit their metadata. It is specifically designed for handling these functions efficiently.

author

Feature	Functionality	Can View/ Can Run	Can Edit
Source Datastore	Checks	❌	✅
	Activate Checks	❌	✅
	Validate Checks	❌	✅
	Change Status of Checks	❌	✅
	Edit Metadata	❌	✅
	Anomalies	❌	✅
Anomalies	Change Status of Anomalies	❌	✅

Drafter

Drafter role is designed specifically for adding and saving data within the system. Users can create new, make edits to existing ones, and save their progress as drafts. It is focused on these basic functions without access to advanced features or management tasks.

drafter

Feature	Functionality	Can View	Can Edit
Source Datastore	Open Datastore	✅	❌
	Add Checks	❌	✅
Profiles	Add Check	❌	✅
Checks	Create as Draft	❌	✅
Field Context	Add Check	❌	✅

Viewer

Viewer role is focused on viewing anomalies within the system and creating notes as needed. It offers read-only access while allowing users to add comments to document their observations.

viewer

Features	Functionality	Can View	Can Edit
Source Datastore	Anomalies	✅	❌
	Add Comment	✅	❌
	Preview (Container)	✅	❌
Enrichment Datastore	View	✅	❌
	Preview	✅	❌
Explore	Anomalies	✅	❌
	Source Records	✅	❌

Reporter

Reporter role provides access to all report-related information, including dashboards, overviews, checks, anomalies, fields, containers, and datastores. It is intended for users who need to view and analyze data to generate reports.

reporter

Feature	Operation	Can View	Can Edit
Source Datastore	List	✅	❌
	View	✅	❌
	Overview	✅	❌
	Activity	✅	❌
	Profiles	✅	❌
	Observability	✅	❌
	Checks	✅	❌
	Anomalies	✅	❌
	Fields (Containers)	✅	❌
Enrichment Datastores	List	✅	❌
Explore	Insights	✅	❌
	Activity	✅	❌
	Profiles	✅	❌
	Observability	✅	❌
	Checks	✅	❌
	Anomalies	✅	❌

team

REF.	FIELD	ACTION	EXAMPLE
4.	Users	Add users to the team	John, Michael
5.	Source Datastores	Grant access to specific source datastores (single or multiple) for the team	Athena
6.	Enrichment Datastores	Add and grant access to additional enrichment datastores (single or multiple) for the team	Bank Enrichment

Step 3: Click on the Save button to save your team.

save-new-team

After clicking on the Save button, your team is created, and a success message will appear saying.

Directory Sync

Directory Sync, also known as User and Group Provisioning, automates the synchronization of users and groups between your identity provider (IDP) and the Qualytics platform. This ensures that your user data is consistent across all systems, improving security and reducing the need for manual updates.

Directory Sync Overview

Directory Sync automates the management of users and groups by synchronizing information between an identity provider (IDP) and your application. This ensures that access permissions, user attributes, and group memberships are consistently managed across platforms, eliminating the need for manual updates.

How Directory Sync Works with SCIM

SCIM is an open standard protocol designed to simplify the exchange of user identity information. When integrated with Directory Sync, SCIM automates the creation, updating, and de-provisioning of users and groups. SCIM communicates securely between the IDP and your platform’s API using OAuth tokens to ensure only authorized actions are performed.

General Setup Requirements

To set up Directory Sync, the following are required:

Administrative access to both the identity provider and Qualytics platform
A SCIM-enabled identity provider or custom integration
The OAuth client set up in your IDP
SCIM URL and OAuth Bearer Token generated from the Qualytics platform

Getting Started

Prerequisites for Setting Up Directory Sync

Before setting up Directory Sync, ensure you have the following:

A SCIM-supported identity provider
Administrative privileges for both your IDP and Qualytics
A SCIM URL and OAuth Bearer Token, which will be generated from your Qualytics instance

Quick Start Guide

Set up an OAuth client in your IDP.
Configure the SCIM endpoints with the SCIM URL and OAuth Bearer Token.
Assign users and groups to provision in the IDP.
Monitor the synchronization to ensure proper operation.

What is SCIM?

SCIM is a standardized protocol used to automate the exchange of user identity information between IDPs and service providers. Its goal is to simplify the process of user provisioning and management.

SCIM improves efficiency by automating user lifecycle management (creation, updating, and de-provisioning) and ensures that data remains consistent across platforms. It also enhances security by minimizing manual errors and ensuring proper access control.

SCIM includes endpoints that are configured within your IDP and your platform. It uses OAuth tokens for secure communication between the IDP and the Qualytics API, ensuring that only authorized users can manage identity data.

Benefits of Using SCIM for User and Group Provisioning

By leveraging SCIM (System for Cross-domain Identity Management), Directory Sync simplifies user management with:

Automated user provisioning and de-provisioning
Reduced manual intervention, improving efficiency and security
Real-time updates of user data, ensuring accuracy and compliance
Support for scaling user management across organizations of any size

Supported Providers

Our API supports SCIM 2.0 (System for Cross-domain Identity Management) as defined in RFC 7643 and RFC 7644. It is designed to ensure seamless integration with any SCIM-compliant identity management system, supporting standardized user provisioning, de-provisioning, and lifecycle management. Additionally, we have verified support with the following providers:

Microsoft Entra (Azure Active Directory)
Okta
OneLogin
JumpCloud

Unsupported Providers

We do not support Google Workspace, as it does not offer SCIM support. Organizations using Google Workspace must use alternate methods for user provisioning.

Providers

1. Microsoft Entra

Creating an App Registration

Step 1: Log in to the Microsoft Azure Portal, and select “Microsoft Entra ID” from the main menu.

login

Step 2: Click on “Enterprise Applications” from the left navigation menu.

enterprise

Step 3: If your application is already created, choose it from the list and move to the section Configuring SCIM Endpoints. If you haven't created your application yet, click on the New Application button.

new-application

Step 4: Click on the “Create your own application” button to create your application.

create-application

Step 5: Give your application a name (e.g., "Qualytics OAuth Client" or "Qualytics SCIM Client").

application-name

Step 6: After entering the name for your application, click the Create button to finalize the creation of your app.

Configuring SCIM Endpoints

Step 1: Click on Provisioning from the left-hand menu.

provisioning

Step 2: A new window will appear, click on the Get Started button.

get-started

Step 3: In the Provisioning Mode dropdown, select “Automatic” and enter the following details in the Admin Credentials section:

Provisioning Mode: Select Automatic.
Tenant URL: https://your-domain.qualytics.io/api/scim/v2
Secret Token: Generate this token from the Qualytics UI when logged in as an admin user. For more information on how to generate tokens in Qualytics, refer to the documentation on Tokens.

admin-creds

Step 4: Click on the Test Connection button to test the connection to see if the credentials are correct.

test-connection

Step 5: Expand the Mappings section and enable your app to enable group and user attribute mappings. The default mapping should work.

mapping

Step 6: Expand the Settings section and make the following changes:

Select Sync only assigned users and groups from the Scope dropdown.
Confirm the Provisioning Status is set to On.

provisioning-status

Step 7: Click on the Save to save the credentials. Now you've successfully configured the Microsoft Entra ID SCIM API integration.

Assigning Users and Groups for Provisioning

Step 1: Click on the Users and groups from the left navigation menu and then click Add user/group.

add-user

Step 2: Click on the None Selected under the Users and Groups.

none

Step 3: From the right side of the screen, select the users and groups you want to assign to the app.

user-group

Step 4: Once you selected the group and users for your app, click the “Select” button.

Step 5: Click on the Assign button to assign the users and groups to the application.

Warning

When you assign a group to an application, only users directly in the group will have access. The assignment does not cascade to nested groups.

assign

2. Okta

Setting up the OAuth Client in Okta

Step 1: Log in to your Okta account using your administrator credentials. From the left-hand navigation menu, click Applications, then select Browse App Catalog.

application

Step 2: In the search bar, type SCIM 2.0 Test App (OAuth Bearer Token), and select the app called SCIM 2.0 Test App (OAuth Bearer Token) from the search results.

Step 3: On the app’s details page, click Add Integration.

add-integration

Step 4: Enter a name for your application (e.g., "Qualytics SCIM Client").

name

Step 5: Click on the Next button.

Configuring SCIM Endpoints

Step 1: In the newly created app, go to the Provisioning tab and click Configure API Integration.

configure

Step 2: Check the box labeled Enable API Integration, and enter the following details:

SCIM 2.0 Base URL: https://your-domain.qualytics.io/api/scim/v2
OAuth Bearer Token: Generate this token from the Qualytics UI when logged in as an admin user. For more information on how to generate tokens in Qualytics, refer to the documentation on Tokens.

enable-api

Step 3: Click Test API Credentials to verify the connection. Once the credentials are validated, click Save.

test-creds

Step 4: A new settings page will appear. Under the To App section, enable the following settings:

Create Users
Update User Attributes
Deactivate Users

After enabling these settings, your Okta SCIM API integration is successfully configured.

app-section

Assigning users for provisioning

Step 1: Click the Assignments tab and select Assign to People from the dropdown Assign.

assign

Step 2: Select the users you want to assign to the app and click the Assign button.

Step 3: After you click the Assign button, you'll see a new popup window with various fields. Confirm the field values and click the Save and Go Back buttons.

Assigning groups for provisioning

Step 1: Navigate to the tab Push Groups and select Find group by name from the dropdown Push Groups.

push-group

Step 2: Search for the group you want to assign to the app.

group

Step 3: After assigning the group name, then click on the Save button.

3. OneLogin

Setting up the OAuth Client in OneLogin

Step 1: Log in to your OneLogin account using your administrator credentials. From the top navigation menu, click Applications, then select Add App.

add-app

Step 2: In the search bar, type SCIM and select the app called SCIM Provisioner with SAML (SCIM V2 Enterprise) from the list of apps.

scim

Step 3: Enter a name for your app, then click Save. You have successfully created the SCIM app in OneLogin.

Configuring SCIM Endpoints

Step 1: In your created application, navigate to the Configuration tab on the left and enter the following information:

API Status: Enable the API status for the integration to work properly.
SCIM Base URL: https://your-domain.qualytics.io/api/scim/v2
SCIM Bearer Token: Generate this token from the Qualytics UI when logged in as an admin user. For more information on how to generate tokens in Qualytics, refer to the documentation on Tokens.

configure-scim

Step 2: Click on the Save button to store the credentials.

Step 3: Navigate to the Provisioning tab, and check the box labeled Enable Provisioning.

provisioning-tab

Step 4: Click on Save to apply the changes.

Step 5: Navigate to the Parameters tab and select the row for Groups.

parameters-tab

Step 5: A popup window will appear, check the box Include in User Provisioning, then click the Save button.

Assigning Users for Provisioning

Step 1: To assign users to your app, go to Users from the top navigation menu, and select the user you want to assign to the app.

From the User page, click the Applications tab on the left, and click the + (plus) sign.

application-tab

Step 3: A popup window will show a list of apps. Select the app you created earlier and click Continue.

Step 4: A new modal window will appear, click on the Save to confirm the assignment.

Step 5: If you see the status Pending in the table, click that text. A modal window will appear, where you can click Approve to confirm the assignment.

approve

Assigning Groups for Provisioning

Step 1: To push groups to your app, go to the top navigation menu, click Users, select Roles from the dropdown, and click New Role to create the role.

new-role

Step 2: Enter a name for the role, select the app you created earlier

name-role

Step 3: Click on the “Save” button.

Step 4: Click the Users tab for the role and search for the user you want to assign to the role.

Step 5: Click the Add To Role button to assign the user, then click Save to confirm the assignment.

Step 6: A modal window will appear, click on the “Save” button to confirm the assignment.

modal-save

Step 7: Go back to your app and click the Rule tab on the left and click the Add Rule button.

Give the rule a name. Under the Actions, select the Set Groups in your-app-name from the dropdown, then select each role with values that match your-app-name.

action

Step 8: Click on the Save button.

Step 9: Click on the Users tab on the left, you may see Pending under the provisions state. Click on it to approve the assignment.

pending

Step 10: A modal window will appear, click on the Approve to finalize the assignment.

approve

4. JumpCloud

Configuring SCIM Endpoints

JumpCloud supports SCIM provisioning within an existing SAML application. Follow these steps to configure SCIM provisioning:

Step 1: Log in to JumpCloud and either choose an existing SAML application or create a new one. From the left navigation menu, click SSO and select your Custom SAML App.

custom

Step 2: Click on the tab Identity Management within your SAML application.

Under the SCIM Version, choose SCIM 2.0 and enter the following information:

Base URL: https://your-domain.qualytics.io/api/scim/v2
Token Key: Generate this token from the Qualytics UI when logged in as an admin user. For more information on how to generate tokens in Qualytics, refer to the documentation on Tokens.
Test User Email

identity

Step 4: Click Test Connection to ensure the credentials are correct, then click Activate to enable SCIM provisioning.

activate

Step 5: Click Save to store your settings. Once saved, SCIM provisioning is successfully configured for your JumpCloud SAML application.

Assigning Users for Provisioning

Step 1: Click the tab User Groups within your SAML application. You can see all the available groups, select the groups you want to sync, and click Save.

user-group

If no existing groups are available, click User Groups from the left navigation menu and click on the plus (+) icon to create a new group.

Step 2: Select the Users tab and choose the users you want to assign to the group.

assign-group

Step 3: Select the Applications tab and choose the app you want to assign the group to.

application-tab

Tokens

A token is a secure way to access the Qualytics API instead of using a password. Qualytics provides two types of tokens to meet different authentication needs:

Token Types

Personal Access Tokens (PATs)

Personal Access Tokens are designed for individual users to authenticate and interact with the Qualytics API. These tokens are:

User-specific: Each user creates and manages their own tokens
Self-service: Can be generated through the Qualytics UI
Ideal for: Personal development, testing, and ad-hoc API exploration

Note

Tokens are created only once, so you need to copy and store them safely because they cannot be retrieved again.

Service Tokens

Service Tokens are designed for automated systems and integrations. These tokens are:

Administrator-managed: Only administrators can create and manage service tokens
Organization-level: Created for service accounts rather than individual users
Ideal for: Data pipeline automation, Qualytics API/CLI access, data catalog integrations, and shared automation

Tip

For detailed information about creating and managing Service Accounts and Service Tokens, see the Service Accounts documentation.

Let's get started 🚀

Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.

global-setting

Step 2: By default, you will be navigated to the Tags section. Click on the Tokens tab.

tokens

Generate Personal Access Tokens (PATs)

Generating a token provides a secure method for authenticating and interacting with your platform, ensuring that only authorized users and applications can access your resources. Personal Access Tokens (PATs) are particularly useful for automated tools and scripts, allowing them to perform tasks without needing manual intervention. By using PATs, you can leverage our Qualytics CLI to streamline data management and operations, making your workflows more efficient and secure.

Step 1: Click on the Generate Token button located in the top right corner.

generate-token

A modal window will appear providing the options for generating the token.

token-detail

Step 2: Enter the following values:

Name: Enter the name for the Token (e.g., DataAccessToken).
Type: Select "Personal" from the dropdown.
Expiration: Set the expiration period for the token (e.g., 30 days).
SCIM Administration Token: Enable this option if the token should only be used for SCIM-related operations, such as provisioning or managing user identities.

enter-values

Step 3: Once you have entered the values, then click on the Generate button.

generate-token

Step 4: After clicking on the Generate button, your token is successfully generated.

Warning

Make sure to download or copy this token. You won't be able to see it again. Keep your token confidential and avoid sharing them with anyone. Use a password manager or an encrypted vault to store your tokens.

token-generated

Token Expiration

Tokens include an expiration period that defines how long they remain valid. When a token reaches its expiration date, it automatically stops working and must be regenerated.

Available expiration options:

30 Days
60 Days
90 Days
1 Year
Never

token-expiration

Choosing a time-bound expiration helps maintain security and ensures unused tokens do not stay active indefinitely.

Warning

Avoid using the Never expiration option, as tokens that never expire can create security risks.

Token Usage Status

Each personal API token displays a usage status to indicate whether it has been used for interaction with the Qualytics API:

Last Used: This shows the token has been successfully used recently and is actively in use.

Last-Used

Not Used: The token has been generated but has not been used for any API requests since creation.

Last-Used

Revoke Token

You can revoke your token to prevent unauthorized access or actions, especially if the token has been compromised, is no longer needed, or to enhance security by limiting the duration of access.

Step 1: Click the vertical ellipsis (⋮) next to the user token, that you want to revoke, then click on Revoke from the dropdown menu.

revoke

Step 2: After clicking the Revoke button, your user token will be successfully revoked. A success message will display saying The token has been successfully revoked. Following revocation, the token's status color will change from green to orange.

revoked-successfully

Restore Token

You can restore a token to reactivate its access, allowing authorized use again. This is useful if the token was mistakenly revoked or if access needs to be temporarily re-enabled without generating a new token.

Step 1: Click the vertical ellipsis (⋮) next to the revoked tokens, that you want to restore, then click on the Restore button from the dropdown menu.

restore

Step 2: After clicking on the Restore button, your secret token will be restored and a confirmation message will display saying "The token has been successfully restored".

token-restored

Delete Token

You can delete a token to permanently remove its access, ensuring it cannot be used again. This is important for maintaining security when a token is no longer needed, has been compromised, or to clean up unused tokens in your system.

Note

You can only delete revoked tokens, not active tokens. If you want to delete an active token, you must first revoke it before you can delete it.

Step 1: Click the vertical ellipsis (⋮) next to the revoked tokens, that you want to delete, then click on the Delete button from the dropdown menu.

delete

After clicking the delete button, a confirmation modal window Delete Token will appear.

delete-window

Step 2: Click on the Delete button to delete the token.

click-delete

After clicking on the Delete button, your token will be deleted and a confirmation message will display saying User token successfully deleted.

successfully-deleted

Service Accounts

Service Accounts provide a secure and centralized way to authenticate automated systems and integrations with the Qualytics API. Unlike Personal Access Tokens (PATs) that are tied to individual users, Service Accounts are synthetic users designed specifically for data pipeline automation, Qualytics API/CLI access, data catalog integrations, and shared automation workflows. They eliminate the security risk of sharing personal tokens, remain active independent of individual user lifecycles, role changes, or access status, and provide clear audit trails for automated system activities.

Let's get started 🚀

Understanding Service Accounts and Tokens

What Are Service Accounts?

Service accounts are synthetic user accounts created specifically for automation and integrations. They have the following characteristics:

No interactive login: Service accounts cannot log into the Qualytics web interface
Administrator-managed: Only administrators can create and manage service accounts
Independent lifecycle: Service accounts are not tied to individual users
Role-based permissions: Can be assigned Admin, Manager, Editor, or Member roles
Team membership: Can be assigned to specific teams for scoped access

Service Account vs Personal Access Token

Understanding when to use service accounts versus personal access tokens is crucial for maintaining secure and reliable integrations:

Feature	Service Account	Personal Access Token (PAT)
Created by	Administrators only	Individual users (self-service)
Tied to	Synthetic service user	Human user account
Best for	Production integrations, data pipelines, shared automation	Personal development, testing, ad-hoc API exploration
Lifecycle	Independent of individual user lifecycles	Tied to individual user's access status
Management	Centrally managed by admins	Self-managed by the user
Audit trail	Clear purpose-based identification	Linked to individual user

When to Use Service Accounts

✅ Use Service Accounts For:

Data Pipeline Automation: Integrating Qualytics operations within your data pipeline workflows
Qualytics API Access: Programmatic access to perform any API operations, manage resources, or retrieve data
Qualytics CLI Operations: Automated scripts using the Qualytics CLI for data quality workflows
Data Catalog Integrations: Metadata synchronization with Alation, Atlan, or other catalog platforms
Shared Automation: Any automation used by multiple team members or systems

❌ Avoid Personal Access Tokens For:

Shared data pipelines (security risk - tokens shouldn't be shared)
Production integrations (breaks when users are deactivated or change roles)
Team-wide API or CLI automation (difficult to manage and audit)
Long-running integrations (tied to individual user lifecycle)

Creating Service Accounts

Service accounts must be created by administrators and can be managed through both the Qualytics UI and API.

Note

You must have the Administrator role to create and manage service accounts.

Creating a Service Account via UI

Step 1: Log in to your Qualytics account and click the Settings button on the left side panel.

global-setting

Step 2: Navigate to the Users tab to view all existing users and service accounts.

users-tab

Step 3: Click on the Add Service Account button in the top right corner.

add-service-account

A modal window will appear with a two-step wizard to create the service account and generate its token.

service-account-form

Step 4: In the first step, complete the service user form with the following details:

Field	Description	Required
Name	Descriptive name for the service account (e.g., "Pipeline Automation", "Alation Sync", "API Access"). This will be converted to a service account ID.	Yes
Role	Select the appropriate role (Admin, Manager, Editor, or Member) based on the required permissions. Start with the least privileged role.	Yes
Teams	Select the teams this service account should belong to. The "Public" team is always included automatically.	Optional

Step 5: After entering the service user information, click on the Next button.

Step 6: In the second step, complete the token generation form:

Field	Description	Required
Name	Descriptive name for the token (e.g., "pipeline-production", "api-access-prod"). Token names must be unique per service account.	Yes
Expiration	Select the expiration period from the dropdown: 30 days, 60 days, 90 days, 1 year, or Never. For production tokens, 1 year with a rotation plan is recommended.	Yes
Type	Automatically set to "Service" (pre-filled and non-editable).	N/A
Service User	Automatically populated with the service user created in Step 4 (pre-filled and non-editable).	N/A

token-generation-form

Step 7: After entering the token details, click on the Finish button.

Step 8: The service account and token have been created successfully. The token will be displayed on the screen.

Warning

Make sure to download or copy this token. You won't be able to see it again!

Copy the bearer token immediately and store it securely in a password manager or secrets management system.

service-account-created

Step 9: You will see the newly created service account in the users list with a "Service" badge.

service-account-in-list

Service Account Naming Convention

When you create a service account, the system automatically generates a service account ID based on the name you provide:

Original name: "Pipeline Automation"
Generated ID: pipeline_automation@service
Sanitization: Lowercase, spaces converted to underscores, special characters removed

Best Practices for Naming:

✅ pipeline_automation@service - Clear purpose
✅ alation_sync@service - Integration name
✅ api_access@service - Indicates purpose
❌ service1@service - Not descriptive
❌ temp@service - Unclear purpose

Creating Additional Service Tokens

When you create a service account using the two-step wizard, the first token is automatically generated. However, you may need to create additional tokens for the same service account (for example, separate tokens for different environments or purposes).

Warning

Service tokens are shown only once during creation. Make sure to copy and store them securely in a secrets manager. They cannot be retrieved later.

Creating Additional Tokens via UI

Step 1: Navigate to Settings > Tokens tab.

tokens-tab

Step 2: Click on the Generate Token button in the top right corner.

A modal window will appear with the token generation form.

token-generation-modal

Step 3: Complete the token generation form:

Field	Description	Required
Name	Descriptive name for the token (e.g., "pipeline-production", "api-access-prod").	Yes
Expiration	Select the expiration period from the dropdown: 30 days, 60 days, 90 days, 1 year, or Never. For production tokens, 1 year with a rotation plan is recommended.	Yes
Type	Select Service from the dropdown (or Personal for your own token).	Yes
Service User	When Type is "Service", a dropdown will appear showing all available service accounts. Select the service user for which you want to create the token.	Yes (when Type is Service)

token-form-filled

Step 4: After entering all details, click on the Generate button.

Step 5: Your service token has been successfully generated. The token will be displayed on the screen.

Warning

Make sure to download or copy this token. You won't be able to see it again!

Copy the bearer token immediately and store it securely in a password manager or secrets management system.

Step 6: After copying the token, you can close the modal. The token will now appear in the service tokens list.

token-in-list

Token Naming Strategy

Token names must be unique per service account (but not globally unique). Use a consistent naming convention:

Best Practices:

Include environment: Production, Staging, Development
Include purpose: Pipeline, Integration, Sync, API, CLI
Include system name if applicable: Alation, Atlan

Examples:

pipeline-production
api-access-prod
alation-sync-staging
cli-automation-dev

Using Service Tokens

Once generated, include the service token in the Authorization header of all API requests:

curl -H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." \
  https://acme.qualytics.io/api/datastores

Example Python Usage:

import requests

QUALYTICS_TOKEN = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
QUALYTICS_API = "https://acme.qualytics.io/api"

headers = {
    "Authorization": f"Bearer {QUALYTICS_TOKEN}",
    "Content-Type": "application/json"
}

response = requests.get(f"{QUALYTICS_API}/datastores", headers=headers)
print(response.json())

Managing Service Tokens

Service tokens can be monitored, revoked, restored, and deleted as needed for security and lifecycle management.

Token Usage Status

Each service token displays a usage status to help you track active and inactive tokens:

Last Used: Shows the timestamp when the token was last used for API authentication.

last-used-token

Not Used: The token has been generated but has not been used for any API requests since creation.

not-used-token

Viewing Service Tokens

Step 1: Navigate to Settings > Tokens tab.

tokens-tab

Step 2: Click on the Service Tokens filter to view only service account tokens.

service-tokens-filter

You will see a list of all service tokens with their status, expiration date, and last used timestamp.

Revoking a Service Token

Revoking a token immediately disables it without permanently deleting it. This is useful if you suspect a token has been compromised or you need to temporarily disable access.

Step 1: Locate the service token you want to revoke from the tokens list.

Step 2: Click the vertical ellipsis (⋮) next to the token, then select Revoke from the dropdown menu.

revoke-token

Step 3: A confirmation message will display saying Service token successfully revoked. The token's status will change to indicate it has been revoked.

token-revoked

Warning

Revoked tokens cannot be used for API authentication. Any systems using this token will immediately lose access.

Restoring a Service Token

If you revoked a token by mistake or need to re-enable access, you can restore it.

Step 1: Locate the revoked token in the tokens list.

Step 2: Click the vertical ellipsis (⋮) next to the revoked token, then select Restore from the dropdown menu.

restore-token

Step 3: The token has been restored and a confirmation message will display saying Service token successfully restored. The token is now active again.

token-restored

Deleting a Service Token

Deleting a token permanently removes it from the system. This action cannot be undone.

Note

You can only delete revoked tokens. If you want to delete an active token, you must first revoke it.

Step 1: Ensure the token is revoked (see Revoking a Service Token).

Step 2: Click the vertical ellipsis (⋮) next to the revoked token, then select Delete from the dropdown menu.

A confirmation modal window will appear.

delete-token

Step 3: Click on the Delete button to permanently delete the token.

confirm-delete

Step 4: A confirmation message will display saying Service token successfully deleted. The token is now permanently removed.

token-deleted

Token Rotation

For security best practices, tokens should be rotated periodically, especially before expiration.

Token Rotation Process:

Create New Token (at least 30 days before expiration)
Generate a new token for the same service account with a new name (e.g., production-integration-2025)
Update Integration
Provide the new token to your integration team
Update your secrets manager or environment variables
Verify the integration works with the new token
Monitor Usage
Check that the new token is being used (verify "Last Used" timestamp)
Ensure the old token is no longer active
Revoke Old Token
Once verified, revoke the old token
Delete Old Token (after grace period)
After 7 days of monitoring, delete the revoked token

API Reference

For administrators who prefer API-based management or need to automate service account creation, Qualytics provides comprehensive API endpoints.

Creating a Service Account via API

Endpoint:

POST /users
Authorization: Bearer {admin_token}
Content-Type: application/json

Request Body:

{
  "name": "API Access",
  "role": "Editor",
  "teams": ["Data Engineering", "Data Quality"]
}

Response:

{
  "id": 123,
  "user_id": "api_access@service",
  "user_name": "api_access@service",
  "email": "service@example.com",
  "name": "API Access",
  "user_type": "Service",
  "role": "Editor",
  "teams": ["Public", "Data Engineering", "Data Quality"]
}

Creating a Service Token via API

Endpoint:

POST /user-tokens
Authorization: Bearer {admin_token}
Content-Type: application/json

Request Body:

{
  "name": "api-production",
  "user_id": 123,
  "expires_in_days": 365
}

Response:

{
  "id": 456,
  "name": "api-production",
  "active": true,
  "expiration": "2026-11-05T12:34:56Z",
  "bearer_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "user": {
    "id": 123,
    "user_id": "api_access@service",
    "name": "API Access",
    "user_type": "Service"
  }
}

Danger

Save the bearer_token immediately! It cannot be retrieved later.

Listing Service Accounts

Endpoint:

GET /users/listing?type=Service
Authorization: Bearer {admin_token}

Listing All Service Tokens

Endpoint:

GET /user-tokens/service
Authorization: Bearer {admin_token}

Revoking a Token via API

Endpoint:

PUT /user-tokens/{token_id}
Authorization: Bearer {admin_token}
Content-Type: application/json

Request Body:

{
  "revoke": true
}

Deleting a Revoked Token via API

Endpoint:

DELETE /user-tokens/{token_id}
Authorization: Bearer {admin_token}

Note

Can only delete tokens that are already revoked.

Role Assignment Best Practices

Choosing the appropriate role for a service account is crucial for maintaining security through the principle of least privilege.

Role Guidelines

Use Case	Recommended Role	Rationale
Data catalog sync (Alation, Atlan)	Manager	Needs to read datastores, containers, and quality checks
Data pipeline automation	Editor	Needs to trigger operations and create containers
Qualytics CLI automation	Editor	Needs to trigger profiling and scanning operations
Read-only API access	Member	Only needs to read metrics, anomalies, and quality results
BI tool integration	Member	Only needs to read data quality results
Full platform automation	Admin	Needs to manage users, teams, and all resources

Role Permissions

Permission	Member	Editor	Manager	Admin
Read data quality results	✅	✅	✅	✅
View datastores and containers	✅	✅	✅	✅
Create and modify resources	❌	✅	✅	✅
View all users and teams	❌	❌	✅	✅
Manage users and service accounts	❌	❌	❌	✅
Configure integrations	❌	❌	❌	✅

Recommendation: Always start with the Member role and escalate only if the integration requires additional permissions.

Team Membership Strategy

Service accounts can be assigned to teams to control access to specific datastores and containers.

How Team Membership Works

Public team: All service accounts are automatically included in the Public team
Additional teams: Service accounts can be added to specific teams for scoped access
Access control: Team membership determines which datastores and containers the service account can access

Team Assignment Examples

Example 1: Data Catalog Integration

{
  "name": "Alation Data Catalog Sync",
  "role": "Manager",
  "teams": [
    "Data Engineering",
    "Analytics",
    "Finance Data"
  ]
}

This service account will have access to:

Public team resources (automatic)
Data Engineering team's datastores
Analytics team's datastores
Finance Data team's datastores

Example 2: Data Pipeline Automation

{
  "name": "Pipeline Automation",
  "role": "Editor",
  "teams": [
    "Data Engineering"
  ]
}

This service account can:

Trigger profiling and scanning operations on Data Engineering team's datastores
Create and modify containers in Data Engineering team's scope

Security Best Practices

Token Storage

✅ Do:

Store tokens in secrets managers (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault)
Use environment variables in deployment configurations
Encrypt tokens at rest
Rotate tokens regularly

❌ Don't:

Commit tokens to source control (Git repositories)
Store tokens in plain text files
Share tokens via email or chat applications
Use the same token across multiple environments

Token Expiration Strategy

Set appropriate expiration periods based on the environment and usage:

Environment	Recommended Expiration	Notes
Production	365 days	Implement rotation plan 30 days before expiration
Staging	180 days	Shorter lifespan for non-production
Development	90 days	Frequent rotation for dev environments
Testing	7-30 days	Short-lived for temporary testing

Monitoring and Auditing

Regularly review service account and token usage:

Monthly Reviews
Check "Last Used" timestamps for all tokens
Revoke tokens unused for 90+ days
Verify service account role assignments are still appropriate
Audit Trail
All API requests log the authenticated service account
created_by_id tracks which admin created each service account
Token usage creates audit trail through last_used timestamp
Alert on Anomalies
Unexpected token usage patterns
API calls from unusual IP addresses
Failed authentication attempts

Access Reviews

Conduct quarterly reviews of all service accounts:

Review Checklist:

✅ All service accounts have documented purposes
✅ Token names clearly indicate usage
✅ No tokens are close to expiration without rotation plan
✅ last_used timestamps are recent
✅ Role assignments follow least privilege
✅ Team memberships are appropriate
✅ No orphaned tokens (service deleted but tokens exist)

Security Considerations

Token Security

Bearer Tokens Are Sensitive

Treat tokens like passwords
Anyone with the token has full API access as that service account
Tokens are hashed before storage (HMAC-SHA256)
Lost tokens cannot be recovered (must create new)

Immediate Revocation

Revoke immediately if compromised
Two-step deletion prevents accidents (revoke → delete)
Revoked tokens fail immediately (no grace period)

Incident Response

If Token Is Compromised

1. Immediate Actions

Revoke the token via UI or API
Notify your security team
Check audit logs for suspicious activity

2. Investigation

Review all API calls made with the compromised token
Identify scope of unauthorized access
Determine if data was exfiltrated or modified

3. Remediation

Create new token for legitimate integration
Update integration with new token
Delete compromised token
Document the incident

4. Prevention

Review storage practices
Enhance monitoring and alerting
Update security training

If Service Account Is Compromised

Revoke ALL tokens for the service account
Create new service account with a different name
Audit all actions taken by the compromised service account
Rotate any downstream credentials the service had access to

Compliance Considerations

Separation of Duties

Service accounts for production integrations
Personal Access Tokens only for development
Admin approval required for service account creation
Clear ownership and responsibility documented

Least Privilege

Start with Member role, escalate only if needed
Limit team memberships to required access
Regular permission reviews
Document justification for elevated roles

Summary

Key Takeaways

Use Service Accounts for Production Integrations

Remain active independent of individual user lifecycles, role changes, or access status
Centralized management by administrators
Clear audit trails with purpose-based identification
Better security posture

Administrator Responsibilities

Create and manage service accounts
Conduct regular audits of service tokens
Enforce token rotation policies
Document all integrations and their purposes

Security First

Store tokens securely in secrets managers
Set appropriate expiration dates based on environment
Monitor usage regularly through "Last Used" timestamps
Revoke immediately if compromised

Least Privilege Principle

Assign the minimum required role for each use case
Limit team memberships to necessary access
Conduct regular access reviews
Document role justifications

Quick Reference

Task	UI Location	API Endpoint	Required Role
Create service account	Settings > Users	`POST /users`	Admin
Create service token	Users > Generate Token	`POST /user-tokens`	Admin
List service tokens	Settings > Tokens (Service filter)	`GET /user-tokens/service`	Admin
Revoke token	Token menu > Revoke	`PUT /user-tokens/{id}`	Admin
Delete token	Token menu > Delete	`DELETE /user-tokens/{id}`	Admin

Status

System status provides a real-time overview of your system's resources, essential for monitoring performance and diagnosing potential issues. It provides key indicators and status updates to help you maintain system health and quickly address potential issues.

Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.

global-settings

Step 2: You will be directed to the Settings page; then click on the Status section.

health

Summary Section

The Summary section displays the current platform version, along with the database status and RabbitMQ state.

REF.	FIELD	ACTION	EXAMPLE
1	Current Platform Version	Shows the current version of your platform's core software.	20240808-3019c60
2	Cloud Platform	Indicates which cloud provider the platform is hosted on.	Amazon Web Services (AWS)
3	Deployment Size	Indicates the size of the deployment that the client has contracted.	Medium
4	Database	Verifies your database connection. An "OK" status means it’s connected.	Status:OK
5	RabbitMQ	Confirms RabbitMQ (a message broker software) is running correctly with an "OK" state.	State:OK

summary

Status Indicator

The status indicator reflects the overall system resources health.For example, in the image below, a green checkmark indicates that our system resources are healthy.

Note

Status indicators are simple: a green checkmark indicates "Healthy," and a red exclamation mark means "Critical."

health-indicator

Analytics Engine

The Analytics Engine section provides advanced information about the analytics engine's configuration and current state for technical users and developers.

REF	FIELD	ACTION	EXAMPLE
1	Build Date	This shows the date and time when the Analytics Engine was built.	Aug 8 2024,7:39 AM (GMT+5:30)
2	Implementation Version	The version of the analytics engine implementation being used.	2.0.0
3	Max Executors	Maximum number of executors allocated for processing tasks.	10
4	Max Memory Per Executor	This shows the maximum amount of memory allocated to each executor.	25000 MB
5	Spark Version	The version of Apache Spark that the Analytics Engine uses for processing.	3.5.1
6	Core Per Executor	This shows the number of CPU cores assigned to each executor.	3
7	Max Dataframe Size	The maximum size of dataframes that can be processed.	50000 MB
8	Thread Pool State	Indicates the current state of the thread pool used for executing tasks.	[Running, parallelism \= 3, size \= 0, active \= 0, running \= 0, steals \= 0, tasks \= 0, submissions \= 0] supporting 0 running operation with 0 queued requests

analytics-engine

Private Routes

Users can now utilize private routes to view their IP addresses along with relevant system messages in the Analytics Engine, ensuring greater transparency and visibility into network activity.

private-routes

Manage Status Summary

You can perform essential tasks such as copying the status summary, refreshing it, and restarting the analytics engine. These functionalities help maintain an up-to-date overview of system performance and ensure accurate analytics.

Copy Status Summary

The Copy Status Summary feature lets you duplicate all data from the Health Section for easy sharing or saving.

Step 1: Click the vertical ellipsis from the right side of the summary section and choose Copy Status Summary from the drop-down menu.

copy-health

Step 2: After clicking on Copy Status Summary, a success message saying Copied.

copied

Refresh Status Summary

The Refresh Status Summary option updates the Health Section with the latest data. This ensures that you see the most current performance metrics and system status.

Step 1: Click the vertical ellipsis from the right side of the summary section and choose Refresh Status Summary to update the latest data.

refresh-health

Restart Analytics Engine

The Restart Analytics Engine option restarts the analytics processing system. This helps resolve issues and ensures that analytics data is accurately processed.

Step 1: Click the vertical ellipsis from the right side of the summary section and choose Restart Analytics Engine from the drop-down menu.

restart-analytics

Step 2: A modal window will pop up. Click the Restart button in this window to restart the analytics engine. Restarting the engine helps resolve any issues and ensures that your analytics data is up-to-date and accurately processed.

restart

Step 3: After clicking on Restart button a success message saying Successfully triggered Analytics Engine restart.

successfully-triggered

Qualytics CLI

Installing Python on Windows

Installing Python on Windows is simple and requires a compatible system, administrative access, and an internet connection. It can be installed through the Microsoft Store or the official website, with options to configure settings for seamless use. Verifying the installation ensures Python is ready for development, and setting environment variables can help with advanced tasks.

Let's get started 🚀

Prerequisites

Before you begin, ensure you meet the following requirements:

System Requirements: Windows 7 or later with sufficient disk space.
Administrative Privileges: You need admin rights to install Python and make changes to system settings.
Internet Connection: A stable internet connection is required for downloading the installer.

Method 1: Install Python from the Microsoft Store

Step 1: Click the Windows icon in the bottom-left corner of the screen, type Microsoft Store in the search bar.

python-microsoft-store

Step 2: Click on the Microsoft Store app to open it.

python-microsoft-store

A Microsoft Store window will open, displaying a home screen with featured apps, games, and promotions.

python-microsoft-store

Step 3: Click the search bar, type Python, and press Enter to search.

Step 4: A list of available Python versions appears. Select the latest version published by the Python Software Foundation to open its installation page.

python-microsoft-store

Step 5: Click on the Python version you wish to install.

For demonstration purposes, we will install Python 3.12.

python-microsoft-store

Step 6: Click the Get button to start the download and installation process.

python-microsoft-store

Step 7: Once the download and installation are complete, click the Downloads button in the left panel of the Microsoft Store to view the downloaded application.

python-microsoft-store

Step 8: Click the Open button next to the downloaded Python version.

python-microsoft-store

Step 9: A modal command prompt window will open. In the command prompt, type python --version and press Enter.

python-microsoft-store

If the installed Python version appears, it confirms that Python has been successfully installed on your system.

python-microsoft-store

Method 2: Installing Python from the Official Website

Step 1: Open a web browser and navigate to the Downloads for Windows section of the official Python website.

python-microsoft-store

In the Downloads section, you will see different Python versions listed under Stable Releases and Pre-releases. Each version includes multiple installer options, such as:

Windows installer (64-bit) – For 64-bit systems
Windows installer (32-bit) – For 32-bit systems
Windows installer (ARM64) – For ARM-based systems

Choose the appropriate installer based on your system requirements before proceeding with the download.

python-microsoft-store

Step 2: Click the link to download the file. For demonstration purposes, we have selected the Download Windows installer (64-bit).

python-microsoft-store

Step 3: Locate the downloaded Python installer on your system and click to open it.

python-microsoft-store

Step 4: Once the Python installer opens, the installation window shows two checkboxes:

python-microsoft-store

Admin privileges: Check the box labelled Admin Privileges parameter controls whether to install Python for the current or all system users. This option allows you to change the installation folder for Python.

python-microsoft-store

Add Python to Path: Check the box labeled Add Python to PATH important for running Python from the command line.

python-microsoft-store

Step 5: Click Install Now option for the recommended installation.

python-microsoft-store

Once installation is complete, you’ll see an option to Disable path length limit. Click this option if prompted, as it can prevent issues with long file paths during development.

python-microsoft-store

Step 6: Click Close to exit the installer.

Step 7: Verify the installation by opening a Command Prompt and typing:

Syntax

python --version

python-microsoft-store

If Python is installed correctly, it should display the installed version:

Example

C:\Users\user>python --version Python 3.12.4

python-microsoft-store

Setting Up Environment Variables

If the Python installer does not include the Add Python to PATH checkbox or you have not selected that option, continue in this step. Otherwise, skip these steps.

Step 1: Open File Explorer (Win +E) and navigate to where Python is installed. The default location is:

Example

C:\Users\win 10\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.12

python-microsoft-store

Step 2: Copy this path, then press Win + R on your keyboard, type sysdm.cpl, and press Enter.

python-microsoft-store

Step 3: A modal window system properties will appear. Click on the Advanced Tab.

python-microsoft-store

Step 4: Click on the Environment Variable button.

python-microsoft-store

A modal window will appear. Under System Variables, select Path and click on Edit button.

python-microsoft-store

Paste the copied Python installation path.

python-microsoft-store

Step 5: Also, add the Scripts folder path:

Example

C:\Users\YourUsername\AppData\Local\Programs\Python\Python3x\Scripts

python-microsoft-store

Step 6: Click OK to save the changes and restart your computer.

python-microsoft-store

Step 7: Open the Command Prompt and type:

Syntax

python --version

python-microsoft-store

If Python is installed correctly, it should display the installed version:

Example

C:\Users\user>python --version Python 3.12.4

python-microsoft-store

Qualytics CLI

Qualytics CLI is a command-line tool designed to interact with the Qualytics API. With this tool, users can manage configurations, export and import checks, run operations and more.

You can check more the latest version in Qualytics CLI

Installation and Upgrading

You can install Qualytics CLI via pip:

pip install qualytics-cli

You can upgrade the Qualytics CLI via pip:

pip install qualytics-cli --upgrade

Usage

Help

To view available commands and their usage:

qualytics --help

Initializing Configuration

To set up your Qualytics URL and token:

Bash ExamplePython Example

    qualytics init 
        --url "https://your-qualytics.qualytics.io/" 
        --token "YOUR_TOKEN_HERE"

    import qualytics.qualytics as qualytics
    CLI_TOKEN = "<your-token>"
    AUDIENCE = "https://your-qualytics.qualytics.io/"

    qualytics.init(AUDIENCE, CLI_TOKEN)

    Configuration saved!

Options:

Option	Type	Description	Default	Required
`--url`	TEXT	The URL of your Qualytics instance	None	Yes
`--token`	TEXT	The personal access token for accessing Qualytics	None	Yes

Display Configuration

To view the currently saved configuration:

Bash ExamplePython Example

qualytics show-config

    import qualytics.qualytics as qualytics

    qualytics.show_config()

    Config file located in: /home/user/.qualytics/config.json
    URL: (https://your-qualytics.qualytics.io/)
    Token: <your-token>

Export Checks

To export checks to a file:

Bash ExamplePython Example

    qualytics checks export 
        --datastore DATASTORE_ID [--containers CONTAINER_IDS] 
        [--tags TAG_NAMES] 
        [--output LOCATION_TO_BE_EXPORTED]

    import qualytics.qualytics as qualytics

    DATASTORE_ID = 844
    CONTAINERS = "7504, 6657"
    qualytics.checks_export(
        datastore=DATASTORE_ID, 
        containers=CONTAINERS, 
        tags=None, 
        output="/home/user/.qualytics/data_checks.json"
    )

    Exporting quality checks... -------------------------------------- 100% 0:00:00
    Total of Quality Checks = 27 
    Total pages = 1 
    Data exported to /home/user/.qualytics/data_checks.json

Options:

Option	Type	Description	Default	Required
`--datastore`	INTEGER	Datastore ID	None	Yes
`--containers`	List of INTEGER	Containers IDs	None	No
`--tags`	List of TEXT	Tag names	None	No
`--output`	TEXT	Output file path	`$HOME/.qualytics/data_checks.json`	No

Export Check Templates

To export check templates:

Bash ExamplePython Example to enrichmentPython Example to local (output destination)

qualytics checks export-templates 
    --enrichment_datastore_id 123 
    [--check_templates "1, 2, 3" or "[1,2,3]"]
    [--status `true` or `false`]
    [--rules "afterDateTime, aggregationComparison" or "[afterDateTime, aggregationComparison]"]
    [--tags "tag1, tag2, tag3" or "[tag1, tag2, tag3]"]
    [--output "/home/user/.qualytics/data_checks_template.json"]

    import qualytics.qualytics as qualytics

    ENRICH_DATASTORE_ID = 597
    CHECK_TEMPLATES = "182716, 179514"
    qualytics.check_templates_export(
        enrich_datastore_id=ENRICH_DATASTORE_ID,
        check_templates=CHECK_TEMPLATES,
        status=None,
        rules=None,
        tags=None
    )

    The check templates were exported to the table `_export_check_templates` to enrichment id: 597.

    import qualytics.qualytics as qualytics

    ENRICH_DATASTORE_ID = 597
    CHECK_TEMPLATES = "182716, 179514"
    qualytics.check_templates_export(
        enrich_datastore_id=None,
        check_templates=CHECK_TEMPLATES,
        status=None,
        rules=None,
        tags=None,
        output="/home/user/.qualytics/data_checks_template.json"
    )

    Exporting quality checks... -------------------------------------- 100% 0:00:01
    Total of Check Templates = 123 
    Total pages = 2 
    Data exported to /home/user/.qualytics/data_checks_template.json

Options:

Option	Type	Description	Default	Required
`--enrichment_datastore_id`	INTEGER	The ID of the enrichment datastore where check templates will be exported.	Yes
`--check_templates`	TEXT	IDs of specific check templates to export (comma-separated or array-like).	No
`--status`	BOOL	Check Template status send `true` if it's locked or `false` to unlocked.	No	No
`--rules`	TEXT	Comma-separated list of check templates rule types or array-like format. Example: "afterDateTime, aggregationComparison" or "[afterDateTime, aggregationComparison]".	No	No
`--tags`	TEXT	Comma-separated list of Tag names or array-like format. Example: "tag1, tag2, tag3" or "[tag1, tag2, tag3]".	No	No
`--output`	TEXT	Output file path [example: `/home/user/.qualytics/data_checks_template.json`].	No	No

Import Checks

To import checks from a file:

Bash ExamplePython Example

qualytics checks import 
    --datastore DATASTORE_ID_LIST 
    [--input LOCATION_FROM_THE_EXPORT]

    import qualytics.qualytics as qualytics

    TARGET_DATASTORE_ID = 1172
    qualytics.checks_import(
        datastore=TARGET_DATASTORE_ID, 
        input_file="/home/user/.qualytics/data_checks.json"
    )

    Quality check id: 195646 for container: CUSTOMER created successfully
    Quality check id: 195647 for container: CUSTOMER created successfully
    Quality check id: 195648 for container: CUSTOMER created successfully
    Quality check id: 195649 for container: CUSTOMER created successfully
    Quality check id: 195650 for container: CUSTOMER created successfully
    Quality check id: 195651 for container: CUSTOMER created successfully
    Quality check id: 195652 for container: CUSTOMER created successfully
    Quality check id: 195653 for container: CUSTOMER created successfully
    Quality check id: 195654 for container: CUSTOMER created successfully

Options:

Option	Type	Description	Default	Required
`--datastore`	TEXT	Datastore IDs to import checks into (comma-separated or array-like).	None	Yes
`--input`	TEXT	Input file path	HOME/.qualytics/data_checks.json	No

Note: Errors during import will be logged in $HOME/.qualytics/errors.log.

Run a Catalog Operation on a Datastore

Allows you to trigger a catalog operation on any current datastore (datastore permission required by admin)

Bash ExamplePython Example

    qualytics run catalog 
        --datastore "DATSTORE_ID_LIST" 
        --include "INCLUDE_LIST" 
        --prune 
        --recreate 
        --background

    import qualytics.qualytics as qualytics

    DATASTORE_ID = "1172"
    qualytics.catalog_operation(
        datastores=DATASTORE_ID, 
        include=None,
        prune=None ,
        recreate=None,
        background=False
    )

    Started Catalog operation 29464 for datastore: 1172 
    Waiting for operation to finish
    Waiting for operation to finish
    Waiting for operation to finish
    Successfully Finished Catalog operation 29464for datastore: 1172 
    Processing... ---------------------------------------- 100% 0:00:29

Options:

Option	Type	Description	Required
`--datastore`	TEXT	Comma-separated list of Datastore IDs or array-like format. Example: 1,2,3,4,5 or "[1,2,3,4,5]"	Yes
`--include`	TEXT	Comma-separated list of include types or array-like format. Example: "table,view" or "[table,view]"	No
`--prune`	BOOL	Prune the operation. Do not include if you want prune == false	No
`--recreate`	BOOL	Recreate the operation. Do not include if you want recreate == false	No
`--background`	BOOL	Starts the catalog but does not wait for the operation to finish	No

Run a Profile Operation on a Datastore

Allows you to trigger a profile operation on any current datastore (datastore permission required by admin)

Bash ExamplePython Example

qualytics run profile 
    --datastore "DATSTORE_ID_LIST" 
    --container_names "CONTAINER_NAMES_LIST" 
    --container_tags "CONTAINER_TAGS_LIST"
    --infer_constraints 
    --max_records_analyzed_per_partition "MAX_RECORDS_ANALYZED_PER_PARTITION" 
    --max_count_testing_sample "MAX_COUNT_TESTING_SAMPLE"
    --percent_testing_threshold "PERCENT_TESTING_THRESHOLD" 
    --high_correlation_threshold "HIGH_CORRELATION_THRESHOLD" 
    --greater_then_date "GREATER_THAN_TIME"
    --greater_than_batch "GREATER_THAN_BATCH" 
    --histogram_max_distinct_values "HISTOGRAM_MAX_DISTINCT_VALUES" 
    --background

    import qualytics.qualytics as qualytics

    DATASTORE_ID = "844"
    CONTAINER_NAMES = "CUSTOMER, NATION"
    qualytics.profile_operation(
        datastores=DATASTORE_ID,
        container_names=CONTAINER_NAMES,
        container_tags=None,
        infer_constraints=True,
        max_records_analyzed_per_partition=None,
        max_count_testing_sample=None,
        percent_testing_threshold=None,
        high_correlation_threshold=None,
        greater_than_time=None,
        greater_than_batch=None,
        histogram_max_distinct_values=None,
        background=False
    )

    Successfully Started Profile 29466 for datastore: 844 
    Waiting for operation to finish
    Waiting for operation to finish
    Waiting for operation to finish
    Waiting for operation to finish
    Waiting for operation to finish
    Waiting for operation to finish
    Successfully Finished Profile operation 29466 for datastore: 844 
    Processing... ---------------------------------------- 100% 0:00:46

Options:

Option	Type	Description	Required
`--datastore`	TEXT	Comma-separated list of Datastore IDs or array-like format. Example: 1,2,3,4,5 or "[1,2,3,4,5]"	Yes
`--container_names`	TEXT	Comma-separated list of include types or array-like format. Example: "container1,container2" or "[container1,container2]"	No
`--container_tags`	TEXT	Comma-separated list of include types or array-like format. Example: "tag1,tag2" or "[tag1,tag2]"	No
`--infer_constraints`	BOOL	Infer quality checks in profile. Do not include if you want infer_constraints == false	No
`--max_records_analyzed_per_partition`	INT	Number of max records analyzed per partition	No
`--max_count_testing_sample`	INT	The number of records accumulated during profiling for validation of inferred checks. Capped at 100,000	No
`--percent_testing_threshold`	FLOAT	Percent of testing threshold	No
`--high_correlation_threshold`	FLOAT	Number of Correlation Threshold	No
`--greater_than_time`	DATETIME	Only include rows where the incremental field's value is greater than this time. Use one of these formats %Y-%m-%dT%H:%M:%S or %Y-%m-%d %H:%M:%S	No
`--greater_than_batch`	FLOAT	Only include rows where the incremental field's value is greater than this number	No
`--histogram_max_distinct_values`	INT	Number of max distinct values of the histogram	No
`--background`	BOOL	Starts the catalog but does not wait for the operation to finish	No

Run a Scan Operation on a Datastore

Allows you to trigger a scan operation on a datastore (datastore permission required by admin)

Bash ExamplePython Example

qualytics run scan 
    --datastore "DATSTORE_ID_LIST"
    --container_names "CONTAINER_NAMES_LIST" 
    --container_tags "CONTAINER_TAGS_LIST"
    --incremental 
    --remediation 
    --max_records_analyzed_per_partition "MAX_RECORDS_ANALYZED_PER_PARTITION" 
    --enrichment_source_records_limit
    --greater_then_date "GREATER_THAN_TIME" 
    --greater_than_batch "GREATER_THAN_BATCH" 
    --background

    import qualytics.qualytics as qualytics

    DATASTORE_ID = 1172
    CONTAINER_NAMES = "CUSTOMER, NATION"
    qualytics.scan_operation(
        datastores=str(DATASTORE_ID),
        container_names=None,
        container_tags=None,
        incremental=False,
        remediation="none",
        enrichment_source_record_limit=10,
        greater_than_batch=None,
        greater_than_time=None,
        max_records_analyzed_per_partition=10000,
        background=False
    )

    Successfully Started Scan 29467 for datastore: 1172 
    Waiting for operation to finish
    Waiting for operation to finish
    Waiting for operation to finish
    Waiting for operation to finish
    Waiting for operation to finish
    Successfully Finished Scan operation 29467 for datastore: 1172 
    Processing... ---------------------------------------- 100% 0:03:04

Options:

Option	Type	Description	Required
`--datastore`	TEXT	Comma-separated list of Datastore IDs or array-like format. Example: 1,2,3,4,5 or "[1,2,3,4,5]"	Yes
`--container_names`	TEXT	Comma-separated list of include types or array-like format. Example: "container1,container2" or "[container1,container2]"	No
`--container_tags`	TEXT	Comma-separated list of include types or array-like format. Example: "tag1,tag2" or "[tag1,tag2]"	No
`--incremental`	BOOL	Process only new or records updated since the last incremental scan	No
`--remediation`	TEXT	Replication strategy for source tables in the enrichment datastore. Either 'append', 'overwrite', or 'none'	No
`--max_records_analyzed_per_partition`	INT	Number of max records analyzed per partition. Value must be Greater than or equal to 0	No
`--enrichment_source_record_limit`	INT	Limit of enrichment source records per . Value must be Greater than or equal to -1	No
`--greater_than_date`	DATETIME	Only include rows where the incremental field's value is greater than this time. Use one of these formats %Y-%m-%dT%H:%M:%S or %Y-%m-%d %H:%M:%S	No
`--greater_than_batch`	FLOAT	Only include rows where the incremental field's value is greater than this number	No
`--background`	BOOL	Starts the catalog but does not wait for the operation to finish	No

Note: Errors during any of the three operations will be logged in $HOME/.qualytics/operation-error.log.

Check Operation Status

To check the status of operations:

Bash ExamplePython Example

qualytics operation check_status 
    --ids "OPERATION_IDS"

    import qualytics.qualytics as qualytics

    qualytics.operation_status(ids="29468")

    Operation: 29468 is still running 
    Processing... ---------------------------------------- 100% 0:00:00

Options:

Option	Type	Description	Required
`--ids`	TEXT	Comma-separated list of Operation IDs or array-like format. Example: 1,2,3,4,5 or "[1,2,3,4,5]"	Yes

Integrations

The Qualytics platform seamlessly connects with your enterprise technology ecosystem, transforming data quality management from a standalone process into an integral part of your data operations. Our comprehensive integration capabilities ensure that data quality insights and actions flow naturally through your existing tools and workflows.

Available Integrations

Source Datastores

Connect directly to your data wherever it lives - from traditional databases to modern cloud storage platforms. Qualytics provides unified quality management across your entire data landscape through our Datastore framework.

Data Catalogs

Surface data quality insights directly within your enterprise data catalogs, enhancing data discovery and governance with rich quality metrics and real-time anomaly detection.

Compute

Leverage flexible deployment options to optimize performance and resource utilization, whether using our managed Kubernetes infrastructure or your own external compute environment.

Alerting

Receive instant notifications about data quality events through your enterprise messaging platforms, enabling rapid response to quality issues as they emerge.

Ticketing

Track and manage data quality initiatives within your existing project management tools, seamlessly incorporating quality management into your team's established workflows.

Workflow

Embed data quality checks directly into your data pipelines and transformation processes, ensuring quality gates are enforced at every stage of your data lifecycle.

Analytics

Visualize data quality metrics and trends through your preferred business intelligence tools, providing actionable insights to stakeholders across your organization.

Single Sign-On

Enable secure, frictionless access to Qualytics through your enterprise identity provider, maintaining consistent authentication and access control policies.

Data Catalogs

Data Catalog Integrations

The Qualytics platform seamlessly integrates with enterprise data catalogs, enabling organizations to: - Surface data quality insights directly within existing catalog tools - Automatically sync metadata between platforms in real-time - Leverage data catalog tags for quality classification - Push quality alerts and anomaly notifications to catalog users - Maintain consistent metadata across platforms - Track data quality metrics within your data governance framework

These catalog integrations ensure that data quality insights are readily available to users within their preferred data discovery and governance platforms.

Setting Up Catalog Integration

Navigate to Settings > Integration to configure your data catalog connection:

Supported Data Catalogs

Currently, Qualytics supports integration with the following data catalog platforms:

Atlan

The Atlan integration enables bidirectional metadata synchronization, providing: - Automated metadata push from Qualytics to Atlan - Real-time metadata pull from Atlan to Qualytics - Automatic updates based on key events - Flexible tag management options - Simple API-based authentication

For detailed configuration steps, see the Atlan documentation.

Alation

The Alation integration supports comprehensive metadata exchange: - Bidirectional metadata synchronization - Real-time quality metric updates - Selective synchronization of active checks - Configurable tag conflict resolution - Token-based secure authentication

For detailed configuration steps, see the Alation documentation.

Synchronization Options

Qualytics provides flexible synchronization methods to match your workflow:

Manual Sync

Trigger complete metadata synchronization on-demand: manual

For detailed steps, see the Synchronization section.

Event Driven

Enable automatic synchronization based on platform events: event

Event	Description
Run an Operation (Profile Or Scan)	Sync all target containers for the operation.
Archive an Anomaly (including bulk)	Sync the container in which the anomaly was identified.
Archive a Check ( including bulk)	Sync the container to which the check belongs.

Atlan

Integrating Atlan with Qualytics allows for easy push and pull of metadata between the two platforms. Specifically, Qualytics "pushes" its metadata to the data catalog and "pulls" metadata from the data catalog. Once connected, Qualytics automatically updates when key events happen in Atlan, such as metadata changes, anomaly updates, or archiving checks. This helps maintain data quality and consistency. During the sync process, Qualytics can either replace existing tags in Atlan or skip assets that have duplicate tags to avoid conflicts. Setting it up is simple—you just need to provide an API token to allow smooth communication between the systems.

Let’s get started 🚀

Atlan Setup

Create an Atlan persona and policy

Before starting the integration process, it is recommended that you set up an Atlan persona. It allows access to the necessary data and metadata. While you can create this persona simultaneously as your API token, it's easier if you create it first. That way, you can link the persona directly to the token later.

Before using Atlan with your data source, authorize the API token with access to the needed data and metadata. You do this by setting up policies within the persona for the Atlan connection that matches your Qualytics data source. Remember, you will need to do this for each data source you want to integrate.

Step 1. Navigate to Governance, then select “Personas”.

atlan-governance-center

Step 2: Click on “+ New Persona Button”.

add-new-persona

Step 3: Enter a Name and Description for a new persona, then click the “Create” button.

create-new-persona

Step 4: Here your new Atlan persona has been created.

new-persona-view

Step 5: After creating a new Atlan persona you have to create policies to authorize the personal access token. Click on "Add Policies" to create a new policy or to add one if there isn't any available.

atlan-new-persona-view

Step 6: Click on "New Policy" and select "Metadata Policy" from the dropdown menu.

new-policy-section

Step 7: Enter a "name", and choose the "connection".

atlan-policy-to-connection

Step 8: Customize the permissions and assets that Qualytics will access.

meta-policies-and-assets-configuration

Step 9: Once the policy is created, you’ll see it listed in the Policies section.

atlan-policy-attached-to-persona

Create Atlan Personal Access Token

After you’ve created the persona, the next step is to create a personal access token.

Step 1: Navigate to the API Tokens section in the Admin Center.

atlan-admin-center

Step 2: Click on "Generate API Token" button.

atlan-generate-api-token

Step 3: Enter a name and description, and select the persona you created earlier.

atlan-add-new-api-token

Step 4: Click the "Save" button and make sure to store the token in a secure location.

atlan-token-generated

Add Atlan Integration

Integrating Atlan with Qualytics enhances your data management capabilities, allowing seamless synchronization between the two platforms. This guide will walk you through the steps to add the Atlan integration efficiently. By following these steps, you can configure essential settings, provide necessary credentials, and customize synchronization options to meet your organization’s needs.

Step 1: Log in to your Qualytics account and click the "Settings" button on the left side panel of the interface.

global-settings

Step 2: You will be directed to the Settings page, then click on the "Integration" tab.

settings-section

Step 3: Click on the Connect button next to Atlan to connect to the Atlan Integration.

click-add-integration

A modal window titled Add Atlan Integration appears.

click-add-integration

Fill in the connection properties to connect to Atlan.

REF.	FIELDS	ACTIONS
1.	URL (Required)	The complete address for the Atlan instance, for example: https://your-company.atlan.com.
2.	Token (Required)	Provide the authentication token needed to connect to Atlan.
3.	Enable Announcements	If enabled, announcements will be automatically posted to Atlan assets whenever anomalies are detected.
4.	Domains	Select specific domains to filter assets for synchronization. - Acts as a filtering mechanism to sync specific assets - Uses domain information from the data catalog (e.g. Sales ). Only assets under the selected domains will synchronize.
5.	Event Driven	If enabled, the integration sync will be activated by operations, archiving anomalies, and checks.
6.	Overwrite Tags	If enabled, Atlan tags will have precedence over Qualytics tags in cases of conflicts (when tags with the same name exist on both platforms).

add-atlan-integration

Step 4: Click on the Create button to set up the Atlan integration.

atlan-integration-click-save

Step 5: Once the Atlan integration is set up with Qualytics, it will appear in Qualytics as a new integration.

atlan-integration-created

Synchronization

The Atlan synchronization supports both push and pull operations. This includes pulling metadata from Atlan to Qualytics and pushing Qualytics metadata to Atlan. During the syncing process, the integration pulls tags assigned to data assets in Atlan and assigns them to Qualytics assets as an external tag.

Note

Tag synchronization requires manual triggering.

Step 1: To sync tags, click the vertical ellipsis next to Atlan and select Sync from the dropdown.

atlan-click-sync

Step 2: After clicking the "Sync" button, you will have the following options:

Pull Atlan Metadata
Push Qualytics Metadata

Specify whether the synchronization will pull metadata, push metadata, or do both.

atlan-sync-modal

Step 3: After selecting the desired options, click on the "Start" button.

atlan-sync-modal-start

Step 4: After clicking the Start button, the synchronization process between Qualytics and Atlan begins. This process pulls metadata from Atlan and pushes Qualytics metadata, including tags, quality scores, anomaly counts, asset links, and many more.

atlan-syncing

Step 5: Review the logs to verify which assets were successfully mapped from Atlan to Qualytics.

atlan-logs

Step 6: Once synchronization is complete, the mapped assets from "Atlan" will display an external tag.

table-external-tag

Step 7: When Qualytics detects anomalies, alerts are sent to the assets in Atlan, displaying the number of active anomalies and including a link to view the corresponding details

table-notification

Metadata

The Quality Score Total, along with the Qualytics 8 metrics (completeness, coverage, conformity, consistency, precision, timeliness, volume, and accuracy), and the count of checks and anomalies per asset identified by Qualytics, are pushed.

custom-metadata

Alation

Integrating Alation with Qualytics, allows you to pull metadata from Alation to Qualytics and push Qualytics metadata to Alation. Once integrated, Qualytics can stay updated with key changes in Alation, like metadata updates and anomaly alerts which helps to ensure data quality and consistency. Qualytics updates only active checks, and metadata updates in Qualytics occur if the Event-Driven option is enabled or can be triggered manually using the "Sync" button. During sync, Qualytics can replace existing tags in Alation or skip duplicate tags to avoid conflicts. The setup is simple—just provide a refresh token for communication between the systems.

Let’s get started 🚀

Alation Setup

Create Refresh Token

Before setting up Alation Integration in Qualytics, you have to generate a Refresh token. This allows Qualytics to access Alation's API and keep data in sync between the two platforms.

Step 1: Navigate to the "Profile Settings".

Step 2: Select the "Authentication" tab.

authentication

Step 3: Click on the "Create Refresh Token" button.

refresh-token

Step 4: Enter a name for the token.

refresh-token

Step 5: After entering the name for the token, click on "Create Refresh Token".

refresh-token-name

Step 6: Your "refresh" token has been generated successfully. Please Copy and save it securely.

token-created

Step 7: Here you can view the token that is successfully added to the access tokens list.

refresh-token-listed

Add Alation Integration

Step 1: Log in to your Qualytics account and click the "Settings" button on the left side panel of the interface.

global-setting

Step 2: You will be directed to the Settings page, then click on the "Integration" tab.

integration

Step 3: Click on the "Add Integration" button.

add integration

Step 4: Complete the configuration form by choosing the Alation integration type.

add integration

REF.	FIELDS	ACTIONS
1.	Name (Required)	Provide a name for the integration.
2.	Type (Required)	Choose the type of integration from the dropdown menu. Currently, 'Atlan' is selected
3.	URL (Required)	Enter the full address of the Alation instance, for example, https://instance.alationcloud.com.
4.	Refresh Token (Required)	Enter the refresh token required to access the Alation API.
5.	User ID (Required)	Provide the user ID associated with the generated token.
6.	Domains	Select specific domains to filter assets for synchronization. - Acts as a filtering mechanism to sync specific assets - Uses domain information from the data catalog (e.g. Sales ). Only assets under the selected domains will synchronize.
7.	Event Driven	If enabled, operations, archiving anomalies, and checks will activate the integration sync.
8.	Overwrite Tags	If enabled, Alation tags will override Qualytics tags in cases of conflicts (when tags with the same name exist on both platforms).

Step 5: Click on the Save button to integrate Alation with Qualytics.

save-integration

Step 6: Here you can view the new integration appearing in Qualytics.

data-mesh

Synchronization

The Alation synchronization supports both push and pull operations. This includes pulling metadata from Alation to Qualytics and pushing Qualytics metadata to Alation. During the syncing process, the integration pulls tags assigned to data assets in Alation and assigns them to Qualytics assets as an external tag.

Note

Tag synchronization requires manual triggering.

Step 1: To sync tags, simply click the "Sync" button next to the relevant integration card.

synchronization

Step 2: After clicking the Sync button, you will have the following options:

Pull Alation Metadata
Push Qualytics Metadata

Specify whether the synchronization will pull metadata, push metadata, or do both.

sync-modal

Step 3: After selecting the desired options, click on the "Start" button.

sync-modal-start

Step 4: After clicking the Start button, the synchronization process between Qualytics and Alation begins. This process pulls metadata from Alation and pushes Qualytics metadata, including tags, quality scores, anomaly counts, asset links, and many more.

integration-created

Step 5: Once synchronization is complete, the mapped assets from Alation will display an external tag.

external-tags

Alerts

When Qualytics detects anomalies, alerts are sent to the assets in Alation, showing the number of active anomalies and providing a link to view them.

Metadata

The Quality Score Total, the "Qualytics 8" metrics (completeness, coverage, conformity, consistency, precision, timeliness, volume, and accuracy), and counts of checks and anomalies per asset identified by Qualytics are pushed to Alation. This enables users to analyze assets based on data profiling and scanning metrics. A link to the asset in Qualytics is also provided.

metatag

Data Health

On the Alation tables page, there's a tab called “Data Health” where Qualytics displays insights from data quality checks in a table format, showing the current status based on the number of anomalies per check.

data-health

Column	Description
Rule	The type of data quality check rule
Object Name	The Table Name
Status	The check status can be either "Alert" if there are active anomalies or "No Issues" if no active anomalies exist for the check.
Value	The current amount of active anomalies
Description	The data quality check description
Last Updated	The last synced timestamp

External Tag Propagation

External tags propagation in Qualytics serve as metadata labels that are automatically synchronized from an integrated data catalog, such as Atlan or Alation. This process helps maintain consistent data tagging across various platforms by using pre-existing tags from the data catalog.

Let’s get started 🚀

Step 1: Log in to your Qualytics account and click the Settings button on the left side panel of the interface.

setting

Step 2: You will be directed to the Settings page, then click on the Integration tab.

integration

Step 3: Click on the Add Integration button.

Screenshot

A modal window Add Integration will appear, providing you with the options to add integration.

Screenshot

REF.	FIELDS	ACTIONS
1.	Name	Provide a detailed description of the integration.
2.	Type	Choose the type of integration from the dropdown menu. Currently, 'Atlan' is selected
3.	URL	The complete address for the Atlan instance, for example: https://your-company.atlan.com.
4.	Token	Provide the authentication token needed to connect to Atlan.
5.	Event Driven	If enabled, the integration sync will be activated by operations, archiving anomalies, and checks.
6.	Overwrite Tags	If enabled, Atlan tags will have precedence over Qualytics tags in cases of conflicts (when tags with the same name exist on both platforms).

Screenshot

For demonstration purposes we have selected Atlan integration type.

Step 5: Click on the Save button to set up the Atlan integration.

Step 6: Once the Atlan integration is set up with Qualytics, it will appear in Qualytics as a new integration.

Screenshot

Synchronization

Synchronization supports both push and pull operations. This includes pulling metadata from one platform to Qualytics and pushing Qualytics metadata to the other platform. During the syncing process, the integration pulls tags assigned to data assets in the source platform and assigns them to Qualytics assets as an external tag.

For demonstration purposes we have selected Atlan synchronization.

Note

Tag synchronization requires manual triggering.

Step 1: To sync tags, simply click on the Sync button next to the relevant integration card.

Screenshot

Step 2: After clicking the Sync button, you will have the following options:

Pull Atlan Metadata
Push Qualytics Metadata

Specify whether the synchronization will pull metadata, push metadata, or do both.

Screenshot

Step 3: After selecting the desired options, click on the Start button.

Step 4: After clicking the Start button, the synchronization process between Qualytics and Atlan begins. This process pulls metadata from Atlan and pushes Qualytics metadata, including tags, quality scores, anomaly counts, asset links, and many more.

Screenshot

Step 5: Review the logs to verify which assets were successfully mapped from Atlan to Qualytics.

Screenshot

Step 6: Once synchronization is complete, the mapped assets from Atlan will display an external tag.

Screenshot

SSO (Single Sign-On) Integrations

The Qualytics platform provides enterprise-grade authentication integration capabilities, enabling organizations to:

Implement secure, frictionless access across all platform components
Leverage existing identity providers and authentication workflows
Support both cloud-based and on-premise deployment scenarios
Maintain compliance with corporate security policies
Enable seamless mobile and web-based access
Automate user provisioning and access management

These authentication capabilities ensure that Qualytics seamlessly integrates with your organization's identity and access management infrastructure while maintaining the highest security standards.

SSO for PaaS Deployments

Qualytics platform harnesses the power of Auth0's Single Sign-On (SSO) technology to create a frictionless authentication journey for our PaaS users. Once users have successfully logged in to Qualytics, they can conveniently access all linked external applications and services without the need for additional sign-ins. Depending on the application and its compatibility with federated SSO protocols such as SAML, OIDC, or any proprietary authentication methods, Qualytics, with the help of Auth0, establishes a secure connection for user authentication. In essence, SSO allows one central domain to authenticate and then share the session across various other domains. The method of sharing may vary between SSO protocols, but the principle remains constant.

Through Auth0's Integration Network (OIN), Qualytics extends SSO access to an extensive range of supported cloud-based applications. These integrations can utilize OpenID Connect (OIDC), SAML, SWA, or proprietary APIs for SSO. Maintenance of SSO protocols and provisioning APIs is reliably managed by Auth0.

In addition to this, Qualytics also leverages Auth0's capabilities to provide SSO integrations for on-premises web-based applications. You have the option to integrate these applications via SWA or SAML toolkits. In addition, Auth0 supports user provisioning and deprovisioning with applications that publicly offer their provisioning APIs.

Further enhancing our SSO integrations, Qualytics provides seamless access to mobile applications. Whether they are web applications optimized for mobile devices, native iOS apps, or Android apps, users can access web app integrations in the OIN using SSO from any mobile device. These mobile web apps can employ industry-standard OIDC, SAML, or Auth0 SWA technologies. To illustrate, Qualytics, in conjunction with Auth0, can integrate with native applications such as Box Mobile using SAML for registration and OAuth for continuous use.

Auth0 supports the following enterprise providers out of the box: - OAuth2 - Active Directory/LDAP

SSO for On-Premise Deployments

In addition to the option of leveraging our robust Auth0 support for federated authentication, customer-managed deployments can choose to directly integrated with their IdP (Identity Provider such as Active Directory, ForgeRock, etc) using OpenID Connect (OIDC). Once configured for direct federated authentication using OIDC, the customer's own user login requirements fully govern the authentication process in support of a fully air-gapped deployment of Qualytics with no egress required for operations.

Compute Integrations

The Qualytics platform offers flexible compute deployment options to optimize performance and resource utilization:

Leverage existing Kubernetes infrastructure for seamless deployment
Scale compute resources dynamically based on workload demands
Deploy the data plane to external Spark environments
Maintain data sovereignty and security compliance
Take advantage of cloud-native performance optimizations

These compute integration options ensure that Qualytics can adapt to your infrastructure requirements while maximizing performance and cost efficiency.

Running Qualytics Data Plane on kubernetes

The Qualytics platform's default deployment unifies control and data planes within a single Kubernetes cluster, simplifying infrastructure management through a declarative approach. This architecture enables dynamic scaling with cost-optimized spot instances while maintaining seamless coordination between platform components.

Alternatively, we support deploying the Qualytics data plane to any external Spark cluster (external to the kubernetes cluster running the Qualytics control plane).

Running Qualytics Data Plane on Databricks

Deploying the Qualytics data plane within your Databricks account will allow our analytics engine to leverage Photon acceleration while ensuring that all data transfer & compute occurs inside your Databricks deployment.

dataplane in databricks

Running Qualytics Data Plane on GCP Dataproc

Similarly, the Qualytics data plane can be deployed to Google Cloud's Dataproc to leverage Dataproc serverless and other Dataproc optimizations in support of the Qualytics analytics engine.

Alerting

Alerting Integrations

The Qualytics platform integrates with popular enterprise messaging platforms, such as Slack and Microsoft Teams to enable real-time communication about data quality events. These integrations help ensure that your teams remain informed and can respond quickly when data issues occur.

Receive instant notifications when data quality issues are detected
Alert relevant team members about failed quality checks in real-time
Share operational status updates and system health notifications
Configure custom alerts based on data quality thresholds and conditions
Route notifications to specific channels or teams based on data context

These integrations ensure your teams stay informed about data quality events as they happen, enabling rapid response and maintaining continuous data quality awareness across your organization.

Available Integration

Qualytics makes it easy to deliver alerts through the communication platforms your teams already rely on. Below are the currently supported integrations:

Slack

Integrate Qualytics with Slack to send real-time alerts directly to your Slack channels. This allows teams to stay on top of data quality events without switching tools.

For more details you can refer to the slack integration documentation.

resource-group-form

Microsoft Teams

Connect Microsoft Teams to receive automated alerts about failed checks, system health updates, and threshold-based events right within your team chats.

For more details you can refer to the microsoft teams documentation.

resource-group-form

Slack

Slack integration in Qualytics enables seamless communication by connecting your Slack workspace with data quality updates and notifications. It involves generating and applying Slack API tokens, authorizing the integration, and providing options to modify and manage the connection effortlessly.

Let's get started 🚀

Step 1: Log in to your Qualytics account and click the "Settings" button on the left side panel of the interface.

settings

Step 2: By default, Connections tab will open. Click on the Integrations tab .

integrations-tab

Connect Slack Integration

Connect Slack by generating tokens, configuring connection properties, and authorizing the integration using OAuth for secure communication and seamless app configuration.

Step 1 :Click on the Connect button next to Slack to connect to the Slack Integration.

connect

A modal window titled "Add Slack Integration" appears. Fill in the connection properties to connect to Slack.

add-slack-integration

Step 2:First, generate the access and refresh tokens through the Slack API by signing in.

slack-sign-in

Alternatively, hover over the ? icon and click on the Go to Slack Tokens.

slack-hover

You will be automatically redirected to the Slack token page, where you can copy the access token for creating and configuring apps and the refresh token for rotating the access token.

tokens-light

Step 3:Fill out the copied connection properties of slack integration :

add-tokens

No.	Field Name	Description
1.	Access Token	Enter the generated access token.
2.	Refresh Token	Enter the generated refresh token.

Step 4: Click the Create button to apply the access and refresh tokens and proceed with authorizing the Slack integration.

create-integration

Once the integration is successfully created, a confirmation message will appear on the screen stating, "The Integration has been successfully created."

created-successfully

Step 5:Click the Authorize button to complete the Slack integration using OAuth authentication.

authorize

Step 6:After clicking the Authorize button, a window appears requesting permission to access the Slack workspace. Click the Allow button to grant the required permissions.

allow-slack

A message appears confirming that the integration has been successfully authorized.

authorized-successfully

Manage Slack Integration

Managing Slack integration involves editing or disconnecting the integration to ensure seamless communication and synchronization between platforms. Users can easily modify integration settings, reauthorize the connection, or disconnect the integration if required.

Edit Integration

Editing Slack integration allows modifications to the existing configuration to ensure that the integration functions according to updated requirements. Users can update Slack details, reauthorize the connection, and apply necessary changes seamlessly

Step 1:Click on the vertical ellipses(⋮) next to the Connected button and select the Edit option.

edit-slack

Step 2:A modal window Edit Slack Integration will appear providing you with options to edit the connection properties.

edit-slack-integration

Step 3 :After editing the connection properties of the slack integration, click on the Update button to apply the changes.

update

A confirmation message will appear on the screen displaying “The Integration has been successfully updated”.

updated-successfully

Step 4:Click on the Authorize button to update the authorization details.

update-authorize

A confirmation message will appear on the screen displaying “The Integration has been successfully authorized”.

authorized-successfully

Disconnect Integration

Disconnecting the Slack integration removes all associated synced assets and disables further data exchange between the platforms. To ensure a smooth disconnection process, follow the steps below to terminate the integration safely and confirm the action.

Step 1 : Click on the vertical ellipses(⋮) next to the connected button and select the Disconnect option to disconnect the integration.

disconnect

Step 2:A modal window Disconnect Integration will appear allowing you to disconnect the slack integration.

disconnect-integration

Step 3:Click on the Disconnect button to proceed.

Note

This action will delete all synced assets by this integration.

A confirmation message will appear on the screen displaying “The Integration has been successfully disconnected”.

disconnected-successfully

Microsoft Teams

Microsoft Teams integration in Qualytics enables seamless communication by connecting your Microsoft Teams workspace with data quality updates and notifications. It involves configuring Azure resources, providing necessary credentials, and establishing a direct link to your Teams workspace for alerts and communication.

Let's get started 🚀

Microsoft Teams Setup Guide

This section provides a comprehensive walkthrough to help you configure the necessary resources and retrieve the required credentials. By following this setup process, you'll have everything you need to complete the integration form.

Warning

Some steps in this guide may require administrator privileges in your Microsoft Azure environment. If you don't have the necessary permissions, you might need to coordinate with your IT department or someone with administrative access to your Azure tenant.

Creating a Microsoft Entra App Registration

The Microsoft Entra App Registration is used by Qualytics to provision Teams bot resources in your environment.

Step 1: Log in to to the Microsoft Entra App Registrations, and select New registration from the main menu to create a new application.

new-app-registration

Step 2: You will be navigated to the App registrations dashboard. Fill in the required details for the app registration:

Name: Enter a name for your app (e.g., Qualytics Bot Manager).
Supported account types: Select "Accounts in this organizational directory only (Single tenant)".
Redirect URI: Leave this field blank, as it is not required for this integration.

app-registration-form

Step 3: Click Register button to complete the app registration.

app-registration-form

Step 4: After the app is registered, you’ll be redirected to the Overview page, where the Application (client) ID is displayed. Copy this ID since it will be needed later for the Qualytics integration.

app-client-id

Adding API Permissions

The Microsoft Entra App needs the "Application.ReadWrite.OwnedBy" permission to create and manage bot resources.

Step 1: In your app registration, go to the side panel and click Manage, then select API permissions from the dropdown.

add-permission

Step 2: Click on Add permission to begin configuring access permissions for the app.

add-permission

Step 3: A right side panel titled Request API permissions will appear. Select Microsoft Graph from the list of options.

add-permission-msft-graph

Step 4: After selecting Microsoft Graph, choose Application permissions.

add-permission-msft-graph

A dropdown appears search for Application.ReadWrite.OwnedBy, check the box under Application permissions, and click Add permissions.

msft-graph-application-permission

Step 5: Once the permission is added, you'll return to the API permissions page. Click Grant admin consent for [Your Organization] to approve the selected permissions.

app-readwrite-all

Creating a Client Secret

The Client Secret authorizes Qualytics to programmatically create bot resources.

Step 1: In your app registration, go to the side panel and click Manage, then select Certificates & secrets from the dropdown.

new-client-secret

Step 2: Click on + New client secret to generate a new secret key for the application.

new-client-secret

Step 3: After clicking + New client secret, a panel will appear. Enter a description (e.g., Qualytics Integration) and choose an expiration period (up to 24 months). Then click Add.

add-client-secret-form

Step 4: Once the client secret is created, copy the Value immediately and save it securely. This will be used as the App Client Secret for the Qualytics integration.

Warning

The client secret value is only displayed once immediately after creation. Make sure to copy and securely store it as you won't be able to retrieve it again.

Retrieving the Azure Subscription ID

The Subscription ID is required to manage bot resources in your Azure environment.

Step 1: Navigate to Subscriptions in the Azure Portal and select the subscription you want to use for the Teams integration.

subscription-id

Step 2: Copy the Subscription ID from the Overview section of your selected subscription. This ID is required later to assign roles and permissions for the Microsoft Teams integration.

subscription-id

Verifying the Microsoft Bot Service

You need to verify if the Microsoft Bot Service resource provider is registered in your subscription.

Step 1: In your subscription, click Settings from the left-hand menu, then select Resource providers from the dropdown.

bot-service-provider

Step 2:Search for Microsoft.BotService in the provider list and check that the Status is Registered.

bot-service-provider

Note

The step 3 is only required if the resource provider is not already registered. If the Microsoft.BotService provider is already marked as "Registered" in your subscription, you can skip this step.

Step 3 (maybe optional): Click Register to enable the resource provider if it's not already registered.

bot-service-provider

Setting Up the Resource Group

The Resource Group will hold and manage the bot resources created by Qualytics.

Step 1: Navigate to Resource Groups in the Azure Portal and click Create a resource to set up a new resource group if you don’t already have one.

resource-group-form

Step 2: Choose your Subscription, enter a Resource group name (e.g., qualytics-msft-teams-rg), select a Region, and then click Review + create.

resource-group-form

Step 3: After clicking Review + create, you'll see a summary of the details. Once validated, click Create.

Note

Once created, note the Resource Group name for the Qualytics integration.

resource-group-form

Assigning the Azure Bot Service Contributor Role

The Microsoft Entra App needs the "Azure Bot Service Contributor" role to manage bot resources.

Step 1: Navigate to your Resource Group and select Access control (IAM) from the left menu and click on the Add and select Add Role Assignment from the dropdown.

add-role-assignment

Step 2: You’ll be navigated to the Add role assignment tab. In the Role section, search and select Azure Bot Service Contributor Role, then click the Next button to continue.

add-role-assignment-selection

Step 3: You will be navigated to the Members tab. Under Assign access to, select User, group, or service principal, then click on Select members.

assign-bot-contributor-role-selection

A Select members panel will appear. Search for the Microsoft Entra app you created earlier (e.g., Qualytics Bot Manager), select it from the list, and click on the Select button.

assign-bot-contributor-role-selection

Tip

Enterprise Applications will only appear in the search results when you start typing the exact name used in your Entra App registration. If you don't see your app immediately, try typing the full name as you entered it when creating the app.

Step 4: Click on Review + assign from the navigation bar and confirm the role assignment then click on Review + assign button.

role-assignment

Getting Your Microsoft Teams Link

You need to provide the link to your Microsoft Teams workspace.

Step 1: Log in to your Microsoft Teams desktop or web application. Navigate to the team where you want to receive Qualytics notifications, then right-click on the team name and select Get link to team.

Step 2: A modal window titled Get a link to the team will appear. Click the Copy button to copy the team link.

teams-link

Integration Summary

Now that you've gathered all the necessary information and configured the Azure resources, you're ready to integrate Microsoft Teams with Qualytics.

In the next section, we'll walk through the steps to access the Qualytics integration interface and enter these credentials to establish the connection between Qualytics and Microsoft Teams.

Step 1: Log in to your Qualytics account and click the "Settings" button on the left side panel of the interface.

settings

Step 2: By default, Connections tab will open. Click on the Integrations tab.

integrations-tab

Connect Microsoft Teams Integration

Connect Microsoft Teams by providing necessary Azure credentials, configuring bot resources, and establishing a direct link to your Teams workspace for secure communication.

Step 1: Click on the Connect button next to Microsoft Teams to connect to the Teams Integration.

connect

A modal window titled "Add Microsoft Teams Integration" appears. Fill in the connection properties to connect to Microsoft Teams.

settings

Step 2: Fill out the required provisioning properties for the Microsoft Teams integration:

No.	Field Name	Description
1.	App Client ID	The Application (client) ID from the Overview page of your Entra App registration.
2.	App Client Secret	The secret value you copied after creating a new client secret in your Entra App.
3.	Azure Subscription ID	The Subscription ID you copied from the Azure Subscriptions page.
4.	Azure Resource Group Name	The name of the Resource Group you created or selected for bot resources.
5.	Microsoft Teams Link	The team link you copied from Microsoft Teams using the "Get link to team" option.

settings

Step 3: Click the Provision and Next button to provision the app resources and proceed with publishing the Qualytics app to the Microsoft Teams App Catalog.

Note

Provisioning the app resources may take around 15 seconds to complete.

settings

Once the app resources have been successfully provisioned, a confirmation message will appear stating, "The Teams app resources have been successfully provisioned."

settings

Step 4: Click the Publish button to publish the Qualytics app to your organization's Microsoft Teams App Catalog.

settings

A microsoft dialog will appear asking you to accept the requested permissions. Click Accept to proceed with the publication.

Once the app has been successfully published, a confirmation message will appear stating, "The Teams app has been successfully published to your organization's App Catalog.".

Warning

Microsoft may take up to 24 hours to make the app available in Teams after it's published.

Completing the Teams Integration

After publishing the app to your organization's Teams App Catalog, the integration will show a "Pending" status in Qualytics until the app is installed in a Teams channel.

settings

Installing the App in Microsoft Teams

To complete the integration, you need to install the Qualytics app in Microsoft Teams:

Step 1: Log in to your Microsoft Teams desktop or web application and click on Apps in the left sidebar.

add-to-workspace

Step 2: After click on apps you will navigated to app dashboard. Select Built for your org to see custom apps for your organization and select the "Qualytics" app.

Note

If you don't see the app immediately, it might still be propagating through Microsoft's systems. This can take up to 24 hours.

add-to-workspace

Step 3: Click Add to begin the installation process.

add-to-workspace

Step 4: After clicking the Add button, a window will appear prompting you to select a team and channel where you want to add the Qualytics app. Once selected, click Go to complete the installation.

add-to-team

When you add the app to a team and channel, Qualytics will automatically detect the installation. You may need to refresh your browser to see the status update from "Pending" to "Connected" in the Qualytics Integrations page.

connect

Manual Verification (optional)

Important

Manual verification serves as a fallback method if Qualytics doesn't automatically detect the app installation after adding it to a channel. If the status remains "Pending" after installing the app and refreshing the Qualytics page, use this manual verification process to complete the integration.

To manually verify the integration:

Step 1: Return to the Qualytics Integrations page and click on the Verify app installation button next to the Microsoft Teams integration.

verify-installation

When the verification is successful, the integration status will change to "Connected", indicating that Qualytics can now send notifications to your Microsoft Teams workspace.

connect

Manage Microsoft Teams Integration

Microsoft Teams integration enables smooth communication between your platform and Teams channels. Users can easily modify connection settings, update authorization details, or disconnect the integration based on their requirements.

Edit Integration

Edit Integration feature allows users to modify Microsoft Teams connection settings directly from the integration panel. By selecting the Edit option from the menu, users can update configuration details and reauthorize the connection if needed .

Step 1: Click on the vertical ellipses(⋮) next to the Connected button and select the Edit option .

vertical-ellipse

Step 2: A modal window Edit Microsoft Teams Integration will appear providing you with options to edit the connection properties.

window

Step 3: After editing the connection properties, click on the Update button to apply the changes.

update

A confirmation message will appear on the screen displaying “The Integration has been successfully updated”.

message

Disconnect Integration

Disconnecting the Microsoft Teams integration will remove its connection from your platform. This means any existing workflows, notifications, or actions relying on Microsoft Teams may stop working, though they won’t be deleted. Make sure to review any dependent flows before proceeding.

Step 1: Click on the vertical ellipses(⋮) next to the connected button and select the Disconnect option to disconnect the integration.

disconnect

Step 2: A modal window Disconnect Integration will appear allowing you to disconnect the microsoft teams integration.

windows

Step 3: Click on the Disconnect button to proceed.

disconnects

A confirmation message will appear on the screen displaying “The Integration has been successfully disconnected”.

msgs

Ticketing Integrations

The Qualytics platform seamlessly integrates with enterprise ticketing and project management solutions, enabling teams to:

Track and manage data quality anomalies within their existing workflow tools
Automatically create and update tickets when quality issues are detected
Sync status changes bidirectionally between Qualytics and ticketing systems
Incorporate data quality management into standard project planning processes
Maintain a unified view of data quality initiatives alongside other work items

This integration capability ensures that data quality management becomes a natural part of your team's established workflows rather than a separate process to manage.

Workflow Integrations

The Qualytics platform integrates with modern data workflow and orchestration tools, empowering teams to:

airflow airflow

Embed data quality checks directly within data pipelines
Automate quality verification steps in ETL and transformation processes
Trigger remediation workflows when quality issues are detected
Schedule and coordinate data quality operations with other pipeline activities
Ensure data quality gates are enforced before critical data movements

dbt dbt

These integrations enable organizations to make data quality an integral part of their automated data workflows, ensuring quality checks and remediation steps are seamlessly woven into their data engineering processes.

Analytics Integrations

The Qualytics platform integrates with enterprise analytics and visualization tools, enabling organizations to:

power power

Create custom dashboards showcasing data quality metrics and trends
Visualize data quality scores across your entire data ecosystem

sigma sigma

Monitor anomaly patterns and profiling characteristics in real-time
Generate detailed reports on data quality at field, schema, and enterprise levels
Share data quality insights through familiar business intelligence interfaces
Track data quality improvement initiatives with executive-level visibility

tableau tableau

These integrations transform Qualytics' rich metadata and quality metrics into actionable insights, helping organizations understand, communicate, and improve their data quality through their preferred analytics platforms.

FAQ

Quality Scores

Quality Scores are quantified measures of data quality calculated at the field and container levels, recorded as time-series to enable tracking of changes over time. Scores range from 0-100 with higher values indicating superior quality for the intended purpose. These scores integrate eight distinct dimensions, providing a granular analysis of the attributes that impact the overall data quality. The overall score is a composite reflecting the relative importance and configured weights of these factors:

Completeness: Measures the average percentage of non-null values in a field throughout the measurement period. For example, if a "phone_number" field has values present in 90 out of 100 records, its completeness score for the measurement would be 90%.
Coverage: Measures the number of quality checks defined for monitoring the field's quality.
Conformity: Measures how well the data adheres to specified formats, patterns, and business rules. For example, checking if dates follow the required format (YYYY-MM-DD) or if phone numbers match the expected pattern.
See Appendix: Conformity Rule Types for the full Conformity rule type list.
Consistency: Measures uniformity in type and scale across all data representations. Verifies that data maintains the same type and representation over time. For example, ensuring that a typed numeric column does not change over time to a string.
Precision: Evaluates the resolution of field values against defined quality checks.
See Appendix: Precision Rule Types for the full Precision rule type list.
Timeliness: Gauges data availability according to schedule.
See Appendix: Timeliness Rule Types for the full Timeliness rule type list.
Volumetrics: Analyzes consistency in data size and shape over time.
See Appendix: Volumetric Rule Types for the full Volumetrics rule type list.
Accuracy: Determines the fidelity of field values to their real-world counterparts or expected values.

How Completeness, Precision, and Accuracy Differ

Dimension	Focus	Example Question It Answers
Completeness	Are values present?	What % of rows in `phone_number` are non-null?
Precision	Are values within the expected level of detail or granularity?	Are all `age` values between 0–120? Do decimals have required 2-digit precision?
Accuracy	Are values correct compared to real-world truth or integrity checks?	Is the relationship between `square_footage` and `price` maintained?

Important

A data asset's quality score is a measure of its fitness for the intended use case. It is not a simple measure of error, but instead a holistic confidence measure that considers the eight fundamental dimensions of data quality as described below. Quality scores are dynamic and will evolve as your data and business needs change over time.

Field-Level Quality Scoring

Each field receives individual scores for eight quality dimensions, each evaluated on a 0-100 scale.

Completeness Dimension

The Completeness score measures the average percentage of non-null values in a field over the measurement period.

How Completeness is Calculated

Scale: 0 to 100, representing the average completeness percentage
Measurement period: Defined by the configured decay time (default 180 days)
Formula: Average of (non-null values / total records) × 100 across all measurements in the period
Example: If a "phone_number" field averages 90% completeness over the measurement period, its completeness score would be 90

Coverage Dimension

The Coverage score measures how many distinct quality checks have been applied to a field. It is designed to reward the first few checks heavily, then taper off as more checks are added, following a curve of diminishing returns.

How Coverage is Calculated

Scale: The score ranges continuously from 0 to 100
Anchor points:
- 0 checks → score of 0
- 1 check → score of approximately 60
Diminishing returns: Each additional check contributes less than the previous one. As the number of checks grows, the score approaches 100 but never exceeds it

Mathematically, the scoring curve follows an exponential growth model:

score(n) = 100 × (1 - e^(-k × n))

where n is the number of checks and k is tuned so that 1 check = 60.

Why This Model?

Strong early reward: The first check dramatically increases confidence in field coverage
Fair balance: More checks always improve the score, but the improvement diminishes as coverage becomes robust, preventing runaway inflation

Field vs. Container Coverage

At the field level, Coverage reflects the number of distinct quality checks defined for that field.
At the container level, Coverage is an aggregate of field-level coverage scores, further adjusted by scan frequency (more frequent scans → greater confidence).

Conformity Dimension

The Conformity score measures how well the data adheres to specified formats, patterns, and business rules.

How Conformity is Calculated

Scale: 0 to 100 based on the ratio of conforming values
Formula: (1 - (rows with anomalous values as specified by conformity checks / min(scanned rows, container rows))) × 100
Denominator: Uses the smaller of scanned row count or container row count to prevent score inflation
Applicable rule types: Pattern matching, length constraints, type validation, schema expectations, and format-specific validations
See Appendix: Conformity Rule Types for the full Conformity rule type list.

Examples

Email field where 95% of scanned/total rows match valid email pattern → Score ~95
Date field with consistent YYYY-MM-DD format → Score ~100
Phone field with mixed formats and invalid entries → Score ~60

Consistency Dimension

The Consistency score measures how stable a field's values remain over time compared to their expected statistical profile. This highlights fields that are "drifting" (changing shape, format, or density).

How Consistency is Calculated

Check for type changes
- If a field flips between types (e.g., sometimes a number, sometimes a string), score is set to 0
Collect summary statistics per field type:
- Numeric fields: median and interquartile range (IQR)
- String fields: distinct count, min/max length, Shannon entropy
- Datetime fields: earliest timestamp, distinct timestamp count
Measure stability
- Track variation of each statistic across the analysis window
- Normalize changes for fair comparison across different scales
Apply thresholds and weights
- Each change type has an expected tolerance (e.g., ±10% for numeric medians)
- Variations within tolerance incur little/no penalty
- Larger variations reduce the score proportionally
Combine into final score
- 100: Field stayed fully consistent
- 60-90: Mild to moderate changes worth monitoring
- Below 60: Meaningful shift requiring investigation
- 0: Type change detected

Consistency vs. Accuracy

Consistency checks whether a field’s statistical shape and distribution remain stable over time (e.g., numeric medians, string entropy).

Accuracy, by contrast, evaluates whether values are correct and aligned to real-world truths or integrity rules.

Together, they capture different aspects of trustworthiness.

Examples

Numeric "Price" field with stable median and IQR → Score ~100
String "Country" field where distinct values double unexpectedly → Score ~75
Datetime field with sudden two-year backfill → Score ~60
ID field alternating between numeric and string types → Score = 0

Precision Dimension

The Precision score evaluates the resolution and granularity of field values against defined quality checks.

How Precision is Calculated

Scale: 0 to 100 based on the ratio of values meeting precision requirements
Formula: (1 - (rows with anomalous values as specified by precision checks / min(scanned rows, container rows))) × 100
Denominator: Uses the smaller of scanned row count or container row count to prevent score inflation
Applicable rule types: Range validations, comparisons, mathematical constraints, and temporal boundaries
See Appendix: Precision Rule Types for the full Precision rule type list.

Examples

Decimal field maintaining required 2-digit precision → Score ~100
Timestamp field with appropriate granularity (no future dates) → Score ~95
Age field with values outside valid range (0-120) → Score ~85

Accuracy Dimension

The Accuracy score determines the fidelity of field values to their real-world counterparts or expected values.

How Accuracy is Calculated

Scale: 0 to 100 based on the overall anomaly rate across all data integrity (excludes metadata checks like schema, volume, freshness, etc..) check types
Formula: (1 - (rows with anomalous values as specified by accuracy checks / min(scanned rows, container rows))) × 100
Denominator: Uses the smaller of scanned row count or container row count to prevent score inflation
Comprehensive: Considers anomalies from all data integrity rule types
Represents: Overall correctness and trustworthiness of the field data

Interpretation

95-100: Highly accurate data suitable for critical decisions
80-94: Generally reliable with some known issues
60-79: Moderate accuracy requiring validation for important uses
Below 60: Significant accuracy concerns requiring remediation

Timeliness & Volumetrics Dimensions

Both the Timeliness and Volumetrics dimensions are measured at the container level as described below. Field-level scores are inherited from their container-level scores.

Container-Level Quality Scoring

A container (table, view, file, or other structured data asset or any aggregation of data assets such as assets that share a common tag) receives an overall quality score derived from its constituent fields and additional container-specific metrics.

How Container Scores Are Calculated

Your container's total Quality Score starts at a baseline of 70. Each of the eight data quality dimensions then adjusts this baseline:

Dimension aggregation:
- Completeness: Weighted average of all field completeness scores
- Coverage: Weighted average of field coverage scores, adjusted for scan frequency
- Conformity: Weighted average of field conformity scores, adjusted for schema-level conformity checks
- Consistency: Weighted average of field consistency scores, adjusted for profiling frequency
- Precision: Weighted average of field precision scores
- Accuracy: Weighted average of field accuracy scores
- Timeliness: Calculated using process described below
- Volumetrics: Calculated using process described below
Proportional adjustment: Each dimension adjusts the score proportionally to its 0–100 rating
Influence capping: Every dimension has maximum positive and negative impact limits
Weight controls: Higher weights make dimensions more influential; zero weight removes effect entirely
Missing value handling: Documented defaults substitute for unmeasurable dimensions
Special case: If only one dimension is weighted, the Quality Score mirrors that dimension's rating
Final clipping: Result is always constrained between 0 and 100

Why a 70-Point Baseline?

The 70-point baseline represents a neutral confidence starting point.

Dimensions then adjust the baseline downward when issues are found or upward when strong quality signals exist.
This calibration ensures that new containers without extensive checks or history begin from a reasonable midpoint rather than 0.

Timeliness Dimension

The Timeliness score gauges whether data is available according to its expected schedule.

How Timeliness is Calculated

Scale: 0 to 100 based on adherence to freshness requirements
Field level: Directly inherited from the container's timeliness score
Anomaly counting: Counts distinct anomalies from the relevant check types within the measurement period (cutoff date)
Formula (container): Scores start at 100 and decrease based on anomaly count
- First anomaly causes a 40-point drop (score becomes 60)
- Each additional anomaly has diminishing impact
- Formula: Score = 100 - min(100 × (1 - e^(-k × anomaly_count)), 100)
- Where k is calibrated so one anomaly = 40% score reduction
Applicable rule types: Time distribution size, freshness constraints
See Appendix: Timeliness Rule Types for the full Timeliness rule type list.

Score Interpretation

100: No timeliness anomalies detected
60: One anomaly detected (40-point penalty)
40-60: Multiple anomalies with diminishing penalties
0-40: Significant anomaly counts indicating serious issues
None/Null: No checks of this type configured (unmeasured)

Volumetrics Dimension

The Volumetrics score analyzes consistency in data size and shape over time.

Shared Scoring Formula

Timeliness and Volumetrics both use the same exponential penalty formula for anomaly counts.
This consistency ensures comparable scoring behavior across dimensions, even though the anomalies being measured differ.

How Volumetrics is Calculated

Scale: 0 to 100 based on volumetric stability
Field level: Directly inherited from the container's volumetrics score
Anomaly counting: Counts distinct anomalies from the relevant check types within the measurement period (cutoff date)
Formula (container): Scores start at 100 and decrease based on anomaly count
- First anomaly causes a 40-point drop (score becomes 60)
- Each additional anomaly has diminishing impact
- Formula: Score = 100 - min(100 × (1 - e^(-k × anomaly_count)), 100)
- Where k is calibrated so one anomaly = 40% score reduction
Applicable rule types: Row count size, partition size constraints
See Appendix: Volumetric Rule Types for the full Volumetric rule type list.

Examples

Container with consistent record counts per partition → Score ~100
Container showing unexpected spikes or drops in volume → Score ~75
Container with erratic or missing time distributions → Score ~50

Additional Container-Level Factors

Beyond the eight dimensions, containers incorporate:

Scanning frequency: More frequent scanning improves confidence and boosts coverage scores
Profiling frequency: Regular profiling ensures statistics remain current and boosts consistency scores
Field tag weights: Field weights are used when calculated weighted averages for container-level dimensions

Most Impactful Dimensions

While specific scoring weights can be customized, dimensions that typically most influence quality scores are:

Coverage: Asserting frequent, comprehensive quality checks is critical
Accuracy: Large volumes of anomalies severely impact scores
Consistency: Erratic or unstable data characteristics reduce confidence

How to Interpret and Use Quality Scores

Quality scores are dynamic measures of confidence that reflect the intrinsic quality of your data. It's important to recognize that different types of data will have varying levels of inherent quality. To illustrate this point, let's consider a standard mailing address in the USA. A typical schema representing a mailing address includes fields such as:

Addressee
Street
Street 2
City
State
Postal Code

The "State" field, which is naturally constrained by a limited set of known values, will inherently have a higher level of quality compared to the "Street 2" field. "Street 2" typically holds free-form text ranging from secondary unit designations to "care of" instructions and may often be left blank. In contrast, "State" is a required field for any valid mailing address.

Consider the level of confidence you would have in making business decisions based on the values held in the "State" field versus the "Street 2" field. This thought exercise demonstrates how the Qualytics Quality Score (with default configuration) should be interpreted.

While there are steps you can take to improve the quality score of the "Street 2" field, it would be unrealistic to expect it to meet the same standards as the "State" field. Instead, your efforts should focus on the change in measured quality score over time, with the goal of raising scores to an acceptable level of quality that meets your specific business needs.

To further explore how to respond to Quality Scores, let's consider the business requirements for capturing "Street 2" and its downstream use:

If the primary use case for this address is to support credit card payment processing, where "Street 2" is rarely, if ever, considered, there may be no business need to focus on improving the quality of this field over time. In this case, you can reduce the impact of this field on the overall measured quality of the Address by applying a Tag with a negative weight modifier.
On the other hand, if the primary use case for this address is to reliably ship a physical product to an intended recipient, ensuring a higher level of quality for the "Street 2" field becomes necessary. In this scenario, you may take actions such as defining additional data quality checks for the field, increasing the frequency of profiling and scanning, establishing a completeness goal, and working with upstream systems to enforce it over time.

Important

The key to effectively adopting Qualytics's Quality Scores into your data quality management efforts is to understand that it reflects both the intrinsic quality of the data and the steps taken to improve confidence that the data is fit for your specific business needs.

Fitness for Purpose in Practice

Remember: Quality Scores are not absolute “grades.”
They reflect how well your data is suited for its intended business use, influenced by weighting, tagging, and anomaly detection.
Two datasets may have different scores but still both be "fit for purpose" depending on use case.

Customizing Quality Score Weights and Decay Time

The default quality score weightings and decay time represent best practice considerations as codified by the data quality experts at Qualytics and our work with enterprises of all shapes, sizes, and sectors. We recommend that both be left in their default state for all customers and use cases.

That said, we recognize that customers may desire to alter our default scoring algorithms for a variety of reasons, and we support that optionality by allowing administrators to tailor the impact of each quality dimension on the total score by adjusting their weights. This alters the scoring algorithm to align with customized governance priorities. Additionally, the decay period for considering past data events defaults to 180 days but can be customized to fit your operational needs, ensuring the scores reflect the most relevant data quality insights for your organization.

Use Caution When Customizing Weights

We strongly recommend retaining default weights unless governance priorities clearly justify changes.

Adjusting weights can significantly alter how anomalies impact overall scores.
Misaligned weights may cause misleading signals about data quality.

Proceed carefully, and document any custom weighting rationale.

Appendix: Rule Types

The following lists summarize which rule types contribute to each dimension’s quality score.

Conformity Rule Types

No.	Rule Type
1.	Matches Pattern
2.	Min Length
3.	Max Length
4.	Data Diff
5.	Is Type
6.	Entity Resolution
7.	Expected Schema
8.	Field Count
9.	Is Credit Card
10.	Is Address
11.	Contains Credit Card
12.	Contains URL
13.	Contains Email
14.	Contains Social Security Number

Precision Rule Types

No.	Rule Type
1.	After Date Time
2.	Before Date Time
3.	Between
4.	Between Times
5.	Equal To
6.	Equal To Field
7.	Greater Than
8.	Greater Than Field
9.	Less Than
10.	Less Than Field
11.	Max Value
12.	Min Value
13.	Not Future
14.	Not Negative
15.	Positive
16.	Predicted By
17.	Sum

Volumetric Rule Types

No.	Rule Type
1.	Volumetric
2.	Min Partition Size
3.	Max Partition Size

Timeliness Rule Types

No.	Rule Type
1.	Freshness
2.	Time Distribution Size

Printing This Guide

This guide changes often, so we can't recommend printing it or saving it offline. However, we recognize that there are circumstances beyond some users' control that make doing so more convenient. Thus, we are providing this link to our userguide as a single page appropriate for saving as a PDF. If you find this useful, please let us know.

Misc

Deployment Options

Introduction

This document serves as a primer for organizations looking to decide which deployment model of Qualytics is right for them. It provides an overview of the two primary deployment models and considerations when making this decision.

Overview

The following two deployment models are supported for the Qualytics platform:

Model 1: Platform as a Service Deployment: to a single-tenant virtual private cloud (VPC) provisioned by Qualytics on infrastructure that Qualytics manages
Model 2: Customer-Managed Deployment: to a CNCF compliant Kubernetes control plane on Customer managed infrastructure (including on-premises options)

Across both models, the following is true:

Raw customer data is not stored at-rest but derivative data and select values may be held in the dedicated VPC
READ access is required to connect a datastore to the Qualytics Platform
WRITE access to Customer chosen datastore is required for enrichment data

Databricks Deployment

For organizations using Databricks as their data processing platform, Qualytics provides a specialized dataplane deployment option. This allows the Qualytics dataplane to run directly within your Databricks environment as a continuous job.

For detailed instructions on setting up the Qualytics dataplane in Databricks, see our Databricks Deployment Guide.

Model 1: Platform as a Service (PaaS) Deployment

Overview

In this model, the Qualytics platform is deployed to a single-tenant virtual private cloud provisioned by Qualytics and with the provider and in the region of Customer's choosing. This VPC is not shared (single-tenant) and contains a single Customer Qualytics deployment.

Supported Cloud Providers

Depending on Customer's cloud infrastructure, this option uses one of the following:

EKS (Elastic Kubernetes Service)
AKS (Azure Kubernetes Service)
GKE (Google Kubernetes Engine)
Oracle OKE (Oracle Container Engine for Kubernetes)

Network Requirements

This model requires that the provisioned VPC have the ability to access Customer's datastore(s). In the case of publicly routable datastores such as Snowflake or S3, no extra configuration is required. In the case of private datastore(s) with no public IP address or route, the hosted VPC will require private routing using: PrivateLink, Transit Gateway peering, point to point VPN, or similar support to enable network access to that private datastore.

Considerations

This is Qualytics' preferred model of deployment. In this model, Qualytics is fully responsible for the provisioning and operation of the Qualytics platform. Customer is only responsible for granting the Qualytics platform necessary access.

Model 2: Customer-Managed Deployment

Overview

In this model, the Qualytics platform is deployed to a CNCF compliant Kubernetes control plane on Customer managed infrastructure, which can include on-premises deployments. This chart will deploy a single-tenant instance of the qualytics platform to a CNCF compliant kubernetes control plane.

Customer-Managed Deployment Architecture

System Requirements

This option supports deployments to any Kubernetes control plane that meets the following system requirements:

Any Kubernetes version that is officially supported for patches running any CNCF compliant control plane
A minimum 16 cores and 80 gigabytes of memory available for workload allocation
Assigned a Customer resolvable fully-qualified domain name for the https ingress to the Qualytics UI
(optional) Grant Qualytics an admin-level ServiceAccount to the cluster for pushing automated upgrades

Network Requirements

This model requires that the Kubernetes nodes supporting Qualytics' analytics engine have the ability to access Customer's datastore(s). Because Customer hosts the Qualytics deployment, Customer is solely responsible for ensuring the necessary network configuration and support.

Considerations

This model supports organizations that due to regulatory or other restrictions cannot permit READ access to their datastore(s) from a third-party hosted product. This model requires Customer to manage and operate the appropriate infrastructure and ensure it is granted all necessary access to the targeted datastore(s).

For deployments to supported commercial Kubernetes control planes (EKS, AKS, GKE, OKE) and at the Customer's discretion, Qualytics will provision the deployment and transfer ownership of the applicable infrastructure to the Customer. Otherwise, the Customer shall be responsible for both the provisioning of a cluster meeting the requisite system requirements and the deployment of the Qualytics platform via Qualytics provided Helm chart.

Our Helm chart

Welcome to the Installation Guide for setting up Helm for your Qualytics Single-Tenant Instance.

Qualytics is a closed source container-native platform for assessing, monitoring, and ameliorating data quality for the Enterprise.

Learn more about our product and capabilities here.

What is Qualytics?

Important Note for Deployment Type

Before proceeding with the installation of Helm for Qualytics Single-Tenant Instance, please note the following:

This installation guide is specifically designed for customer-managed deployments where you manage your own infrastructure.
If you are a Qualytics Software as a Service (SaaS) customer, you do not need to perform this installation. The Helm setup is managed by Qualytics for SaaS deployments.

If you are unsure about your deployment type or have any questions, please reach out to your Qualytics account manager for clarification.

What is in this chart?

This chart will deploy a single-tenant instance of the qualytics platform to a CNCF compliant kubernetes control plane.

Prerequisites

Before deploying Qualytics, ensure you have:

A Kubernetes cluster (recommended version 1.30+)
kubectl configured to access your cluster
helm CLI installed (recommended version 3.12+)
Docker registry credentials from your Qualytics account manager
Auth0 configuration details from your Qualytics account manager

How should I use this chart?

Please work with your account manager at Qualytics to secure the right values for your licensed deployment. If you don't yet have an account manager, please write us here to say hello!

1. Create a CNCF compliant cluster

Qualytics fully supports kubernetes clusters hosted in AWS, GCP, and Azure as well as any CNCF-compliant control plane.

Node Requirements

Node(s) with the following labels must be made available:

appNodes=true
driverNodes=true
executorNodes=true

Nodes with the driverNodes=true and executorNodes=true labels will be used for Spark jobs, while nodes with the appNodes=true label will be used for all other needs.

Users have the flexibility to merge the driverNodes=true and executorNodes=true labels into a single label, sparkNodes=true, within the same node group, as long as the provided node group can supply sufficient resources to handle both Spark driver and executors.

Alternatively, users may choose not to use node selectors at all, allowing the entire cluster to be used without targeting specific node groups. However, it is highly recommended to set up autoscaling for Apache Spark operations by providing separate node groups with the driverNodes=true and executorNodes=true labels to ensure optimal performance and scalability.

	Application Nodes	Spark Driver Nodes	Spark Executor Nodes
Label	appNodes=true	driverNodes=true	executorNodes=true
Scaling	Autoscaling (1 node on-demand)	Autoscaling (1 node on-demand)	Autoscaling (1 - 12 nodes spot)
EKS	t3.2xlarge (8 vCPUs, 32 GB)	r5.2xlarge (8 vCPUs, 64 GB)	r5d.2xlarge (8 vCPUs, 64 GB)
GKE	n2-standard-8 (8 vCPUs, 32 GB)	n2-highmem-8 (8 vCPUs, 64 GB)	n2-highmem-8 (8 vCPUs, 64 GB)
AKS	Standard_D8_v5 (8 vCPUs, 32 GB)	Standard_E8s_v5 (8 vCPUs, 64 GB)	Standard_E8s_v5 (8 vCPUs, 64 GB)

Docker Registry Secrets

Execute the command below using the credentials supplied by your account manager as a replacement for "<token>". The secret created will provide access to Qualytics private registry on dockerhub and the required images that are available there.

kubectl create namespace qualytics
kubectl create secret docker-registry regcred -n qualytics --docker-username=qualyticsai --docker-password=<token>

Important

The above configuration will connect your cluster directly to our private dockerhub repositories for pulling our images.

If you are unable to directly connect your cluster to our image repository for technical or compliance reasons, then you can instead import our images into your preferred registry using these same credentials (docker login -u qualyticsai -p <token>).

You'll need to update the image URLs in the values.yaml file in the next step to point to your repository instead of ours.

2. Create your configuration file

For a quick start, copy the simplified template configuration:

cp template.values.yaml values.yaml

The template.values.yaml file contains essential configurations with sensible defaults. You'll need to update these required settings:

DNS Record (provided by Qualytics or managed by customer):

global:
  dnsRecord: "your-company.qualytics.io"  # or your custom domain

Auth0 Settings (provided by your Qualytics account manager):

secrets:
  auth0:
    auth0_audience: your-api-audience
    auth0_organization: org_your-org-id
    auth0_spa_client_id: your-spa-client-id

Security Secrets (generate secure random values):

secrets:
  auth:
    jwt_signing_secret: your-secure-jwt-secret
  postgres:
    secrets_passphrase: your-secure-passphrase
  rabbitmq:
    rabbitmq_password: your-secure-password

Optional configurations:

Enable nginx if you need an ingress controller
Enable certmanager for automatic SSL certificates
Configure controlplane.smtp settings for email notifications
Node selectors are now enabled by default for dedicated node groups

For advanced configuration, refer to the full charts/qualytics/values.yaml file which contains all available options.

Info

Contact your Qualytics account manager for assistance.

3. Deploy Qualytics to your cluster

Add the Qualytics Helm repository and deploy the platform:

# Add the Qualytics Helm repository
helm repo add qualytics https://qualytics.github.io/qualytics-helm-public
helm repo update

# Deploy Qualytics
helm upgrade --install qualytics qualytics/qualytics \
  --namespace qualytics \
  --create-namespace \
  -f values.yaml \
  --timeout=20m

Monitor the deployment:

# Check deployment status
kubectl get pods -n qualytics

Get the ingress IP address:

# If using nginx ingress
kubectl get svc -n qualytics qualytics-nginx-controller

# Or check ingress resources
kubectl get ingress -n qualytics

Note

Note this IP address as it's needed for the next step!

4. Configure DNS for your deployment

You have two options for DNS configuration:

Option A: Qualytics-managed DNS (Recommended)

Send your account manager the IP address from step 3. Qualytics will assign a DNS record under *.qualytics.io (e.g., https://acme.qualytics.io) and handle SSL certificate management.

Option B: Custom Domain

If using your own domain:

Create an A record pointing your domain to the ingress IP address
Ensure your global.dnsRecord in values.yaml matches your custom domain
Configure SSL certificates (enable certmanager or provide your own)
Update any firewall rules to allow traffic to your domain

Info

Contact your account manager for assistance with either option.

Can I run a fully "air-gapped" deployment?

Yes. The only egress requirement for a standard self-hosted Qualytics deployment is to https://auth.qualytics.io which provides Auth0-powered federated authentication. This is recommended for ease of installation and support, but not a strict requirement. If you require a fully private deployment with no access to the public internet, you can instead configure an OpenID Connect (OIDC) integration with your enterprise identity provider (IdP). Simply contact your Qualytics account manager for more details.

Upgrade Qualytics Helm chart

Do you have the Qualytics Helm chart repository locally?

Make sure you have the Qualytics Helm chart repository in your local Helm repositories. Run the following command to add them:

helm repo add qualytics https://qualytics.github.io/qualytics-helm-public

Update Qualytics Helm Chart:

helm repo update

Target Helm chart version?

The target Helm chart version must be higher than the current Helm chart version.

To see all available Helm chart versions of the specific product run this command:

helm search repo qualytics

Upgrade Qualytics Helm Chart:

helm upgrade --install qualytics qualytics/qualytics --namespace qualytics --create-namespace -f values.yaml --timeout=20m

Monitor Update Progress:

Monitor the progress of the update by running the following command:

kubectl get pods --namespace qualytics --watch

Watch the status of the pods in real-time. Ensure that the pods are successfully updated without any issues.

Verify Update

Once the update is complete, verify the deployment by checking the pods' status:

kubectl get pods --namespace qualytics

Ensure that all pods are running, indicating a successful update.

Troubleshooting

Common Issues

Pods stuck in Pending state:

Check node resources: kubectl describe nodes
Verify node selectors match your cluster labels
Ensure storage classes are available

Image pull errors:

Verify Docker registry secret: kubectl get secret regcred -n qualytics -o yaml
Check if images are accessible from your cluster

Ingress not working:

Ensure an ingress controller is installed and running
Check ingress resources: kubectl describe ingress -n qualytics

Useful Commands

# Check all resources
kubectl get all -n qualytics

# Restart a deployment
kubectl rollout restart deployment/qualytics-api -n qualytics
kubectl rollout restart deployment/qualytics-cmd -n qualytics

# View detailed pod information
kubectl describe pod <pod-name> -n qualytics

# Get spark application logs
kubectl logs -f pod qualytics-spark-driver -n qualytics

Dataplane Deployment Guide for Databricks

This guide will walk you through deploying the Qualytics dataplane in your Databricks environment.

Databricks Deployment Architecture — Deployment Architecture with Databricks

Prerequisites

Before starting the deployment, ensure you have: - Databricks CLI installed and configured - Access to your Databricks workspace with job creation permissions

Step 1: Create Secrets Scope

First, create a secrets scope to securely store sensitive information:

databricks secrets create-scope qualytics

Step 2: Add Required Secrets

Add the following secrets to your Databricks secrets scope:

RabbitMQ Password

databricks secrets put-secret qualytics rabbitmq-password

When prompted, enter the RabbitMQ password: [RABBIT_PASSWORD_TO_BE_PROVIDED]

Docker Hub Token

databricks secrets put-secret qualytics docker-token

When prompted, enter the Docker Hub token: [DOCKER_TOKEN_TO_BE_PROVIDED]

Step 3: Deploy the Job

Create a file named databricks.yml with the following configuration:

resources:
  jobs:
    QualyticsDataplane:
      name: QualyticsDataplane
      continuous:
        pause_status: PAUSED
      tasks:
        - task_key: QualyticsDataplane
          spark_jar_task:
            jar_uri: ""
            main_class_name: io.qualytics.dataplane.SparkMothership
            run_as_repl: true
          job_cluster_key: QualyticsJobCluster
          libraries:
            - jar: file:///opt/qualytics/qualytics-dataplane.jar
      job_clusters:
        - job_cluster_key: QualyticsJobCluster
          new_cluster:
            spark_version: 17.1.x-scala2.13
            spark_conf:
              spark.driver.extraJavaOptions: -Dconfig.resource=prod.conf
                -Djava.library.path=/databricks/libs
                --add-opens=java.base/java.lang=ALL-UNNAMED
                --add-opens=java.base/java.util=ALL-UNNAMED
                --add-opens=java.base/java.lang.invoke=ALL-UNNAMED
                --add-opens=java.base/java.nio=ALL-UNNAMED
                --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
                --add-opens=java.management/sun.management=ALL-UNNAMED
                --add-exports=java.management/sun.management=ALL-UNNAMED
                -Djava.security.manager=allow
              spark.executor.extraJavaOptions: -Djava.library.path=/databricks/libs
                --add-opens=java.base/java.lang=ALL-UNNAMED
                --add-opens=java.base/java.util=ALL-UNNAMED
                --add-opens=java.base/java.lang.invoke=ALL-UNNAMED
                --add-opens=java.base/java.nio=ALL-UNNAMED
                --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
                --add-opens=java.management/sun.management=ALL-UNNAMED
                --add-exports=java.management/sun.management=ALL-UNNAMED
                -Djava.security.manager=allow
              spark.databricks.r.command: /bin/false
              spark.executorEnv.PYSPARK_PYTHON: /bin/false
              spark.executorEnv.PYSPARK_DRIVER_PYTHON: /bin/false
              spark.databricks.driverNfs.clusterWidePythonLibsEnabled: "false"
              spark.databricks.driverNfs.enabled: "false"
              spark.databricks.sql.externalUDF.env.enabled: "false"
            aws_attributes:
              first_on_demand: 1
              availability: SPOT_WITH_FALLBACK
              zone_id: auto
              spot_bid_price_percent: 100
            node_type_id: r6id.2xlarge
            spark_env_vars:
              MOTHERSHIP_NUM_CORES_PER_EXECUTOR: "8"
              MOTHERSHIP_MAX_MEMORY_PER_EXECUTOR: "50000"
              MOTHERSHIP_MAX_EXECUTORS: "20"
              MOTHERSHIP_RABBIT_HOST: rabbitmq.us-east-1.elb.amazonaws.com
              JNAME: zulu21-ca-amd64
              MOTHERSHIP_RABBIT_USER: user
              MOTHERSHIP_RABBIT_PASS: "{{secrets/qualytics/rabbitmq-password}}"
            enable_elastic_disk: false
            docker_image:
              url: qualyticsai/dataplane-databricks:latest
              basic_auth:
                username: qualyticsai
                password: "{{secrets/qualytics/docker-token}}"
            data_security_mode: DATA_SECURITY_MODE_DEDICATED
            runtime_engine: PHOTON
            kind: CLASSIC_PREVIEW
            is_single_node: false
            autoscale:
              min_workers: 1
              max_workers: 20
      queue:
        enabled: true

Step 4: Deploy the Configuration

Deploy the job using the Databricks CLI:

databricks bundle deploy

Step 5: Start the Job

Once deployed, you can start the job from the Databricks UI or using the CLI:

databricks jobs start --job-id <job-id>

Configuration Notes

RabbitMQ Connection: The dataplane connects to rabbitmq.us-east-1.elb.amazonaws.com with user user
Cluster Configuration: Uses r6id.2xlarge instances with autoscaling from 1-20 workers
Docker Image: Uses qualyticsai/dataplane-databricks:latest with provided authentication

Troubleshooting

If you encounter issues:

Verify secrets are properly configured: databricks secrets list-secrets qualytics
Check job logs in the Databricks UI
Ensure the JAR file is uploaded to the correct location
Verify network connectivity to the RabbitMQ endpoint

Support

For additional support or questions, please contact the Qualytics team.

External Automation

Overview

Users may want to create their own scheduled operations in Qualytics to automate routine tasks such as data exports or running specific operations at defined intervals. Instead of executing these operations manually, they can be scheduled to run automatically on Linux or Windows, or through the Qualytics CLI.

Choose the setup guide that matches your environment:

Linux Machine

This guide explains how to configure scheduled tasks on Linux using cron jobs with curl commands. For more steps, refer to the linux machine documentation.

Windows Machine

This guide explains how to configure scheduled tasks on Windows using PowerShell scripts and the Windows Task Scheduler. For more steps, refer to the windows machine documentation.

Installation – Qualytics CLI

This page points you to the Qualytics CLI Overview, where you can find installation and initialization instructions. For more steps, refer to the qualytics CLI documentation.

Automation Setup Using Qualytics CLI

This guide explains how to use the Qualytics CLI’s scheduling commands to automate operations, including Linux and Windows behavior. For more steps, refer to the automation setup using qualytics CLI documentation.

Linux machine

You can automate Qualytics operations on a Linux machine by scheduling them with cron jobs. This guide walks you through setting up a scheduled curl command to trigger exports at defined intervals.

Prerequisites

Before proceeding, ensure that you have the following:

Access to the terminal on your machine.
The curl command-line tool installed.
The desired Qualytics instance details, including the instance URL and authentication token.

Steps to Create a Scheduled Operation

1. Open the Crontab Editor

Run the following command in your terminal to open the crontab editor:

crontab -e

2. Add the Cron Job Entry

In the crontab editor, add the following line to execute the curl command at your specified schedule:

<cronjob-expression> /usr/bin/curl --request POST --url 'https://<your-instance>.qualytics.io/api/export/<operation>?datastore=<datastore-id>&containers=<container-id-one>&containers=<container-id-two>' --header 'Authorization: Bearer <your-token>' >> <path-to-show-logs> 2>&1

3. Example:

For example, to run the command every 5 minutes:

*/5 * * * * /usr/bin/curl --request POST --url 'https://your-instance.qualytics.io/api/export/anomalies?datastore=123&containers=14&containers=16' --header 'Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...' >> /path/to/show/logs.txt 2>&1

4. Verify or List Cron Jobs:

crontab -l

Customize the placeholders based on your specific details and requirements. Save the crontab file to activate the scheduled operation.

Windows machine

Automate Qualytics operations on Windows using PowerShell and Task Scheduler. This guide shows how to set up and run scheduled export tasks.

Prerequisites

Before proceeding, ensure that you have the following:

Access to PowerShell on your machine.
The desired Qualytics instance details, including the instance URL and authentication token.

Steps to Create a Scheduled Operation

1. Open your text editor of your preference and add the script entry

In the text editor, add the following line to execute the Invoke-RestMethod command:

Invoke-RestMethod -Method 'Post' -Uri https://<your-instance>/api/export/anomalies?datastore=<datastore-id>&containers=<container-id-one>&containers=<container-id-two> -Headers @{'Authorization' = 'Bearer <your-token>'; 'Content-Type' = 'application/json'}

2. Example:

For example, to run the command every 5 minutes:

Invoke-RestMethod -Method 'Post' -Uri https://your-instance.qualytics.io/api/export/anomalies?datastore=123&containers=44&containers=22 -Headers @{'Authorization' = 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'; 'Content-Type' = 'application/json'}

Customize the placeholders based on your specific details and requirements. Save the script with the desired name with the extension .ps1.

3. Add the script to the Task Scheduler:

Open Task Scheduler:
- Press Win + S to open the Windows search bar.
- Type "Task Scheduler" and select it from the search results.
Create a Basic Task:
- In the Task Scheduler window, click on Create Basic Task... on the right-hand side.
Provide a Name and Description:
- Enter a name and description for your task. Click Next to proceed.
Choose Trigger:
- Select when you want the task to start. Options include Daily, Weekly, or At log on.
- Choose the one that fits your schedule. Click Next.
Set the Start Date and Time:
- If you selected a trigger that requires a specific start date and time, set it accordingly. Click Next.
Choose Action:
- Select Start a program as the action and click Next.
Specify the Program/Script:
- In the Program/script field, provide the path to PowerShell executable (powershell.exe), typically located at C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe. Alternatively, you can just type powershell.exe.
- In the Add arguments (optional) field, provide the path to your PowerShell script. For example: -File "C:\Path\To\Your\GeneratedScript.ps1".
- Click Next.
Review Settings:
- Review your task settings. If everything looks correct, click Finish.
Finish:
- You should now see your task listed in the Task Scheduler Library.

Automated Setup Using Qualytics CLI

Easily automate scheduled exports with the Qualytics CLI on both Linux and Windows. This setup generates the required scripts and cron/task entries for you, with simple placeholders to customize.

For Linux and Windows Users

Use the Qualytics CLI to schedule a task automatically.

qualytics schedule export-metadata --crontab "<cronjob-expression>" --datastore <datastore-id> --containers <container-ids> --options <metadata-options>

Replace placeholders as needed.

Behaviour on Linux:

It will create the files inside your home/user/.qualytics folder.

The schedule operations commands are going to be located in home/user/.qualytics/schedule-operation.txt.

You can see some files with the option you selected with the logs of the cronjob run.

It will already create for you a cronjob expression, you can run crontab -l to list all cronjobs.

Behaviour on Windows:

It will create the files inside your home/user/.qualytics folder.

The script files will be located in home/user/.qualytics with a pattern task_scheduler_script_<option-you-selected>_<datastore-number>.ps1 and it's just a matter for you to follow the step above to create the Task Scheduler.

Explanation of Placeholders:

<cronjob-expression>: Replace this with your desired cron expression. For example, */5 * * * * means "every 5 minutes." You can check crontab.guru for more examples.
<your-instance>: Replace with the actual Qualytics instance URL.
<operation>: Replace with the specific operation (e.g., "anomalies", "checks" or "field-profiles").
<datastore-id>: Replace with the ID of the target datastore.
<container-id-one> and <container-id-two>: Replace with the IDs of the containers. You can add more containers as needed.
<container-ids>: Comma-separated list of container IDs or array-like format. Example: "1, 2, 3" or "[1,2,3]".
<options>: Comma-separated list of operation to export or all for everything. Example: anomalies, checks, field-profiles or all.
<your-token>: Replace with the access token obtained from Qualytics (Settings -> Security -> API Keys).
<path-to-show-logs>: Replace with the file path where you want to store the logs.

Filename Globbing

Overview

Our data quality product offers a sophisticated feature that facilitates the organization and categorization of files on a distributed filesystem. This feature, known as Multi-Token Filename Globbing, enables the system to recursively scan files and intelligently group them based on shared filename conventions. It achieves this through a combination of filename pattern analysis and globbing techniques.

Process

Delimiter Identification: The first step involves identifying a common delimiter in filenames, such as an underscore (_) or dash (-). This delimiter is used to split the filenames into tokens. Tokenization and Grouping: Once the filenames are tokenized, the system groups them based on shared tokens. This is achieved through a method called applyMultiTokenGlobbing. Glob Pattern Formation: The core of this feature lies in forming glob patterns that represent groups of files sharing a schema. These patterns are created using the tokens derived from the filenames.

Methodology

Initial Token Grouping: The method begins by grouping filenames based on each token. It considers the number of tokens and processes each token index separately.
Left or Right Side Grouping Decision: The system decides whether to group tokens starting from the left side or the right side of the filename, based on the distribution of tokens.
Pattern Creation Logic:
For filenames with a single token, the system avoids globbing and keeps the filenames as they are.
For multi-token filenames, the method constructs a container name (glob pattern) by iterating through each token.
At each token, the method decides whether to include the token as-is or replace it with a wildcard (*). This decision is based on several factors, such as:
- The uniqueness of the token in the context of other filenames.
- The nature of the token (e.g., all letters).
- The comparison of token counts in adjacent indexes.
Special Cases Handling: The method includes logic to handle special cases, such as all-letter tokens, tokens at the beginning or end of a filename, and unique tokens.
Glob Pattern Optimization: Finally, the system optimizes the glob patterns, ensuring that each pattern uniquely represents a group of files with a shared schema. This is done by comparing new patterns with existing ones and updating them based on the latest file modifications.

Detailed Methodology: Multi-Token Filename Globbing

Step-by-Step Process

Delimiter Identification and Tokenization

The system identifies a common delimiter in the filenames, typically an underscore (_) or dash (-), and splits the filenames into tokens.

Token Grouping and Indexing

Each token in a filename is indexed (0, 1, 2, ...).
Filenames are grouped based on the value of tokens at each index.

Determining Grouping Strategy

The system decides whether to group tokens from the left (start of filename) or right (end of filename) based on the distribution and variation of tokens at each index.

Pattern Creation Logic

Single-Token Filenames: No globbing is applied to filenames with only one token.
Multi-Token Filenames: The method constructs glob patterns by analyzing each token. It considers factors like token uniqueness, commonality, and special cases like all-letter tokens.

Uniqueness vs. Commonality:

Unique tokens (unique in their position across all filenames) are replaced with a wildcard "*".
Common tokens across many files are kept as they are in the pattern.

Special Considerations for All-Letter Tokens:

Tokens comprising entirely of letters are often grouped together, unless they are unique identifiers.
Tokens at the start or end of a filename are treated with contextual logic, considering their potential roles (like identifiers or file types).

Adjacent Token Group Sizes:

The method compares the group sizes of adjacent tokens to determine if a token leads to a tighter grouping, influencing whether it's kept as literal or replaced with a wildcard.

Constructing Container Names (Glob Patterns)

For each token index, the method constructs a container name, deciding whether to include the token as-is or replace it with "*".
This decision is influenced by factors like the uniqueness of the token, the nature of the token (all letters or not), and the comparison of token counts in adjacent indexes.

Optimization and Finalization

The system optimizes the glob patterns to ensure each pattern uniquely represents a group of files with a shared schema.
It compares new patterns with existing ones and updates them based on the latest file modifications.

Example Scenarios

Filename: "project_data_2023_v1.csv"
- Potential Pattern: "project_data_*_*.csv" (if "2023" and "v1" vary across files).
Filename: "user_123_profile_2023-06-01.json"
- Potential Pattern: "user_*_profile_*.json" (if "123" and dates vary, and "user" and "profile" are consistent).
Filename: "log2023-06_error.txt"
- Potential Pattern: "*_error.txt" (if dates vary but "error" is a constant token).

Limitations

Context

While the Multi-Token Filename Globbing feature is a powerful tool for organizing files in distributed filesystems, including object storage systems like AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage, it's important to understand the limitations of using glob patterns with wildcards in these environments.

Wildcard Mechanics in Directory Listings

Wildcard Character (*):

In glob patterns, the asterisk (*) is used as a wildcard that matches any character, any number of times. This flexibility is powerful for grouping a wide range of file patterns but has limitations in precision.

Behavior in Object Storage Systems:

Systems like AWS S3, GCS, and Azure Blob interpret the wildcard in a glob pattern to match any sequence of characters in a filename.
This means a pattern with a wildcard can encompass a broad range of filenames, potentially grouping files that were not intended to be grouped together.

Specific Limitation Example

Consider the following scenario to illustrate this limitation:

Intended File Grouping Patterns:

Pattern A: project_data_*.txt
Pattern B: project_data_*_*.txt

Example Filenames:

project_data_1234.txt
project_data_1234_suffix.txt

Limitation in Practice:

In this case, Pattern A (project_data_*.txt) is intended to match files like project_data_1234.txt. However, due to the nature of the wildcard, this pattern will also inadvertently match project_data_1234_suffix.txt.
The wildcard in Pattern A extends to any length of characters following project_data_, making it impossible to exclusively group files that strictly follow the project_data_1234.txt format without including those with additional suffixes like project_data_1234_suffix.txt.

Addressing the Limitations:

Understanding the inherent limitations of glob patterns, particularly when dealing with wildcards in object storage systems, is crucial for effective file management.

When users encounter scenarios where filenames within a folder are incompatible due to these limitations, several practical options are available.

Ensure appropriate file grouping:

Separation into Distinct Folders:

One effective strategy is to organize files with conflicting name formats into separate folders.

By doing so, the resultant glob patterns within each folder will be distinct and won’t overlap, ensuring precise file grouping.

Leveraging Folder-Globbing Feature:

For added flexibility, users can also utilize our folder-globbing feature.

This feature simplifies the grouping process by aggregating all files in the same folder, regardless of their filename patterns. This approach is particularly useful in scenarios where filename-based grouping is less critical or when dealing with a wide variety of filename formats within the same directory.

Customized Filename Conventions:

Users are encouraged to adopt filename conventions that align better with the capabilities and limitations of glob patterns. By designing filenames with clear, distinct segments and predictable structures, users can more effectively leverage the globbing feature for accurate file categorization.

Conclusion

The Multi-Token Filename Globbing feature stands out as a powerful and efficient tool for organizing and categorizing files within a distributed filesystem.

By astutely analyzing filename patterns and forming optimized glob patterns, this feature significantly streamlines the process of managing files that share common schemas, thereby elevating the overall data quality and accessibility within the system.

Glossary

Anomaly

Something that deviates from the standard, normal, or expected. This can be in the form of a single data point, record, or a batch of data.

Accuracy

The data represents the real-world values they are expected to model.

Catalog Operation

Used to read fundamental metadata from a Datastore required for the proper functioning of subsequent Operations such as Profile, Hash, and Scan.

Comparison

An evaluation to determine if the structure and content of the source and target Datastores match.

Comparison Runs

An action to perform a comparison.

Completeness

Required fields are fully populated.

Conformity

Alignment of the content to the required standards, schemas, and formats.

Connectors

Components that can be easily connected to and used to integrate with other applications and databases. Common uses include sending and receiving data.

Info

We can connect to any Apache Spark accessible datastore. If you have a datastore we don’t yet support, talk to us! We currently support: Files (CSV, JSON, XLSX, Parquet) on Object Storage (S3, Azure Blob, GCS); ETL/ELT Providers (Fivetran, Stitch, Airbyte, Matillion – and any of their connectors!); Data Warehouses (BigQuery, Snowflake, Redshift); Data Pipelining (Airflow, DBT, Prefect); Databases (MySQL, PostgreSQL, MSSQL, SQLite, etc.) and any other JDBC source.

Consistency

The value is the same across all datastores within the organization.

Container (of a Datastore)

The uniquely named abstractions within a Datastore that hold data adhering to a known schema. The Containers within a RDBMS are tables, the containers in a filesystem are well formatted files, etc.

Data-at-rest

Data that is stored in a database, warehouse, file system, data lake, or other datastore.

Data Drift

Changes in a data set’s properties or characteristics over time.

Data-in-flight

Data that is on the move, transporting from one location to another, such as through a message queue, API, or other pipeline.

Data Lake

A centralized repository that allows you to store all your structured and unstructured data at any scale.

Data Quality

Ensuring data is free from errors, including duplicates, inaccuracies, inappropriate fields, irrelevant data, missing elements, non-conforming data, and poor data entry.

Data Quality Check

aka "Check" is an expression regarding the values of a Container that can be evaluated to determine whether the actual values are expected or not.

Datastore

Where data is persisted in a database, file system, or other connected retrieval systems. You can check more in Datastore Overview.

Data Warehouse

A system that aggregates data from different sources into a single, central, consistent datastore to support data analysis, data mining, artificial intelligence (AI), and machine learning.

Distinctness (of a Field)

The fraction of distinct values (appear at least once) to total values that appear in a Field.

Enrichment Datastore

Additional properties that are added to a data set to enhance its meaning. Qualytics enrichment includes whether a record is anomalous, what caused it to be an anomaly, what characteristics it was expected to have, and flags that allow other systems to act upon the data.

Favorite

Users can mark instances of an abstraction (Field, Container, Datastore, Check, Anomaly, etc.) as a personalized favorite to ensure it ranks higher in default ordering and is prioritized in other personalized views & workflows.

Compute Daemon

An application that protects a system from contamination due to inputs, reducing the likelihood of contamination from an outside source. The Compute Daemon will quarantine data that is problematic, allowing the user to act upon quarantined items.

Incremental Identifier

A Field that can be used to group the records in the Table Container into distinct ordered Qualytics Partitions in support of incremental operations upon those partitions:

a whole number - then all records with the same partition_id value are considered part of the same partition.
a float or timestamp - then all records between two defined values are considered part of the same partition (the defining values will be set by incremental scan/profile business logic). Since Qualytics Partitions are required to support Incremental Operations, an Incremental Identifier is required for a Table Container to support incremental Operations.

Incremental Scan Operation

A Scan Operation where only new records (inserted since the last Scan Operation) are analyzed. The underlying Container must support determining which records are new for incremental scanning to be a valid option for it.

Inference Engine

After Compute Daemon gathers all the metadata generated by a profiling operation, it feeds that metadata into our Inference Engine. The inference engine then initiates a "true machine learning" (specifically, this is referred to as Inductive Learning) process whereby the available customer data is partitioned into a training set and a testing set. The engine applies numerous machine learning models & techniques to the training data in an effort to discover well-fitting data quality constraints. Those inferred constraints are then filtered by testing them against the held out testing set & only those that assert true are converted to inferred data quality Checks.

Metadata

Data about other data, including descriptions and additional information.

Object Storage

A type of data storage used for handling large amounts of unstructured data managed as objects.

Operation

The asynchronous (often long-running) tasks that operate on Datastores are collectively referred to as "Operations". Examples include Catalog, Profile, Hash, and Scan.

Partition Identifier

A Field that can be used by Spark to group the records in a Dataframe into smaller sets that fit within our Spark worker’s memory. The ideal Partition Identifier is an Incremental Identifier of type datetime since that can serve as both but we identify alternatives should that not be available.

Pipeline

A workflow that processes and moves data between systems.

Precision

Your data is the resolution that is expected- How tightly can you define your data?

Profile Operation

An operation that generates metadata describing the characteristics of your actual data values.

Profiling

The process of collecting statistics on the characteristics of a dataset involving examining, analyzing, and reviewing the data.

Proprietary Algorithms

A procedure utilizing a combination of processes, tools, or systems of interrelated connections that are the property of a business or individual in order to solve a problem.

Quality Score

A measure of data quality calculated at the Field, Container, and Datastore level. Quality Scores are recorded as time-series enabling you to track movement over time. You can read more in Quality Scoring.

Qualytics App

aka "App" this is the user interface for our Product delivered as a web application.

Qualytics Deployment

A single instance of our product (the k8s cluster, postgres database, controlplane/app/compute daemon pods, etc).

Qualytics Compute Daemon

aka "Compute Daemon" this is the layer of our Product that connects to Datastores and directly operates on users’ data.

Qualytics Implementation

A customer’s Deployment plus any associated integrations.

Controlplane

aka "controlplane" this is the layer of our Product that exposes an Application Programming Interface (API).

Qualytics Partition

The smallest grouping of records that can be incrementally processed. For DFS datastores, each file is a Qualytics Partition. For JDBC datastores, partitions are defined by each table’s incremental identifier values.

Record (of a Container)

A distinct set of values for all Fields defined for a Container (e.g. a row of a table).

Schema

The organization of data in a datastore. This could be the columns of a table, the header of a CSV file, the fields in a JSON file, or other structural constraints.

Schema Differences

Differences in the organization of information between two datastores that are supposed to hold the same content.

Source

The origin of data in a pipeline, migration, or other ELT/ETL process. It’s where data gets extracted.

Tag

Users can assign Tags to Datastores, Profiles (Files, Tables, Containers), Checks and Anomalies. Add a Description and Assign a Weight. The weight value directly correlates with the level of importance, where a higher weight indicates higher significance.

Target

The destination of data in a pipeline, migration, or other ELT/ETL process. It’s where data gets loaded.

Third-party data

Data acquired from a source outside of your company which may not be controlled by the same data quality processes. You may not have the same level of confidence in the data and it may not be as trustworthy as internally vetted datasets.

Timeliness

It can be calculated as the time between when information should be available and when it is actually available, focused on if data is available when it’s expected.

Volumetrics

Data has the same size and shape across similar cycles. It includes statistics about the size of a data set including calculations or predictions on the rate of change over time.

Weight

The weight value directly correlates with the level of importance, where a higher weight indicates higher significance.

Changelog

2025

Release Notes

2025.11.6

Feature Enhancements

Introduced Service Users and Enhanced Token Management
- Added dedicated accounts for API integrations, separating service credentials from personal user accounts.
- Service Users include automatic token generation, making it easier to set up integrations and manage automated workflows.
- The Tokens page has been redesigned with a new tabbed layout that separates Personal Tokens from Service Tokens, with quick navigation to user details and improved search capabilities.
- Token generation has been improved with download-as-file support, clearer copy feedback, and the ability to generate tokens directly from the user settings dropdown.
Added historical settings view for flow executions.
- Flow execution actions now display the configuration used during their execution.
- Each action's settings are preserved, enabling accurate historical review and troubleshooting of past executions.
- Actions in the flow execution view are now clickable, allowing users to inspect their historical configuration details.
Introduced Discarded status for anomalies
- New archive option allows marking anomalies as no longer relevant without implying resolution or error.
- Added ability to reactivate acknowledged anomalies, allowing them to be moved back to the active state.
Improved computed table creation for SQL Server, Oracle, and Redshift.
- Autocomplete suggestions now display table names with schema prefixes for these JDBC datastores.

General Fixes and Improvements

Corrected issue where operation status remained unchanged after completion in container activity displays.
Resolved webhook integration issues for Microsoft Teams and Slack that prevented interactive features from working properly.
General Fixes and Improvements.

2025.10.23

Feature Enhancements

Expanded Command Palette with quick access to creation actions across datastores, flows, templates, and settings.
Enhanced operation tracking by adding end time display in datastore activity and navigation links for container profiles and scans.
- Operation duration tooltips now show completion time for finished operations.
- Profile and scan tooltips now include clickable links in tooltips that navigate to the corresponding operation in the activity view.
Added abort option for flow executions.
- New abort action available in flow execution list and details pages.

General Fixes and Improvements

Fixed volumetric chart threshold bar calculations to correctly use the configured window size in Absolute and Percentage Change comparisons.
Corrected missing runtime display for fast export operations.
Improved Quality Score calculation reliability and error handling.
Fixed table count display showing double the actual number of tables in Snowflake and Oracle datastores when members belong to multiple teams.
Added validation to prevent computed fields from being used as partition or incremental fields in container configurations.
Corrected user list filtering and sorting errors when combining team filters with team-based sorting.
Resolved container profile creation errors caused by concurrent operations processing inferred quality checks.
Corrected materialize operation failures for glob-pattern containers by sanitizing invalid characters in output table names.
Resolved Quality Score calculation failures when container rowcount
General Fixes and Improvements.

2025.10.15

Feature Enhancements

Introduced fuzzy search to form and filter inputs.
- Search results now tolerate typos and partial matches for more natural filtering.
Improved Expected Values and Required Values check configuration with visual warnings for spacing issues.
- Values with trailing or leading spaces now display in warning-colored chips.
- A tooltip will show when a value containing extra spaces.
Added informational message to Flow action datastores selection explaining filter criteria.
Improved DFS datastore overview to display file format type.

General Fixes and Improvements

Fixed data write failures in BigQuery enrichment datastores caused by message size limits exceeding API thresholds.
Optimized operation triggering endpoint performance, including scheduled runs.
Improved Quality Score calculation accuracy according to Quality Score Dimensions.
Fixed breadcrumb navigation not updating correctly when cloning checks and changing field or container context.
Improved background task processing for enhanced system reliability and performance with optimized task execution.
General Fixes and Improvements.

2025.10.3

Feature Enhancements

Optimized computed tables, files, and joins creation process.
- Now the creation is much faster with optimized validation and asynchronous profiling.
- Added "Validate" button to check only syntax and semantics, eliminating wait times for full data profiling.
Enhanced Quality Score calculation with improved clarity and transparency.
- Renamed "Quality Score Factors" to "Quality Score Dimensions" throughout the application for better conceptual understanding.
- Quality scores now better reflect data fitness for intended use cases rather than simple error counts.
Introduced "Has Logs" filter to display operations that completed with logs.
Added sort by "Last Triggered" option to the Flows list page for better workflow management.
Improved page metadata for better link sharing across Flow, Check, Anomaly, Library, and Enrichment pages.

General Fixes and Improvements

Fixed validation error when clearing filter clause fields in computed table and join forms.
Corrected Ctrl+C copy functionality that was blocked by command palette shortcuts.
Resolved delete button visibility for archived anomalies in both light and dark themes.
Fixed inconsistent text styling for "No inference" label.
Corrected catalog operation incorrectly identifying Iceberg metadata files as data files.
Resolved bulk selection checkboxes disappearing after multiple select/deselect cycles.
Fixed error when updating computed tables with excluded fields.
Corrected icon alignment in collapsed datastore tree view sidebar.
Resolved missing background color for "Not Asserted" status in Insights checks section.
Fixed misleading error messages when attempting to edit comments on archived anomalies.
Corrected a regression in computed joins that prevented using columns with non-normalized names in join conditions.
General Fixes and Improvements.

2025.9.17

Feature Enhancements

We are happy to announce the new Keyboard Commands feature
- Execute common actions using context-sensitive shortcuts that adapt based on your current location and selection within the application.
- Quickly navigate between main pages using intuitive multi-key sequences with the new "Go to" shortcuts.
  - Press G followed by a specific key to quickly jump to pages like: G + E for Explore, G + L for Library, G + T for Tags, and etc.
- Enhance Tab navigation allows smooth movement through interface elements using ⌥ / Alt + ← and ⌥ / Alt + → with clear visual focus indicators.
- Update the global search shortcut to be accessible via ⌘/Ctrl + Shift + F.
- Access the Command Palette by pressing ⌘/Ctrl + K to search, navigate, and execute actions without leaving your current context.
- View a complete list of all available keyboard shortcuts by pressing ⌘/Ctrl + / for easy reference and learning.
Improved hover highlighting to Data Preview, Source Records, and Enrichment tables for consistent user experience across all data views.
Enhanced queries with metadata comments for better cost tracking and operation identification in query logs.

General Fixes and Improvements

Fixed profile operations failing on BigQuery tables with nested record types.
Limited anomaly rollup threshold to 1,000 with visual markers to prevent system overload from excessive anomaly generation.
Fixed dataDiff validation errors when using computed fields as row identifiers.
General Fixes and Improvements.

2025.9.12

Feature Enhancements

We are thrilled to introduce the Profiles Over Time for tracking field-level data changes
- Users can now compare the current field profile with previous profiles to track changes.
- Visual indicators highlight metrics that changed between selected profiles.
- Interactive charts display numeric metric trends across profile history.
- Easily identify data drift and type changes at the field level.
- Special badges indicate when field types have changed between profiles.
Introducing the Data Diff check as an enhancement of the Is Replica Of functionality.
Enhanced File Profile visibility with file format display and improved handling of long names with tooltips.
Improved hover contrast for list items in light mode for better visibility.
Optimized slim profile logic to protect existing field typing from being overwritten by limited data samples.
Added OAuth support to the Databricks connector.

General Fixes and Improvements

Fixed broken enrichment datastore redirect link in the datastore tree footer.
Corrected filter application in Exists In checks that was causing inaccurate anomaly detection.
Fixed grouped inference checks to properly validate against filtered test data for each group combination.
Resolved Is Address check failing to assert during scan.
Removed User Defined Function check support.
General Fixes and Improvements.

2025.9.4

Feature Enhancements

Introducing Product Updates
- Users can now view feature announcements directly within the application.
- Read full posts by clicking the external link for each update.
Enhanced filter clause display in readonly checks with copy functionality and improved text formatting.

General Fixes and Improvements

Resolved an issue where the API allowed creation of quality checks with string fields for rules that don't support them.
Corrected cache logic to avoid unnecessary data refresh in insights.
Fixed input field overflow when entering long filter values that caused UI layout issues.
Fixed Data Preview errors when filters return no records, now properly displays empty results instead of failing with a server error.
Fixed activation failures for Exists In checks that reference computed fields.
Fixed issue where rule type filter options did not appear on the check list page for Member users.
Updated enrichment processing for more frequent and reliable data writes.
General Fixes and Improvements.

2025.8.20

Feature Enhancements

Introducing Bulk Quality Check Creation through Templates
- Users can now create multiple quality checks from a single template.
- Select multiple target containers across different datastores and apply the same template configuration.
Announcing Pause Schedule operation
- Users can now deactivate and reactivate schedules without losing configuration.
- Added a new filter option to show only deactivated schedules.
Introducing Test Connection capability for existing connections
- Users can now verify connection changes before saving when editing a connection configurations.
Improved isReplicaOf check to better handle incremental data comparisons.
Added support for connecting to Kerberos-secured Hive datastores.

General Fixes and Improvements

Fixed Data Preview filtering to properly indicate that computed fields are not supported in WHERE clauses and not exposing as auto complete options.
Fixed issues related to ANSI SQL compliance in Spark 4.
Fixed issues with nested data types in the Databricks connector.
Corrected inconsistent formatting in operation details where containers read appeared larger than total containers.
Resolved filter UI issues where selected items caused layout misalignment.
General Fixes and Improvements.

2025.8.8

Feature Enhancements

Enhanced Source Record visualization
- Users can now view more source records with selectable limits (10, 100, 1000, or 10000 records).
- Added sticky headers for easier navigation when scrolling through large datasets.
Introducing Quality Check Migration
- Users can now migrate quality checks from one container to another, even across different datastores.
  - Archived and inferred checks are excluded from migration.
- Migrated checks are set to Draft status for users review before activation.
Enhanced search functionality for datastores and flows
- Users can now search by ID or name using a single search field.

General Fixes and Improvements

Updated Dataplane to Spark 4
Fixed Is Replica Of dry run validation to correctly handle filtered datasets.
Resolved an issue where anomaly action buttons redirect to the details page instead of performing the intended action.
Corrected the navigation issue when switching between the Checks and Observability tabs that prevented lists from rendering properly.
General Fixes and Improvements.

API Changes

The following endpoint GET /operations is affected:
- Added start_date and end_date query parameters to filter operations by date range.

2025.7.28

Feature Enhancements

Improved Spark SQL autocomplete handling for complex field names
- SQL autocomplete now automatically adds backticks to field names, preventing errors with special characters.

General Fixes and Improvements

Resolved an issue where scheduled operations were not executing reliably under high load.
Fixed last asserted timestamp accuracy for all check types.
General Fixes and Improvements.

2025.7.23

Feature Enhancements

Simplified Computed Join prefix management
- Select expressions automatically adjust when prefixes change, eliminating manual field name updates.
Added User Guide links to Check Creation form
- Selecting a Rule Type provides a direct link to that specific rule's documentation in the User Guide.

General Fixes and Improvements

Corrected Quality Check Update errors affecting Greater Than Field, Less Than Field, and Equal To Field rule types.
Fixed source record display for high-precision decimal values
- Source records now display full decimal precision on hover for truncated numeric values.
General Fixes and Improvements.

2025.7.18

Feature Enhancements

Introducing Computed Join for creating new containers by joining data across different datastores
- Users can now create computed joins between two containers, even from different datastores, enabling cross-datastore data analysis.
- Supports multiple join types: Inner Join, Left Join, Right Join, and Full Outer Join to accommodate various data combination needs.
- Configure joins using intuitive left and right reference selections with field mapping and optional prefixes.
- Add custom SELECT expressions and WHERE clauses to refine the joined data output.
Introducing Dry Run operation for draft checks
Enhanced Bulk Check Operations and Management Capabilities
- Added metadata field to bulk update dialog, enabling users to update metadata across multiple checks simultaneously without opening each individually.
- Extended bulk operations to support archived checks, previously limited to active only.
- Bulk activate and draft actions now available for archived checks, expanding beyond the previous delete-only option.
Added Subject Field to Email Notifications
- Email notifications now support customizable subject lines, allowing users to add meaningful context to their messages.
Enhanced Record Limit Configuration
- Users can now manually input custom record limit values in Profile and Scan operations, as well as Flow operations through a text field, providing flexibility beyond the predefined options.
- A dropdown menu provides quick access to common values (1M, 10M, 100M, All).
Adding Unlink Enrichment Datastore
- Users can now unlink enrichment datastores directly from the "Enrichment Datastore Settings" dialog.
Improved Datastore Deletion Experience
- Error messages during deletion now appear directly within the confirmation dialog instead of temporary toast notifications.
- When deleting an Enrichment Datastore, the dialog now displays the number of linked source datastores and uses clear labeling to distinguish between datastore types.
Enhanced catalog operation to properly recognize subdirectories within partitioned file structures, ensuring more accurate container identification for complex directory hierarchies.

General Fixes and Improvements

Addressed modal dismissal issues across multiple dialogs where clicking outside or pressing ESC would cause accidental closure and data loss.
Fixed "Open in new tab" and "Copy link" actions for checks and anomalies that were not functioning correctly.
Fixed source record formatting for 'Is Replica Of' anomalies when check configuration changes after anomaly detection.
Fixed an issue where anomaly URLs generated from check side panels were not functioning correctly when shared.
Fixed incorrect redirection after creating checks from templates.
Fixed an issue where source records weren't displaying correctly during dry run operations.
Corrected cloning behavior to preserve tags from the check being cloned.
Fixed scan operations failing after deleting or unlinking enrichment datastores.
Corrected failed checks information in anomaly responses to accurately reflect the historical check version at the time the anomaly was detected, rather than showing the current check version.
General Fixes and Improvements.

2025.7.2

Feature Enhancements

Introduced Failed Check Version display, providing visibility into the exact check configuration that triggered each anomaly.
- Failed Check now displays the specific check properties and configuration that were active when the anomaly was generated.
Enhanced Check Activities visualization.
- Users can now view historical check configurations directly from the timeline, including all properties and tags as they were at that point in time.
Enhanced flow nodes with improved visual design and contextual information display for better user experience.
- Action nodes now show inline summaries with relevant details based on their type (e.g., datastore names for operation actions, channel names for Slack or Teams, URLs for webhooks, etc).
- Export nodes now display asset types in their titles (e.g., "Export Anomalies").
- Added filter tooltips to trigger nodes displaying applied conditions (tags, datastores, operation types) for quick configuration visibility.
Supports Data Preview functionality for containers that haven't been profiled yet, removing the requirement to profile first before viewing data.
Enhanced editing flexibility for Asserted Checks.
- Users can now edit SparkSQL expressions that define calculated fields.
- Row Identifiers and Passthrough Fields are now editable for Is Replica Of Check.
Improving Computed Assets:
- Users can now add Additional Metadata to computed Tables/files and computed fields.
- Display the Last Editor information in the tree footer to provide context on who last modified the asset.
Added Last Profile visibility in Field Overview and Field Details.
- Users can now see the last time a field was profiled, helping clarify the timeframe of the metrics shown in the Profile section.
Improved Anomaly Bulk Archive with comment support.
- Similar to Acknowledge Anomaly, users can now add optional comments when bulk archiving anomalies.
Improve dry run result message readability.

General Fixes and Improvements

Fix a bug when user selected a date in Date Picker and this return user's timezone instead of UTC timezone.
Fixed an issue with Metadata Checks Dry Run execution where status messages were not displaying properly.
General Fixes and Improvements.

2025.6.20

Feature Enhancements

Added Source Records Download for Check Dry Run: Users can now download the source records as a CSV file after executing a Dry Run.
Improving Source Records Performance: Implemented caching for anomaly source records, significantly reducing load times and improving in user session level.

General Fixes and Improvements

General Fixes and Improvements.

2025.6.13

Feature Enhancements

Introducing Quality Check Dry Run: Users can now quickly assess the impact of a quality check without persisting results.
- The Dry Run functionality is accessible directly from the Data Quality Check settings configuration, enabling users to test checks before scan assertion.
- A comprehensive modal displays the execution results, presenting critical metrics including Dry Run status, operation time, and sampling limits.
- Dedicated section display potential Anomalies and Source Records that would be generated by the check. When no issues are detected, users receive a clear confirmation message indicating no anomalies were identified.
Enhancing Fingerprint Visualization: Users can now easily view and manage related anomalies that share the same fingerprint.
- Clicking "View Related Anomalies" opens a right-side panel displaying all anomalies with matching fingerprints, including the total count of related anomalies.
- The panel enables direct anomaly management, allowing users to acknowledge, archive, or click individual cards to view detailed anomaly information without navigating away.
Introducing Sticky Navigation: Users can now maintain access to navigation elements and actions while scrolling through content-rich pages.
- The sticky navigation feature ensures breadcrumb information and interaction buttons remain visible and accessible as users scroll down the page.
Expanding Scan Operation Support for Iceberg Tables
- Incremental scans now fully support Iceberg table formats, significantly expanding the range of asset types eligible for incremental scanning operations.

General Fixes and Improvements

Resolved an issue where metric charts failed to display data when users accessed metric details.
Fixed a bug that incorrectly allowed users to edit settings on computed fields inherited from parent computed files.
Corrected the rendering logic for Authored Check Details that prevented information from displaying after tag update operations.
General Fixes and Improvements.

2025.6.6

Feature Enhancements

Introducing the new Quality Check dedicated page, enabling users to analyze check properties and metrics.
- A Check Assert Visualization is provided to analyze assertions over time, helping users monitor assertion results, with the ability to hover over a timeline point to view the latest assertion and totals.
- Displays key metrics such as Status, Rule Type, Last Asserted, Weight, Coverage, and Active Anomalies and including the check description.
- Exposes all relevant check properties to provide a comprehensive view of each check's configuration without opening the edit modal.
- Shows the full activity history for the check, including property updates, and exposes previous and new values when a check setting is modified.
- Supports inline check tag editing by clicking the tag badge, allowing users to add or remove tags without opening a modal.
Announcing the Anomaly exclusive page: This new page will allow users to get detailed information about Anomaly metrics, Failed Checks and Source Records.
- Exposes detailed anomaly information, including Status, Anomalous Records, Total Failed Checks, Weight, Detected DateTime, and Scan Operation, as well as Source Datastore, Computed Table, and Location.
- Lists the Failed Checks that were violated and led to the creation of the anomaly. Clicking on a failed check opens a right-side panel with the corresponding quality check information, eliminating the need to navigate to a different page.
- Show Source Records from your data that failed the checks when available. Users can apply filters and sorting options to personalize the data display according to their preferences.
- Displays the complete activity history, including all updates made to the anomaly over time. User comments are also shown, making it easier to follow discussions and decisions.
- Similar to the dedicated Quality Check page, users can edit Anomaly Tags inline.
Datastore Connection Status Visibility
- A badge attached to the datastore icon now appears in both the breadcrumb and the tree view footer, clearly indicating the connection status of the datastore.
Adding support for gzipped and .txt files in Catalog Operation
- Users can now use gzipped (.gz) and .txt files in DFS Datastores for Catalog Operations.

General Fixes and Improvements

General Fixes and Improvements.

2025.5.23

Feature Enhancements

Atlan Integration
- Users can now choose whether or not to receive notifications in the Atlan platform, giving more control over their notification preferences.
Freshness Heatmap
- The freshness chart has been redesigned for an improved user experience.
- Milliseconds are now displayed in a more readable date/time format for better comprehension, while the underlying data still uses milliseconds.

General Fixes and Improvements

Create User
- Fixed a bug that occurred when creating a service user with automatic admin permission enabled.
Rerun Operations
- Catalog, export, and materialize operations will now only display rerun operations.
General Fixes and Improvements.

2025.5.16

Feature Enhancements

Anomaly
- Users can now view the Anomaly Fingerprint directly in the Anomaly Details page.
- A new button allows users to quickly copy the fingerprint value.
- A link to the User Guide has been added to explain how this feature works.
Datastore Connection
- A new validation step was added to several connectors to verify if the specified schema exists.

General Fixes and Improvements

Schedule Operation
- Fixed a bug in scheduled operations that allowed None as a value for max_records_analyzed_per_partition when updating.
Check
- Fixed an issue where creating a metric check with a non-existent comparison value would fail..
- Fixed a bug where checks would fail if the filtered set was empty — now the check will pass in this case.
Catalog Operation
- Fixed an issue in DB2 where evaluating the distribution of values caused an error.
Scheduled Scan
- Fixed an issue that occurred when adding connection retries related to the Secrets Manager.
Anomaly
- Fixed an issue where some triggered anomalies had no data available.
General Fixes and Improvements.

2025.5.9

Feature Enhancements

Microsoft Teams Integration
- We're excited to announce a native integration with Microsoft Teams, bringing powerful collaboration features directly into your Teams workspace.
- Users can now:
  - Share Qualytics links to datastores, containers, or fields and see rich previews directly in Teams.
  - Receive proactive notifications when:
    - An operation completes
    - An anomalous table or file is detected
    - A specific anomaly is triggered
  - Note that the message content and actions will adapt based on the trigger type defined in the Flow.
  - Manage anomalies without leaving Teams:
    - View, acknowledge, comment on, or archive anomalies from within the Teams UI
    - Click to open linked anomalies directly in Qualytics
- The integration must be configured by a Qualytics admin. A dedicated setup guide is available in our User Guide.
- As part of this rollout:
  - Any Flows previously configured to send Teams notifications via incoming webhooks or workflows (Teams) have been automatically migrated to the Webhook action.
  - The Teams notification action is now only available through the new integration.
Tokens
- Users can now view the last time a token was used.
Session Expiration
- Improved handling of session expiration for users who are logged in but inactive for an extended period.

General Fixes and Improvements

Check Template
- Fixed an issue with the message displayed when a user archives a check that has associated checks.
General Fixes and Improvements.

2025.5.5

Feature Enhancements

Tokens
- Users can now view the last time a token was used, providing better visibility into token activity.

General Fixes and Improvements

Is Replica Of Check
- Fixed a bug where anomalies using pass-through fields on both the left and right side were not handled correctly.
Action Operations
- Fixed an issue where cloning an Action Operation could exceed the allowed action limit.
Group By
- Fixed a bug that occurred when users added single quotes to string columns in the Group By clause.
Atlan Sync
- Updated and created custom metadata objects to align with the latest schema.
General Fixes and Improvements.

2025.4.24

Feature Enhancements

Improved Filters
- Tag Filter
  - Users will only see tag options corresponding to items currently visible in the list pages (Datastore List, Container List, and Filter List).
  - The same filtering behavior has been applied to Anomalies and Checks within the datastore context.
  - If no visible items contain a specific tag, a No option found message will be displayed in the filter dropdown.
Scan Operation
- Scan form
  - The scan form has been reorganized to improve the user experience.
  - Now, the following steps to configure the Scan Flow are Check Categories, Reading Settings and Scan Settings.
- Enrichment Settings
  - The Remediation Strategy option is now in the Enrichment Datastore Settings.
  - The option will be a Datastore global value.
  - Also, the Anomaly Rollup Threshold and Source Record Limit can be configured as defaults.
  - During scan operations, these options will be pre-filled but can still be edited within the scan form.
Tree View
- The tree view layout is now adjustable, allowing users to customize it to their preferences.
- The footer in the tree view can be expanded or collapsed based on user needs.

General Fixes and Improvements

Filters
- Fixed an issue where filters behaved inconsistently when navigating between different datastores.
The container page is not loading
- Fixed a bug that caused the container page to fail to load under certain conditions.
General Fixes and Improvements.

API Changes

Incoming Breaking Changes
- REQUEST PAYLOAD: The fields enrich_only and enrichment_only will be replaced by enrich_container_prefix and enrichment_prefix and will affect the following endpoints:
  - POST /datastores
  - PUT /datastores/{id}
- RESPONSE PAYLOAD: The fields enrich_only and enrichment_only will be replaced by enrich_container_prefix and enrichment_prefix and will affect the following endpoints:
  - GET /anomalies/{id}
  - PUT /anomalies/{id}
  - POST /containers
  - GET /containers/{id}
  - PUT /containers/{id}
  - GET /containers/{id}/profile
  - GET /containers/{id}/scan
  - POST /containers/{id}/scan
  - PATCH /containers/{id}/favorite
  - GET /container-profiles/{id}
  - GET /container-scans/{id}
  - POST /datastores
  - PUT /datastores/{id}
  - GET /datastores/{id}
  - PATCH /datastores/{id}/favorite
  - PATCH /datastores/{datastore_id}/enrichment/{enrich_store_id}
  - GET /field-profiles/{id}
  - POST /operations/schedule
  - PUT /operations/schedule/{id}
  - GET /operations/schedule/{id}
  - GET /operations/{id}
  - POST /operations/run
  - PUT /operations/run/{id}
  - PUT /operations/rerun/{id}
  - PUT /operations/abort/{id}
- POST /operations/run, POST /operations/schedule, PUT /operations/schedule/{id}, POST /flows, and PUT /flows/{id}
  - DEPRECATE PARAMETER: remediation (To specify a remediation strategy going forward, use the new enrichment_remediation_strategy field available in the POST /datastores and PUT /datastores/{id} endpoints.)

2025.4.11

Feature Enhancements

Anomalies and Checks Filter
- The Rule filter is now dynamically populated based on the types present in the current view.
- Each rule type displayed in the filter now includes a counter showing the total number of occurrences.
Status Page
- The Health page has been renamed to Status page.
- Users can view additional information like Cloud Platform and Deployment Size.
- Deployment size details are now displayed to provide better visibility into environment configurations.

General Fixes and Improvements

Anomaly
- Resolved a bug when handling NaN values in float-type data.
General Fixes and Improvements.

2025.4.5

General Fixes and Improvements

Quality Score
- Enhanced the descriptions for each Factor Weight to provide clearer guidance on how they impact the overall score.
General Fixes and Improvements.

2025.3.28

Feature Enhancements

Anomalies
- The “Anomalies Identified” count now shows the sum of Open and Archived anomalies.
- Anomalies are now categorized into two groups:
  - Open: anomalies currently active (Active, Acknowledged).
  - Archived: anomalies that have been archived (Resolved, Duplicated, Invalid).
- Users can now view the total counts of Duplicate and Invalid anomalies in the Overview tab under the Explore page.
- Users can now see the total counts of Open and Archived anomalies directly in Scan Operations.
- The Scan Results modal now displays the totals for Open and Archived anomalies.
  - Users can filter anomalies by status using a new dropdown selector.
Quality Checks
- Users can now sort the Quality Checks list by the “Last Asserted” date.

General Fixes and Improvements

Teams Permissions
- Fixed an issue where users with the Drafter role were unable to restore a check as a draft.
General Fixes and Improvements.

2025.3.23

Feature Enhancements

Slack Integration
- We are excited to introduce a major enhancement to our integration with Slack.
- Users can now add the new Qualytics Slack App for enhanced capabilities.
  - Qualytics admins can configure the Slack Integration with two easy steps.
    - After configuring the integration, a Slack administrator must approve the Qualytics Slack App.
- The Qualytics Slack App supports selecting specific Slack channels for receiving Qualytics notifications in the context of a Qualytics Flow.
  - Different types of messages will be sent for each trigger in Flow operations.
    - The text and actions will vary depending on the selected trigger.
    - The message state (Slack message color) will change based upon the message status.
- The Qualytics Slack App allows users to interact with Qualytics from the Slack interface:
  - Interact with anomalies by acknowledging, commenting, or archiving them without leaving the Slack UI where the flow notification is received.
  - Click a link in Slack to be redirected to the Qualytics UI for more details regarding a specific notification.
  - View anomalous tables and files detected without leaving Slack.
Anomaly Fingerprint
- We are thrilled to introduce support for a new feature that will begin identifying duplicate anomalies.
- Anomalies created after this release will be "fingerprinted" so that re-detection of that same anomaly can be readily identified as a duplicate detection.
  - New Scan Operation options allow users to define how detected duplicates should be handled.
  - This feature helps maintain the history and timeline of specific data errors, allowing users to track how long a specific issue has persisted and a log of detections over time.
  - Anomaly fingerprints will also be exposed in API responses and written to enrichment

General Fixes and Improvements

External Scan
- Users can now use and rerun external scans only from the activity listing for the targeted asset.
Check Last Asserted
- Fixed an issue where checks were still being marked as never asserted even after producing anomalies.

API Changes

Integration Impacting Changes
- POST & PUT api/integrations
  - REMOVED PARAMETERS: name and api_token
  - NEW PARAMETERS: api_access_token, api_refresh_token and api_service_token
- POSTs to api/flows/actions/notifications/ endpoints
  - MODIFIED PARAMETERS: tokenized_message is now a json object instead of a string
- All requests to api/datastores endpoints
  - REMOVED RESPONSE PROPERTY METRIC: The metric for archived_anomalies previously returned as metrics.archived_anomalies has been removed from responses
Deprecation Notices
- POST api/operations/run
  - DEPRECATED PARAMETER: archive_overlapping_anomalies (migrate to the new parameter archive_duplicate_anomalies for enhanced functionality)
- POST api/operations/schedule
  - DEPRECATED PARAMETER: archive_overlapping_anomalies (migrate to the new parameter archive_duplicate_anomalies for enhanced functionality)

2025.3.17

Feature Enhancements

Activity Operation
- Added a column showing the anomaly rollup threshold.
Export Operations
- Added an option allowing users to schedule Export Operations.
DFS Enrichment Datastore
- Normalized delta file names to lowercase.

General Fixes and Improvements

Anomaly Rollup Threshold
- Fixed an issue where the field was not accepting the maximum value.
Volumetric Tracking Observability
- Fixed an issue where inferred check validation errors disrupted observability measurements.
Export Operation
- Updated wording on the Export dialog.
Scan and External Scan
- Fixed an issue where schema checks were failing along with other checks but were not persisting in the enrichment of source records for anomalies.
External Scan
- Fixed an issue where CSV data was not being properly cast for non-text fields.
Enrichment Datastore
- Fixed a bug where exported tables were not appearing in the UI.
General Fixes and Improvements.

2025.3.5

General Fixes and Improvements

Tag Updating
- Fixed an issue where some containers failed to receive tag changes when updating tags across multiple containers or performing bulk updates.
Edit Scheduled Materialize Operation
- Fixed an issue where the modal did not appear when users attempted to edit a scheduled materialize operation.
Restart Analytics Engine
- Fixed an issue where restarting the Analytics Engine did not take effect.
Anomaly Count
- Fixed an inconsistency in the anomaly count.
General Fixes and Improvements.

2025.2.26

Feature Enhancements

Enrichment Operations
- Introducing Materialize Operation
  - This will "capture a snapshot" of selected containers from your source datastore and export them to the enrichment datastore for seamless data loading.
  - Users can define the maximum number of records to be materialized per table.
  - A schedule option is available for users to set up and schedule the operation according to their needs.
- Introducing Export Operation
  - Users can extract metadata from selected assets in their source datastore and export it to the enrichment datastore for seamless integration.
  - Assets metadata options are available to export to the enrichment datastore. Users can export:
  - Anomalies
  - Quality checks
  - Field profiles
- These operations are available in Flow Action.
Flows
- Introduced a cloning feature for actions in flow.
  - Users can now clone a simple action by clicking the vertical ellipses.
Scan Operation
- Introducing Anomaly Rollup Threshold
  - Users can now roll up anomalies that will be created per check before they are merged into a single rolled-up anomaly.
Error Messages
- Improved custom messages when users receive 502 and 503 status codes.

General Fixes and Improvements

System Timestamp
- Standardized timestamps across the platform.
External Integrations
- Fixed an issue where external tags should be updated instead of being deleted and dropped during sync.
General Fixes and Improvements.

2025.2.13

Feature Enhancements

Explore View
- Users can now refresh the Insights data.
- A label will indicate when the page was last refreshed.
System Appearance
- Users can now select the System theme.
  - This setting automatically adjusts based on the user's system theme, switching between dark and light mode.
- Users can still manually select dark or light mode.
Dismiss Popup Window
- Users can now dismiss popup windows by pressing the Esc key, improving the user experience.

General Fixes and Improvements

External Scan
- Fixed an issue where attempting to run an external scan resulted in a "Request Failed" error message.
Explore View
- Fixed an issue causing excessively long loading times.
General Fixes and Improvements.

2025.2.6

Feature Enhancements

Overview Improvement
- Added inferred and authored check totals
  - Users can now view the total number of inferred and authored checks, along with a comparison timeframe.
  - A redirect link allows users to access the checks directly, displaying only their current statuses.
- Improved check assertion-related metrics to reflect assertions as of the report date.
Team Permissions
- Manager users can now update datastore teams.
  - Requires Editor permission within the team.

General Fixes and Improvements

Flow Graph
- Fixed an issue where the flow graph position was randomly changing when a user updated a flow-node.
Observability Chart
- Fixed an issue where threshold calculations incorrectly referenced measurement values that did not account for grouping.
General Fixes and Improvements.

Breaking Changes

Container Overview Tab
- Refactored the Totals section to clarify that metrics are based on sampled data rather than the full dataset.
  - Added the sampling percentage next to the spark chart to indicate that derived metrics are based on this sampling percentage.
  - Updated titles and labels for better clarity.

2025.1.31

Feature Enhancements

Freshness View
- We are excited to introduce the "Freshness View" feature in Qualytics!
- Users can now visualize both volumetric and freshness checks within the same tab.
- The displayed data includes:
  - Unit: Day, Month, Hour, etc.
  - Maximum Age: Defines the maximum allowed time since the last data update.
  - Last Asserted: Indicates the last time the data was validated.
Datastore Filter Condition in Flows
- Users can now configure datastore filter conditions in triggers for flows, enhancing control over triggered actions.
Treat Empty Value as Null for DFS
- A new option allows users to enable "Empty value as null" as the default behavior for File Patterns, improving data consistency.

General Fixes and Improvements

General Fixes and Improvements.

2025.1.24

Feature Enhancements

Introducing Freshness Tracking in Containers
- Users can now enable a freshness tracking option for containers to measure and record the last time data was added or updated in a data asset. This helps ensure data timeliness and identifies pipeline delays.
Private Routes on Analytics Engine
- Customers using private routes can now view their IP addresses along with relevant messages displayed in the Analytics Engine for improved transparency.
Clone a Flow
- Users can duplicate existing flows, streamlining the process of reusing and modifying flow configurations for similar scenarios.
Additional Option to Execute Manual Flows
- A new "Start a Manual Flow" option has been added to the vertical ellipsis menu, providing users with enhanced flexibility for executing manual flows.
Cancel Action for Unpublished Flows
- A "Cancel" action has been introduced in the flow builder, allowing users to reset the graph to its initial state for unpublished flows. This update also addresses issues related to the execute button and read-only state logic.
"Is Replica Of" Passthrough
- The IsReplicaOf rule now supports a passthrough property, allowing users to exclude specific fields from assertions. Fields listed under this property are no longer flagged as anomalous.
Enhancement for Volumetric Rule Type
- Volumetric checks now include a comparison property, ensuring consistency with metric checks and offering greater flexibility in rule configurations.

General Fixes

General Fixes and Improvements.

2025.1.20

Feature Enhancements

Enhanced behavior for "All" in Schedule Operations
- The "All" option in Schedule Operations has been updated to include future containers automatically. Previously, if you created a schedule with the "All" option and added new tables or containers later, the schedule would not include these new additions.
Validate Button for Enrichment Datastore Connections
- Users can now validate their data when creating or editing an enrichment datastore connection, improving reliability and confidence in datastore setups.

General Fixes

Inaccurate Check Assertion Details
- Resolved an issue where some checks were being marked as never asserted despite producing anomalies, ensuring more accurate reporting.
General Fixes and Improvements.

2025.1.16

Feature Enhancements

Flows
- Introducing Flows: Users can now create automated pipelines by chaining actions and configuring triggers based on predefined events and filters.
  - Triggers: Configurable based on events, filters, and operation conditions.
  - Actions: Include notifications (Email, Slack, PagerDuty, etc.) and operations (catalog, profile, and scan).
  - Real-Time Execution: Monitor execution history and real-time progress in the Flow Executions tab.
- Setup Made Simple
  - Add and configure flows using the “Add Flow” button in the top-right corner.
  - Deactivate, delete or edit flows via the vertical ellipses or node configurations.
- Enhanced Triggering Options
  - Operations Complete, Anomalous Table/File Detected, and Anomaly Detected triggers provide flexible, event-driven automation.
  - Fine-tune triggers using filters like tags, rule types, or anomaly weights.
- Diverse Action Support
  - Notify through in-app messages, Emails, Slack, Microsoft Teams, PagerDuty, and custom HTTP actions
  - Trigger operational tasks across cataloging, profiling, and scanning.
- Flow Identification on Activity Tab
  - Operations executed by flows are marked in the new Flow column, displaying the associated flow name.
  - Users can navigate directly to the flow execution view from this tab.

General Fixes

Duplicate Anomalies on Scan Schedule Operations
- Fixed an issue where duplicate anomalies were not being archived during scan operations despite user selection.
BigQuery Message Size
- Enhanced default batch insertion size to improve performance and reliability.
Anomalous Record Integer out of Range
- Updated check metrics to use BigInteger, addressing large value handling.
Fix the Last Asserted Date
- Resolved inconsistencies in the Last Asserted Date logic for partition and container scans.
General Fixes and Improvements.

Breaking Changes

Notification Rules Replaced by Flows.
- Existing notification rules have been migrated to Flows and will continue to function as before.
- For new notifications, the users must create a flow leveraging the Flows functionality.

2024

Release Notes

2024.12.23

Feature Enhancements

User and Teams Permissions
- We are excited to introduce an enhancement to User and Team Permissions.
  - Users can now have Admin, Manager, or Member roles.
    - The Manager role provides a subset of Admin permissions for global assets or settings but does not include the "Admin exemption to team roles."
  - Teams can have specific permissions: Editor, Author, Drafter, Viewer, and Reporter.
    - Each permission type includes restricted capabilities tailored to its role.
  - Admins can now create special tokens that grant access exclusively to SCIM endpoints. These tokens allow customers to enable SCIM integrations with minimal access, ensuring the holder cannot access other endpoints or log in to the platform.
Improve Visibility of Datastore Teams
- Users can now view respective teams in the tree view footer. Depending on privileges, they can manage this field.
- Teams are also visible in the table and field context for improved collaboration and data transparency.

General Fixes

Completeness Rounding
- Added two more decimal places to the Completeness metric in the Overview tab.
  - Previously, percentages were being rounded up incorrectly.
"Is Replica Of" Check Validation
- Fixed a bug that occurred when users attempted to validate this check using the same container.
Global Search
- Fixed the label to better distinguish between Enrichment Datastores and Source Datastores.
General Fixes and Improvements.

2024.12.11

Feature Enhancements

Add Max Parallelization Field on Datastore Connection
- Users can now configure the maximum parallelization level for certain datastores, providing greater control over operation performance.

General Fixes

General Fixes and Improvements.

2024.11.29

Feature Enhancements

Activity List
- Removed the Warning status for a cleaner and more concise status display.
- Added an alert icon to indicate if an operation completed with warnings, improving visibility into operation outcomes.

General Fixes

Better handling of Oracle Date and Numeric columns during Catalog operations for improved partition field selection.
General Fixes and Improvements.

2024.11.21

Feature Enhancements

Improved Operations Container Dialogs
- Added container status details based on profile and scan results, providing better visibility of container-level operations.
- Introduced a loading tracker component for containers, enhancing feedback during operation processing.
- Made the entire modal reactive to operation updates, enabling real-time tracking of operation progress within the modal.
- Removed "containers requested" and "containers analyzed" dialogs for a cleaner interface.

General Fixes

Resolved an issue where the table name was not rendering correctly in notifications when using the {{ customer_name }} variable.
General Fixes and Improvements.

2024.11.12

Feature Enhancements

Enhance Data Catalog Integration
- Introduced a new domain input field that allows users to select specific domains, enabling more granular control over assets synchronization.
Scan Results Enhancements
- Added partition label to the scan results modal for improved partition identification.
- Removed unnecessary metadata partitions created solely for volumetric checks, reducing clutter in scan results.
Activity Tab
- Display of Unprocessed Containers in the Operation List
  - Unprocessed containers are now visible in the operation list within the operation summary.
  - A total count label was added to indicate if the number of analyzed containers exceeds the total requested.
  - The search icon now highlights in a different color if not all containers were analyzed, making it easier to identify incomplete operations.
- Reorder the Datastore Column in the Activity Tab
  - Users can now reorder columns in the Activity tab for easier navigation and data organization.
- Profile Operations
  - Users can now view added, updated, and total inferred checks within Profile operations.
- Triggered by Column
  - Updated the term "Triggered by API" to "Triggered by System" for clarity.

General Fixes

General Fixes and Improvements.

2024.11.1

Feature Enhancements

Observability Enhancements
- An observability heatmap was added to the volumetric card in the Observability tab.
  - The heatmap allows users to monitor volumetric status and check for new anomalies.
- Improved observability chart for clearer insights.
  - Users can now view the count of volumetric anomalies produced over time, along with the last recorded measurements for each period.
  - Introduced new color indicators to help distinguish volumetric measures outside thresholds that didn’t produce anomalies from those that did.
Editable Tags in Field Details
- Users with write permissions can now manage tags directly in the Field Details within the Explore context.
Distinct Count Rule Update
- The Distinct Count rule now excludes the Coverage field for more accurate assessments.
Support for Pasting into Expected Values
- Users can now paste values from spreadsheets directly into Expected Values, saving time on data entry.

General Fixes

General Fixes and Improvements.

2024.10.23

Feature Enhancements

Dremio Connector
- We’ve expanded our connectivity options by supporting a new connection with Dremio.
Full View of Abbreviated Metrics in Operation Summary
- Users can now hover over abbreviated metrics to see the full value for better clarity.
Redirect to Conflicting Check
- Added a redirect link to the conflicting check from the error message, improving navigation when addressing errors.
Enhanced Visibility and Engagement for Tags and Notifications Setup
- Introduced a Call to Action to encourage users to manage Tags and Notifications for better engagement.
Favorite Containers
- Users can now favorite individual containers.
- The option to favorite datastores and containers is now available in both card and list views.

General Fixes

General Fixes and Improvements.

2024.10.16

Feature Enhancements

Improved Anomaly Modal
- Introduced an information icon in each failed check to display the check's description.
- Anomaly links now persist filters for sort order and displayed fields.
- Added integration details to fields in a source record.
Secrets Management
- Added support for Secrets Manager in connection properties, enabling integration with Vault and other secrets management systems.
Alation Data Dictionary
- Enhanced the dictionary to display friendly names in anomaly screens for improved usability.
- Added integration information to the datastore, container, and fields in the tree view footer.
Tag Category
- Introduced support for tag categories to improve tag management, with sorting and filtering options based on the category field.
Call to Action for Volumetric Measurements
- A call to action was added in the overview tab within the container context, and the observability page per container was added to enable volumetric measurements.
Error Display for Check Operations
- Bulk operations like Edit, Activate, Update, and Template Edit now display error messages clearly when validation fails.
Check Validation
- Improved check validation logic to enhance bulk check validation speed and prevent timeouts.
Tag Filtering for Fields
- Users can now filter fields by tags in the field list under the datastore context.
Field Remarks in Native Field Properties
- Added support for displaying field remarks alongside other native field properties.
Customer Support Link
- Users can now access the Qualytics Helpdesk via the Discover menu in the main header.

General Fixes

General Fixes and Improvements.

2024.10.4

Feature Enhancements

Insights Page Redesign
- Introduced a new Overview card displaying key metrics such as Data Under Management, Source Datastores, and Containers.
- Added a doughnut chart visualization for checks and anomalies, providing a clearer view of data health.
- Expanded available metrics to include profile runs and scan runs.
- Users can now easily navigate to Checks and Anomalies based on their current states and statuses.
- Implemented data volume visualizations to give users better insight into data trends.
- Introduced a legend option that allows users to compare specific metrics against the primary one.
- Enhanced the check distribution visualization across the platform within the overview tabs.
Check Filter
- Now users can filter Not Asserted checks.
Team Management
- Now admin users can modify the Read and Write permissions of the Public Team.
Reapplying Clone Field
- Check cloning functionality by attempting to reapply the field from the original (source) check when a new container is selected. If the selected container matches the field and type from the original check, the cloned field will be reapplied automatically.

General Fixes

Allow saving checks with attached templates as drafts
- Adjusted the behavior to allow checks attached to a template to be saved as drafts. The Save as draft feature now remains functional when a template is attached.
Incremental identifier strange behavior
- When a user tries to modify a query in a computed table, the Incremental Modifier is set to null.
General Fixes and Improvements

2024.9.25

Feature Enhancements

Observability
- Time-series charts are presented to monitor data volume and related anomalies for each data asset.
  - Custom thresholds were added to adjust minimum and maximum volume expectations.
- The Metrics tab has been moved to the Observability tab.
- The Observability tab has replaced the Freshness page.
Check Category Options for Scan Operations
- Users can select one or multiple check categories when running a scan operation.
Anomaly Trigger Rule Type Filter
- Added a filter by check rule types to anomaly triggers. A help component was added to the tags selector to improve clarity.
Auto-Archive Anomalies
- A new Duplicate status has been introduced for anomalies.
- Users can now use Incremental Identifier ranges to auto-archive anomalies with the new Duplicate status.
- An option has been added to scan operations to automatically archive anomalies identified as duplicates if the containers analyzed have incremental identifiers configured.
A dedicated tab for filtering duplicate anomalies has been added for better visibility.
Tree View and Breadcrumb Context Menu
- A context menu has been added, allowing users to copy essential information and open links in new tabs.
- Users can access the context menu by right-clicking on the assets.
Incremental Identifier Support
- Users can manage incremental identifiers for computed tables and computed files.
Native Field Properties
- Users can now see native field properties in the field profile, displayed through an info icon next to the Type Inferred section.
Qualytics CLI Update
- Users can now import check templates.
- A status filter has been added to check exports. Users can filter by Active, Draft, or Archived (which will include Invalid and Discarded statuses).

General Fixes

The Oracle connector now handles invalid schemas when creating connections.
Anomalies identified in scan operations were not counting archived statuses.
Improved error message when a user creates a schedule name longer than 50 characters.
General Fixes and Improvements.

Breaking Changes

Freshness and SLA references have been removed from user notifications and notification rules, users should migrate to Observability using volumetric checks.

2024.9.14

Feature Enhancements

Volumetric Measurement
- We are excited to introduce support for volumetric measurements of views, computed tables and computed files.
Enhanced Source Record CSV Download
- Users can now download all source records as CSV that have been written to the enrichment datastores.
Tags and Notifications Moved to Left-Side Navigation
- Users can now quickly switch between Tags, Notifications, and Data Assets through the left-side navigation.
- Access to the Settings page is restricted to admin users.
Last Asserted Information in Checks
- The Created Date information has been replaced with Last Asserted to improve visibility.
- Users can hover over an info icon to view the Created Date.
Auto-Generated Description in Check Template Dialog
Descriptions are now automatically generated in the Template Dialog based on the rule type, ensuring consistency with the check form.
Exposed Properties in Profile and Scan Operations
- Profile and scan operations now expose properties when listed:
  - Record Limit
  - Infer As Draft
  - Starting Threshold
  - Enrichment Record Limit

General Fixes

Fixed a bug where the container list would not update when a user created a computed container.
Fixed an issue where deactivated users were not filtered on the Settings page under the Security tab.
Improved error messages when operations fail.
Fixed a bug where the Last Editor field was empty after a user was deactivated by an admin.
General Fixes and Improvements.

2024.9.10

Feature Enhancements

Add Source Datastore Modal
- Enhanced text messages and labels for better clarity and user experience.
Add Datastore
- Users can now add a datastore directly from the Settings page under the Connections tab, simplifying connection management.

General Fixes

General Fixes and Improvements

2024.9.6

Feature Enhancements

Introducing Bulk Activation on Draft Checks
- Users can now activate and validate multiple draft checks at once, streamlining the workflow and reducing manual effort.

General Fixes

Improved error message for BigQuery temporary dataset configuration exceptions.
Added a retry operation for Snowflake when no active warehouse is selected in the current session.
General Fixes and Improvements

Breaking Changes

API fields (type and container_type) are now mandatory in request payloads where they were previously optional.
- POST /global-tags: type is now required.
- PUT /global-tags/{name}: type is now required.
- POST /containers: container_type is now required.
- PUT /containers/{id}: container_type is now required.
- POST /operations/schedule: type is now required.
- PUT /operations/schedule/{id}: type is now required.
- POST /operations/run: type is now required.

2024.9.3

Feature Enhancements

Introducing Catalog Scheduling
- Users can now schedule a Catalog operation like Profile and Scan Operations, allowing automated metadata extraction.

General Fixes

General Fixes and Improvements

2024.8.31

Feature Enhancements

New Draft Status for Checks
- Introduced a new 'draft' status for checks to enhance lifecycle management, allowing checks to be prepared and reviewed without impacting scan operations.
- Validation is only applied to active checks, ensuring draft checks remain flexible for adjustments without triggering automatic validations.
Introduce Draft Check Inference in Profile Operations
- Added a new option to infer checks as drafts, offering more flexibility during data profiling.
Improve Archive Capabilities for Checks and Anomalies
- Enhanced the archive capabilities for both checks and anomalies, allowing recovery of archived items.
- Introduced a hard delete option that allows permanent removal of archived items, providing greater control over their management.
- The Anomaly statuses 'Resolved' and 'Invalid' are now treated as archived states, aligning with the consistent approach used for checks.
Introduce a new Volumetric Check
- Introduced the Volumetric Check to monitor and maintain data volume stability within a specified range. This check ensures that the volume of data assets does not fluctuate beyond acceptable limits based on a moving daily average.
- Automatically inferred and maintained by the system for daily, weekly, and monthly averages, enabling proactive management of data volume trends.
Incremental Identifier Warning in Scan Dialog
- Enhanced the dialog to notify users when they attempt an incremental scan on containers lacking an incremental identifier, ensuring transparency and preventing unexpected full scans.

General Fixes

Improve enrichment writes with queuing all writes (up to a queue threshold) for the entire scan operation. This will dramatically reduce the number of write operations performed.
Explicit casting to avoid weak CSV parser support for typing.
General Fixes and Improvements

2024.8.19

Feature Enhancements

Enhance Auto-Refresh Mechanism on Tree View
- The datastore and container tree footers are now automatically refreshed after specific actions, eliminating the need for manual page refreshes.
Support Oracle Client-Side Encryption
- Connections with Oracle now feature end-to-end encryption. Database connection encryption adds an extra layer of protection, especially for transmissions over long-distance, insecure channels.

General Fixes

UI Label on Explore Page
- Fixed an issue where the labels on the Explore page did not change based on the selected time frame.
Inferred Field Type Enhancements
- Behavior updated to infer field types at data load time rather than implicitly cast them to latest profiled type. This change supports more consistent expected schema verification for delimited file types and resolves issues when comparing inferred fields to non-inferred fields in some rule types.
Boolean Type Inference
- Behavior updated to align boolean inference with Spark Catalyst so that profiled types are more robustly handled during Spark based comparisons
General Fixes and Improvements

2024.8.10

Feature Enhancements

Introducing Profile Inference Threshold
- This feature allows users to adjust which check types will be automatically created and updated during data profiling, enabling them to manage data quality expectations based on the complexity of inferred data quality rules.
Anomaly Source Records Retrieval Retry Option
- Enabled users to manually retry fetching anomaly source records when the initial request fails.

General Fixes

General Fixes and Improvements

2024.7.31

Feature Enhancements

Introducing Field Count to the Datastore Overview
- This enhancement allows users to easily view the total number of fields present in a datastore across all containers.
Search Template
- Added a check filter to the templates page.
- Added a template filter to the checks page in the datastore context and explore.
Driver Free Memory
- Added driver free memory information on the Health Page.
Anomalous Record Count to the Anomaly Sidebar Card
- Added the anomalous record count information to the anomaly sidebar card located under the Scan Results dialog.

General Fixes

Enhanced write performance on scan operations with enrichment and relaxed hard timeouts.
Updated Azure Blob Storage connector to use TLS encrypted access by default.
Overview Tab is not refreshing asset details automatically.
General Fixes and Improvements

2024.7.26

Feature Enhancements

Introducing Event Bus for Extended Auto-Sync with Data Catalog Integrations
- We are excited to expand our auto-sync capabilities with data catalog integrations by implementing an event bus pattern.
- Added functionality to delete any DQ values that do not meet important checks.
- Included support for a WARNING status in the Alation Data Health tab for checks that have not been asserted yet.
Add Autocomplete to the Notification Form
- Improved the notification message form by implementing autocomplete. Users can now easily include internal variables when crafting custom messages, streamlining the message creation process.
Redesign the Analytics Engine Functions
- The functions are now accessible through a menu, which displays the icon and full functionality.
- Added a modal to alert users before proceeding with the restart. The modal informs users that the system will be unavailable for a period during the restart process.
Improve Qualytics metadata presentation in Alation
- Previously, multiple custom fields were used to persist data quality metrics measured by Qualytics. This process has been simplified by consolidating the metrics into a single rich text custom field formatted in HTML, making it easier for users to analyze the data.

General Fixes

Normalize Enrichment Internal Containers
- To improve user recognition and differentiate between our internal tables and those in source systems, we now preserve the original case of table names.
Validation Error on Field Search Result
- Resolved the logic for cascade deletion of dependencies on containers that have been soft deleted, ensuring proper handling of related data.
Members Cannot Add Datastore on the Onboarding Screen
- Updated permissions so that members can no longer add Datastores during the onboarding process. Only Admins now have this capability.
General Fixes and Improvements

2024.7.19

Feature Enhancements

Global Search
- We are thrilled to introduce the “Global Search” feature into Qualytics! This enhancement is designed to streamline the search across the most crucial assets: Datastores, Containers, and Fields. It provides quick and precise search results, significantly improving navigation and user interaction.
- Navigation Update: To integrate the new global search bar seamlessly, we have relocated the main menu icons to the left side of the interface. This adjustment ensures a smoother user experience.
Teradata Connector
- We’ve expanded our connectivity options by supporting a new connection with Teradata. This enhancement allows users to connect and interact with Teradata databases directly from Qualytics, facilitating more diverse data management capabilities.
Snowflake Key-pair Authentication
- In our ongoing efforts to enhance security, we have implemented support for Snowflake Key-pair authentication. This new feature provides an additional layer of security for our users accessing Snowflake, ensuring that data transactions are safe and reliable.

General Fixes

General Fixes and Improvements

2024.7.15

Feature Enhancements

Alation Data Catalog Integration
- We're excited to introduce integration with Alation, enabling users to synchronize and manage assets across both Qualytics and Alation.
- Metadata Customization:
  - Trust Check Flags: We now support warning flags at both the container and field levels, ensuring users are aware of deprecated items.
  - Data Health: Qualytics now pushes important checks to Alation's Data Health tab, providing a comprehensive view of data health at the container level.
  - Custom Fields: Quality scores and related metadata are pushed under a new section in the Overview page of Alation. This includes quality scores, quality score factors, URLs, anomaly counts, and check counts.
Support for Never Expiration Option for Tokens
- Users now have the option to create tokens that never expire, providing more flexibility and control over token management.

General Fixes

General Fixes and Improvements

2024.7.5

Feature Enhancements

Enhanced Operations Listing Performance
- Optimized the performance of operations listings and streamlined the display of container-related information dialogs. These enhancements include improved handling of operations responses and the addition of pagination for enhanced usability

General Fixes

Fix Computed Field Icon Visibility
- Resolved an issue where the computed field icon was not being displayed in the table header.
General Fixes and Improvements

2024.6.29

Feature Enhancements

Computed Field Support
- Introduced computed fields allowing users to dynamically create new virtual fields within a container by applying transformations to existing data.
- Computed fields offer three transformation options to cater to various data manipulation needs. Each transformation type is designed to address specific data characteristics:
  - Cleaned Entity Name: Automates the removal of business signifiers such as 'Inc.' or 'Corp.' from entity names, simplifying entity recognition.
  - Convert Formatted Numeric: Strip formatting like parentheses (for negatives) and commas (as thousand separators) from numeric data, converting them into a clean, numerically-typed format.
  - Custom Expression: Allows users to apply any valid Spark SQL expression to combine or transform fields, enabling highly customized data manipulations.
- Users can define specific checks on computed fields to automatically detect anomalies during scan operations.
- Computed fields are also visible in the data preview tab, providing immediate insight into the results of the defined transformations.
Autogenerated Descriptions for Authored Checks
- Implemented an auto-generation feature for check descriptions to streamline the check authoring process. This feature automatically suggests descriptions based on the selected rule type, reducing manual input and simplifying the setup of checks.
Event-Driven Catalog Integrations and Sync Enhancements
- Enhanced the Atlan integration and synchronization functionalities to include event-driven support, automatically syncing assets during Profile and Scan operations. This update also refines the Sync and Integration dialogs, offering clearer control options and flexibility.
Sorting by Anomalous Record Count
- Added a new sorting filter in the Anomalies tabs that allow users to sort anomalies by record count, improving the manageability and analysis of detected anomalies.
Refined Tag Sorting Hierarchy:
- Updated the tag sorting logic to consistently apply a secondary alphabetical sort by name. This ensures that tags will additionally be organized by name within any primary sorting category.

General Fixes

Profile Operation Support for Empty Containers
- Resolved an issue where profiling operations failed to record fields in empty containers. Now, fields are generated even if no data rows are present.
Persistent Filters on the Explore Page
- Fixed a bug that caused Explore to disable when switching tabs on the Explore page. Filters now remain active and consistent, enhancing user navigation and interaction.
Visibility of Scan Results Button
- Corrected the visibility issue of the 'results' button in the scan operation list at the container level. The button now correctly appears whenever at least one anomaly is detected, ensuring users have immediate access to detailed anomaly results.
General Fixes and Improvements

2024.6.18

Feature Enhancements

Improvement to Anomaly Dialog
- Enhanced the anomaly dialog to include a direct link to the operation that generated the anomaly. Users can now easily navigate from an anomaly to view other anomalies generated by the same operation directly from the Activity tab.
Sorting by Duration in Activity Tab
- Introduced the ability to sort by the duration of operations in the Activity tab by ascending or descending order.
Last Editor Information for Scheduled Operations
- Added visibility of which users have created or last updated scheduled operations, enhancing traceability in scheduling management.
Display Total Anomalous Records for Anomalies
- Added the total count of anomalous records in the anomalies listing view.

General Fixes

Performance Fixes on Computed Table Creation and Check Validation
- Optimized the processes for creating computed tables and validating checks. Users previously experiencing slow performance or timeouts during these operations will now find the processes significantly faster and more reliable.
General Fixes and Improvements

2024.6.14

Feature Enhancements

Improvements to Atlan Integration
- When syncing Qualytics with Atlan, badges now display the "Quality Score Total," increasing visibility and emphasizing key data quality indicators on Atlan assets.
- Improved performance of the synchronization operation.
- Implemented the propagation of external tags to checks, now automatically aligned with the container synchronization process, enabling better accuracy and relevance of data tagging.
Refactor Metric Check Creation
- Enhanced the encapsulated Metric Check creation flow to improve user experience and efficiency. Users can now seamlessly create computed tables and schedule operations simultaneously with the metric check creation.
Support Update of Weight Modifier for External Tags
Add Validation on Updated Connections
- Added support for testing the connection if there's at least one datastore attached to the connection, ensuring more reliable and accurate connection updates.
Standardize Inner Tabs under the Settings Page
- Tags and Notifications Improvements: The layout has been revamped for better consistency and clarity. General headers have been removed, and now each item features specific headers to enhance readability.
- Security Tab Improvements: The redesign features chip tabs for improved navigation and consistency. Filters have been updated to ensure they meet application standards.
- Tokens Tab Accessibility: Moved the action button to the top of the page to make it more accessible.
- Refine Connector Icons Display: Improved the display of connector icons for Datastores and Enrichments in the Connections Tab.
Streamlined Container Profiling and Scanning
- In the container context, the profile and scan modals have been updated to automatically display the datastore and container, eliminating the need for a selection step and streamlining the process.
Swap Order During Check Creation
- Rule Type Positioning: The Rule Type now appears before the container selection, making the form more intuitive.
- Edit Mode Header: In edit mode, the Rule Type is prominently displayed in the modal header, immediately under the check ID.

General Fixes

Address Minor Issues in the Datastore Activity Page
- Operation ID Auto-Search: Restored the auto-search feature by operation ID for URL access, enhancing navigation, especially for Catalog Operations.
- Tree View Auto-Refresh: Implemented an auto-refresh feature for the tree view, which activates after any operation in the CTA flow (Catalog, Profile, Scan).
Fix "Greater Than Field" Quality Check
- Corrected the inclusive property of the greater than field quality check.
Fix Exporting Field Profiles for Non-Admin User with Write Permission
- Resolved issues for non-admin users with write permissions to allow proper exporting of field profile metadata to enrichment.
Fix "Is Replica Of" Quality Check validation on Field Names with Special Characters
- Improved validation logic to handle field names with special characters
General Fixes and Improvements

2024.6.7

Feature Enhancements

Atlan Integration Improvements
- Enhanced the Atlan assets fetch and external tags syncing.
- Added support for external tag propagation to checks and anomalies.
- Merged Global and External tags section for streamlined tag management.
Restart Button for Analytics Engine
- Introduced a new "Restart" button under the Settings - Health section, allowing admins to manually restart the Analytics Engine if it is offline or unresponsive.
Interactive Tooltip Component
- Added a new interactive tooltip component that remains visible upon hovering, enhancing user interaction across various modules of the application.
- Refactored existing tooltip usage to integrate this new component for a more consistent user experience.
Defaulting to Last-Used Enrichment Datastore for Check Template Exports
- Improved user experience by persisting the last selected enrichment datastore as the default option when exporting a check template.

General Fixes

Shared Links Fixes
- Fixed issues with shared operation result links, ensuring that dialogs for scan/profile results and anomalies now open correctly.
- Addressed display inaccuracies in the "Field Profiles Updated" metrics.
General Fixes and Improvements

2024.6.4

Feature Enhancements

Atlan Data Catalog Integration
- We're excited to introduce integration with Atlan, enabling users to synchronize and manage assets across both Qualytics and Atlan:
  - Tag Sync: Sync tags assigned to data assets in Atlan with the corresponding assets in Qualytics, enabling tag-based quality score reporting, notifications, and bulk data quality operations using Atlan-managed tags.
  - Metadata Sync: Automatically synchronize Atlan with Qualytics metadata, including asset URL, total score, and factor scores such as completeness, coverage, conformity, consistency, precision, timeliness, volume, and accuracy.
Entity Resolution Check
- We've removed the previous limitation on the maximum number of distinct entity names that could be resolved with the Entity Resolution rule type. This release includes various performance enhancements that support an unlimited number of entity names.
Enhancements to Catalog Operation Results
- We've improved the catalog operation results by now including detailed information on whether tables, views, or both were involved in each catalog operation.
Enhancements to 'Equal to Field' Rule Type
- The 'Equal to Field' rule now supports string values, allowing for direct comparisons between text-based data fields.
Enhancements to Enrichment
- Qualytics now includes a property for anomalousRecordCount on shape anomaly, which previously was neither populated nor persisted. This aims to accurately capture and record the total number of anomalous records identified in ShapeAnomaly, regardless of the max_source_records threshold.
Dynamic Meta Titles
- Pages such as Datastore Details, Container Details, and Field Details now feature dynamic meta titles that accurately describe the page content and are visible in browser tabs providing better searchability.

General Fixes

Fix Trends of Quality Scores on the Insights Page
- Addressed issues with displaying trends on the Insights page. Trends now accurately reflect changes and comparisons to the previous report period, providing more reliable and insightful analytics.
Resolved a bug in Entity Resolution where the distinction constraint was only applied to entity names that differed.
General Fixes and Improvements

2024.5.22

Feature Enhancements

Datastore Connection Updates:
- Users can now update the connection on a datastore if the new one has the same type as the current one.
Enrichment Datastore Redirection:
- Enhanced the user interface to facilitate easier redirection to enrichment datastores, streamlining the process and improving user experience.
Label Enhancements for Data Completeness:
- Updated labels to better distinguish between completeness percentages and Factor Scores. The label for completeness percentage has been changed to provide clear context when viewed alongside.

General Fixes

Rule Type Anomaly Corrections:
- Fixed an issue where the violation messages for record anomalies incorrectly included "None" for some rule types. This update ensures accurate messaging across all scenarios.
Shape Anomaly Logic Adjustment:
- Revised the logic for Shape Anomalies to prevent the combination of failed checks for high-count record checks on the same field. This change ensures that displayed sample rows have definitively failed the specific checks shown, enhancing the accuracy of anomaly reporting.
Entity Resolution Anomalies:
- Addressed an inconsistency where some Entity Resolution Checks did not return source records. Ongoing investigations and fixes have improved the reliability of finding source records for entity resolution checks across DFS and JDBC datastores.
General Fixes and Improvements

2024.5.16

Feature Enhancements

Entity Resolution Check
- Introduced rule "Entity Resolution" to determine if multiple records reference the same real-world entity. This feature uses customizable fields and similarity settings to ensure accurate and tailored comparisons.
Support for Rerunning Operations
- Added an option to rerun operations from the operations listing, allowing users to reuse the configuration from previously executed operations.

General Fixes

Export Operations
- Fixed metadata export operations silently failing on writing to the enrichment datastores.
Computed File/Table Creation
- Resolved an issue that prevented the creation of computed files/tables with the same name as previously deleted ones, even though it is a valid action.
General Fixes and Improvements

2024.5.13

General Fixes

Enhanced Quality Score Factors Computation
- Addressed issues in quality score calculation and its associated factors ensuring accuracy
General Fixes and Improvements

2024.5.11

Feature Enhancements

Introducing Quality Score Factors
- This new feature allows users to control the quality score factor weights at the datastore and container levels.
  - Quality Score Detail Expansion: Users can now click on the quality score number to expand its details, revealing the contribution of each factor to the overall score. This enhancement aids in understanding what drives the quality score.
  - Insights Page Overhaul: The Insights page has been restructured to better showcase the quality score breakdown. This redesign aims to make the page more informative and focused on quality score metrics.
  - Customization of Factor Weights: Users can now customize the weights of different factors at the Datastore and Container levels. This feature is essential for adapting the quality score to meet specific user needs, such as disregarding the Timeliness factor for dimensional tables where it might be irrelevant.
  - Enhanced Inferred Checks: Introduced a new property in the Check Listing schema and a feature in the Check modal that displays validity metrics, which help quantify the accuracy of inferred checks. A timezone handling issue in the last_updated property of the Check model has also been addressed.
Quality Score UI Enhancements
- Enhancements have been made to the user interface to provide a clearer and more detailed view of the quality score metrics, including Completeness, Coverage, Conformity, Consistency, Precision, Timeliness, Volumetrics, and Accuracy. These changes aim to provide deeper insight into the components that contribute to the overall quality score.

General Fixes

Fixes to JDBC Incremental Support
- Updated the conditional logic in the catalog operation for update tables to ensure the incremental identifier is preserved if already established.
General Fixes and Improvements

2024.5.2

Feature Enhancements

Datastore Connections:
- Users can now create connections that can be shared across different datastores. This introduces a more flexible approach to managing connections, allowing users to streamline their workflow and reduce duplication of effort. With shared connections, users can easily reuse common elements such as hostname and credentials across various datastores, enhancing efficiency and simplifying management.
File Container Header Configuration:
- Adds support for setting the hasHeader boolean property on File Containers, enabling users to specify whether their flat file data sources include a header row. This enhances compatibility and flexibility when working with different file formats.
Improved Error Handling in Delete Dialogs:
- Error handling within delete dialogs has been revamped across the application. Error messages will now be displayed directly within the dialog itself, providing clearer feedback and preventing misleading success messages in case of deletion issues.

General Fixes

Locked Template Field Editing:
- Resolves an issue where selecting a new container in the check form would reset check properties, causing problems for locked templates. The fix ensures that checks derived from templates retain their properties, allowing users to modify the field_to_compare field as needed.
General Fixes and Improvements

2024.4.25

Feature Enhancements

Profile Results Modal:
- Introducing a detailed Results Modal for each profile operation. Users can now view comprehensive statistics about the produced container profiles and their partitions, enhancing their ability to analyze data effectively.
Checks Synchronized Count:
- The operations list now includes the count of synchronized checks for datastore and explore operations. This addition streamlines the identification of operations, improving user experience.

General Fixes

General Fixes and Improvements

2024.4.23

Feature Enhancements

Introduction of Comparators for Quality Checks:
- Launched new Comparator properties across several rule types, enhancing the flexibility in defining quality checks. Comparators allow users to set margins of error, accommodating slight variations in data validation:
  - Numeric Comparators: Enables numeric comparisons with a specified margin, which can be set as either a fixed absolute value or a percentage, accommodating datasets where minor numerical differences are acceptable.
  - Duration Comparators: Supports time-based comparisons with flexibility in duration differences, essential for handling time-based data with variable precision.
  - String Comparators: Facilitates string comparisons by allowing for variations in spacing, ideal for textual data where minor inconsistencies may occur.
- Applicable to rule types such as Equal To, Equal To Field, Greater Than, Greater Than Field, Less Than, Less Than Field, and Is Replica Of.
Introduced Row Comparison in the isReplicaOf Rule:
- Improved the rule to support row comparison by id, enabling more precise anomaly detection by allowing users to specify row identifiers for unique row comparison. Key updates include:
  - Revamp of the source record presentation to highlight differences between the left and right containers at the cell level, enhancing visibility into anomalies.
  - New input for specifying unique row identifiers, transitioning from symmetric difference to row comparison when set.
  - The original behavior of symmetric comparison remains unchanged if no row identifiers are provided.
New equalTo Rule Type for Direct Value Comparisons
- Introduced the equalTo rule type, enabling precise assertions that selected fields match a specified value. This new rule not only simplifies the creation of checks for constant values across datasets but also supports the use of comparators, allowing for more flexible and nuanced data validation.
Redirect Links for Requested Containers in Operation Details:
- Introduced redirect links in the "Containers Requested" section of operation results. This enhancement provides direct links to the requested containers (such as tables or files), facilitating quicker navigation and streamlined access to relevant operational data.
Enhanced Description Input with Expandable Option:
- Implemented an expandable option for the Description input in the Check Form & Template Form. This enhancement allows users to more comfortably manage lengthy text entries, improving the usability of the form by accommodating extensive descriptions without compromising the interface's usability.

General Fixes

Addressed Data Preview Timeout Issues:
- Tackled the timeout problems in the data preview feature, ensuring that data retrieval processes complete successfully within the new extended timeout limits.
General Fixes and Improvements

2024.4.12

Feature Enhancements

File Pattern Overrides:
- We have added support in the UI to override a file pattern. Now, a file pattern overwritten by a user will replace the one that the system generated during the first catalog. To have a new file pattern in the UI, users need to perform a new catalog operation without prune.
Batch Edit in the Check Templates Library::
- We are now supporting batch edits for check templates in the Library. This enhancement will allow filters and tags.
Improved Presentation of Incremental, Remediation, and Infer Constraints:
- We have improved the presentation of Incremental, Remediation, and Infer Constraints in the operation listing for catalog, profile, and scan operations. The Incremental, Remediation, and Infer Constraints icons have been added to the list of items, and the visualization of these items has been enhanced.
Default Placeholders for Computed File in UI:
- We are now automatically populating the form dialog with fields from the selected container. This improvement simplifies the process for users, especially in scenarios where they wish to select or cast specific fields directly from the source container.

General Fixes

Tree View Default Ordering:
- We have updated the tree view default ordering. Datastore names are now grouped and presented in alphabetical order.
General Fixes and Improvements

2024.4.6

Breaking Changes

Remediation Naming Convention Update:
- Updated the naming convention for remediation to {enrich_container_prefix}_remediation_{container_id}, standardizing remediation identifiers.
Add file extension for DFS Enrichment:
- Introduced .delta extension to files in the enrichment process on DFS, aligning with data handling standards.

Feature Enhancements

Revamp Enrichment Datastore Main Page:
- Tree View & Data Navigation: Enhanced the enrichment page with an updated tree view that now lists source datastores linked to enrichment datastores, improving navigability. A newly introduced page for enrichment datastore enables:
  - Data preview across enrichment, remediation, and metadata tables with the ability to apply "WHERE" filters for targeted insights.
  - Direct downloading of preview data as CSV.
- UI Performance Optimization: Implemented UI caching to boost performance, reducing unnecessary network requests and smoothly preserving user-inputted filters and recent data views.
User Sorting by Role:
- Introduced a sorting feature in the Settings > Users tab, allowing users to be sorted by their roles in ascending or descending order, facilitating easier user management.
Expanded Entity Interaction Options:
- Enhanced entity lists and breadcrumbs with new direct action capabilities. Users can now right-click on an item to access useful functions: copy the entity's ID or name, open the entity's link in a new tab, and copy the entity's link. This enhancement simplifies data management by making essential actions more accessible.

General Fixes

Record Quality Scores Overlap Correction:
- Resolved a problem where multiple violations could be open for the same container simultaneously, contrary to logic. This fix ensures violations for containers are uniquely recorded, eliminating parallel open violations.
Anomaly Details Text Overflow:
- Corrected text overflow issues in the anomaly details' violation box, ensuring all content is properly contained and readable.
Enhanced "Not Found" Warnings with Quick Filters:
- Improved user guidance for Checks and Anomalies list filters by adding hints for "not found" items, suggesting users check the "all" group for unfiltered search results, clarifying navigation and search results.
General Fixes and Improvements

2024.3.29

Feature Enhancements

Data Preview
- Introducing the "Data Preview" tab, providing users with a streamlined preview of container data within the platform. This feature aims to enhance the user experience for tasks such as debugging checks, offering a grid view showcasing up to 100 rows from the container's source.
  - Data Preview Tab: Implemented a new tab for viewing container data, limited to displaying a maximum of 100 rows for improved performance.
  - Filter Support: Added functionality to apply filter clauses to the data preview, enabling users to refine displayed rows based on specific criteria.
  - UI Caching: Implemented a caching layer within the UI to enhance performance and reduce unnecessary network requests, storing the latest refreshed data along with applied filters.
Enhanced Syntax Highlight Inputs
- Improved the syntax highlight inputs for seamless inline editing, minimizing the friction of entering expressions. This feature includes a dual-mode capability, allowing users to type directly within the input field or utilize an expanded dialog for more complex entries, significantly improving user experience.
Volumetric Measurements
- Periodically measure container volumetrics for a more robust approach. This update focuses on measuring only containers without a volume measure in the last 24 hours and scheduling multiple runs of the job daily.
Sort Tags by Color
- Users can now sort tags by color, visually grouping similar colors for easier navigation and management.
Download Source Records
- Added a "Download Source Records" feature to the Anomaly view in the UI, allowing users to export data held in the enrichment store for that anomaly in CSV format.
Check Templates Navigation
- Implemented a breadcrumb trail for the Check Template page to improve user navigation.

General Fixes

Fix Scheduling Issues
- Resolved scheduling issues affecting specific sets of containers, particularly impacting scheduled profile and scan operations. Users must manually add new profiles after catalog operations or computed file/table creation for inclusion in existing scheduled operations.
Fix Notifications Loading Issue on Large Screens
- Fixed an issue where the infinity loading feature for the user notification list was not functioning properly on large screens. The fix ensures correct triggering of infinity loading regardless of screen size, allowing all notifications to be accessed properly.
General Fixes and Improvements

2024.3.15

Feature Enhancements

Enhanced Observability
- Automated daily volumetric measurements for all tables and file patterns
- Time-series capture and visualizations for volume, freshness, and identified anomalies
Overview Tab:
- Introduced a new "Overview" tab with information related to monitoring at the datastore and container level.
- This dashboard interface is designed for monitoring and managing data related to qualytics for datastore and containers.
- Users can see:
  - Totals: Quality Score, Tables, Records, Checks and Anomalies
  - Total of Quality Checks grouped by Rule type
  - Data Volume Over Time: A line graph that shows the total amount of data associated with the project over time.
  - Anomalies Over Time: A line graph that shows the number of anomalies detected in the project over time.
Datastore Field List Update:
- The datastore field profiles list has been updated to match the existing list views design.
- All card-listed pages now display information in a column format, conditionally using scrolling for smaller and larger screens.
- Now the field details will show on a modal with Profiling and Histogram
Heatmap Simplification:
- Simplified the heatmap to consider only operations counted.
Datastore Metrics:
- Improved distinction between 0 and null values in the datastore metrics (total records, total fields, etc).
Explore Page Update:
- Added new metrics to the Explore page.
- We are now adding data volume over time (records and size).
- Improved distinction between 0 and null values in metrics (total records, total fields, etc).

General Fixes

UI Wording and Display for Cataloged vs Profiled Fields:
- Addressed user confusion surrounding the display and wording used to differentiate between fields that have been cataloged versus those that have been profiled.
- Updated the messaging within the tree view and other relevant UI components to accurately reflect the state of fields post-catalog operation.
- Implemented a clear distinction between non-profiled and profiled fields in the field count indicators.
- Conducted a thorough review of the CTAs and descriptive text surrounding the Catalog, Profile, and Scan operations to improve clarity and user understanding.
General Fixes and Improvements

2024.3.7

General Fixes

Corrected MatchesPattern Checks Inference:
- Fixed an issue where the inference engine generated MatchesPattern checks that erroneously asserted false on more than 10% of training data. This resolution ensures all inferred checks now meet the 99% coverage criterion, aligning accurately with their training datasets.
Fixed Multi-Field Check Parsing Error in DFS:
- Addressed a bug in DFS environments that caused parsing errors for checks asserting against multiple fields, such as AnyNotNull and NotNull, when selected fields contained white spaces. This resolution ensures that checks involving multiple fields with spaces are now accurately parsed and executed.
Volumetric Measurements Tracking Fix:
- Addressed a bug that prevented the recording of volumetric measurements for containers without a last modified time. This fix corrects the problem by treating last_modification_time as nullable, ensuring that containers are now accurately tracked for volumetric measurements regardless of their modification date status.
General Fixes and Improvements

2024.3.5

Feature Enhancements

Check Validation Improvement:
- Enhanced the validation process for the "Is Replica Of" check. Previously, the system did not validate the field name and type, potentially leading to undetected issues until a Scan Operation was executed. Now, the validation process includes checking the field name and type, providing users with immediate feedback on any issues.

General Fixes

Matches Pattern Data Quality Check Handling White Space:
- Resolved a bug in the Matches Pattern data quality check that caused white space to be ignored during training. With this fix, the system now accounts for white space during training, ensuring accurate pattern inference even with data containing significant white space. If 1% or more of the training data contains blanks, the system will derive a pattern that includes blanks as a valid value, improving data quality assessment.
General Fixes and Improvements

2024.2.28

Feature Enhancements

User Token Management:
- Transitioned from Generic Tokens to a more robust User Token system accessible under Settings for all users. This enhancement includes features to list, create, revoke, and delete tokens, offering granular control of API access. User activities through the API are now attributable, aligning actions with user accounts for improved accountability and traceability.

General Fixes

Datetime Validation in API Requests:
- Strict validation of datetime entries in API requests has been implemented to require the Zulu datetime format. This update addresses and resolves issues where incomplete datetime entries could disrupt Scan operations, enhancing API reliability.
Context-Aware Redirection Post-Operation:
- Enhanced the operation modal redirect functionality to be context-sensitive, ensuring that users are directed to the appropriate activity tab after an operation, whether at the container or datastore level. This enhancement ensures a logical and intuitive post-operation navigation experience.
Template Details Page Responsiveness:
- Addressed layout issues on the Template Details page caused by long descriptions. Adjustments ensure that the description section now accommodates larger text volumes without disrupting the page layout, maintaining a clean and accessible interface.
General Fixes and Improvements

2024.2.23

Feature Enhancements

Introduction of Operations Management at the Table/File Level:
- The Activity tab has been added at the table/file level, extending its previous implementation at the source datastore level. This update provides users with the ability to view detailed information on operations for individual tables/files, including scan metrics, and histories of operation runs and schedules. It enhances the user's ability to monitor and analyze operations at a granular level.
Enhanced Breadcrumb Navigation UX:
- Breadcrumb navigation has been improved for better user interaction. Users can now click on the breadcrumb representing their current context, enabling more intuitive navigation. In addition, selecting the Source Datastore breadcrumb takes users directly to the Activity tab, streamlining the flow of user interactions.

General Fixes

Improved Accuracy in Profile and Scan Metrics:
- Enhanced the accuracy of metrics for profiled and scanned operations by excluding failed containers from the count. Now, metrics accurately reflect only those containers that have been successfully processed.
Streamlined input display for Aggregation Comparison rule in Check/Template forms:
- Removed the "Coverage" input for the "Aggregation Comparison" rule in Check/Template Forms, as the rule does not support coverage customization. This simplification helps avoid confusion during rule configuration.
Increased Backend Process Timeouts:
- In response to frequent timeout issues, the backend process timeouts have been adjusted. This change aims to reduce interruptions and improve service reliability by ensuring that processes have sufficient time to complete.
General Fixes and Improvements

2024.2.19

Feature Enhancements

Support for exporting Check Templates to the Enrichment Datastore:
- Added the ability to export Check Library metadata to the enrichment datastore. This feature helps users export their Check Library, making it easier to share and analyze check templates.
File Upload Size Limit Handling:
- Implemented a user-friendly error message for file uploads that exceed the 20MB limit. This enhancement aims to improve user experience by providing clear feedback when the file size limit is breached, replacing the generic error message previously displayed.

General Fixes

Resolved Parsing Errors in Expected Values Rule:
- Fixed an issue where single quotes in the list of expected values caused parsing errors in the Analytics Engine, preventing the Expected Values rule from asserting correctly. This correction ensures values, including those with quotes or special characters, are now accurately parsed and asserted.
General Fixes and Improvements

2024.2.17

General Fixes

Corrected Typing for Expected Values Check:
- Resolved an issue with the expectedValues rule, where numeric comparisons were inaccurately processed due to a misalignment between the API and the analytics engine. This fix ensures numeric values are correctly typed and compared, enhancing the reliability of validations.
Fixed Anomaly Filtering in Scan Results dialog:
- Addressed a flaw where scan results did not consistently filter anomalies based on the operation ID. The fix guarantees that anomalies are only displayed once the operation ID parameter is accurately defined in the URL, ensuring more precise and relevant scan outcome presentations.
Check Validation Sampling Behavior Adjustment:
- Fixed intermittent validation issues encountered in specific source datastore types (DB2, Microsoft SQL Server). The problem, where validation could unpredictably fail or succeed based on container size, was corrected by fine-tuning the sampling method for these technologies, leading to consistent validation performance.
General Fixes and Improvements

2024.2.15

Feature Enhancements

UX Improvements for Profile and Scan Operation Dialogs:
- Implemented significant UX enhancements to Profile & Scan Operation Dialogs for improved clarity and user flow. Key improvements include:
  - Visibility of incremental fields and their current starting positions in Scan Operation dialogs.
  - Logical reordering of Profile and Scan Operation steps to align with user workflows, including prioritizing container selection and clarifying the distinction between "Starting Threshold" and "Limit" settings.
- Simplified operation initiation, allowing users to start operations directly before the final scheduling step, streamlining the process for immediate execution.
Naming for Scheduled Operations:
- Added a name field to scheduled operations, enabling users to assign descriptive names or aliases. This feature aids in distinguishing and managing multiple scheduled operations more effectively.
Container Name Filters for Operations:
- Provided filtering options for operations and scheduled operations by container name, improving the ability to quickly locate and manage specific operations.
Improved Design for Field Identifiers in Tooltips:
- The design of field identifiers within tooltips has been refined for greater clarity. Enhancements focus on displaying Grouping Fields, Excluded Fields, Incremental Fields, and Partition Fields, aiming to offer users a more intuitive experience.

General Fixes

External Scan Rollup Threshold Correction:
- Fixed an issue in external scans where the rollup threshold was not applied as intended. This correction ensures that anomalies exceeding the threshold are now accurately consolidated into a single shape anomaly, rather than being reported as multiple individual record anomalies.
Repetitive Release Notification and Live Update Fixes:
- Resolved a recurring issue with release notifications continually prompting users to refresh despite acknowledgment. Additionally, it restored the live update notifications' functionality, ensuring users are correctly alerted to new features while actively using the system, with suggestions for a hard refresh to access the latest version.
Corrected Field Input Logic in Check & Template Forms:
- Addressed a logic error that incorrectly disabled field inputs for certain rules in check and template forms. This correction re-enables the necessary field input, removing a significant barrier that previously prevented users from creating checks affected by this issue.
Addressed Absence of Feedback for No-Match Field Filters on Explore Page:
- Rectified the absence of feedback when field filters on the Explore Page yield no results, ensuring users receive a clear message indicating no items match the specified filter criteria.
General Fixes and Improvements

2024.2.10

Feature Enhancements

Immediate Execution Option for Scheduled Operations:
- Introduced a "Run Now" feature for scheduled operations, enabling users to execute operations immediately without waiting for the scheduled time. This addition provides flexibility in operation management, ensuring immediate execution as needed without altering the original schedule.
Simplified Customization of Notification Messages:
- Removed the "use custom message" toggle from the notification form, making the message input field always editable. This change simplifies the user interface and improves usability by allowing direct editing of notification messages.
- Enhanced default messages for each notification trigger type have also been implemented to improve clarity.
Performance Improvement in User Notifications Management:
- Implemented infinite scrolling pagination for the user notifications side panel. This update addresses performance issues with loading large numbers of notifications, ensuring a smoother and more responsive experience for users with extensive notification histories.
Enhanced Archive Template Confirmation:
- Updated the archive dialog for templates to include information on the number of checks associated with archiving the template. This enhancement ensures users are aware of the impact of checks linked to the template, promoting informed decision-making.
Improved Interaction with Computed Tables:
- Refined the Containers list UX to allow navigation to container details immediately after the creation of a computed table, addressing delays caused by background profiling. This improvement ensures users can access computed table details without waiting for the profile operation to complete, drawing inspiration from Tree View functionality for a more seamless experience.

General Fixes

General Fixes and Improvements

2024.2.2

Feature Enhancements

Excluded Fields Inclusion in Drop-downs:
- Refined container settings to incorporate previously excluded fields in the dropdown list, enhancing user flexibility. In addition, a warning message has been added to notify users if a profile operation is required when deselecting excluded fields that were previously selected.

General Fixes

Linkable Scan Results for Direct Access:
- Made Scan Results dialogs accessible via direct URL links, addressing previous issues with broken anomaly notification links. This enhancement provides users with a straightforward path to detailed scan outcomes.
Property Display Refinement for Various Field Types:
- Corrected illogical property displays for specific field types like Date/Timestamp. The system now intelligently displays only properties relevant to the selected data type, eliminating inappropriate options. This update also includes renaming 'Declared Type' to 'Inferred Type' and adjusting the logic for accurate representation.
Timezone Consistency in Insights and Activity Pages:
- Implemented improvements in timezone handling across Insights and Activity pages. These changes ensure that date aggregations are accurately aligned with the user's local time, eliminating previous inconsistencies compared to the Operations list results.
Fixed breadcrumb display in the datastore for members with restricted permissions
- Enhanced the datastore interface to address issues faced by members with limited permissions. This update also fixes misleading breadcrumb displays and ensures that correct datastore enhancement information is visible.
Resolved State Issue in Bulk Check Archive:
- Addressed a bug in the bulk selection process for archiving checks. The fix corrects an issue where the system recognized individual selections instead of the intended group selection due to an overlooked edge case.
Improved Operation Modal State Management:
- Tackled state management inconsistencies in Operation Modals. Fixes include resetting the remediation strategy to its default and ensuring 'include' options do not carry over previous states erroneously.
Eliminating Infinite Load for Non-Admin Enrichment Editing:
- Solved a persistent loading issue in the Enrichment form for non-admin users. Updates ensure a smoother, error-free interaction for these users, improving accessibility and functionality.
General Fixes and Improvements

2024.1.30

Feature Enhancements

Enhanced External Scan Operations:
- Improved data handling in External Scans by applying type casting to uploaded data using Spark. This update is particularly significant for date-time fields, which now expect and conform to ISO 8601 standards.
Optimized DFS File Reading:
- Streamlined file reading in DFS by storing and utilizing the 'file_format' identified during the Catalog operation. This change eliminates the need for repeated format inspection on each read, significantly reducing overhead, especially for partitioned file types.

General Fixes

Resolved DFS Reading Issues with Special Character Headers:
- Fixed a DFS reading issue where columns with headers containing special characters (like pipes |) adversely affected field profiling, including inaccuracies in histogram generation.
General Fixes and Improvements

2024.1.26

Feature Enhancements

Incremental Scan Starting Threshold:
- Introduced a "Starting Threshold" option for incremental Scans. This feature allows users to manually set a starting value for the incremental field in large tables, bypassing the need to scan the entire dataset initially. It's handy for first-time scans of massive databases, facilitating more efficient and targeted data scanning.
Add Support for Archiving Anomalies:
- Implemented the capability of archiving anomalies. Users can now remove anomalies from view without permanently deleting them, providing greater control and flexibility in anomaly management.
External Scan Operation for Ad hoc Processes:
- Introduced 'External Scan Operation' as a new feature enabling ad hoc data validation for all containers. This operation allows users to validate ad hoc data, such as Excel or CSV files, against a container's existing checks and enrichment configuration. The provided file's structure must align with the container's schema, ensuring a seamless validation process.

General Fixes

Preventing Unrelated Entity Selection in Check Form:
- Fixed an issue in the Check Form where users could inadvertently select unrelated entities. Selecting datastores, containers, and fields is restricted during any ongoing data loading, preventing mismatched entity selections.
Performance enhancements for BigQuery and Snowflake removing the need for count operations during full table analysis
General Fixes and Improvements

2024.1.23

Feature Enhancements

Introduction of 'Expected Schema' Rule for Advanced Schema Validation:
- Introduced the 'Expected Schema' rule, replacing the 'Required Fields' rule. This new rule asserts that all selected fields are present and their data types match predefined expectations, offering more comprehensive schema validation. It also includes an option to validate additional fields added to the schema, allowing users to specify whether the presence of new fields should cause the check to fail.
Refined Tree Navigation Experience:
- Updated the tree navigation to prevent automatic expansion of nodes upon selection and eliminated the auto-reset behavior when re-selecting an active node. These changes provide a smoother and more user-friendly navigation experience, especially in tables/files with numerous fields.
Locked/Unlocked Status Filter in Library Page:
- Added a new filter feature to the Library page, enabling users to categorize and view check templates based on their Locked or Unlocked status. This enhancement simplifies the management and selection of templates.
Improved Messaging for Locked Template Properties in Check Form:
- Enhanced the Check Form UX by adding informative messages explaining why certain inputs are disabled when a check is associated with a locked template. This update enhances user understanding and interaction with the form.

General Fixes

Corrected Insights Metrics for Check Templates:
- Fixed an issue where check templates were incorrectly counted as checks in related metrics and counts on the Insights page. Templates are now appropriately filtered out, ensuring accurate representation of check-related data.
Enabled Template Creation with Calculated Rules:
- Resolved a limitation that prevented the creation of templates using calculated rules like 'Satisfies Expression' and 'Aggregation Comparison'. This fix expands the capabilities and flexibility of template creation.
General Fixes and Improvements

2024.1.11

Feature Enhancements

Introduction of Check Templates:
- Implemented Check Templates to offer a balance between flexibility and consistency in quality check management. Checks can now be associated with templates in either a 'locked' or 'unlocked' state, allowing for synchronized properties or independent customization, respectively. This feature streamlines check management and enables efficient tracking and review of anomalies across all checks associated with a template.
isType Rule Implementation:
- Replaced the previous dataType rule with the new isType rule for improved accuracy and understanding. The isType rule is now specifically tailored to assert only against string fields, enhancing its applicability and effectiveness.
Enhanced Container Details Page with Identifier Icons:
- Updated the Container Details page to display icons for key container identifiers, including Partition Field, Grouping Fields, and Exclude Fields. This enhancement provides a more intuitive and informative user interface, facilitating easier identification and understanding of container characteristics.

General Fixes

Notification System Reliability Improvement:
- Fixed intermittent failures in the notifications system. Users will now receive reliable notifications for identified anomalies, ensuring timely awareness and response to data irregularities.
Safeguard Against Overlapping Scheduled Operations:
- Implemented a mechanism to prevent the overloading of deployments due to overlapping scheduled operations. If a scheduled operation doesn’t complete before its next scheduled run, the subsequent run will be skipped, thereby avoiding potential strain on system resources.
Correction of Group-by Field Display in Containers:
- Resolved an issue where selected grouping fields were not appearing in the list fields of a container. This fix ensures that user-specified fields for group-by operations are correctly displayed, maintaining the integrity of data organization and analysis.
General Fixes and Improvements

2024.1.4

Feature Enhancements

Enhanced Warnings for Schema Inconsistencies in Files Profiled
- Improved the warning message for cases where the user profiles files with different schemas under a single glob pattern. This update ensures users receive clear, helpful information when files within a glob have inconsistent structures.

General Fixes

Containers with 'Group By' settings Leading to Erroneous Profile Operation
- Fixed an issue affecting profile operations which included containers with 'Group By' settings. Previously, running a profile without inferring checks resulted in all fields being erroneously removed from the field list.
General Fixes and Improvements

2023

Release Notes

2023.12.20

General Fixes

Resolved Datastore Creation Issue with Databricks:
- Fixed an issue encountered when creating source datastores using Databricks with catalog names other than the default hive_metastore. This fix ensures a smoother and more flexible datastore creation process in Databricks environments.
Conflict Resolution for 'anomaly_uuid' Field in Source Container:
- Corrected a problem where source containers with a field named anomaly_uuid were unable to run scan operations. This fix eliminates the conflict with internal system columns, allowing for uninterrupted operation of these containers.
General Fixes and Improvements

2023.12.14

Feature Enhancements

Auto-Detection of Partitioned Files:
- Improved file handling to automatically detect partitioned files like *.delta without the need for an explicit extension. This update resolves the issue of previously unrecognized delta tables.
Anomaly Weight Threshold for Notifications:
- Enhanced the notification system to support a minimum anomaly weight threshold for the trigger type "An anomaly is detected". Notifications will now be triggered only for anomalies that meet or exceed the defined weight threshold.
Team Assignment in Datastore Forms:
- Updated the Datastore Forms to enable users to manage teams. This enhancement provides Admins with the flexibility to assign or adjust teams right at the point of datastore setup, moving away from the default assignment to the Public team.

General Fixes

Corrected Health Page Duplication:
- Addressed an issue on the Health Page where "Max Executors" information was being displayed twice. This duplication has been removed for clearer and more accurate reporting.
General Fixes and Improvements

2023.12.12

Feature Enhancements

Incremental Catalog Results Posting:
- Enhanced the catalog operation to post results incrementally for each container catalogued. Previously, results were only available after the entire operation was completed. With this enhancement, results from successfully catalogued containers are now preserved and posted incrementally, ensuring containers identified are not lost even if the operation does not complete successfully.

General Fixes

Aggregation Comparison Rule Filter:
- Resolved an issue where filters were not being applied to the Aggregation Comparison Check, affecting both the reference and target filters.
Case Sensitivity File Extension Support
- Addressed a limitation in handling file extensions, ensuring that uppercase formats like .TXT and .CSV are now correctly recognized and processed. This update enhances the system's ability to handle files consistently, irrespective of extension case.
SLA Violation Notification Adjustment:
- Modified the SLA violation notifications to trigger only once per violation, preventing a flood of repetitive alerts and improving the overall user experience.
Source record not Available for Max Length Rule
- Addressed a bug where the Max Length Rule was not producing source records in cases involving null values. The rule has been updated to correctly handle null values, ensuring accurate anomaly marking and data enrichment.
General Fixes and Improvements

2023.12.8

Breaking Changes

Renaming of Enrichment Datastore Tables

Due to lack of consistency and to avoid conflicts between different categories of Enrichments tables, changes were performed to the table name patterns:
- The Enrichment table previously named <enrichment_prefix>_anomalies has been renamed to <enrichment_prefix>_failed_checks due to its content and granularity.
- The terms remediation and export were added to distinguish Enrichment Remediation and Export tables from others, resulting in:
  - <enrichment_prefix>_remediation_<container_name> for Remediation tables.
  - <enrichment_prefix>_export_<asset> for Export tables.

Feature Enhancements

Refactor Notifications Panel:
- Introduced a new side panel for Notifications, categorizing alerts by type (Operations, Anomalies, SLA) for improved organization.
- Added notification tags, receivers, and an action menu enabling users to mute or edit notifications directly from the panel
- Enhanced UI for better readability and interaction, providing an overall improved user experience.
Add Enrichment Export Anomalies available asset:
- Anomalies are now supported as a type of asset for export to an enrichment datastore, enhancing data export capabilities.
Add files count metric to profile operation summary
- Displayed file count (number of partitions) in addition to existing file patterns count metric in profile operations for DFS datastores.
Improve Globing Logic:
- Optimized support for multiple subgroups when globing files from DFS datastores during profile operations, enhancing efficiency.

General Fixes

General Fixes and Improvements

2023.12.5

Feature Enhancements

Navigation Improvements in Explore Profiles Page:
- Upgraded the Explore Profiles Page by adding direct link icons for more precise navigation. Users can now use these links on container and field cards/lists for a direct redirection to detailed views.

General Fixes

General Fixes and Improvements

2023.12.1

Feature Enhancements

List View Layout Support:
- Introduced list view layouts for Datastores, Profiles, Checks, and Anomalies, providing users with an alternative way to display and navigate through their data.
Bulk Acknowledgement Performance:
- Improved the performance of bulk acknowledging in-app notifications, streamlining the user experience and enhancing the application's responsiveness.

General Fixes

Checks and Anomalies Dialog Navigation:
- Resolved an issue with arrow key navigation in Checks and Anomalies dialogs where unintended slider movement occurred when using keyboard navigation. This fix ensures that arrow keys will only trigger slider navigation when the dialog is the main focus.
Profiled Container Count Inconsistency
- Ensured that containers that fail to load data during profiling are not mistakenly counted as successfully profiled, improving the accuracy of the profiling process.
Histogram Field Selection Update:
- Fixed a bug where histograms were not updating correctly when navigating to a new field. Histograms now properly reflect the data of the newly selected field.
General Fixes and Improvements

2023.11.28

Feature Enhancements

Operations with Tag Selectors:
- Users can now configure operations (including schedules) with multiple tags, enabling dynamic profile evaluation based on tags at the operation's trigger time.
Asserted State Filter for Checks:
- Introduced a new check list filter, allowing users to filter checks by those that have passed or identified active anomalies.
Bulk Delete for Profiles:
- Enhanced the system to allow bulk deletion of multiple profiles, streamlining the management process where previously only individual deletions were possible.
Resizable Columns in Source Records Table:
- Columns in the anomaly dialog source records can now be manually resized, improving visibility and preventing content truncation.
Automated Partition Field Setting for BigQuery:
- For BigQuery tables constrained by a required partition filter, the profile partition field setting is now automatically populated during the Catalog operation.

General Fixes

Sharable Link Authentication Flow:
- Fixed an issue where direct links did not work if the user was not signed in. Now, users are redirected to the intended page post-authentication.
Clarified Violation Messages for 'isUnique' Check:
- Updated the violation message for the 'isUnique' check to describe the anomaly, reducing misinterpretation clearly.
Access Restriction and Loading Fix for Health Page:
- Corrected the health page visibility so only admin users can view it, and improved loading behavior for Qualytics services.
Availability of Requested Tables During Operations:
- The dialog displaying requested tables/files is now accessible immediately after an operation starts, enhancing transparency for both Profile and Scan operations.
General Fixes and Improvements

2023.11.14

Feature Enhancements

Qualytics App Color Palette and Design Update:
- Implemented a comprehensive design update across the Qualytics App, introducing a new color palette for a refreshed and modern look. This update includes a significant change to the anomalies color, transitioning from red to orange for a more distinct visual cue. Additionally, the font-family has been updated to enhance readability and provide a more cohesive aesthetic experience across the application.
System Health Readout:
- A new Health tab has been added to the Admin menu, offering a comprehensive view of each deployment's operational status. This feature encompasses critical details such as the status of app services, current app version, and analytics engine information, enabling better control over system health.
Enhanced Check with Metadata Input:
- The Check form now includes a new input field for custom metadata. This enhancement allows users to add key-value pairs for tailored metadata, significantly increasing the flexibility and customization of the Check definition.
Responsiveness Improvement in Cards Layout:
- The Cards layout has been refined to improve responsiveness and compactness. This adjustment addresses previous UI inconsistencies and ensures a consistent visual experience across different devices, enhancing overall usability and aesthetic appeal.
Source Record Enrichment for 'isUnique' Checks:
- The isUnique check has been enhanced to support source record enrichment. This significant update allows users to view specific records that fail to meet the 'isUnique' condition. This feature adds a layer of transparency and detail to data validation processes, enabling users to easily identify and address data uniqueness issues.
New Enrichment Data:
- Scan operations now record operation metadata in a new enrichment table with the suffix scan_operations including an entry for each table/file scanned with the number of records processed and anomalies identified as well as start/stop time and other relevant details.
Insights Enhancement with Check Pass/Fail Metrics:
- Insights now features the checks section with new metrics indicating the total number of checks passed and failed. This enhancement also offers a visual representation through a chart, detailing the passed and failed checks over a specified reporting period.

General Fixes

isAddress now supports defining multiple checks against the same field with different required label permutations
General Fixes and Improvements

2023.11.8

Feature Enhancements

Is Address Check:
- Introduced a new check for address conformity that ensures the presence of required components such as road, city, and state, enhancing data quality controls for address fields. This check leverages machine learning to support multilingual street address parsing/normalization trained on over 1.2 billion records of data from over 230 countries, in 100+ languages. It achieves 99.45% full-parse accuracy on held-out addresses (i.e. addresses from the training set that were purposefully removed so we could evaluate the parser on addresses it hasn’t seen before).
Revamped Heatmap Flow in Activity Tab:
- Improved the user interaction with the heatmap by filtering the operation list upon selecting a date. A new feature has been added to operation details allowing users to view comprehensive information about the profiles scanned, with the ability to drill down to partitions and anomalies.
Link to Schedule in Operation List:
- Enhanced the operation list with a new "Schedule" column, providing direct links to the schedules triggering the operations, thus improving traceability and scheduling visibility.
Insights Tag Filtering Improvement:
- Enhanced the tag filtering capability on the Insights page to now include table/file-level analysis. This ensures a more granular and accurate reflection of data when using tags to filter insights.
Support for Incremental Scanning of Partitioned Files:
- Optimized the incremental scanning process by tracking changes at the record level rather than the last modified timestamp of the folder. This enhancement prevents the unnecessary scanning of all records and focuses on newly added data.

General Fixes

General Fixes and Improvements

2023.11.2

Feature Enhancements

Auto Selection of All Fields in Check Form:
- Improved the user experience in the Check Form by introducing a "select all" option for fields. Users can now auto-select all fields when applying rules that expects a multi select input, streamlining the process especially for profiles with a large number of fields.
Enhanced Profile Operations with User-Defined Starting Points for Profiling:
- Users can now specify a value for the incremental identifier, to determine the comprehensive set that will be analyzed.
- Two new options have been added:
  - Greater Than Time: Targets profiles with incremental timestamp strategies, allowing the inclusion of rows where the incremental field's value surpasses a specified time threshold.
  - Greater Than Batch: Tailored for profiles employing an incremental batch strategy, focusing the analysis on rows where the incremental field’s value is beyond a certain numeric threshold.
Configurable Enrichment Source Record Limit in Scan Operations:
- Users can now configure the enrichment_source_record_limit to dictate the number of anomalous records retained for analysis, adapting to various use case necessities beyond the default sample limit of 10 per anomaly. This improvement allows for a more tailored and comprehensive analysis based on user requirements.
Introduction of Passed Status in Check Card:
- A new indicative icon has been added to the Check Card to assure users of a "passed" status based on the last scan. This icon will be displayed only when there are no active anomalies.
Inclusion of Last Asserted Time in Check Card:
- Enhanced the Check Card by including the last asserted time, offering users more detailed and up-to-date information regarding the checks.
Enhanced Anomaly Search with UUID Support:
- Improved the anomaly search functionality by enabling users to search anomalies using the UUID of the anomaly, making the search process more flexible and comprehensive.

General Fixes

General Fixes and Improvements

2023.10.27

Feature Enhancements

Check Creation through Field Details Page:
- Users can now initiate check creation directly from the Field Details page, streamlining the check creation process and improving usability.
Tree View Enhancements:
- Introduced a favorite group feature where favorite datastores are displayed in a specific section, making them quicker and easier to access.
- Added search functionalities at both Profile and Field levels to improve the navigation experience.
- Nodes now follow the default sorting of pages, creating consistency across various views.
- Enhanced the descriptions in tree view nodes for non-catalogued datastores and non-profiled profiles, providing a clearer explanation for the absence of sub-items.
Bulk Actions for Freshness & SLAs:
- Users can now perform bulk actions in Freshness & SLAs, enabling or disabling freshness tracking and setting or unsetting SLAs for profiles efficiently.
Archived Check Details Visualization:
- Enhanced the anomaly modal to allow users to view the details of archived checks in a read-only mode, improving the visibility and accessibility of archived checks’ information.
User Pictures as Avatars:
- User pictures have been incorporated across the application as avatars, enhancing the visual representation in user listings, teams, and anomaly comments.
Slide Navigation in Card Dialogs:
- Introduced a slide navigation feature in the Anomalies and Checks dialogs, enhancing user navigation. Users can now effortlessly navigate between items using navigational arrows, eliminating the need to close the dialog to view next or previous items.

General Fixes

General Fixes and Improvements

2023.10.23

Feature Enhancements

Enhanced Data Asset Navigation:
- Tree View Implementation: Easily navigate through your data assets with our new organized tree view structure
- Context-Specific Actions: Access settings and actions that matter most depending on your current level of interaction.
- Simplified User Experience: This update is designed to streamline and simplify your data asset navigation and management.
Aggregation Comparison Check:
- New Rule Added: Ensure valid comparisons by checking the legitimacy of operators between two aggregation expressions.
- Improved Monitoring: Conduct in-depth comparisons, such as verifying if total row counts match across different source assets.
Efficient Synchronization for Schema Changes:
- Seamless Integration: Our system now adeptly synchronizes schema changes in source datastores with Qualytics profiles.
- Avoid Potential Errors: We reduced the risk of creating checks with fields that have been removed or altered in the source datastore.
Clarity in Quality Check Editors:
- Distinct Update Sources: Easily identify if an update was made manually by a user or automatically through the API.
Dynamic Quality Score Updates:
- Live Anomaly Status Integration: Quality Scores now reflect real-time changes based on anomaly status updates.

General Fixes

Various bug fixes and system improvements for a smoother experience.

2023.10.13

Feature Enhancements

Export Metadata Enhancements:
- Added a "weight" property to the quality check asset
New AWS Athena Connector:
- Introduced support for a new connector, AWS Athena, expanding the options and flexibility for users managing data connections.
Operations List:
- Introduced a multi-select filter to the operation list, enabling users to efficiently view operations based on their status such as running, success, failure, and warning, thereby streamlining navigation and issue tracking.

General Fixes

Logging Adjustments:
- Enhanced logging for catalog operations, ensuring that logs are visible and accessible even for catalogs with a warning status, facilitating improved tracking and resolution of issues.
General Fixes and Improvements

2023.10.9

Feature Enhancements

Check Categorization:
- Introduced new check categories on the checks page to streamline UX and prioritize viewing:
  1. Important: Designed around a check's weight value, this category will by default comprise authored checks and inferred checks with active anomalies.
  2. Favorite: Featuring all user-favorited checks
  3. Metrics: Incorporating all metric checks
  4. All: Displaying all checks, whether inferred, authored, or anomalous
- The default view is set to "Important" (if available) to highlight critical checks and avoid overwhelming users
Anomalies Page Update:
- Revamped the Anomalies page with a simplified status filter, adopting a design in alignment with the checks page:
  - Quick Status Filter: Facilitates an effortless switch between anomaly statuses.
  - The "Active" tab is presented as the default, providing immediate visibility into ongoing anomalies.
Notification Testing:
- Enhanced the Notification Form with a "Test Notification" button, enabling users to validate notification settings before saving
Metadata Export to Enrichment Stores:
- Enabled users to export metadata from their datastore directly into enrichment datastores, with initial options for quality checks and field profiles.
- Users can specify which profiles to include in the export operation, ensuring relevant data transfer.

General Fixes

General Fixes and Improvements

2023.10.4

Feature Enhancements

Anomalies Details User Experience:
- Implemented a "skeleton loading" feature in the Anomaly Details dialog, enhancing user feedback during data loading.
Enhanced Check Dialog:
- Added "Last Updated" date to the Check Dialog to provide users with additional insights regarding check modifications.
API Engine Control:
- Exposed a new endpoint allowing users to gracefully restart the analytics engine through the API.

General Fixes

Timezone Handling on MacOS:
- Resolved an issue affecting timezone retrieval due to MacOS privacy updates, ensuring accurate timezone handling.
Notifications and Alerts:
- Pager Duty Integration: Resolved issues preventing message sending and improved UI for easier configuration.
- HTTP Action Notification: Fixed Anomaly meta-data serialization issues affecting successful delivery in some circumstances.
Scan Duration Accuracy:
- Adjusted scan duration calculations to accurately represent the actual processing time, excluding time between a failed scan and a successful retry.
Spark Partitioning:
- Certain datastores may fail to properly coerce types into Spark-compatible partition column values if that column itself contains anomalous values. When this occurs, an attempt will be made to load the data without a partition column and a warning will be generated for the user.
General Fixes and Improvements

2023.9.29

Feature Enhancements

Operations & Schedules UI Update:
- Redesigned the UI for the operations and schedules lists for a more intuitive UX and to provide additional information.
  - Introduced pagination, filtering, and sorting for the schedules list.
  - Added a "Next Trigger" column to the schedules list to inform users of upcoming schedule triggers.
- Improved Profile List Modal:
  - Enhanced the profile list modal accessible from operations and schedules.
  - Users can now search by both ID and profile name.
Check Navigation Enhancements:
- Enhanced navigation between Standard and Metric Cards by introducing direct links that allow users to access metric charts seamlessly from check forms.
- The checks page navigation state is now reflected in the URL, enhancing UX and enabling precise redirect capabilities.
Computed Table Enhancements:
- Upon the creation or update of a computed table, a minimalistic profile operation is now automatically triggered. This basic profile limits sampling to 1,000 and does not infer quality checks.
- This enhancement streamlines the process when working with computed tables. Users can now directly create checks after computed table creation without manually initiating a profile operation, as the system auto-fetches required field data types.
Analytics Engine Enhancements:
- This release replaces our previous consistency model with a more robust one relying upon AMQP brokered durable messaging. The change dramatically improves Qualytics' internal fault tolerance with accompanying performance enhancements for common operations.

General Fixes

Insights Filter Consistency:
- Fixed an inconsistency issue with the datastore filter that was affecting a couple of charts in Insights
General Fixes and Improvements

2023.9.21

Feature Enhancements

Anomalies Modal Redesign:
- Streamlined the presentation of Failed Checks by removing the Anomalous Fields grouping. The new layout focuses on a list of Failed Checks, each tagged with the associated field(s) name, if applicable. This eliminates redundancy and simplifies the UI, making it easier to compare failed checks directly against the highlighted anomalous fields in the Source Record.
- Added the ability to filter Failed Checks by anomalous fields.
- Introduced direct links to datastores and profiles for enhanced navigation.
- Updated the tag input component for better UX.
- Removed the 'Hide Anomalous' option and replaced it with an 'Only Anomalous' option for more focused analysis.
- Included a feature to display the number of failed checks a field has across the modal.
- Implemented a menu allowing users to copy Violation messages easily.
Bulk Operation for Profiles:
- Extended the profile selection functionality to allow initiating bulk operations like profiling and scanning directly from the selection interface.

General Fixes

DFS Incremental Scans:
- Addressed an issue that caused incremental scans to fail when no new files were detected on globs. Scans will now proceed without failure or warning in such cases.
Improve performance of the Containers endpoint
General Fixes and Improvements

2023.9.16

Feature Enhancements

Insights Timeframe and Grouping:
- Trend tooltips have been refined to change responsively based on the selected timeframe and grouping, ensuring that users receive the most relevant information at a glance.
Enhanced PDF export for Insights:
- Incorporated the selected timeframe and grouping settings into the exported PDF, ensuring that users experience consistent detail and clarity both within the application and in the exported document.
- Added a "generated at" timestamp to the PDF exports, providing traceability and context to when the data was captured, further enhancing the comprehensiveness of exported insights.
Source Record Display Improvements:
- The internal columns' background color has been calibrated to offer a seamless appearance in both light and dark themes.

General Fixes

Time Series Chart Rendering:
- Addressed an issue where the time series chart would not display data points despite having valid measurements. The core of the problem was pinpointed to how the system handled 0 values, especially when set as min and/or max thresholds.
- Resolved inconsistencies in how undefined min/max thresholds were displayed across different comparison types. While we previously had a UI indicator displaying for some comparison types, this was missing for "Absolute Change" and "Absolute Value".
General Fixes and Improvements

2023.9.14

Feature Enhancements

Insights Improvements:
- Performance has been significantly optimized for smoother interactions.
- Introduced timeframe filters, allowing users to view insights data by week, month, quarter, or year.
- Introduced grouping capabilities, enabling users to segment visualizations within a timeframe, such as by days or weeks.
Metric Checks Enhancements:
- Introduced a new Metric Checks tab in both the datastore and explore perspectives.
- Added a Time Series Chart within the Metric Checks tab:
  - Displays check measurements over time.
  - Allows on-the-fly adjustments of min/max threshold values.
  - Showcases enhanced check metadata including tags, active anomaly counts, and check weights.
Check Form Adjustments:
- Disabled the Comparison Type input for asserted checks

General Fixes

Configuring Metric Checks through the Check Form:
- Resolved a bug where users were unable to clear optional inputs such as "min" or "max".
General Fixes and Improvements

2023.9.8

Feature Enhancements

Presto & Trino Connectors:
- We've enhanced our suite of JDBC connectors by introducing dedicated support for both Presto and Trino. Whether you're utilizing the well-established Presto or the emerging Trino, our platform ensures seamless compatibility to suit your data infrastructure needs.

General Fixes

Incremental Scan:
- Resolved an issue where the scan operation would fail during the "Exists In Check" if there were no records to be processed.
General Fixes and Improvements

2023.9.7

Feature Enhancements

Concurrent Operations:
- Introduced the ability to run multiple operations of the same type concurrently within a single datastore, even if one is yet to finish. This brings more flexibility and efficiency in executing operations
Autocomplete Widget:
- A hint for a shortcut has been added, allowing users to manually trigger the autocomplete widget and enhancing usability
Source Record Display Enhancements:
- Added a new 'Hide Anomalous' option, providing users with the choice to hide anomalous records for clearer viewing
- Transitioned from hover-based tooltips to click-activated ones for better UX
- For a consistent data presentation, internal columns will now always be displayed first
Check Form Improvements:
- Users now receive feedback directly within the form upon successful validation, replacing the previous toast notification method
- Additionally, for 504 validation timeouts, a more detailed and context-specific message is provided

General Fixes

Addressed issues for 'Is Replica Of' failed checks in source record handling
General Fixes and Improvements

2023.8.31

General Fixes

Fixed an issue where the Source Record remediation was incorrectly displayed for all fields
Adjusted the display of field Quality Scores and Suggestion Scores within the Source Record
Fixed a bug in the Check Form where the field input wouldn’t display when cloning a check that hasn’t been part of a scan yet
Resolved an issue where failed checks for shape anomalies were not receiving violation messages

2023.8.30

Feature Enhancements

Anomaly Dialog Updates:
- Optimized Source Data Columns Presentation: To facilitate faster identification of issues, anomalous fields are now presented first. This enhancement will prove particularly useful for data sources with a large number of columns.
- Enhanced Sorting Capabilities: Users can now sort the source record data by name, weight, and quality score, providing more flexible navigation and ease of use.
- Field Information at a Glance: A new menu box has been introduced to deliver quick insights about individual fields. Users can now view weight, quality score, and suggested remediation for each field directly from this menu box.
Syntax Highlighting Autocomplete Widget:
- Improved UX: The widget has been enhanced to better identify and display hint types, including distinctions between tables, keywords, views, and columns. This enhancement enriches the autocomplete experience.

General Fixes

Check Dialog Accessibility:
- Addressed an issue where the check dialog was not opening as expected when accessed through a direct link from the profile page.
General Fixes and Improvements

2023.8.23

Feature Enhancements

Profiles Page:
- Introduced two new sorting methods to provide users with more intuitive ways to explore their profiles: Sort by last profiled and Sort by last scanned.
- Updated the default sorting behavior. Profiles will now be ordered by name right from the start, rather than by their creation date.
Add New isNotReplicaOf Check:
- With this rule, users can assert that certain datasets are distinct and don't contain matching data, enhancing the precision and reliability of data comparisons and assertions.
Introduce new Metric Check
- We've added a new Metric check tailored specifically for handling timeseries data. This new check is set to replace the previous Absolute and Relative Change Checks.
- To offer a more comprehensive and customizable checking mechanism, the Metric check comes with a comparison input:
  - Percentage Change: Asserts that the field hasn't deviated by more than a certain percentage (inclusive) since the last scan.
  - Absolute Change: Ensures the field hasn't shifted by more than a predetermined fixed amount (inclusive) from the previous scan.
  - Absolute Value: During each scan, this option records the field value and asserts that it remains within a specified range (inclusive).

General Fixes

Schema Validation:
- We've resolved an issue where the system was permitting the persistence of empty values under certain conditions for datastores and checks. This fix aims to prevent unintentional data inconsistencies, ensuring data integrity.
General Fixes and Improvements

2023.8.18

Feature Enhancements

Auditing:
- Introduced significant enhancements to the auditing capabilities of the platform, designed to provide better insights and control over changes. The new auditing features empower users to keep track of change sets across all entities, offering transparency and accountability like never before. A new activity endpoint has been introduced, providing a log of user interactions across the application.
Search Enhancements:
- Profiles and Anomalies lists can now be searched by both identifiers and descriptions using the same search input.
Catalog Operation Flow Update:
- Made a minor update to the datastore creation and catalog flow to enhance user flexibility and experience. Instead of automatically running a catalog operation post datastore creation, users now have a clearer, intuitive manual process. This change offers users the flexibility to set custom catalog configurations, like syncing only tables or views.
Operation Flow Error Handling:
- Enhanced user experience during failures in the Operation Flow. Along with the failure message, a "Try Again" link has been added. Clicking this link will revert to the configuration state, allowing users to make necessary edits without restarting the entire operation process.
Sorting Enhancements:
- Introduced new sorting options: "Completeness" and "Quality Score". These options are now available on the profiles & fields pages.

General Fixes

Datastore Connection Edit:
- Improved the Datastore connection edit experience, especially for platforms like BigQuery. Resolved an issue where file inputs were previously obligatory for minor edits. For instance, renaming a BigQuery Datastore no longer requires a file input, addressing this past inconvenience.
Pagination issues:
- Resolved an issue with paginated endpoints returning 500 instead of 422 on requests with invalid parameters.

2023.8.11

Feature Enhancements

Insights Export: Added a new feature that allows users to export Insights directly to PDF, making it easier to share and review data insights.
Check Form UX:
- Fields in the Check Form can now be updated if the check hasn't been used in a Scan operation, offering more flexibility to users.
- Enhanced visual cues in the form with boxed information to clarify the limitations certain properties have, depending on the state of the form.
- A new icon has been introduced to represent the number of scan operations that have utilized the check, providing users with a clearer overview.
SLA Form UX:
- Revamped Date Time handling for enhanced time zone coverage, allowing for user-specified date time configurations based on their preferred time zone.
Filter and Sorting:
- Added Datastore Type filter and sorting for source datastores
- Added Profile Completeness sorting and type filtering and sorting
- Added Check search by identifier or description

General Fixes

SparkSQL Expressions: Added support to field names with special characters to SparkSQL expressions using backticks
Pagination Adjustment: The pagination limit has been fine-tuned to support a maximum of 100 items per page, improving readability and navigation.

2023.8.3

Maintenance Release

Updated enrichment sidebar details design.
Tweaked SQL input dialog sizing.
Fixed filter components width bug.
Retain the start time of operation on restart.
Fixed exclude fields to throw exceptions on errors.
Improved performance when using DFS to load reference data.

2023.7.31

Maintenance Release

Changed UX verbiage and iconography for Anomaly status updates.
Fixed intermittent notification template failure.
Fixed UI handling of certain rule types where unused properties were required.
Improved error messages when containers are no longer accessible.
Fixed Hadoop authentication conflicts with ABFS.
Fixed an issue where a Profile operation run on an empty container threw a runtime exception.

2023.7.29

Feature Enhancements

Added a NotExistsIn Check Type: Introducing a new rule type that asserts that values assigned to this field do not exist as values in another field.
Check Authoring UI enhancements: Improved user interface with larger edit surfaces and parenthesis highlighting for better usability.
Container Details UI enhancement: Improved presentation of container information in sidebars for easier accessibility and understanding.
Added Check Authoring Validation: Users can now perform a dry run of the proposed check against representative data to ensure accuracy and effectiveness.
Change in default linkage between Checks and Anomalies: Filters now default to "Active" status, providing more refined results and support for specific use cases.

2023.7.25

Feature Enhancements

Satisfies Expression Enhancement: The Satisfies Expression feature has been upgraded to automatically bind fields referenced in the user-defined expressions, streamlining integration and improving usability.

Added Support

Extended Support for ExistsIn Checks: The ExistsIn checks now offer support for computed tables, empowering users to perform comprehensive data validation on computed data.

General Fixes

Enhanced Check Referencing: Checks can now efficiently reference the full dataframe by using the alias "qualytics_self," simplifying referencing and providing better context within checks.
Improved Shape Anomaly Descriptions: Shape anomaly descriptions now include totals alongside percentages, providing more comprehensive insights into data irregularities.
Fix for Computed Table Record Calculation: A fix has been implemented to ensure accurate calculation of the total number of records in computed tables, improving data accuracy and reporting.
Enhanced Sampling Source Records Anomaly Detection: For shape anomalies, sampling source records now explicitly exclude replacement, leading to more precise anomaly detection and preserving data integrity during analysis.

2023.7.23

Bug Fixes

Fix for total record counts when profiling large tables

2023.7.21

Feature Enhancements

Notification Form: Enhanced the user interface and experience by transforming the Channel and Tag inputs into a more friendly format.
Checks & Anomalies: Updated the default Sort By criterion to be based on "Weight", enabling a more effective overview of checks and anomalies.
Profile Details (Side Panel): Introduced a tooltip to display the actual value of the records metric, providing clearer and instant information.
Freshness Page: Added a new navigation button that directly leads to the Profile Details page, making navigation more seamless.
Profile Details: Introduced a settings option for the user to perform actions identical to those from the Profile Card, such as changing profile settings and configuring Checks and SLAs.
SparkSQL Inputs: Implemented a new autocomplete feature to enhance user experience. Writing SQL queries is now more comfortable and less error-prone.

2023.7.19

General Fixes

General Fixes and Improvements

2023.7.14

Feature Enhancements

API enhancements
- Improved performance of our json validation through the adoption of Pydantic 2.0
- Upgraded our API specification to OpenAPI 3.1.0 compatible, this uses JSON Schema 2020-12.
Upgraded to Spark 3.4
- Significant performance enhancements for long-running tasks and shuffles
Added support for Kerberos authentication for Hive datastores
Enhanced processing for large dataframes with JDBC sources
- Handle arbitrarily large tables and views by chunking into sequentially processed dataframes
Improvements for Insights view when limited data is available
Various user experience enhancements

Bug Fixes

Date Picker fix for Authored Checks
Allow tags with special characters to be edited

2023.7.3

Feature Enhancements

Insights Made Default View on Data Explorer
- Gain valuable data insights more efficiently with the revamped Insights feature, now set as the default view on the Data Explorer.
Reworked Freshness with Sorting and Grouping
- Easily analyze and track data freshness based on specific requirements thanks to the improved Freshness feature, now equipped with sorting and grouping functionalities.
Enhanced Tables/Files Cards Design:
- Experience improved data analysis with the updated design of tables/files cards, including added average completeness information and reorganized identifiers.

Added Support

Support for Recording Sample Shape Anomalies to Remediation Tables
- Address potential data shape issues more effectively as the platform now supports recording a sample of shape anomalies to remediation tables.
New Metrics and Redirect to Anomalies for Profile/Scan Results
- Access additional metrics for profile/scan results and easily redirect to anomalies generated by a scan from Activity tab for efficient identification and resolution of data issues.

General Fixes

Reduced Margin Between Form Input Fields:
- Enjoy a more compact and streamlined design with a reduced margin between form input fields for an improved user experience.

Bug Fixes

Fixed Pagination Reset Issue During Check Updates
- Pagination will no longer reset when checks are updated, providing a smoother user experience, with reset now occurring only during filtering.
Resolved Vertical Misalignment of Check and Anomaly Icons
- The issue causing vertical misalignment between Check and Anomaly icons on the Field Profile page has been fixed, resulting in a visually pleasing and intuitive user interface.

2023.6.24

Feature Enhancements

Refactored Partition Reads on JDBC
- Refactored partitioned reads on JDBC to improve performance, resulting in faster and more efficient data retrieval.

Bug Fixes

Fixed Inputs on Change Checks
- Refined inputs on change checks to differentiate between Absolute and Relative measurements, ensuring precise detection and handling of data modifications based on numeric values (Absolute) and percentage (Relative) variations.
Resolved Enum Type Ordering Bug for Paginated Views
- Fixed bug causing inconsistent and incorrect sorting of enum values across all paginated views, ensuring consistent and accurate sorting of enum types.

General Fixes

Added Success Effect
- Added effect when a datastore is configured successfully, enhancing the user experience by providing visual confirmation of a successful configuration process.

2023.6.20

Feature Enhancements

Reworked Tags View
- Improved the usability and visual appeal of the tags view. Added new properties like description and weight modifier to provide more detailed information and assign relative importance to tags. The weight value directly correlates with the level of importance, where a higher weight indicates higher significance.
Inherited Tags Support
- Implemented support for inherited tags in taggable entities. Now tags can be inherited from parent entities, streamlining the tagging process and ensuring consistency across related items. Inherited Tags will be applied to anomalies AFTER a Scan operation.
Added Total Data Under Management to Insights
- Introduced a new metric under Insights that displays the total data under management. This provides users with valuable insights into the overall data volume being managed within the system.

Added Support

Bulk Update Support
- Introduced bulk update functionality for tables, files, and fields. Users can now efficiently Tag multiple items simultaneously, saving time and reducing repetitive tasks.
Smart Partitioning of BigQuery
- Enabled smart partitioning in BigQuery using cluster keys. Optimized data organization within BigQuery for improved query performance and cost savings.

Bug Fixes

Fixed Scheduling Operation Issues
- Addressed a bug causing scheduling operations to fail with invalid days in crontabs. Users can now rely on accurate scheduling for time-based tasks without encountering errors.

General Fixes

Improved Backend Performance
- Implemented various internal fixes to optimize backend performance. This results in faster response times, smoother operations, and an overall better user experience.
Enhanced Tag Input:
- Improved tag input functionality in the Check form dialog. Users can now input tags more efficiently with enhanced suggestions and auto-complete features, streamlining the tagging process.
Enhanced File Input Component
- Upgraded the file input component in the Datastore form dialog, providing a more intuitive and user-friendly interface for uploading files. Simplifies attaching files to data entries and improves overall usability.

2023.6.12

Feature Enhancements

Explore is the new centralized view of Activities, Containers (Profiles, Tables, Computed Tables), Checks, Anomalies and Insights across ALL Datastores. This new view allows for filtering by Datastores & Tags, which will persist the filters across all of the submenu tabs. The goal is to help with Critical Data Elements and filter out irrelevant information.
Enhanced Navigation Features
- The navigation tabs have been refined for increased user-friendliness.
- Enhanced the Profile View and added a toggle between card and list views.
- Datastores and Enrichment Datastores have been unified, with a tabular view introduced to distinguish between your Source Datastores and Enrichment Datastores.
- Explore has been added to the main navigation, and Insights has been conveniently relocated into the Explore submenu.
- Renamed Tables/Files to Profiles in the Datastore details page.

Added Support

We're thrilled to introduce two new checks, the Absolute Change Limit and the Relative Change Limit, tailored to augment data change monitoring. These checks enable users to set thresholds on their numeric data fields and monitor fluctuations from one scan to the next. If the changes breach the predefined limits, an anomaly is generated.
- The Absolute Change Limit check is designed to monitor changes in a field's value by a fixed amount. If the field's value changes by more than the specified limit since the last applicable scan, an anomaly is generated.
- The Relative Change Limit check works similarly but tracks changes in terms of percentages. If the change in a field's value exceeds the defined percentage limit since the last applicable scan, an anomaly is generated.

General Fixes

General UI fixes with new navigational tabs
Resolved an issue when creating a computed table
Incorporated functionality to execute delete operations and their related results.
Renamed "Rerun" button to "Retry" in the operation list

2023.6.2

General Fixes

Added GCS connector with Keyfile support:
- The GCS connector now supports Keyfile authentication, allowing users to securely connect to Google Cloud Storage.
Improved BigQuery connector by removing unnecessary inputs:
- Enhancements have been made to the BigQuery connector by streamlining the inputs, eliminating any unnecessary fields or options.
- This results in a more user-friendly and efficient experience.
Renamed satisfiesEquation to satisfiesExpression:
- The function "satisfiesEquation" has been renamed to "satisfiesExpression" to better reflect its functionality.
- This change makes it easier for users to understand and use the function.

Added Support

Added Check Description to Notification rule messages:
- Notification rule messages now include the Check Description.
- This allows users to add additional context and information about the specific rule triggering the notification and passing that information to downstream workflows.
Added API support for tuning operations with a high correlation threshold for profiles and high count rollup threshold for anomalies in scan:
- The API now supports tuning operations by allowing users to set a higher correlation threshold for profiles.
- It also enables users to set a higher count rollup threshold for anomalies in scan.
- This customization capability helps users fine-tune the behavior of the system according to their specific needs and preferences.

2023.5.26

Usability

Improved the navigation in the Activity tab’s side panel for easier and more intuitive browsing including exposing the ability to comment directly into an anomaly
Added a redirect to the Activity tab when an operation is initiated for a smoother workflow.

Bug Fixes

Resolved an issue where the date and time were not displaying correctly for the highest value in profiles.
Fixed a problem with scheduled operations when the configured timing was corrupted.
Addressed an issue where filtered checks were causing unexpected errors outside of the intended dataset.

2023.5.23

Feature Enhancements

Scheduled operation editing
- Added the ability for users to edit a scheduled operation. This allows users to make changes to the schedule of an operation.
Catalog includes filters
- Added catalog include filters to only process tables, views, or both in JDBC datastores. This allows users to control which object types are processed in the datastore.
isReplicaOf check filters
- Added filter support to the isReplicaOf check. This allows users to control which tables are checked for replication.
Side panel updates
- Updated side panel design and added an enrichment redirect option.

Added Support

IBM DB2 datastore
- Added support for the IBM DB2 datastore. This allows users to connect to and process data from IBM DB2 databases.
API support for tagging fields
- Added API support for tagging fields. This allows users to tag fields in the datastore with custom metadata.

Bug Fixes

Freshness attempting to measure views
- Fixed an issue with freshness attempting to measure views.
Enrichment to Redshift and string data types
- Fixed an issue with enrichment to Redshift and string data types. This issue caused enrichment to fail for tables that contained string data types

2023.5.10

Feature Enhancements

Container Settings
- Introducing the ability to Group fields for improved insights and profiling precision.
- Added functionality to Exclude fields from the container, allowing associated checks to be ignored during operations, leading to reduced processing time and power consumption.
- We now support identifiers on commuted tables during profiling operations.
Checks
- Improved usability by enabling quick cloning of checks within the same datastore.
  - Users can now easily create a new check with minor edits to tables, fields, descriptions, and tags based on an existing check.
- Introducing the ability to write Check Descriptions to the Enrichment store, enabling better organization and management of check-related data downstream.
  - Note: Updating the Enrichment store data requires a new Scan operation.
- Enhanced anomaly management by providing a convenient way to filter and view all anomalies generated by a specific check.
  - Users can now access the Anomaly warning sign icon within the Check dialog, providing quick access to two options: View Anomalies and Archive Anomalies.
Usability
- Introducing the ability to generate an API token from within the user interface.
  - This can be done through the Settings > Security section, providing a convenient way to manage API authentication.
- Added the ability to search tables/files and apply filters to running operations.
  - This feature eliminates the need to rely solely on pagination, making it easier to select specific tables/files for operations.
- Included API and SparkSQL links in the documentation for easy access to additional resources and reference materials.

Added Support

Hive datastore support has been added, allowing seamless integration with Hive data sources.
Timescale datastore support has been added, enabling efficient handling of time-series data.
Added support for HTTP(S) and SOCKS5 proxies, allowing users to configure proxy settings for data operations.
Default encryption for rabbitMQ has been implemented, enhancing security for data transmission.

Bug Fixes

Resolved a bug related to updating tag names, ensuring that tag name changes are properly applied.
Fixed an overflow bug in freshness measurements for data size, resulting in accurate measurements and improved reliability.

General Fixes

Updated default weighting for shape anomalies, enhancing the accuracy of anomaly detection and analysis.
Increased datastore connection timeouts, improving stability and resilience when connecting to data sources.
Implemented general bug fixes and made various improvements to enhance overall performance and user experience.

2023.4.19

We're pleased to announce the latest update that includes enhancements to UI for an overall better experience:

Feature Enhancements

Added Volumetric measurements to Freshness Dashboard:
- Gain valuable insights into your data's scale and storage requirements with our new volumetric measurements. SortBy Row Count or Data Size to make informed decisions about your data resources.
Added isReplicaOf check:
- The new isReplicaOf check allows you to easily compare data between two different tables or fields, helping you identify and resolve data inconsistencies across your datastores.

Added Support

Redesigned Checks and Anomalies listing:
- Enjoy a cleaner, more organized layout with more information that makes navigating and managing checks and anomalies even easier.
Redesigned Anomaly Details view:
- The updated anomaly view provides a more thoughtful and organized layout.
Improved Filter components:
- With a streamlined layout and organized categories, filtering your data is now more intuitive. Dropdown options are now to the right to allow view of the Clear and Apply buttons
Updated Importance score to Weight & added SortBy support:
- Manage checks and anomalies more effectively with our updated ‘Weight' feature (formerly ‘Importance Score') and the new SortBy support function, allowing you to quickly identify high-priority issues.

General Fixes

General Fixes and Performance Improvements

2023.4.7

Feature Enhancements

We've just deployed an MVP version of the Freshness Dashboard! This feature lets you create, manage, and monitor all of the SLAs for each of your datastores and their child files/tables/containers, all in one place. It's like having a birds-eye view of how your datastores are doing in relation to their freshness.
- To access the Freshness Dashboard, just locate and click on the clock icon in the top navigation between Insights and Anomalies. By default, you'll see a rollup of all the datastores in a list view with their child files/tables/containers collapsed. Simply click on a datastore row to expand the list.
We've also made some improvements to the UI, including more sorting and filtering options in Datastores, Files/Tables, Checks, and Anomalies. Plus, we've added the ability to search the description field in checks, making it easier to find what you're looking for.
Last but not least, we've added a cool new feature to checks - the ability to archive ALL anomalies generated by a check. Simply click on the anomaly warning icon at the top of the check details box to bring up the archive anomalies dialog box.

User Guide

Introduction to Qualytics

Managing Data Quality

Key Features

Seamless Integration and Deployment

Demo

Embarking on Your Journey

Using the Platform

Onboarding

Onboarding Process

1. Screening & Criteria Gathering

2. User Invitations

Deployment Options

1. SaaS Deployment (Default)

2. On-Premise Deployment

Frequent Asked Questions (FAQs)

Quick Start Guide

Deployment Access

Onboarding Process

1. Screening and Criteria Gathering

2. Environment Setup

3. User Access

Signing In

Method 1: Direct Credentials

Method 2: Enterprise SSO

Getting Started Checklist

Understanding Datastores

JDBC Datastores

Distributed File System (DFS) Datastores

Connecting Your First Datastore

Adding a Source Datastore

Enrichment Datastores

Core Operations

1. Catalog Operation

2. Profile Operation

3. Scan Operation

Managing Data Quality

Quality Checks

1. Inferred Checks

2. Authored Checks

Platform Navigation

Explore Dashboard

1. Insights

2. Activity

3. Profiles

4. Observability

5. Checks

6. Anomalies

Configuration & Management

Tags

Flows

Platform Settings

Next Steps

Web Application

Global Search

In-App Notifications

Discover

Theme

View Mode

Product Updates

User Profile

Navigation Menu (Left Sidebar)

Source Datastores (Default View)

Enrichment Datastores

Explore

Library

Tags

Flows

Global Settings

Keyboard Shortcuts

Search

Navigation

Anomaly

Check

Checks

Container

Datastore

Enrichment

Field

Flow