Skip to content

Glossary

Anomaly

Something that deviates from the standard, normal, or expected. This can be in the form of a single data point, record, or a batch of data.

Accuracy

The data represents the real-world values they are expected to model.

Catalog Operation

Used to read fundamental metadata from a Datastore required for the proper functioning of subsequent Operations such as Profile, Hash, and Scan.

Comparison

An evaluation to determine if the structure and content of the source and target Datastores match.

Comparison Runs

An action to perform a comparison.

Completeness

Required fields are fully populated.

Conformity

Alignment of the content to the required standards, schemas, and formats.

Connectors

Components that can be easily connected to and used to integrate with other applications and databases. Common uses include sending and receiving data.

Info

Qualytics provides verified connectors for a wide range of datastores, including: Files (CSV, JSON, XLSX, Parquet, Delta, Iceberg) on Object Storage (S3, Azure Blob, GCS); Data Warehouses (BigQuery, Snowflake, Redshift); and Databases (Oracle, MSSQL, MySQL, PostgreSQL, Trino, etc.). Because Qualytics is built on Apache Spark, additional JDBC-accessible datastores or file formats may be technically compatible. If yours is not listed, contact us — our team will evaluate feasibility and work with you to determine whether a supported connection can be established.

Consistency

The value is the same across all datastores within the organization.

Container (of a Datastore)

The uniquely named abstractions within a Datastore that hold data adhering to a known schema. The Containers within a RDBMS are tables, the containers in a filesystem are well formatted files, etc.

Data-at-rest

Data that is stored in a database, warehouse, file system, data lake, or other datastore.

Data Drift

Changes in a data set’s properties or characteristics over time.

Data-in-flight

Data that is on the move, transporting from one location to another, such as through a message queue, API, or other pipeline.

Data Lake

​​A centralized repository that allows you to store all your structured and unstructured data at any scale.

Data Quality

Ensuring data is free from errors, including duplicates, inaccuracies, inappropriate fields, irrelevant data, missing elements, non-conforming data, and poor data entry.

Data Quality Check

aka "Check" is an expression regarding the values of a Container that can be evaluated to determine whether the actual values are expected or not.

Datastore

Where data is persisted in a database, file system, or other connected retrieval systems. You can check more in Datastore Overview.

Data Warehouse

A system that aggregates data from different sources into a single, central, consistent datastore to support data analysis, data mining, artificial intelligence (AI), and machine learning.

Distinctness (of a Field)

The fraction of distinct values (appear at least once) to total values that appear in a Field.

Enrichment Datastore

Additional properties that are added to a data set to enhance its meaning. Qualytics enrichment includes whether a record is anomalous, what caused it to be an anomaly, what characteristics it was expected to have, and flags that allow other systems to act upon the data.

Excluded Field

A field that has been manually removed from quality monitoring by a user. Its quality checks are archived (except Expected Schema, which is updated), and dependent computed fields are also excluded. Excluded fields can be restored to active status.

Favorite

Users can mark instances of an abstraction (Field, Container, Datastore, Check, Anomaly, etc.) as a personalized favorite to ensure it ranks higher in default ordering and is prioritized in other personalized views & workflows.

Field Status

A property assigned to every field in Qualytics that determines how the platform interacts with it. The four statuses are Active, Masked, Missing, and Excluded. Field status controls whether a field is included in profiling, scanning, and quality check evaluations. Learn more in Field Status Overview.

Compute Daemon

An application that protects a system from contamination due to inputs, reducing the likelihood of contamination from an outside source. The Compute Daemon will quarantine data that is problematic, allowing the user to act upon quarantined items.

Incremental Identifier

A Field that can be used to group the records in the Table Container into distinct ordered Qualytics Partitions in support of incremental operations upon those partitions:

  • a whole number - then all records with the same partition_id value are considered part of the same partition.
  • a float or timestamp - then all records between two defined values are considered part of the same partition (the defining values will be set by incremental scan/profile business logic). Since Qualytics Partitions are required to support Incremental Operations, an Incremental Identifier is required for a Table Container to support incremental Operations.

Incremental Scan Operation

A Scan Operation where only new records (inserted since the last Scan Operation) are analyzed. The underlying Container must support determining which records are new for incremental scanning to be a valid option for it.

Inference Engine

After Compute Daemon gathers all the metadata generated by a profiling operation, it feeds that metadata into our Inference Engine. The inference engine then initiates a "true machine learning" (specifically, this is referred to as Inductive Learning) process whereby the available customer data is partitioned into a training set and a testing set. The engine applies numerous machine learning models & techniques to the training data in an effort to discover well-fitting data quality constraints. Those inferred constraints are then filtered by testing them against the held out testing set & only those that assert true are converted to inferred data quality Checks.

Masked Field

A field that remains fully operational (profiled, scanned, and quality-checked) but whose actual values are hidden across the platform by default. Users with Editor permission can reveal masked values, and every access is recorded in the masking audit log. Learn more in Mask a Field.

Merge Fields

An operation that combines two fields on the same container — typically used when a column is renamed in the source data. The source field keeps its history (quality checks, anomalies, profiles) and adopts the target field's name. The target field is removed. Learn more in Merge Fields.

Metadata

Data about other data, including descriptions and additional information.

Missing Field

A field that was previously active but is no longer found in the source data during a profile operation. Missing fields are automatically restored to Active when they reappear. They cannot be manually restored by a user. Learn more in Field Status Lifecycle.

Object Storage

A type of data storage used for handling large amounts of unstructured data managed as objects.

Operation

The asynchronous (often long-running) tasks that operate on Datastores are collectively referred to as "Operations". Examples include Catalog, Profile, Hash, and Scan.

Partition Identifier

A Field that can be used by Spark to group the records in a Dataframe into smaller sets that fit within our Spark worker’s memory. The ideal Partition Identifier is an Incremental Identifier of type datetime since that can serve as both but we identify alternatives should that not be available.

Pipeline

A workflow that processes and moves data between systems.

Precision

Your data is the resolution that is expected- How tightly can you define your data?

Profile Operation

An operation that generates metadata describing the characteristics of your actual data values.

Profiling

The process of collecting statistics on the characteristics of a dataset involving examining, analyzing, and reviewing the data.

Proprietary Algorithms

A procedure utilizing a combination of processes, tools, or systems of interrelated connections that are the property of a business or individual in order to solve a problem.

Quality Score

A measure of data quality calculated at the Field, Container, and Datastore level. Quality Scores are recorded as time-series enabling you to track movement over time. You can read more in Quality Scoring.

Qualytics App

aka "App" this is the user interface for our Product delivered as a web application.

Qualytics Deployment

A single instance of our product (the k8s cluster, postgres database, controlplane/app/compute daemon pods, etc).

Qualytics Compute Daemon

aka "Compute Daemon" this is the layer of our Product that connects to Datastores and directly operates on users’ data.

Qualytics Implementation

A customer’s Deployment plus any associated integrations.

Controlplane

aka "controlplane" this is the layer of our Product that exposes an Application Programming Interface (API).

Qualytics Partition

The smallest grouping of records that can be incrementally processed. For DFS datastores, each file is a Qualytics Partition. For JDBC datastores, partitions are defined by each table’s incremental identifier values.

Record (of a Container)

A distinct set of values for all Fields defined for a Container (e.g. a row of a table).

Schema

The organization of data in a datastore. This could be the columns of a table, the header of a CSV file, the fields in a JSON file, or other structural constraints.

Schema Differences

Differences in the organization of information between two datastores that are supposed to hold the same content.

Source

The origin of data in a pipeline, migration, or other ELT/ETL process. It’s where data gets extracted.

Tag

Users can assign Tags to Datastores, Profiles (Files, Tables, Containers), Checks and Anomalies. Add a Description and Assign a Weight. The weight value directly correlates with the level of importance, where a higher weight indicates higher significance.

Target

The destination of data in a pipeline, migration, or other ELT/ETL process. It’s where data gets loaded.

Third-party data

Data acquired from a source outside of your company which may not be controlled by the same data quality processes. You may not have the same level of confidence in the data and it may not be as trustworthy as internally vetted datasets.

Timeliness

It can be calculated as the time between when information should be available and when it is actually available, focused on if data is available when it’s expected.

Volumetrics

Data has the same size and shape across similar cycles. It includes statistics about the size of a data set including calculations or predictions on the rate of change over time.

Weight

The weight value directly correlates with the level of importance, where a higher weight indicates higher significance.