Understanding DFS
What is DFS?
A Distributed File System (DFS) is a storage architecture that allows data to be stored across multiple machines or locations, while providing access to it as if it were on a single local system. In the context of modern cloud platforms, DFS encompasses cloud object storage services like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage — which store data as objects (files) organized in buckets and folders, supporting formats like Parquet, AVRO, CSV, and JSON.
How DFS Works in Qualytics
Qualytics uses DFS connectors powered by Apache Spark to connect to cloud object storage and distributed file systems. When you add a DFS datastore, Qualytics:
- Establishes a connection to your storage platform using credentials (access keys, service accounts, service principals, or connection strings).
- Walks the directory tree during the Sync operation — reading files with supported extensions and creating containers based on file metadata and naming patterns.
- Reads data through Spark using native cloud connectors for optimized, parallel file reads across partitions.
- Writes enrichment data (all DFS connectors support enrichment) back to the storage platform to persist scan results, anomalies, and remediation records.
Connections and Security
For details on connection configuration, authentication methods (Shared Key, Service Principal, IAM roles), and secrets management (HashiCorp Vault integration), see the How Connections Work documentation.
Data Organization
In DFS datastores, data is organized as:
- Containers — Files in a folder that share a common schema. Qualytics automatically groups files with similar naming patterns into a single globbed container (e.g.,
orders_*.csv). This grouping process is called Filename Globbing and treats each file as a partition of the same logical dataset. - Fields — Columns within each file, detected automatically based on the file format and schema.
- Records — Rows of data within each file, analyzed during Profile and Scan operations.
Containers
For a detailed understanding of how Qualytics manages containers in DFS datastores, see the Containers Overview documentation.
Field Type Inference
During the Sync operation, Qualytics infers field types automatically based on the file format:
- Schema-aware formats (Parquet, AVRO, ORC, Delta, Iceberg) — Field types are read directly from the file's embedded schema.
- Schema-less formats (CSV, JSON) — Qualytics uses weighted histogram analysis to infer field types from actual data values, detecting integers, decimals, dates, timestamps, and text fields. Inferred types can be reviewed and overridden manually on each field.
Bulk Creation
DFS datastores do not support the multi-schema creation flow (which is designed for JDBC connectors with catalog/schema hierarchy). However, you can bulk-create multiple DFS datastores via the API by providing a list of root_paths in the bulk creation request. Each root path creates a separate DFS datastore. See the Datastore API for details.
Getting Started
-
Add with New Connection
Create a new DFS datastore by setting up a new connection from scratch.
-
Add with Existing Connection
Create a new DFS datastore by reusing credentials from an existing connection.
-
Available DFS Connectors
Browse the full list of supported DFS connectors.
-
Connections
Configure connection details, authentication, and secrets management.
-
Supported File Formats
Parquet, AVRO, CSV, JSON, Delta, Iceberg, ORC, and more.
-
Filename Globbing
How files are grouped into containers and best practices for organizing your data.
Available Operations
Once a DFS datastore is added, you can run the following operations:
| Operation | Description |
|---|---|
| Sync | Walks the directory tree, reads files with supported extensions, and creates containers based on file metadata and naming patterns. Detects new, changed, or removed files incrementally. |
| Profile | Analyzes records across containers to compute statistics, detect data patterns, and automatically infer quality checks. |
| Scan | Executes quality checks against the data, measures data quality metrics, and detects anomalies at the record and schema levels. |
| External Scan | Runs scan operations using externally provided data files. |
Tip
The recommended sequence is Sync → Profile → Scan. This cycle is repeatable — as your data evolves, re-running these operations keeps your quality checks and anomaly detection up to date.