Computed Join

A Computed Join Container allows you to combine data from two containers, which can be from the same source datastore or different source datastores (e.g., a database table vs. a file system container). You can choose the join type (Inner, Left, Right, or Full Outer) and apply transformations, filters, and custom queries to the joined result.

This feature is useful when you want to:

Merge information from multiple source datastores into a single dataset.
Perform cross-datastore analysis (e.g., JDBC tables with DFS files).
Apply Spark SQL transformations and filters on top of the joined data.

Let's get started 🚀

How It Works

The Add Computed Join form consists of:

REF.	FIELDS	DESCRIPTION
1	Name	The unique name for your computed join container.
2	Join Type	Choose one of the following: • Inner Join: Keeps only rows with matching keys in both containers. • Left Join: Keeps all rows from the left container, matching rows from the right. • Right Join: Keeps all rows from the right container, matching rows from the left. • Full Outer Join: Keeps all rows from both containers.
3	Left Reference	• Datastore: Source datastore where the computed join container will be created. • Container: The left container to join. • Field: The key (column) to join on. • Prefix: A label (e.g., `left`) applied to all columns from this container.
4	Right Reference	• Datastore: Source datastore containing the second container. • Container: The right container to join. • Field: The key (column) to join on. • Prefix: A label (e.g., `right`) applied to all columns from this container.
5	Select Expression	A list of columns to include in the result. Columns are automatically prefixed (e.g., `left_name`, `right_name`) to avoid conflicts.
6	Filter Clause (WHERE)	Additional filters applied to the join result.

computed-join

Example Use Case

Scenario

We want to join:

Left Container: customers
Right Container: orders
Join Key: customer_id
Join Type: Left Join
Prefixes: cust_ and order_

Input Tables

customers

customer_id	name	city
1	Alice	Berlin
2	Bob	London
3	Charlie	Paris

orders

order_id	customer_id	product
101	1	Laptop
102	1	Mouse
103	2	Keyboard

Joined Result (Left Join)

cust_customer_id	cust_name	cust_city	order_order_id	order_product
1	Alice	Berlin	101	Laptop
1	Alice	Berlin	102	Mouse
2	Bob	London	103	Keyboard
3	Charlie	Paris	NULL	NULL

Visual Diagram

+------------+                 +--------+
| customers  |    LEFT JOIN    | orders |
+------------+ <-------------> +--------+
    |                               |
    +-- customer_id = customer_id --+

API Example

Endpoint

POST:/api/containers

Expected response: 200 OK

Request Payload

{
    "container_type": "computed_join",
    "name": "customer_orders_join",
    "select_clause": "cust_customer_id, cust_name, cust_city, order_order_id, order_product",
    "where_clause": null,
    "left_join_field_name": "customer_id",
    "left_prefix": "cust",
    "right_join_field_name": "customer_id",
    "right_prefix": "order",
    "join_type": "left",
    "left_container_id": 101,
    "right_container_id": 202
}

Tips

Always set prefixes to avoid column name collisions.
Use Select Expression to choose only the columns you need.
Apply a Filter Clause for better performance by reducing unnecessary data.
Test the join type with sample data to verify expected behavior.