Combining data
Confidential Computing operates on isolated datasets from different providers and allows to securely combine them into a single, coherent input for computation.
This article explains how to perform such combinations, focusing on:
The sections below describe how to combine confidential datasets using these operations, including how to handle missing values, duplicated records, and how to resolve records with duplicated ids.
Once the data has been combined into a single confidential dataset, it is possible to compute confidential statistics using the query language.
Inner join
This operation is similar to SQL INNER JOIN
. It matches values in two tables and creates a combined row containing
data from both tables.
Consider the following two tables:
Hospital A
citizen_id | cancer | encounter |
---|---|---|
1 | positive | 2024-02-02 |
2 | negative | 2020-03-03 |
3 | positive | 2010-05-21 |
Citizen statistical data
citizen_id | education | graduation |
---|---|---|
1 | 0 | 1989-06-01 |
1 | 1 | 1991-07-01 |
2 | 1 | 2000-06-01 |
The citizen statistical data table contains several rows for the same citizen_id
value.
This means that this individual has completed several educational qualifications at varying levels.
For example, a single citizen_id
may be associated with records for both a high school diploma and a university
bachelor's degree.
When we perform Inner Join on the tables above, matching on citizen_id
, we get the following merged table:
citizen_id | cancer | encounter | education | graduation |
---|---|---|---|---|
1 | positive | 2024-02-02 | 0 | 1989-06-01 |
1 | positive | 2024-02-02 | 1 | 1991-07-01 |
2 | negative | 2020-03-03 | 1 | 2000-06-01 |
The merged table includes all columns from the source tables, with the merge column citizen_id
appearing only once.
When multiple rows share the same citizen_id
value, all matching combinations are included in the result.
If our Hospital A table included several rows with citizen_id = 1
, the inner join would contain all possible
combinations of matching rows.
Rows with citizen_id values that do not appear in both tables are excluded from the output (e.g. citizen_id = 3
).
Left Outer Join
This operation is similar to SQL LEFT OUTER JOIN
. It retains all rows from the "left" table (the first table for this
merge operation) and matching rows from the "right" table.
If no match is found, the missing values from the right table are filled with missing
.
Consider the tables from the previous example, but now we want to keep all the rows in the merged table, even for the cases where there is no match. Left outer join would yield the following merged table:
citizen_id | cancer | encounter | education | graduation |
---|---|---|---|---|
1 | positive | 2024-02-02 | 0 | 1989-06-01 |
1 | positive | 2024-02-02 | 1 | 1991-07-01 |
2 | negative | 2020-03-03 | 1 | 2000-06-01 |
3 | positive | 2010-05-21 | missing | missing |
This operation is useful when data from the right table may be missing but the corresponding rows from the left table must still be included in the analysis. For example, it allows identifying patients diagnosed with cancer who lack education data.
Union
Join operations allow combining datasets containing different information about the same entity. For example,
medical and education data about a patient.
In contrast, the Union operation allows for combining the same information about different entities,
such as patient records from multiple hospitals.
This operation is similar to SQL UNION ALL
.
In this example, each hospital maintains medical data for citizens within a specific region, with one hospital per region.
Hospital A
citizen_id | cancer | encounter |
---|---|---|
1 | positive | 2024-02-02 |
2 | negative | 2020-03-03 |
3 | positive | 2010-05-21 |
Hospital B
citizen_id | cancer | encounter |
---|---|---|
10 | negative | 2022-11-02 |
11 | positive | 2011-12-01 |
The union of tables for Hospital A and Hospital B is a table containing all rows from both tables
citizen_id | cancer | encounter |
---|---|---|
1 | positive | 2024-02-02 |
2 | negative | 2020-03-03 |
3 | positive | 2010-05-21 |
10 | negative | 2022-11-02 |
11 | positive | 2011-12-01 |
Info
Union is only possible if the two tables have the same columns.
Selecting Unique Row
In some situations, the analysis requires having a single row for a patient to be considered. Quite often, this can be done for each table individually. However, when taking the union of two tables, one might end up with several columns for the same patient.
In some cases, the analysis requires a single row per patient. While this can often be achieved within each table, combining two tables using a union may result in multiple rows for the same patient. In our running example, imagine that two different hospitals have records for the same patient because the patient has relocated.
Hospital A
citizen_id | cancer | encounter |
---|---|---|
1 | positive | 2024-02-02 |
2 | negative | 2020-03-03 |
Hospital B
citizen_id | cancer | encounter |
---|---|---|
2 | negative | 2024-02-02 |
3 | negative | 2020-03-03 |
The union of the two tables will contain two rows with citizen_id = 2
citizen_id | cancer | encounter |
---|---|---|
1 | positive | 2024-02-02 |
2 | negative | 2020-03-03 |
2 | negative | 2024-02-02 |
3 | negative | 2020-03-03 |
However, suppose we want to include only the first encounter of the patient.
The unique row operation achieves this by retaining only the row that meets a specific selection criterion, such as the
earliest encounter
date in this example.
After applying Unique row we get
citizen_id | cancer | encounter |
---|---|---|
1 | positive | 2024-02-02 |
2 | negative | 2020-03-03 |
3 | negative | 2020-03-03 |
Combine Data in Multiple Ways
The merge operations described above can be combined in various ways to produce data suitable for further computation. The analyst can define a merge tree that specifies how merge operations pass their output further to other merge operations.
For example, if Hospital A and Hospital B have patient records and the analyst wants to combine them with education
data, the analyst first performs a union of the hospital datasets.
This may result in duplicated records for a patient who received treatment at both hospitals.
To remove duplicates, the analyst can use the unique row operation, selecting
the row with the minimum encounter
date. The data can then be joined with the education data.