Unity Catalog (UC) is an open source data catalog.
It's a server that exposes two sets of OpenAPI endpoints:
(1) Unity Catalog API ➜ "native"
(2) Iceberg REST Catalog API ➜ converted to (1) under the hood
UC uses an embedded H2 database as metastore.
The metastore contains the data served by the APIs.
Clients call the APIs to interact with data "assets".
Clients are typically compute engines like Spark, Trino, and DuckDB.
Assets can be:
- table ➜ structured data
- volume ➜ semi-structured or unstructured data
- function ➜ data transformation logic
- (more types on the roadmap, but not yet available)
They are identified in a three-level namespace: <catalog>.<schema>.<asset>
Each level in the namespace has their own API(s):
(1) CatalogsApi ➜ catalog information
(2) SchemasApi ➜ schema information
(3) TablesApi, VolumesApi, FunctionsApi ➜ asset metadata
There are also APIs for "credentials vending":
- TemporaryTableCredentialsApi ➜ temporary storage credentials for a table
- TemporaryVolumeCredentialsApi ➜ temporary storage credentials for a volume
credentials vending: UC requests temporary credentials from a storage provider and gives them to a client.
Typical case: client GETs (b) and (c) to retrieve a data asset from storage.
UC can be used as central access layer.
If all clients access assets via the catalog, permissions can be managed in a single place.
Open source Unity Catalog was released three weeks ago.
The project is in an early stage.
For example, credentials vending is currently only supported for AWS S3 and ML models are not yet available as asset type.
#dataengineering #softwareengineering