Architecture & Core Concepts
This document explains what the Dataset API is, the core domain model, and how the major subsystems fit together.
What the Dataset API is
The Dataset API is a governed, read-only data access layer that:
- publishes a dataset catalogue (DCAT-AP compatible)
- exposes a restricted SQL query interface over catalogued datasets
- provides schema and metadata introspection for clients and UIs
- integrates with OpenLineage to keep provenance and trust up to date
The API is not an ingestion tool. Pipelines produce data; the Dataset API governs and serves it.
System Context
Actors
- Producers (pipelines): create/refresh physical tables and emit lineage (OpenLineage)
- Operators: manage catalogue definitions via CLI and ensure OPA/policy config is correct
- Consumers (apps/DTs/BI): discover datasets, fetch schemas, run governed queries
External Dependencies
- Physical storage: PostgreSQL (tables, views)
- Authorization: OPA (policy decision point)
- Lineage backend: Marquez (OpenLineage ingestion/query)
- Identity provider: issues JWTs (users + service accounts)
High-level Architecture
+------------------------+
| Pipelines (ETL/dbt/..) |
+-----------+------------+
|
| OpenLineage events
v
+-----------+
| Marquez |
+-----+-----+
|
| export lineage + metadata
v
+---------+ +---------------------+ +--------------------+
| Clients | --> | Dataset API | <--> | OPA (Policy) |
| (apps) | | | | allow/deny |
+---------+ | - Catalogue | +--------------------+
| - Query Engine |
| - Schema API |
| - Metadata API |
+----------+----------+
|
v
+---------------+
| PostgreSQL |
| tables/views |
+---------------+
Core Domain Model
Dataset
A dataset is a governed contract over a physical data asset.
Dataset identity
- dataset_id (stable string; often namespace-qualified)
Dataset governance
- access_level: open | internal | restricted
- ownership / stewardship fields
- classification, tags, retention hints
Dataset physical mapping - resolved storage reference (e.g., Postgres table/view) - schema and column metadata derived from reflection
Namespace
Namespaces are a first-class taxonomy for lifecycle and intent:
- raw: ingestion/staging
- silver: enriched internal
- gold: curated/exposed
Namespaces drive:
- catalogue selection filters
- policy rules (e.g., only gold exposed externally)
- operational grouping
Distribution
In DCAT terms, a dataset can expose one or more distributions (e.g., SQL endpoint, files, API resource). In practice: - the API exposes a query distribution - optional documentation and external references may be included
Read-only Contract
Consumers cannot mutate data or catalogue state. Mutations happen only via: - data pipelines (tables/views) - CLI-managed catalogue imports - admin endpoints used by CLI
This guarantees: - reproducibility - auditability - consistent governance enforcement
Catalogue vs Storage Reality
The catalogue is validated against storage.
Expected behaviors: - if a dataset points to a missing table/view, it should be marked invalid and/or removed during cleanup - imports reconcile desired state (YAML) vs actual DB objects - schema endpoints reflect what exists in storage today
The catalogue never creates physical data.
Lifecycle of a Dataset (Conceptual)
- Pipeline creates/refreshes physical table/view
- Lineage is emitted to Marquez (OpenLineage)
- Operator exports lineage-derived candidates (CLI)
- Operator curates YAML (titles, descriptions, access levels, tags, docs)
- CLI imports catalogue (create/update)
- API exposes dataset in catalogue (if allowed)
- Consumers query datasets under governance
- Cleanup removes stale entries when physical assets disappear
Data Integrity & Guardrails
- SQL must be validated (AST-based, allowlisted)
- dataset references must resolve to catalogued assets
- access is policy-controlled (OPA)
- limits and pagination protect the system from unbounded workloads