Skip to content

Catalogue Management

This document covers how datasets are defined, imported, reconciled, and cleaned up.


Catalogue as Code

Catalogue state is defined in YAML and treated like application config: - version controlled - reviewed - validated before import

The API database stores the result of the import, but YAML remains the source of truth.


A compact pattern:

defaults:
  access_level: internal
  classification: green
  tags: []
  ownership: []
  retention_days: 365

sources:
  datasets.gold.example:
    access_level: external
    title: Example dataset
    description: Curated indicator for X.
    tags: [gold, example]
    documentation_url: https://...
    source_system: Example producer

Key fields you typically need: - dataset_id - title, description - access_level - tags, classification - source_system, documentation_url - optional ownership, license, retention hints


Import Semantics

Imports are reconciling:

  • create missing dataset entries
  • update metadata on existing entries
  • optionally delete or disable entries that are no longer present in YAML

This makes environments reproducible.

Create vs Update

  • if dataset_id exists → update metadata and refresh schema references
  • if not → create entry, then validate physical mapping

Selection & Filters

To manage large catalogues, imports support dataset selection filters.

Recommended semantics: - +pattern includes (glob) - -pattern excludes (glob)

Example: - include only gold: +datasets.*.gold.* - exclude one: -datasets.gold.experimental_*

The import command should resolve the final selection list before applying changes.


Dry Run

--dry-run should: - print the selected dataset_ids after filters - show what would be created/updated/deleted - perform no writes

This is essential for safe ops.


Physical Validation & Reflection

During import (or post-import), the system should: - verify the referenced physical table/view exists - reflect schema to build columns/types - optionally generate JSON Schema artifacts

If reflection fails: - mark dataset as invalid (or reject import for that dataset) - surface actionable error


Cleanup of Stale Entries

A robust import process includes a cleanup phase.

Goal: remove catalogue entries whose physical tables no longer exist.

Recommended algorithm: 1. list catalogue entries 2. for each, check existence (reflection / information_schema) 3. if missing, delete entry unless protected 4. support skip-list for datasets just imported or explicitly pinned

This addresses real-world drift when pipelines drop or rename tables.


DCAT Exposure Rules

The API should only expose datasets in the public catalogue that meet both: - configured exposure rules (e.g. namespace in {gold}) - access level compatible with anonymous viewing (typically open)

You can still keep internal/restricted datasets in the internal catalogue, but hide from public endpoints.


Operational Tips

  • keep titles/descriptions in YAML (reviewable)
  • use tags to express domain/tenant scoping for OPA
  • keep dataset_ids stable; rename through controlled migration
  • do not overload YAML with physical implementation details unless necessary