Catalogue Management

This document covers how datasets are defined, imported, reconciled, and cleaned up.

Catalogue as Code

Catalogue state is defined in YAML and treated like application config: - version controlled - reviewed - validated before import

The API database stores the result of the import, but YAML remains the source of truth.

YAML Structure (Recommended)

A compact pattern:

defaults:
  access_level: internal
  classification: green
  tags: []
  ownership: []
  retention_days: 365

sources:
  datasets.gold.example:
    access_level: external
    title: Example dataset
    description: Curated indicator for X.
    tags: [gold, example]
    documentation_url: https://...
    source_system: Example producer

Key fields you typically need: - dataset_id - title, description - access_level - tags, classification - source_system, documentation_url - optional ownership, license, retention hints

Import Semantics

Imports are reconciling:

create missing dataset entries
update metadata on existing entries
optionally delete or disable entries that are no longer present in YAML

This makes environments reproducible.

Create vs Update

if dataset_id exists → update metadata and refresh schema references
if not → create entry, then validate physical mapping

Selection & Filters

To manage large catalogues, imports support dataset selection filters.

Recommended semantics: - +pattern includes (glob) - -pattern excludes (glob)

Example: - include only gold: +datasets.*.gold.* - exclude one: -datasets.gold.experimental_*

The import command should resolve the final selection list before applying changes.

Dry Run

--dry-run should: - print the selected dataset_ids after filters - show what would be created/updated/deleted - perform no writes

This is essential for safe ops.

Physical Validation & Reflection

During import (or post-import), the system should: - verify the referenced physical table/view exists - reflect schema to build columns/types - optionally generate JSON Schema artifacts

If reflection fails: - mark dataset as invalid (or reject import for that dataset) - surface actionable error

Cleanup of Stale Entries

A robust import process includes a cleanup phase.

Goal: remove catalogue entries whose physical tables no longer exist.

Recommended algorithm: 1. list catalogue entries 2. for each, check existence (reflection / information_schema) 3. if missing, delete entry unless protected 4. support skip-list for datasets just imported or explicitly pinned

This addresses real-world drift when pipelines drop or rename tables.

DCAT Exposure Rules

The API should only expose datasets in the public catalogue that meet both: - configured exposure rules (e.g. namespace in {gold}) - access level compatible with anonymous viewing (typically open)

You can still keep internal/restricted datasets in the internal catalogue, but hide from public endpoints.

Operational Tips

keep titles/descriptions in YAML (reviewable)
use tags to express domain/tenant scoping for OPA
keep dataset_ids stable; rename through controlled migration
do not overload YAML with physical implementation details unless necessary