Skip to content

CELINE Utils

A collection of shared utilities, libraries, and command-line tools that form the technical backbone of the CELINE data platform. Provides reusable building blocks for data pipelines, governance, lineage, metadata management, and platform integrations.

Not an end-user application — a platform utility layer embedded into CELINE applications and executed within orchestrated environments using Meltano, dbt, Prefect, and OpenLineage.


Scope and goals

  • Centralise cross-cutting platform logic used by multiple CELINE projects
  • Provide opinionated but extensible tooling for data pipelines
  • Enforce consistent governance and lineage semantics
  • Reduce duplication across pipeline applications
  • Act as a stable foundation for CELINE-compatible services and workflows

Key capabilities

Governance framework

A declarative governance.yaml specification defines the metadata, access control, and dataspace exposure rules for each dataset.

The GovernanceRule model covers:

  • Dataset ownership (owner, attribution)
  • License and access level (open, internal, restricted, secret)
  • Data classification (pii, green, yellow, red) and retention
  • Tags, documentation links, and source system
  • user_filter_column — the column used for per-subject consent-based row filtering
  • expose: true — controls whether the dataset appears in the DCAT catalogue and is registered as an EDC asset

Extended blocks for DCAT-AP 3.0 and dataspace integration:

dcat: block — propagated to the DCAT-AP catalogue by dataset-api: - publisher_uri — overrides the API-level fallback publisher - themes — EU Publications Office data-theme URIs - language_uris — dct:language URIs - spatial_uris — dct:spatial URIs - accrual_periodicity — dct:accrualPeriodicity URI - conforms_to — dct:conformsTo URI - temporal.start / temporal.end — dct:temporal coverage

dataspace: block — consumed by export_governance.py when registering datasets in EDC: - contract_required — enables ds:contractRequired ODRL constraint - consent_required — enables ds:consentStatus ODRL constraint and consent-based row filtering - odrl_action — default ODRL action (default use) - purpose — ODRL purpose values - medallion — data quality level (gold / silver / bronze)

Governance rules are resolved with pattern matching via GovernanceResolver — defaults cascade from the defaults: block into each source entry, with per-source values taking precedence. The expose and dcat/dataspace fields use an OR-merge for booleans and override-merge for objects.

Both celine-utils (pipeline side) and dataset-api/cli/export_governance.py (catalogue side) parse the same governance.yaml format. EDC-specific sub-objects in the dataspace: block are silently ignored by celine-utils via model_config = ConfigDict(extra="ignore").

Pipeline orchestration

Structured execution layer for:

  • Meltano ingestion pipelines
  • dbt transformations and tests
  • Prefect-based Python flows

The PipelineRunner coordinates execution, logging, error handling, and lineage emission consistently across tools.

See the pipeline tutorial.

OpenLineage integration

  • Automatic emission of START, COMPLETE, FAIL, and ABORT events
  • Dataset-level schema facets
  • Data quality assertions from dbt tests
  • Custom CELINE governance facets (including userFilterColumn, medallion, classification)

Dataset tooling

The DatasetClient enables:

  • Schema and table introspection
  • Column metadata inspection
  • Safe query construction
  • Export to Pandas

Platform integrations

  • Keycloak for identity and access management
  • Apache Superset for analytics platform integration
  • MQTT for lightweight messaging

CLI

celine-utils governance generate   # generate governance.yaml template
celine-utils pipeline init         # scaffold a new pipeline
celine-utils pipeline run          # run a pipeline

Repository structure

celine/
  admin/
  cli/
  common/
  datasets/
  pipelines/
schemas/
tests/

Configuration

Environment-driven via pydantic-settings:

  • Environment variables first
  • Optional .env files
  • Typed validation with container-friendly defaults

Documentation

  • Pipeline Tutorial — end-to-end pipeline setup guide
  • Governance — governance.yaml format, access levels, pattern matching, dcat/dataspace blocks
  • Schemas — JSON Schema definitions including governance.schema.json
  • CLI — full CLI reference

Installation

pip install celine-utils

Intended audience

  • Data engineers
  • Platform engineers
  • CELINE application developers

License

Copyright © 2025 Spindox Labs

Licensed under the Apache License, Version 2.0.