Pipeline Overview

Standard Pipeline Anatomy

Every CELINE pipeline is a self-contained application following a canonical structure. Each pipeline lives under apps/<name>/ and bundles all the tooling needed to ingest, transform, govern, and publish a dataset.

Data Layers

Layer	Purpose
RAW	Verbatim data as received from the source — no transformations
STAGING	Technical normalization: type casting, deduplication, field renaming
SILVER	Enriched and curated datasets, domain-ready for analysis
GOLD	Shareable, publication-ready datasets with full governance metadata

Each layer is a set of dbt models (or Meltano taps for RAW). Governance rules apply from STAGING onward.

Tooling Per Layer

Layer	Tool	Description
RAW	Meltano / Singer tap	Source API ingestion
STAGING → GOLD	dbt	SQL transformations and tests
Orchestration	Prefect	Flow execution, scheduling, retries
Lineage	OpenLineage	Automatic metadata emission per run
Governance	`governance.yaml`	Dataset license, access level, attribution

governance.yaml in Pipelines

Each pipeline includes a governance.yaml at the app root. This file declares:

datasets:
  - name: om_weather_hourly
    namespace: ds_dev_silver
    license: ODbL-1.0
    access_level: internal
    attribution: "Open-Meteo contributors"
    retention_days: 365
    tags:
      - weather
      - meteorology

Rules are resolved using pattern matching. The celine-utils governance engine injects these rules into OpenLineage events at run time.

Pipeline Container Structure

Each pipeline application is containerized:

apps/<name>/
  flows/
    pipeline.py       # Prefect flow definition
  models/
    staging/          # dbt staging models
    silver/           # dbt silver models
    gold/             # dbt gold models
  meltano.yml         # Meltano configuration
  governance.yaml     # Dataset governance rules
  pyproject.toml
  Dockerfile

The container entrypoint runs the Prefect flow. Pipelines can also run as scheduled Prefect deployments.

Multi-Flow Pipelines

A single app can host multiple Prefect flows that share the same Docker image but run on independent schedules. Each flow gets its own pipeline_<name>.py and config_<name>.yaml. dbt models are organized in subdirectories within the standard layer folders and selected using dbt tags (comma = intersection):

apps/om/
  flows/
    pipeline.py           # om-flow (weather)
    pipeline_wind.py      # om-wind-flow (wind grid)
    config.yaml
    config_wind.yaml
  dbt/models/
    staging/
      stg_om_weather.sql            # weather (no tag)
      wind/
        stg_om_wind.sql             # wind (tag: wind)
    silver/
      om_weather_hourly.sql
      wind/
        om_wind_hourly.sql
    gold/
      om_weather_features.sql
      wind/
        om_wind_gusts.sql

Each flow selects only its own models via dbt tags:

# wind pipeline uses intersection: layer AND tag
dbt_run("-s staging,tag:wind", cfg)   # only wind staging
dbt_run("-s silver,tag:wind", cfg)    # only wind silver
dbt_run("-s gold,tag:wind", cfg)      # only wind gold
dbt_run("test -s tag:wind", cfg)      # only wind tests

Tags are defined in per-directory schema YAML files:

# dbt/models/silver/wind/wind_schema.yml
models:
  - name: om_wind_hourly
    config:
      tags: ["wind"]

This pattern avoids cross-flow interference: the weather flow's dbt_run("staging", cfg) runs all staging models (including wind), but wind models are harmless when run by the weather flow (views are cheap, incremental tables skip when no new data).

Note on dbt selectors: In dbt, comma within --select is intersection (AND), while space-separated or multiple -s flags is union (OR). This is the opposite of what most people expect. Verified in dbt-core source (dbt/graph/cli.py).