Governance & Security

This document defines the governance model (access levels, identity), the security posture (validation, isolation), and the authorization flow.

Security Goals

prevent data exfiltration by escaping the catalogue
prevent side effects (no writes, no DDL)
ensure authorization decisions are explainable and consistent
keep resource usage bounded and predictable

Access Levels

Each dataset declares an access_level:

Access level	Intended exposure	Authentication	Authorization
`open`	public catalogue + read access	not required	not required
`internal`	internal audiences	required	policy may still apply
`restricted`	limited audiences	required	required (OPA)

Notes: - open does not mean “unsafe”; SQL hardening and limits still apply. - restricted should be used when access depends on groups/roles/attributes.

Identity Model

JWT Inputs

The API receives a JWT from an Identity Provider. The raw token can vary by IdP. The API normalizes claims to a consistent internal representation.

Typical normalized fields: - sub (subject) - preferred_username or display name - groups (list) - roles (list) - scopes (list) (if provided)

Users vs Service Accounts

Service accounts (client credentials) are treated as first-class identities. They should carry: - a distinct subject (client id) - roles/groups appropriate for their operating scope - optionally environment scoping (tenant/community)

You should avoid “god tokens” unless policy explicitly supports it.

Authorization with OPA

OPA is the Policy Decision Point (PDP). The Dataset API is the Policy Enforcement Point (PEP).

Decision flow

API normalizes JWT → identity object
API collects dataset metadata and requested action
API builds OPA input document
API calls OPA
OPA returns decision (allow/deny + optional reason)
API enforces result (deny by default)

Example OPA input shape (illustrative)

{
  "input": {
    "action": "query",
    "dataset": {
      "dataset_id": "datasets.gold.example",
      "namespace": "gold",
      "access_level": "restricted",
      "tags": ["energy", "forecast"]
    },
    "identity": {
      "sub": "service:dt-app",
      "groups": ["community:alpha", "ops"],
      "roles": ["service"],
      "scopes": ["dataset:query"]
    },
    "request": {
      "sql_tables": ["datasets.gold.example"],
      "limit": 100,
      "offset": 0
    }
  }
}

Your actual field names may vary; keep them stable for policy authors.

SQL Hardening (Non-negotiable)

All SQL is validated using an AST parser and an allowlist.

Allowed

SELECT queries only
explicit projections and filters
bounded queries (LIMIT required or enforced)
references only to catalogued datasets

Rejected

DDL: CREATE, ALTER, DROP, …
DML: INSERT, UPDATE, DELETE, MERGE, …
multiple statements / statement chaining
functions not on allowlist
referencing raw physical table names to bypass catalogue
unbounded scans (policy + query guards)

Defense-in-depth Execution Pipeline

Authentication (if required by access level)
SQL parse + validation
Extract referenced tables/datasets
Resolve to catalogued mapping
OPA authorization (dataset + action + identity)
Execute in restricted context with limits/timeouts
Return paginated results only

If any step fails → request fails.

Auditability & Observability

Log (at least): - request id - subject - dataset ids referenced - allow/deny decision - validation errors (sanitized) - execution time and rows returned

Do not log raw SQL without sanitization if it may contain sensitive literals.

Common Policy Patterns

Namespace gating: allow external access only to gold datasets
Group-based: allow if identity belongs to dataset:{id}:read
Tenant/community scoping: allow if identity has community:{X} group matching dataset tag community:{X}
Service account capabilities: allow if role includes service and scope includes dataset:query