Governance & Security
This document defines the governance model (access levels, identity), the security posture (validation, isolation), and the authorization flow.
Security Goals
- prevent data exfiltration by escaping the catalogue
- prevent side effects (no writes, no DDL)
- ensure authorization decisions are explainable and consistent
- keep resource usage bounded and predictable
Access Levels
Each dataset declares an access_level:
| Access level | Intended exposure | Authentication | Authorization |
|---|---|---|---|
open |
public catalogue + read access | not required | not required |
internal |
internal audiences | required | policy may still apply |
restricted |
limited audiences | required | required (OPA) |
Notes:
- open does not mean “unsafe”; SQL hardening and limits still apply.
- restricted should be used when access depends on groups/roles/attributes.
Identity Model
JWT Inputs
The API receives a JWT from an Identity Provider. The raw token can vary by IdP. The API normalizes claims to a consistent internal representation.
Typical normalized fields:
- sub (subject)
- preferred_username or display name
- groups (list)
- roles (list)
- scopes (list) (if provided)
Users vs Service Accounts
Service accounts (client credentials) are treated as first-class identities. They should carry: - a distinct subject (client id) - roles/groups appropriate for their operating scope - optionally environment scoping (tenant/community)
You should avoid “god tokens” unless policy explicitly supports it.
Authorization with OPA
OPA is the Policy Decision Point (PDP). The Dataset API is the Policy Enforcement Point (PEP).
Decision flow
- API normalizes JWT → identity object
- API collects dataset metadata and requested action
- API builds OPA input document
- API calls OPA
- OPA returns decision (allow/deny + optional reason)
- API enforces result (deny by default)
Example OPA input shape (illustrative)
{
"input": {
"action": "query",
"dataset": {
"dataset_id": "datasets.gold.example",
"namespace": "gold",
"access_level": "restricted",
"tags": ["energy", "forecast"]
},
"identity": {
"sub": "service:dt-app",
"groups": ["community:alpha", "ops"],
"roles": ["service"],
"scopes": ["dataset:query"]
},
"request": {
"sql_tables": ["datasets.gold.example"],
"limit": 100,
"offset": 0
}
}
}
Your actual field names may vary; keep them stable for policy authors.
SQL Hardening (Non-negotiable)
All SQL is validated using an AST parser and an allowlist.
Allowed
SELECTqueries only- explicit projections and filters
- bounded queries (
LIMITrequired or enforced) - references only to catalogued datasets
Rejected
- DDL:
CREATE,ALTER,DROP, … - DML:
INSERT,UPDATE,DELETE,MERGE, … - multiple statements / statement chaining
- functions not on allowlist
- referencing raw physical table names to bypass catalogue
- unbounded scans (policy + query guards)
Defense-in-depth Execution Pipeline
- Authentication (if required by access level)
- SQL parse + validation
- Extract referenced tables/datasets
- Resolve to catalogued mapping
- OPA authorization (dataset + action + identity)
- Execute in restricted context with limits/timeouts
- Return paginated results only
If any step fails → request fails.
Auditability & Observability
Log (at least): - request id - subject - dataset ids referenced - allow/deny decision - validation errors (sanitized) - execution time and rows returned
Do not log raw SQL without sanitization if it may contain sensitive literals.
Common Policy Patterns
- Namespace gating: allow external access only to
golddatasets - Group-based: allow if identity belongs to
dataset:{id}:read - Tenant/community scoping: allow if identity has
community:{X}group matching dataset tagcommunity:{X} - Service account capabilities: allow if role includes
serviceand scope includesdataset:query