Query Engine

This document explains the governed SQL interface: request/response semantics, validation rules, dataset resolution, and performance guardrails.

Endpoints (Conceptual)

Your exact route names may differ; the engine typically exposes: - POST query: run a SQL query against a dataset - GET schema: fetch JSON Schema for a dataset - GET metadata: column and dataset metadata for UI/clients

The README should link to the concrete API reference.

Query Request Model

Typical POST body:

{
  "sql": "SELECT ...",
  "limit": 100,
  "offset": 0
}

Notes: - limit/offset are applied server-side. - the engine may override limit with a maximum.

Response Model

A typical paginated response:

{
  "items": [
    {"col_a": 1, "col_b": "x"},
    {"col_a": 2, "col_b": "y"}
  ],
  "limit": 100,
  "offset": 0,
  "count": 2
}

SQL Validation Rules

Statement restrictions

single statement only
SELECT only

Table restrictions

references must map to catalogued datasets
if a query contains multiple table references, each must be resolvable

Function and expression allowlist

only allow safe scalar functions
block functions that can access filesystem, network, or server internals

Projection safety

avoid SELECT * if you want strict contracts (optional)
optionally enforce explicit column selection for restricted datasets

Dataset Resolution

The engine enforces a separation between: - logical ids: what the client references (dataset identifiers) - physical names: actual table/view names

Resolution steps: 1. parse SQL and extract table identifiers 2. map identifiers to catalogue entries 3. substitute physical references into the execution query (or bind via prepared mapping) 4. reject if any identifier cannot be mapped

This prevents “escaping” to arbitrary tables.

Pagination, Limits & Timeouts

The engine must guard the storage backend.

Recommended controls: - hard max limit (e.g. 1k / 10k rows) - max offset (to prevent deep scans) or encourage keyset pagination - statement timeout - max query complexity (joins, subqueries, regex-like operations)

Even for open datasets, resource controls must remain enforced.

Join Policy

Joins can be permitted, but only within controlled boundaries:

join only catalogued datasets
join only within allowed namespaces (e.g. silver+gold)
reject cartesian products
optionally limit join count (e.g. <= 3 tables)

If you want safer defaults: - forbid joins by default, allow per-dataset tags/policy

Error Semantics (Examples)

Validation error (400)

invalid SQL
forbidden keyword
unknown dataset reference

Authorization error (403)

identity not permitted by OPA
missing required scopes/roles/groups

Not found (404)

dataset id does not exist in catalogue

Execution error (500/502)

database error
upstream dependency failure (OPA/lineage)

Examples

Query a dataset

curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"sql":"SELECT * FROM datasets.gold.example WHERE ts >= now() - interval \'1 day\'","limit":100,"offset":0}' \
  https://host/api/dataset/datasets.gold.example/query

Paginate

curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"sql":"SELECT id, ts, value FROM datasets.gold.example ORDER BY ts DESC","limit":100,"offset":100}' \
  https://host/api/dataset/datasets.gold.example/query

Performance Recommendations for Clients

always filter by time windows where possible
request only needed columns
prefer indexed predicates
avoid deep offsets; paginate with stable ordering