Query Engine
This document explains the governed SQL interface: request/response semantics, validation rules, dataset resolution, and performance guardrails.
Endpoints (Conceptual)
Your exact route names may differ; the engine typically exposes: - POST query: run a SQL query against a dataset - GET schema: fetch JSON Schema for a dataset - GET metadata: column and dataset metadata for UI/clients
The README should link to the concrete API reference.
Query Request Model
Typical POST body:
{
"sql": "SELECT ...",
"limit": 100,
"offset": 0
}
Notes:
- limit/offset are applied server-side.
- the engine may override limit with a maximum.
Response Model
A typical paginated response:
{
"items": [
{"col_a": 1, "col_b": "x"},
{"col_a": 2, "col_b": "y"}
],
"limit": 100,
"offset": 0,
"count": 2
}
SQL Validation Rules
Statement restrictions
- single statement only
SELECTonly
Table restrictions
- references must map to catalogued datasets
- if a query contains multiple table references, each must be resolvable
Function and expression allowlist
- only allow safe scalar functions
- block functions that can access filesystem, network, or server internals
Projection safety
- avoid
SELECT *if you want strict contracts (optional) - optionally enforce explicit column selection for restricted datasets
Dataset Resolution
The engine enforces a separation between: - logical ids: what the client references (dataset identifiers) - physical names: actual table/view names
Resolution steps: 1. parse SQL and extract table identifiers 2. map identifiers to catalogue entries 3. substitute physical references into the execution query (or bind via prepared mapping) 4. reject if any identifier cannot be mapped
This prevents “escaping” to arbitrary tables.
Pagination, Limits & Timeouts
The engine must guard the storage backend.
Recommended controls:
- hard max limit (e.g. 1k / 10k rows)
- max offset (to prevent deep scans) or encourage keyset pagination
- statement timeout
- max query complexity (joins, subqueries, regex-like operations)
Even for open datasets, resource controls must remain enforced.
Join Policy
Joins can be permitted, but only within controlled boundaries:
- join only catalogued datasets
- join only within allowed namespaces (e.g. silver+gold)
- reject cartesian products
- optionally limit join count (e.g. <= 3 tables)
If you want safer defaults: - forbid joins by default, allow per-dataset tags/policy
Error Semantics (Examples)
Validation error (400)
- invalid SQL
- forbidden keyword
- unknown dataset reference
Authorization error (403)
- identity not permitted by OPA
- missing required scopes/roles/groups
Not found (404)
- dataset id does not exist in catalogue
Execution error (500/502)
- database error
- upstream dependency failure (OPA/lineage)
Examples
Query a dataset
curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"sql":"SELECT * FROM datasets.gold.example WHERE ts >= now() - interval \'1 day\'","limit":100,"offset":0}' \
https://host/api/dataset/datasets.gold.example/query
Paginate
curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"sql":"SELECT id, ts, value FROM datasets.gold.example ORDER BY ts DESC","limit":100,"offset":100}' \
https://host/api/dataset/datasets.gold.example/query
Performance Recommendations for Clients
- always filter by time windows where possible
- request only needed columns
- prefer indexed predicates
- avoid deep offsets; paginate with stable ordering