Query Engine
This document explains the governed SQL interface: request/response semantics, validation rules, dataset resolution, and performance guardrails.
Endpoints (Conceptual)
Your exact route names may differ; the engine typically exposes: - POST query: run a SQL query against a dataset - GET schema: fetch JSON Schema for a dataset - GET metadata: column and dataset metadata for UI/clients
The README should link to the concrete API reference.
Query Request Model
Typical POST body:
{
"sql": "SELECT ...",
"limit": 100,
"offset": 0
}
Notes:
- limit/offset are applied server-side.
- the engine may override limit with a maximum.
Response Model
A typical paginated response:
{
"items": [
{"col_a": 1, "col_b": "x"},
{"col_a": 2, "col_b": "y"}
],
"limit": 100,
"offset": 0,
"count": 2
}
SQL Validation Rules
Statement restrictions
- single statement only
SELECTonly
Table restrictions
- references must map to catalogued datasets
- if a query contains multiple table references, each must be resolvable
Function and expression allowlist
- only allow safe scalar functions
- block functions that can access filesystem, network, or server internals
Projection safety
- avoid
SELECT *if you want strict contracts (optional) - optionally enforce explicit column selection for restricted datasets
Dataset Resolution
The engine enforces a separation between: - logical ids: what the client references (dataset identifiers) - physical names: actual table/view names
Resolution steps:
- parse SQL and extract table identifiers
- map identifiers to catalogue entries
- substitute physical references into the execution query (or bind via prepared mapping)
- reject if any identifier cannot be mapped
This prevents “escaping” to arbitrary tables.
Pagination, Limits & Timeouts
The engine must guard the storage backend.
Recommended controls:
- hard max limit (e.g. 1k / 10k rows)
- max offset (to prevent deep scans) or encourage keyset pagination
- statement timeout
- max query complexity (joins, subqueries, regex-like operations)
Even for open datasets, resource controls must remain enforced.
Join Policy
Joins can be permitted, but only within controlled boundaries:
- join only catalogued datasets
- join only within allowed namespaces (e.g. silver+gold)
- reject cartesian products
- optionally limit join count (e.g. <= 3 tables)
If you want safer defaults: - forbid joins by default, allow per-dataset tags/policy
Error Semantics (Examples)
Validation error (400)
- invalid SQL
- forbidden keyword
- unknown dataset reference
Authorization error (403)
- identity not permitted by OPA
- missing required scopes/roles/groups
Not found (404)
- dataset id does not exist in catalogue
Execution error (500/502)
- database error
- upstream dependency failure (OPA/lineage)
Examples
Query a dataset
curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"sql":"SELECT * FROM datasets.gold.example WHERE ts >= now() - interval \'1 day\'","limit":100,"offset":0}' \
https://host/api/dataset/datasets.gold.example/query
Paginate
curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"sql":"SELECT id, ts, value FROM datasets.gold.example ORDER BY ts DESC","limit":100,"offset":100}' \
https://host/api/dataset/datasets.gold.example/query
Performance Recommendations for Clients
- always filter by time windows where possible
- request only needed columns
- prefer indexed predicates
- avoid deep offsets; paginate with stable ordering