Run SQL over encrypted data without revealing the query.
What it is · Quickstart · Docs · Use cases · What it is not · Building · Cluster · References · License · Donate
What it is
OpaqueDB answers SQL queries over data without learning what you asked for. A client encrypts the value it is searching for under its own key. The server evaluates the match over encrypted data and returns an encrypted result that only the client can decrypt. The operator runs the query but never sees the query value or the secret key.
It is a computational private information retrieval (PIR) system built on Microsoft SEAL with the BFV scheme. The privacy guarantee is precise:
- Privacy rests on Ring-LWE, a lattice assumption with no known quantum attack.
- It holds against a semi-honest operator. There is no non-collusion assumption between servers, so the whole cluster is one trust domain.
- Today the system is QueryPrivate: the operator never learns the query value. DataPrivate mode, where the operator also never learns the stored data, is a later release.
The deployed unit is a sharded cluster of identical nodes. Sharding spreads the linear scan that PIR requires across many machines.
For where this fits, from private credential checks to confidential lookups, see the use cases on the docs site.
How matching works
The match key is binary-expanded into key_bits SIMD slots per record and many
records are packed into the slots of one ciphertext. For each query the server
computes, in parallel across the whole batch:
diff = query - keythensq = diff * diff, giving(q-k)^2, which is 0 where bits match and 1 where they differ;eq = 1 - sq, the per-bit XNOR, then an AND across each record's bits by a rotate-and-multiply tree of depthlog2(key_bits), leaving the equality indicator per record;- a plaintext-masked retrieve: multiply the indicator by the packed payload and sum, so only the matching record's payload survives, plus an encrypted presence count so a no-match query returns an empty result.
This costs only 1 + log2(key_bits) ciphertext-times-ciphertext multiplies per
batch (the dominant FHE cost) regardless of how many records the batch holds, so
the expensive multiplies amortize across thousands of rows. The parameters were
chosen by measuring against examples/crypto_bench: poly_modulus_degree
16384 (required to fit the multiplicative depth in the noise budget), a 349-bit
coefficient modulus, and key_bits 16 by default (all configurable). The query
travels as a single ciphertext.
A secondary INDEX column reuses this exact pipeline. Each searchable column's
key is packed side by side in one record on disk, and the server slices out the
column the query names, so matching on an index is the same single-key
evaluation with no extra encrypted multiplies. Because data is sharded by the
primary key, an index query simply fans out to every shard and sums the
partials, the same path a key query takes.
IN (...) and a same-column OR (city = "Tokyo" OR city = "Cairo") match a
set of values on one column. The client encrypts one operand per value, the
server builds one equality indicator per operand and sums them. A key equals at
most one listed value, so the indicators are disjoint and the sum is their union.
This adds one square plus AND-tree per extra value, not a deeper circuit, so the
multiplicative depth stays the same as a single equality.
Quickstart
Build the binary (see Building), then declare a table, load a CSV, serve it, and run a private query.
Declare a schema with CREATE TABLE. Exactly one column is the primary KEY
(it is searchable and shards the data). Any column may also be marked INDEX to
make it searchable too; the rest are typed payload columns that are returned but
not matched:
-- weather.sql
CREATE TABLE weather (
id INT KEY,
city TEXT INDEX,
country TEXT INDEX,
temperature INT,
humidity INT,
conditions TEXT INDEX
);
A query matches on whichever column its WHERE names, so this table can be
looked up by id, city, country, or conditions. Column types are INT,
REAL, TEXT, and JSON. A JSON column is stored and returned just like
TEXT but its value is validated as well-formed JSON on insert, so clients get
back parseable JSON, not an opaque string. A KEY or INDEX column must be
INT or TEXT (not REAL or JSON, which are payload only). An INDEX column
is stored both as a search key and as payload, so it is also returned; the KEY
column is the one exception that is matched but not returned.
The CSV's header names the columns:
id,city,country,temperature,humidity,conditions
1,Amsterdam,NL,18,72,Cloudy
2,Tokyo,JP,27,61,Clear
Load it, start a node, and query by the key. The example files live in
examples/:
Load the example data and start a node (local insecure dev mode):
opaquedb run --set auth.mode=none --set auth.enable_insecure=true &
opaquedb load --schema examples/weather.sql --csv examples/weather.csv
Then open the interactive shell and run private queries by any searchable
column, the primary KEY or any INDEX:
Statements end with a semicolon and may span lines. Up and down recall history
(saved across sessions), tab completes keywords and known table and column
names, \d <table> shows a schema, and \timing toggles query timing.
$ opaquedb repl
OpaqueDB shell. \help for commands, \quit to exit.
opaquedb(default)> SELECT city, temperature, conditions FROM weather WHERE id = 1;
city | temperature | conditions
-----------+-------------+------------
Amsterdam | 18 | Cloudy
opaquedb(default)> SELECT city, country FROM weather WHERE conditions = "Sunny";
city | country
---------+---------
Nairobi | KE
Cairo | EG
opaquedb(default)> SELECT country FROM weather WHERE city = "Atlantis";
(no rows)
opaquedb(default)> \quit
The query value is encrypted whichever column you match on; matching on a
secondary INDEX costs the operator no more information and the same encrypted
round trip as matching on the key.
A one-shot query works the same way and prints the decoded row:
$ opaquedb query 'SELECT country, temperature FROM weather WHERE city = "Tokyo"'
country | temperature
---------+-------------
JP | 27
"Tokyo" is encrypted before it leaves the client. The node scans every row
under encryption and returns only the encrypted match.
Multiple matches: LIMIT and OFFSET
A searchable value can match many rows. The default is LIMIT 10, so a bare
query returns up to ten matching rows. Two cities share conditions = "Sunny",
and both come back:
$ opaquedb query 'SELECT city, country FROM weather WHERE conditions = "Sunny"'
city | country
---------+---------
Nairobi | KE
Cairo | EG
LIMIT n caps the rows and OFFSET m pages through them; rows come back in a
stable order across queries, so LIMIT 1 OFFSET 1 returns the second match:
$ opaquedb query 'SELECT city, country FROM weather WHERE conditions = "Sunny" LIMIT 1 OFFSET 1'
city | country
-------+---------
Cairo | EG
LIMIT/OFFSET are public (they are not secret, so they stay in the
plaintext template) and applied on the client. Under the hood the server
partitions matches into crypto.result_buckets buckets (default 16) and packs
every bucket into one result, so LIMIT counts rows rather than bucket slots
and the encrypted result size does not grow with the limit. A single value can
therefore return up to result_buckets rows in one round trip; raise
result_buckets for more.
Scan, sort, and de-duplicate
A SELECT with no WHERE reads the table directly. There is no value to match,
so nothing is hidden and the rows come back in the clear (a plaintext scan, not
the encrypted matcher). It still defaults to LIMIT 10:
$ opaquedb query 'SELECT city, temperature FROM weather ORDER BY temperature DESC LIMIT 3'
city | temperature
---------+-------------
Cairo | 33
Tokyo | 27
Nairobi | 24
ORDER BY col [ASC|DESC], DISTINCT, and column aliases (col AS name) are
applied on the client over the rows it gets back:
$ opaquedb query 'SELECT DISTINCT country FROM weather ORDER BY country LIMIT 4'
country
---------
CA
CL
EG
GB
Exclude a value: <>
WHERE col <> :v (or !=) matches every row whose value differs. It costs the
operator no more than = and hides the value the same way. The count is exact:
$ opaquedb query 'SELECT COUNT(*) FROM weather WHERE conditions <> "Sunny"'
7
Returning the rows themselves follows the same bucket rules as any multi-row
result (see below): a <> usually matches many rows, so some can collide in a
bucket and be dropped from the returned set. Raise crypto.result_buckets or
page with OFFSET to recover them; the COUNT(*) stays exact regardless.
Match a set of values: IN and OR
WHERE col IN (...) and a same-column OR match several values on one column in
a single query. Each value is encrypted separately, so the operator still learns
nothing about any of them:
$ opaquedb query 'SELECT city, country, temperature FROM weather WHERE city IN ("Tokyo", "Cairo", "London")'
city | country | temperature
--------+---------+-------------
London | GB | 11
Tokyo | JP | 27
Cairo | EG | 33
A flat OR on the same column is the same union written differently:
$ opaquedb query 'SELECT city, temperature FROM weather WHERE city = "Tokyo" OR city = "Nairobi"'
city | temperature
---------+-------------
Nairobi | 24
Tokyo | 27
IN works on the key too, for example WHERE id IN (1, 5, 9). Combining
conditions across different columns (col1 = a AND col2 = b, or OR spanning
two columns) is not evaluated yet.
Count matches privately
SELECT COUNT(*) returns the number of matching rows as a single number. The
count is exact and the operator still never sees the value:
$ opaquedb query 'SELECT COUNT(*) FROM weather WHERE conditions = "Sunny"'
2
What it is not
- Not a full SQL engine yet. The evaluated query matches one searchable column,
the primary
KEYor anyINDEX, with=,<>(!=),IN (...), or a same-columnOR, optionally withLIMIT/OFFSETorCOUNT(*). ASELECTwith noWHEREis a plaintext full scan. Cross-column conditions (col1 = a AND col2 = b, orORspanning two columns),LIKE, and ranges (<,>,BETWEEN) parse but are not evaluated under encryption yet. Widening the set of operators the engine can evaluate privately is active work, so expect more SQL support over time; the docs track what is supported. - Some SQL runs on the client, not the server.
ORDER BY,DISTINCT, column aliases (AS), and theLIMIT/OFFSETwindow are applied by the client over the rows the server returns, not pushed into the encrypted scan. So they sort and de-duplicate only what came back, which for a matched query is at mostcrypto.result_bucketsrows per query, and for a scan is what the server returned (it caps how many rows one scan reads). A bareORDER BY ... LIMIT 5over a table larger than that window is not a global top-5. This keeps the server's job to the one thing it must do under encryption, the private match, and leaves presentation to the client. - A no-
WHEREscan is not private and is single-node only. With no value to match there is nothing to hide, so the scan returns plaintext rows; do not read it as a private query. It also runs on one node's shard, so a full scan against a sharded cluster is rejected rather than returning a partial answer. - Not a way to skip work. PIR requires a full linear scan. Sharding improves latency and throughput, not total work.
- Not anonymity. Authentication is access control: OpaqueDB hides the query value, never who is asking. The server always learns which authenticated principal sent a query. You can build anonymous authentication on top of OpaqueDB (for example an anonymizing proxy, a mix layer, or anonymous credentials in your application), but the database alone will not provide it. Client anonymity is an extra layer you add above the database, not a property it gives you.
- It ships no client SDK. The gRPC
.protofiles are the wire contract. Thequerysubcommand is a dev test client.
Features
CREATE TABLEschemas with typed columns (int, real, text, json), one matchKEY, and any number of secondaryINDEXcolumns to search on- Private equality (
=) and inequality (<>) lookup over encrypted data via Microsoft SEAL (BFV), matching on the key or any secondary index, with<>costing the same as= - Match a set of values on one column with
IN (...)or a same-columnOR, each operand encrypted, with no added multiplicative depth - Private
SELECT COUNT(*)that returns an exact match count and nothing else - Plaintext full scan for a
SELECTwith noWHERE(no value to hide), with client-sideORDER BY,DISTINCT, and column aliases - Multi-row results with public
LIMIT/OFFSET(defaultLIMIT 10): a value matching many rows returns them in one round trip, at a result size that does not grow with the limit - Interactive
replwith persistent history, multi-line statements, tab completion, aligned table output, and\d/\timing - Sharded cluster with etcd leader election, membership, and query fan-out
- Versioned immutable epochs: write-ahead log, atomic publish, rollback
- Token, mTLS, and no-auth modes with constant-time token comparison
- One management facade (AdminService) shared by the gRPC API and the CLI
- Unified logging to stdout or a file, in text or JSON, set from config
- Strict layering enforced by CMake target visibility
Building
The build uses CMake with Ninja and vcpkg in manifest mode. Dependencies are
pinned by the builtin-baseline in vcpkg.json.
Common tasks run through the Makefile, which is the single source of truth for
build, test, lint, and packaging commands (CI invokes the same targets):
export VCPKG_ROOT=/opt/vcpkg # provided by the dev container
make configure # first run builds dependencies via vcpkg
make build
make test
Run make help to list every target. Useful ones: make lint (clang-format and
clang-tidy), make package (release .deb and .tar.gz), and PRESET=release
on configure/build for an optimized build.
The dev container in .devcontainer/ provides the C++20 toolchain and vcpkg.
The first configure is slow because vcpkg builds dependencies from source; later
builds use the binary cache.
Multi-node cluster
The Docker Compose setup in docker/ brings up one etcd and three nodes that
elect a leader, load a disjoint shard each, and answer a cross-shard private
query:
docker compose -f docker/docker-compose.yml up --build -d
docker compose -f docker/docker-compose.yml run --rm tools \
query 'SELECT country, temperature FROM weather WHERE city = "Santiago"' \
--target node1:50051
Any node can be the target; each coordinates the query across all shards.
References
OpaqueDB's private matching builds on the homomorphic-encryption techniques for querying over encrypted data described in the following paper. If you use OpaqueDB in academic work, please cite it:
@article{karacay2020intrusion,
title = {Intrusion Detection Over Encrypted Network Data},
author = {Kara{\c{c}}ay, Leyli and Sava{\c{s}}, Erkay and Alptekin, Halit},
journal = {The Computer Journal},
volume = {63},
number = {4},
pages = {604--619},
year = {2020},
publisher = {Oxford University Press},
doi = {10.1093/comjnl/bxz111}
}
Comments