OpaqueDB

Run SQL over encrypted data without revealing the query.

What it is · Quickstart · Docs · Use cases · What it is not · Building · Cluster · References · License · Donate


What it is

OpaqueDB answers SQL queries over data without learning what you asked for. A client encrypts the value it is searching for under its own key. The server evaluates the match over encrypted data and returns an encrypted result that only the client can decrypt. The operator runs the query but never sees the query value or the secret key.

It is a computational private information retrieval (PIR) system built on Microsoft SEAL with the BFV scheme. The privacy guarantee is precise:

  • Privacy rests on Ring-LWE, a lattice assumption with no known quantum attack.
  • It holds against a semi-honest operator. There is no non-collusion assumption between servers, so the whole cluster is one trust domain.
  • Today the system is QueryPrivate: the operator never learns the query value. DataPrivate mode, where the operator also never learns the stored data, is a later release.

The deployed unit is a sharded cluster of identical nodes. Sharding spreads the linear scan that PIR requires across many machines.

For where this fits, from private credential checks to confidential lookups, see the use cases on the docs site.

How matching works

The match key is binary-expanded into key_bits SIMD slots per record and many records are packed into the slots of one ciphertext. For each query the server computes, in parallel across the whole batch:

  1. diff = query - key then sq = diff * diff, giving (q-k)^2, which is 0 where bits match and 1 where they differ;
  2. eq = 1 - sq, the per-bit XNOR, then an AND across each record's bits by a rotate-and-multiply tree of depth log2(key_bits), leaving the equality indicator per record;
  3. a plaintext-masked retrieve: multiply the indicator by the packed payload and sum, so only the matching record's payload survives, plus an encrypted presence count so a no-match query returns an empty result.

This costs only 1 + log2(key_bits) ciphertext-times-ciphertext multiplies per batch (the dominant FHE cost) regardless of how many records the batch holds, so the expensive multiplies amortize across thousands of rows. The parameters were chosen by measuring against examples/crypto_bench: poly_modulus_degree 16384 (required to fit the multiplicative depth in the noise budget), a 349-bit coefficient modulus, and key_bits 16 by default (all configurable). The query travels as a single ciphertext.

A secondary INDEX column reuses this exact pipeline. Each searchable column's key is packed side by side in one record on disk, and the server slices out the column the query names, so matching on an index is the same single-key evaluation with no extra encrypted multiplies. Because data is sharded by the primary key, an index query simply fans out to every shard and sums the partials, the same path a key query takes.

IN (...) and a same-column OR (city = "Tokyo" OR city = "Cairo") match a set of values on one column. The client encrypts one operand per value, the server builds one equality indicator per operand and sums them. A key equals at most one listed value, so the indicators are disjoint and the sum is their union. This adds one square plus AND-tree per extra value, not a deeper circuit, so the multiplicative depth stays the same as a single equality.

Quickstart

Build the binary (see Building), then declare a table, load a CSV, serve it, and run a private query.

Declare a schema with CREATE TABLE. Exactly one column is the primary KEY (it is searchable and shards the data). Any column may also be marked INDEX to make it searchable too; the rest are typed payload columns that are returned but not matched:

-- weather.sql
CREATE TABLE weather (
  id INT KEY,
  city TEXT INDEX,
  country TEXT INDEX,
  temperature INT,
  humidity INT,
  conditions TEXT INDEX
);

A query matches on whichever column its WHERE names, so this table can be looked up by id, city, country, or conditions. Column types are INT, REAL, TEXT, and JSON. A JSON column is stored and returned just like TEXT but its value is validated as well-formed JSON on insert, so clients get back parseable JSON, not an opaque string. A KEY or INDEX column must be INT or TEXT (not REAL or JSON, which are payload only). An INDEX column is stored both as a search key and as payload, so it is also returned; the KEY column is the one exception that is matched but not returned.

The CSV's header names the columns:

id,city,country,temperature,humidity,conditions
1,Amsterdam,NL,18,72,Cloudy
2,Tokyo,JP,27,61,Clear

Load it, start a node, and query by the key. The example files live in examples/:

Load the example data and start a node (local insecure dev mode):

opaquedb run --set auth.mode=none --set auth.enable_insecure=true &
opaquedb load --schema examples/weather.sql --csv examples/weather.csv

Then open the interactive shell and run private queries by any searchable column, the primary KEY or any INDEX:

Statements end with a semicolon and may span lines. Up and down recall history (saved across sessions), tab completes keywords and known table and column names, \d <table> shows a schema, and \timing toggles query timing.

$ opaquedb repl
OpaqueDB shell. \help for commands, \quit to exit.
opaquedb(default)> SELECT city, temperature, conditions FROM weather WHERE id = 1;
 city      | temperature | conditions
-----------+-------------+------------
 Amsterdam | 18          | Cloudy
opaquedb(default)> SELECT city, country FROM weather WHERE conditions = "Sunny";
 city    | country
---------+---------
 Nairobi | KE
 Cairo   | EG
opaquedb(default)> SELECT country FROM weather WHERE city = "Atlantis";
(no rows)
opaquedb(default)> \quit

The query value is encrypted whichever column you match on; matching on a secondary INDEX costs the operator no more information and the same encrypted round trip as matching on the key.

A one-shot query works the same way and prints the decoded row:

$ opaquedb query 'SELECT country, temperature FROM weather WHERE city = "Tokyo"'
 country | temperature
---------+-------------
 JP      | 27

"Tokyo" is encrypted before it leaves the client. The node scans every row under encryption and returns only the encrypted match.

Multiple matches: LIMIT and OFFSET

A searchable value can match many rows. The default is LIMIT 10, so a bare query returns up to ten matching rows. Two cities share conditions = "Sunny", and both come back:

$ opaquedb query 'SELECT city, country FROM weather WHERE conditions = "Sunny"'
 city    | country
---------+---------
 Nairobi | KE
 Cairo   | EG

LIMIT n caps the rows and OFFSET m pages through them; rows come back in a stable order across queries, so LIMIT 1 OFFSET 1 returns the second match:

$ opaquedb query 'SELECT city, country FROM weather WHERE conditions = "Sunny" LIMIT 1 OFFSET 1'
 city  | country
-------+---------
 Cairo | EG

LIMIT/OFFSET are public (they are not secret, so they stay in the plaintext template) and applied on the client. Under the hood the server partitions matches into crypto.result_buckets buckets (default 16) and packs every bucket into one result, so LIMIT counts rows rather than bucket slots and the encrypted result size does not grow with the limit. A single value can therefore return up to result_buckets rows in one round trip; raise result_buckets for more.

Scan, sort, and de-duplicate

A SELECT with no WHERE reads the table directly. There is no value to match, so nothing is hidden and the rows come back in the clear (a plaintext scan, not the encrypted matcher). It still defaults to LIMIT 10:

$ opaquedb query 'SELECT city, temperature FROM weather ORDER BY temperature DESC LIMIT 3'
 city    | temperature
---------+-------------
 Cairo   | 33
 Tokyo   | 27
 Nairobi | 24

ORDER BY col [ASC|DESC], DISTINCT, and column aliases (col AS name) are applied on the client over the rows it gets back:

$ opaquedb query 'SELECT DISTINCT country FROM weather ORDER BY country LIMIT 4'
 country
---------
 CA
 CL
 EG
 GB

Exclude a value: <>

WHERE col <> :v (or !=) matches every row whose value differs. It costs the operator no more than = and hides the value the same way. The count is exact:

$ opaquedb query 'SELECT COUNT(*) FROM weather WHERE conditions <> "Sunny"'
7

Returning the rows themselves follows the same bucket rules as any multi-row result (see below): a <> usually matches many rows, so some can collide in a bucket and be dropped from the returned set. Raise crypto.result_buckets or page with OFFSET to recover them; the COUNT(*) stays exact regardless.

Match a set of values: IN and OR

WHERE col IN (...) and a same-column OR match several values on one column in a single query. Each value is encrypted separately, so the operator still learns nothing about any of them:

$ opaquedb query 'SELECT city, country, temperature FROM weather WHERE city IN ("Tokyo", "Cairo", "London")'
 city   | country | temperature
--------+---------+-------------
 London | GB      | 11
 Tokyo  | JP      | 27
 Cairo  | EG      | 33

A flat OR on the same column is the same union written differently:

$ opaquedb query 'SELECT city, temperature FROM weather WHERE city = "Tokyo" OR city = "Nairobi"'
 city    | temperature
---------+-------------
 Nairobi | 24
 Tokyo   | 27

IN works on the key too, for example WHERE id IN (1, 5, 9). Combining conditions across different columns (col1 = a AND col2 = b, or OR spanning two columns) is not evaluated yet.

Count matches privately

SELECT COUNT(*) returns the number of matching rows as a single number. The count is exact and the operator still never sees the value:

$ opaquedb query 'SELECT COUNT(*) FROM weather WHERE conditions = "Sunny"'
2

What it is not

  • Not a full SQL engine yet. The evaluated query matches one searchable column, the primary KEY or any INDEX, with =, <> (!=), IN (...), or a same-column OR, optionally with LIMIT/OFFSET or COUNT(*). A SELECT with no WHERE is a plaintext full scan. Cross-column conditions (col1 = a AND col2 = b, or OR spanning two columns), LIKE, and ranges (<, >, BETWEEN) parse but are not evaluated under encryption yet. Widening the set of operators the engine can evaluate privately is active work, so expect more SQL support over time; the docs track what is supported.
  • Some SQL runs on the client, not the server. ORDER BY, DISTINCT, column aliases (AS), and the LIMIT/OFFSET window are applied by the client over the rows the server returns, not pushed into the encrypted scan. So they sort and de-duplicate only what came back, which for a matched query is at most crypto.result_buckets rows per query, and for a scan is what the server returned (it caps how many rows one scan reads). A bare ORDER BY ... LIMIT 5 over a table larger than that window is not a global top-5. This keeps the server's job to the one thing it must do under encryption, the private match, and leaves presentation to the client.
  • A no-WHERE scan is not private and is single-node only. With no value to match there is nothing to hide, so the scan returns plaintext rows; do not read it as a private query. It also runs on one node's shard, so a full scan against a sharded cluster is rejected rather than returning a partial answer.
  • Not a way to skip work. PIR requires a full linear scan. Sharding improves latency and throughput, not total work.
  • Not anonymity. Authentication is access control: OpaqueDB hides the query value, never who is asking. The server always learns which authenticated principal sent a query. You can build anonymous authentication on top of OpaqueDB (for example an anonymizing proxy, a mix layer, or anonymous credentials in your application), but the database alone will not provide it. Client anonymity is an extra layer you add above the database, not a property it gives you.
  • It ships no client SDK. The gRPC .proto files are the wire contract. The query subcommand is a dev test client.

Features

  • CREATE TABLE schemas with typed columns (int, real, text, json), one match KEY, and any number of secondary INDEX columns to search on
  • Private equality (=) and inequality (<>) lookup over encrypted data via Microsoft SEAL (BFV), matching on the key or any secondary index, with <> costing the same as =
  • Match a set of values on one column with IN (...) or a same-column OR, each operand encrypted, with no added multiplicative depth
  • Private SELECT COUNT(*) that returns an exact match count and nothing else
  • Plaintext full scan for a SELECT with no WHERE (no value to hide), with client-side ORDER BY, DISTINCT, and column aliases
  • Multi-row results with public LIMIT/OFFSET (default LIMIT 10): a value matching many rows returns them in one round trip, at a result size that does not grow with the limit
  • Interactive repl with persistent history, multi-line statements, tab completion, aligned table output, and \d/\timing
  • Sharded cluster with etcd leader election, membership, and query fan-out
  • Versioned immutable epochs: write-ahead log, atomic publish, rollback
  • Token, mTLS, and no-auth modes with constant-time token comparison
  • One management facade (AdminService) shared by the gRPC API and the CLI
  • Unified logging to stdout or a file, in text or JSON, set from config
  • Strict layering enforced by CMake target visibility

Building

The build uses CMake with Ninja and vcpkg in manifest mode. Dependencies are pinned by the builtin-baseline in vcpkg.json.

Common tasks run through the Makefile, which is the single source of truth for build, test, lint, and packaging commands (CI invokes the same targets):

export VCPKG_ROOT=/opt/vcpkg   # provided by the dev container
make configure                 # first run builds dependencies via vcpkg
make build
make test

Run make help to list every target. Useful ones: make lint (clang-format and clang-tidy), make package (release .deb and .tar.gz), and PRESET=release on configure/build for an optimized build.

The dev container in .devcontainer/ provides the C++20 toolchain and vcpkg. The first configure is slow because vcpkg builds dependencies from source; later builds use the binary cache.

Multi-node cluster

The Docker Compose setup in docker/ brings up one etcd and three nodes that elect a leader, load a disjoint shard each, and answer a cross-shard private query:

docker compose -f docker/docker-compose.yml up --build -d
docker compose -f docker/docker-compose.yml run --rm tools \
  query 'SELECT country, temperature FROM weather WHERE city = "Santiago"' \
  --target node1:50051

Any node can be the target; each coordinates the query across all shards.

References

OpaqueDB's private matching builds on the homomorphic-encryption techniques for querying over encrypted data described in the following paper. If you use OpaqueDB in academic work, please cite it:

@article{karacay2020intrusion,
  title   = {Intrusion Detection Over Encrypted Network Data},
  author  = {Kara{\c{c}}ay, Leyli and Sava{\c{s}}, Erkay and Alptekin, Halit},
  journal = {The Computer Journal},
  volume  = {63},
  number  = {4},
  pages   = {604--619},
  year    = {2020},
  publisher = {Oxford University Press},
  doi     = {10.1093/comjnl/bxz111}
}