OLake: Fast Database-to-Iceberg Replication Tool

If you work with data, you've probably been through this pain before. You have a Postgres or MongoDB database running in production, and someone on the analytics team needs that data in a data lake. So you start looking at tools — Debezium, Kafka Connect, Airbyte, Fivetran — and suddenly you're managing a whole Kafka cluster, writing custom configs, debugging connector issues, and wondering why a "simple" data pipeline turned into a full-time job.

OLake is trying to fix that. It's an open-source data ingestion engine built by the team at Datazip, and its whole thing is getting data from your databases into Apache Iceberg (or plain Parquet) as fast as possible, without all the infrastructure overhead.

OLake Banner

What Does OLake Actually Do?

In simple terms: it replicates your database into a data lakehouse. You point it at your Postgres, MySQL, MongoDB, Oracle, or even Kafka — and it pulls the data out and writes it directly into Apache Iceberg tables or Parquet files sitting on S3, MinIO, GCS, or Azure.

No Kafka required. No Spark. No Flink. No Debezium. Just OLake talking directly to your source and writing to your destination.

It supports both full snapshots (dump everything) and CDC — Change Data Capture — so after the initial load, it keeps watching for changes and syncs them in near real-time using native database logs like pgoutput for Postgres, binlogs for MySQL, and oplogs for MongoDB.

OLake Architecture

The Numbers Are Kinda Wild

The benchmarks they published are hard to ignore. We're talking 580K records per second for Postgres full loads and 338K records per second for MySQL. They claim to be around 15x faster than Debezium and significantly cheaper than managed tools like Fivetran or Airbyte.

The whole thing is written in Golang, which helps explain the performance. Go's memory efficiency and concurrency model are a good fit for this kind of high-throughput data shuffling.

Supported Sources and Destinations

Here's what OLake can pull from right now:

PostgreSQL — Full refresh, incremental sync, and pgoutput-based CDC
MySQL — Full refresh, incremental sync, and binlog-based CDC
MongoDB — Full refresh, incremental sync, and oplog-based CDC
Oracle — Full refresh and incremental sync
MSSQL — Full refresh and incremental sync
DB2 — Full refresh and incremental sync
Apache Kafka — Consumer group-based streaming
S3 — Object store ingestion

And where it writes to:

Apache Iceberg — with support for Glue, Hive, JDBC, and REST catalogs (Nessie, Polaris, Unity Catalog, AWS S3 Tables)
S3 Parquet — plain Parquet files on MinIO, S3, or GCS

Once your data lands in Iceberg, you can query it with basically anything — Athena, Trino, Spark, Presto, Dremio, Databricks, Snowflake, ClickHouse, you name it.

How to Get Started

OLake ships with a web UI that runs via Docker Compose. Getting it up is genuinely a one-liner:

curl -sSL https://raw.githubusercontent.com/datazip-inc/olake-ui/master/docker-compose.yml | docker compose -f - up -d

That spins up the full stack — the UI, backend, Temporal for workflow orchestration, Postgres for internal state, and everything else.

Then just open http://localhost:8000 in your browser and log in with the default creds:

Username: admin
Password: password

From there it's pretty straightforward:

Create a Job — go to the Jobs tab, click Create Job
Configure your Source — point it at your Postgres, MySQL, MongoDB, whatever
Configure your Destination — pick Iceberg or Parquet, set your catalog and storage
Select your tables — choose which tables or collections to sync
Hit sync — and watch it go

If you prefer the command line, there's also a full CLI with four main commands: spec, check, discover, and sync. The CLI is great for automation and plugging into orchestration tools like Airflow or Kubernetes.

Want to Kick the Tires First? Try the Playground

If you just want to see how everything fits together without connecting your own databases, OLake has a Playground. It's a self-contained Docker Compose environment that comes preconfigured with sample data, a source database, MinIO for storage, and Presto for querying.

One docker-compose up and you have a full lakehouse running on your laptop. Pretty neat for understanding the flow before committing to anything.

Why Not Just Use Airbyte or Fivetran?

Fair question. The short answer is speed and cost.

Airbyte and Fivetran are great products, but they come with trade-offs. Fivetran is managed but expensive. Airbyte is open-source but still relies on a connector framework that adds overhead, and for high-volume CDC workloads the performance gap becomes noticeable.

OLake was specifically built for the database-to-Iceberg use case. It doesn't try to be a universal connector platform for 300+ sources. It does a handful of things really well — and the "really well" part is the speed. The parallel chunking strategy, native BSON extraction for MongoDB, Arrow-based writes, and exactly-once delivery are all purpose-built for this specific workflow.

The other big thing is infrastructure simplicity. No Kafka cluster to manage. No Spark jobs to tune. The whole stack runs in Docker containers and the operational surface area is way smaller.

What's the Catch?

Like any young open-source project, there are rough edges. A few things to keep in mind:

Delta Lake and Hudi destinations aren't supported yet (Iceberg and Parquet only for now)
Oracle and MongoDB CDC are still maturing
The project is moving fast, which means breaking changes can happen
If you need 300+ connectors, this isn't the tool for that — it's laser-focused on databases to lakehouse

That said, the community is growing fast. The repo has 1.3k+ stars, an active Slack, and they run a contributor program and even participate in Google Summer of Code.

The Bigger Picture

The whole data lakehouse movement is about getting away from expensive, locked-in warehouses and moving to open formats that you control. Iceberg is winning that battle, and tools like OLake make the "getting data into Iceberg" part way less painful.

If you're a data engineer who's tired of managing Kafka + Debezium + Spark just to get database changes into your lake, OLake is worth a serious look. The setup is fast, the performance is legit, and the MIT-style Apache 2.0 license means you own your pipeline.

OLake — The Open-Source Tool That Makes Database-to-Iceberg Replication Actually Fast

What Does OLake Actually Do?

The Numbers Are Kinda Wild

Supported Sources and Destinations

How to Get Started

Want to Kick the Tires First? Try the Playground

Why Not Just Use Airbyte or Fivetran?

What's the Catch?

The Bigger Picture

Links

Comments

What Does OLake Actually Do?

The Numbers Are Kinda Wild

Supported Sources and Destinations

How to Get Started

Want to Kick the Tires First? Try the Playground

Why Not Just Use Airbyte or Fivetran?

What's the Catch?

The Bigger Picture

Links

Comments

Related Posts

Pyrite64 Just Dropped and It Might Change N64 Homebrew Forever