Let’s be honest: you’re tired of stitching together half-baked evaluation dashboards, custom LangChain trace loggers, and hand-rolled guardrail wrappers just to know if your AI agent actually works — or worse, breaks in production. You’ve tried LangSmith (but hit the $29/mo wall at 50k traces), fiddled with Phoenix (only to realize it’s tracing-first, eval-second), and maybe even ran a janky sqlite-backed eval script that died when your dataset hit 200 rows. Enter future-agi: an Apache 2.0, self-hostable, end-to-end platform built for engineers who treat AI applications like real software — with observability, reproducible evals, deterministic simulations, and actual guardrail enforcement — not just “let’s log some JSON and pray”.
As of today, it’s sitting at 83 GitHub stars, written in Python, and — crucially — not another SaaS wrapper or vendor-locked telemetry sink. It’s lean, opinionated, and designed to run on your homelab or VPS. I’ve been running it alongside my self-hosted Llama 3.2 1B agent stack for 11 days. Here’s what’s real, what’s rough, and whether it deserves a spot in your docker-compose.yml.
What Is Future-AGI? Not Just Another Tracer
Future-AGI isn’t a tracing tool with evals tacked on. It’s built around five tightly coupled pillars, each with first-class CLI, API, and UI support:
- Tracing: Structured, hierarchical trace trees (not flat spans) with
run_id,parent_id,input/output, metadata, and nested agent steps — all stored in Postgres (no vendor lock-in). - Evals: Define reusable, parameterized evaluation suites (e.g.,
correctness,toxicity,tool-use-accuracy) in YAML or Python — then run them across traces, across models, or against live gateways. - Simulations: Record real user sessions → convert to replayable
*.simfiles → run them against candidate models (e.g.,Qwen2.5-7BvsPhi-3.5-mini) with side-by-side diffing. No more “it felt better” — you get+23% tool-call precision,-8% hallucination rate. - Datasets: Import from CSV/JSONL or generate synthetically with built-in LLM-based data augmentation. Each row can carry
input,expected_output,category,tags, andmetadata— and be tied directly to evals. - Gateway & Guardrails: A lightweight, pluggable API gateway (think:
fastapi+litellmproxy) that enforces rules before hitting your LLM — e.g., block PII in inputs, rate-limit by user ID, rewrite prompts to avoid jailbreaks, or inject safety context. Guardrails are Python functions — not YAML configs.
That said: it is not an LLM inference engine. It doesn’t host models. It doesn’t do RAG indexing. It’s the observability and quality control layer — the “Datadog for AI agents”, if Datadog had eval harnesses and simulation replay.
Installation & Docker-First Setup (No Python Hell)
Future-AGI ships with production-ready docker-compose.yml — and thank god for that. I skipped poetry install entirely. Here’s what I ran on my Ubuntu 24.04 box (16GB RAM, 4c/8t, NVMe SSD):
git clone https://github.com/future-agi/future-agi.git
cd future-agi
# Pull latest stable (v0.4.2 as of 2024-10-15 — check releases!)
git checkout v0.4.2
Then I edited docker-compose.yml to match my infra. Key changes:
- Swapped
postgres:15→postgres:16-alpine(smaller, same compat) - Mounted persistent volumes for
/data(traces,evals,simulations) - Added
TZ: "Asia/Jakarta"(yes, I care about timestamps) - Disabled the demo
ollamaservice (I uselitellm+vLLM)
Here’s my trimmed-down docker-compose.yml snippet (full version in their repo):
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql://futureagi:futureagi@postgres:5432/futureagi
- REDIS_URL=redis://redis:6379/0
- LITELLM_BASE_URL=http://litellm:4000
- FUTURE_AGI_ENV=production
depends_on:
- postgres
- redis
- litellm
postgres:
image: postgres:16-alpine
environment:
- POSTGRES_DB=futureagi
- POSTGRES_USER=futureagi
- POSTGRES_PASSWORD=futureagi
volumes:
- ./volumes/postgres:/var/lib/postgresql/data
redis:
image: redis:7-alpine
command: redis-server --save 60 1 --loglevel warning
volumes:
- ./volumes/redis:/data
# Optional: I run litellm separately — but you can drop their service in
Then: docker compose up -d --build. Wait ~45 seconds. Hit http://localhost:8000/docs — you’ll see the full OpenAPI spec. The web UI (http://localhost:8000) loads in <2s. No Node.js build step. No npm install failures. Just Python + FastAPI + Postgres doing what they do best.
Future-AGI vs. LangSmith, Phoenix, and Custom Scripts
If you’re comparing tools, here’s the unfiltered breakdown:
| Feature | Future-AGI (v0.4.2) | LangSmith (v0.1.122) | Phoenix (v2.0.1) | DIY SQLite + Pandas |
|---|---|---|---|---|
| Self-hosted, zero SaaS | ✅ Apache 2.0 | ❌ Cloud-first (self-host possible, but unsupported, no Helm) | ✅ MIT, but tracing-only | ✅ (but fragile) |
| Simulation replay | ✅ Native .sim format + diff UI |
❌ (only manual “compare runs”) | ❌ | ❌ |
| Guardrail enforcement (pre-LLM) | ✅ Python hooks + gateway | ❌ (post-hoc filtering only) | ❌ | ⚠️ (custom middleware) |
Eval composition (e.g., AND(toxicity < 0.2, tool_call_accuracy > 0.9)) |
✅ YAML + Python DSL | ✅ (but locked to LangChain ecosystem) | ❌ (basic pass/fail only) | ⚠️ (manual) |
| Resource footprint (idle) | ~320MB RAM, 0.1 CPU | ~1.2GB RAM (backend + frontend) | ~850MB RAM | ~150MB (but no UI, no auth) |
| Trace storage backend | Postgres (configurable) | Postgres + S3 (for artifacts) | SQLite (dev), Postgres (prod) | SQLite (breaks at ~10k traces) |
Here’s the kicker: LangSmith’s eval UI is slick — but if you’re not using LangChain, good luck. Future-AGI’s /evals/run endpoint accepts any trace ID from any source (even if you logged it manually via curl -X POST http://localhost:8000/traces -d '{"input":"...", "output":"..."}'). That flexibility is why I switched after 3 days of LangSmith’s langchain-core versioning hell.
Why Self-Host This? Who Actually Needs It?
Let’s cut the fluff: Future-AGI is for teams building AI agents that ship to real users — not demos or hackathon projects. Specifically:
- AI Infra Engineers who need to prove their RAG agent reduced support ticket hallucinations by 40% (not just “seems better”)
- Product Teams shipping agent-powered features (e.g., “customer onboarding bot”) and needing auditable eval reports for compliance (SOC2, HIPAA prep)
- ML Ops folks tired of
git diff-ingeval_results.jsonand wanting reproducible simulation replay across model upgrades - Security-conscious devs who won’t send PII through a third-party tracing service — and want guardrails enforced at the gateway, not logged and ignored
It is not for:
- Solo devs doing one-off LLM prompts (“just wanna try Qwen”)
- Teams already locked into LangChain + LangSmith + Vertex AI and happy with it
- Anyone needing built-in vector DBs or fine-tuning pipelines (it doesn’t do those — and shouldn’t)
Self-hosting isn’t just about privacy — it’s about control. I run mine behind caddy with forward_auth to my existing Authelia instance. I back up Postgres nightly to S3 with rclone. I set up alerting on evals.failed > 5% via Prometheus + Grafana (Future-AGI exposes /metrics). That level of integration? Impossible with SaaS.
Hardware Requirements & Real-World Resource Usage
Future-AGI is lean — but don’t run it on a Raspberry Pi 4 (8GB) and expect joy. Here’s what I see on my setup:
- Baseline (API + Postgres + Redis, idle): ~480MB RAM, <0.05 CPU load
- Under load (100 evals running concurrently, 500 traces/day): ~1.1GB RAM, 0.3–0.6 CPU (peaking at 1.2 during simulation diffing)
- Storage: 1M traces ≈ ~4.2GB (Postgres, with indexes). I’m at 87k traces after 11 days — using 380MB on disk.
- Minimum viable: 2 vCPU, 4GB RAM, 20GB SSD (for Postgres + traces + simulations)
- Recommended for teams: 4 vCPU, 8GB RAM, 100GB SSD (with room for dataset snapshots and eval artifacts)
No GPU required — all eval logic is CPU-bound Python (heavily optimized NumPy/Pandas where possible). Simulations do leverage litellm’s async, but that’s your LLM backend’s problem — not Future-AGI’s.
One note: the simulations diff UI loads large JSON traces into memory. If you’re replaying 50-step agent chains with 10KB outputs, keep an eye on the api container’s RSS. I added mem_limit: 2g to my compose file — and haven’t hit it.
The Honest Take: Worth Deploying? Rough Edges?
Yes — if you’re building production AI agents and care about quality as much as speed.
I deployed Future-AGI on a Thursday. By Friday afternoon, I’d:
- Replayed 12 real user sessions (
simfiles generated from my FastAPI middleware logs) - Run them against
Qwen2.5-7BandPhi-3.5-mini - Discovered
Phi-3.5-minihallucinated 3× more ondate-related tool calls, but was 40% faster - Wrote a guardrail that strips
Authorization:headers from incoming requests (we hit this in staging — oops) - Exported an eval report as PDF for our product lead
That’s real value. Fast.
But — and this is critical — it’s early. Rough edges do exist:
- No RBAC yet (v0.4.2): Everyone with API key is admin. Fine for homelab; unacceptable for multi-team orgs. Workaround: I run it behind Authelia + Caddy with path-based auth (
/api/admin/*blocked). - UI feels “dev preview”: The trace tree view is functional, but lacks search-by-tag or bulk export. The simulation diff UI highlights changes but doesn’t show why (e.g., “model changed ‘2024-10-12’ → ‘2024-10-13’ because it misparsed relative time”). You still need to dive into raw JSON.
- Eval authoring isn’t visual: You write YAML or Python — no drag-and-drop eval builder. Not a dealbreaker, but a learning curve for non-dev PMs.
- Docs are light on advanced guardrail patterns: The example shows PII blocking, but not chaining (e.g., “if PII + high-risk-intent → reject, else if PII → redact”). I had to read source in
future_agi/guardrails/to figure it out.
Also: the project is very new (first commit: Aug 2024). 83 stars in 2 months is promising — but it’s not LangChain-tier mature. That said, the maintainers are responsive (I filed an issue about Redis connection pooling — got a PR merged in 18 hours).
Final Verdict
Future-AGI isn’t perfect. It won’t replace your model host. It won’t generate your RAG pipeline. But if you’re asking “Did this agent actually solve the user’s problem — and how do I prove it?”, then it’s the most coherent, self-hostable answer available today. It’s built like infrastructure — not a shiny dashboard. It favors composability over convenience. And in the AI tooling space, that’s rare.
I’m keeping it deployed. I’m contributing eval templates. And I’ll be watching v0.5 like a hawk.
Want to try it? Run this right now:
git clone https://github.com/future-agi/future-agi.git && \
cd future-agi && \
git checkout v0.4.2 && \
docker compose up -d --build && \
echo "✅ Check http://localhost:8000 — trace your first run in <60s"
No credit card. No telemetry opt-out toggle. Just open source, Apache 2.0, and the quiet confidence of knowing your AI doesn’t just run — it performs, improves, and stays safe. That’s not future AGI. That’s today’s sanity — served local.
Comments