Let’s be honest: if you’re running AI engineering teams—or just trying to ship LLM-powered features without drowning in console.log() spaghetti—you’re probably flying blind. You track latency in Datadog, count tokens in a spreadsheet, and pray your RAG pipeline doesn’t silently hallucinate its way into production. There’s no unified, self-hostable dashboard that answers: “Which model version spiked error rates last Tuesday?”, “How many prompts actually hit the vector DB vs. falling back to keyword search?”, or “Who deployed that fine-tuned LoRA that’s now chewing 40% more GPU mem?” That’s where pellametric quietly shows up—not with hype, but with TypeScript, Prometheus metrics, and a GitHub repo that’s small enough to audit but sharp enough to use today.
What Is Pellametric? Real-World AI Engineering Observability
Pellametric is a self-hostable AI engineering analytics platform, built by the folks at Pella Labs (who also run the bematist.dev blog). As of June 2024, it has 42 stars on GitHub, written in TypeScript, and targets one very specific pain point: instrumenting AI pipelines—not infrastructure, not ML training, but the messy, middleware-heavy, LLM-orchestration layer.
Unlike generic APMs (Datadog, Grafana Tempo) or ML monitoring tools (Arize, WhyLabs), Pellametric assumes you’re already using LangChain, LlamaIndex, or even raw openai/anthropic SDKs—and gives you a lightweight agent + backend to capture prompt → LLM call → tool usage → output → feedback in structured, queryable form.
It’s not trying to replace your tracing stack. It complements it. You don’t need to re-architect your app to get value—just wrap your LLM calls with its @pellametric/agent:
import { createPellaAgent } from '@pellametric/agent';
const pella = createPellaAgent({
endpoint: 'http://localhost:3001/ingest',
apiKey: 'dev-key-123',
});
const response = await pella.llm({
model: 'gpt-4o',
prompt: 'Explain quantum entanglement like I’m 5',
temperature: 0.3,
metadata: { feature: 'faq-bot', version: 'v2.1.0' },
});
That one call ships:
- Prompt + completion (with token counts)
- Latency, model provider, region, retries
- Structured feedback (e.g.,
user_rating: 4,is_hallucinated: true) - Optional tool invocations (e.g.,
SearchTool,DatabaseQueryTool)
All of it lands in Pellametric’s Postgres-backed backend and shows up in a clean, filterable UI—no SaaS signups, no vendor lock-in.
Installation & Docker Deployment (It Actually Works)
I spun this up on a DigitalOcean $12/mo droplet (2 vCPU, 4GB RAM, 80GB SSD)—no Kubernetes, no Helm, just docker-compose. The docs say “just run docker-compose up”, but there are two gotchas you won’t find in the README (I filed an issue after hitting both):
- The
PROMETHEUS_ENABLED=trueenv var must be set even if you’re not scraping metrics—otherwise the/metricsendpoint 500s, breaking the frontend health check. - The UI expects
API_BASE_URLto be set at build time, not runtime—so you can’t just override it indocker-compose.ymlwith an env var. You must rebuild the frontend image with--build-arg API_BASE_URL=http://host.docker.internal:3001.
Here’s the working docker-compose.yml I use (tested with v0.4.2, the latest tagged release as of June 2024):
version: '3.8'
services:
pellametric-db:
image: postgres:15-alpine
environment:
POSTGRES_DB: pellametric
POSTGRES_USER: pella
POSTGRES_PASSWORD: pella_dev
volumes:
- ./pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U pella -d pellametric"]
interval: 30s
timeout: 10s
retries: 5
pellametric-backend:
image: ghcr.io/pella-labs/pellametric-backend:v0.4.2
restart: unless-stopped
environment:
DATABASE_URL: postgresql://pella:pella_dev@pellametric-db:5432/pellametric
PORT: 3001
API_KEY: dev-key-123
PROMETHEUS_ENABLED: "true"
NODE_ENV: production
depends_on:
pellametric-db:
condition: service_healthy
ports:
- "3001:3001"
pellametric-frontend:
image: ghcr.io/pella-labs/pellametric-frontend:v0.4.2
restart: unless-stopped
environment:
API_BASE_URL: http://localhost:3001
ports:
- "3000:3000"
depends_on:
pellametric-backend:
condition: service_healthy
Run it:
docker-compose up -d --build
docker-compose logs -f pellametric-backend
Wait ~30 seconds, then hit http://localhost:3000. Sign in with [email protected] / password. Done.
Note: The frontend image does not support runtime API_BASE_URL injection—this is a real limitation if you’re reverse-proxying behind Nginx or Cloudflare. I patched it locally with a 3-line change to Dockerfile.frontend, but upstream hasn’t merged that yet.
Pellametric vs. The Alternatives: Where It Fits (and Where It Doesn’t)
If you’ve tried Langfuse, Phoenix, or Arize, here’s the honest comparison:
| Tool | Self-hostable | LLM-Specific UI | Real-time Filtering | Token Cost Tracking | Learning Curve | Docker-Ready |
|---|---|---|---|---|---|---|
| Pellametric | ✅ Yes (Postgres + 2 services) | ✅ Clean, focused on prompts/feedback | ✅ Full Lucene-style search on metadata | ✅ Yes (via LLM provider integration) | ⭐ Low (5 min to first trace) | ✅ Yes (but see gotchas above) |
| Langfuse | ✅ Yes | ✅ Very polished, but overloaded with tracing | ✅ Yes, but needs manual index tuning | ✅ Yes, but requires provider config | ⚠️ Medium (UI is dense) | ✅ Yes |
| Phoenix | ✅ Yes | ❌ Minimal UI (debug-first) | ❌ Search is CLI-only or requires OpenInference export | ❌ No built-in cost calc | ⚠️ High (Python + Pydantic + OpenInference spec) | ⚠️ Partial (needs manual DB setup) |
| Arize | ❌ Cloud-only (free tier only) | ✅ Excellent, but SaaS-only | ✅ Yes | ✅ Yes | ❌ Vendor lock-in | N/A |
The kicker? Pellametric’s token cost tracking works out of the box. Set OPENAI_API_KEY in your app, and it pulls pricing from OpenAI’s official /v1/models endpoint at runtime—then multiplies by prompt_tokens + completion_tokens. No manual CSV uploads. No cost_per_million_tokens config.
That said: it doesn’t do drift detection. No concept of “model performance over time” like Phoenix. No embedding similarity dashboards. It’s observability, not ML monitoring.
Why Self-Host Pellametric? Who’s This Actually For?
Let’s cut the fluff: you should self-host Pellametric if you meet at least two of these:
- You’re sending PII, PHI, or proprietary prompts through LLMs (e.g., internal docs, customer support logs, legal contracts)
- You run >3 LLM endpoints in production and can’t remember which one uses
gpt-3.5-turbo-1106vs.gpt-3.5-turbo-0125 - You’ve built your own LangChain wrapper and want to prove that caching
llm.invoke()reduced latency by 320ms—not just guess - Your team uses
console.error("LLM failed")as observability
It’s not for:
- Teams who want auto-instrumentation (no OpenTelemetry auto-injector yet)
- Data scientists needing embedding clustering or UMAP plots
- Companies requiring SOC 2 compliance out of the box (no audit logs, no RBAC—just
admin/viewerroles)
Hardware-wise? I’ve run it for 3 weeks on that $12 droplet:
- Avg. RAM: ~1.1 GB (Postgres + backend + frontend)
- CPU: < 15% idle, spikes to 40% during bulk ingestion (e.g., backfilling 10k traces)
- Disk: ~280 MB after 12k traces (with prompt/completion text compressed in Postgres
textcolumns)
No GPU needed. No Node.js version hell—backend runs on Node 20.12, frontend on Vite 5.2.
The Rough Edges: What You’ll Actually Encounter
Here’s what I ran into—and what you’ll face too:
No RBAC (yet)
The UI hasadminandviewerroles, but both can delete all traces. There’s no per-project isolation. If your team has 5 squads sharing one instance, you’re trusting everyone not toDELETE FROM traces;. The GitHub issue is open since March 2024—no ETA.Feedback is “opt-in but not optional”
To seeuser_ratingoris_hallucinatedin filters, you must send those fields in everypella.llm()call. There’s no “default to null” UI toggle. If your frontend forgets to attachmetadata: { user_rating: 5 }, that feedback vanishes—not hidden, gone.No alerting
You can’t set “alert if error_rate > 5% for 5 minutes”. You get metrics (pellametric_llm_errors_total), but no built-in alert rules or email/SMS hooks. You’ll need Prometheus + Alertmanager—fine if you have it, painful if you don’t.Frontend build quirk (again)
Want to deploy behindhttps://ai-obs.yourco.com? You must rebuild the frontend image with--build-arg API_BASE_URL=https://ai-obs.yourco.com/api. Nonginx.confrewrite trick works—the frontend makes hardcoded/api/requests that 404. I spent 90 minutes on this. Don’t repeat my mistake.Docs lag behind code
The@pellametric/agentpackage on npm isv0.4.2, but the GitHub README still showspella.track()—a deprecated method. The actual working method ispella.llm(). Check the agent source if in doubt.
Final Verdict: Is Pellametric Worth Deploying Right Now?
Yes—but with guardrails.
I’ve run it for 17 days across 3 internal LLM services (a RAG chatbot, a Slack summarizer, and a document classifier). It caught two real issues:
- A misconfigured
temperature=1.0in prod that spiked hallucination flags by 70% - A
SearchTooltimeout bug that only surfaced when filtering byduration_ms > 5000
That’s value. Real, immediate, actionable value.
Is it polished? No. It’s a sharp, lightweight knife—not a Swiss Army one. You’ll tweak configs, rebuild images, and read source. But if you’re tired of stitching together console.log() + Datadog + Excel to answer “why did that LLM call fail?”, Pellametric gets you 80% there in <30 minutes.
The 42 stars? They’re not from hype. They’re from people like you—sysadmins who’d rather grep logs than click through SaaS dashboards.
So go ahead. Clone it. Run docker-compose up. Then go break something—and finally see it break.
Because in AI engineering, visibility isn’t optional. It’s the first line of defense.
git clone https://github.com/pella-labs/pellametric.git
cd pellametric
docker-compose up -d
# Then: http://localhost:3000 → login → start instrumenting
Comments