Let’s be honest: you’re tired of juggling API keys across half a dozen LLM providers, manually tracking spend in spreadsheets, and praying your internal RAG app doesn’t leak OPENAI_API_KEY to a misconfigured Next.js build. You need a secure, auditable, centralized AI gateway — not another “AI proxy” that just slaps NGINX in front of /v1/chat/completions. Enter ThinkWatch: a Rust-built, enterprise-grade AI bastion host that’s quietly gaining traction (190+ GitHub stars as of May 2024) and actually delivers on its bold claim: “Secure API and MCP access, unified proxying, RBAC, audit logs, rate limiting, and cost tracking — across OpenAI, Anthropic, Gemini, and self-hosted LLMs.”

This isn’t just another llm-proxy crate. ThinkWatch is built like a zero-trust API gateway — hardened, minimal, and opinionated — and it’s the first tool I’ve seen that treats AI traffic like production infrastructure, not a sidecar experiment.

What Is ThinkWatch — And Why Does It Exist?

ThinkWatch isn’t a model host, a UI, or a LangChain wrapper. It’s a secure AI API bastion host: a hardened, single-binary reverse proxy that sits between your internal apps and external (or self-hosted) LLM endpoints. Think of it as the nginx + auth0 + prometheus + cloudwatch combo — but purpose-built for AI.

Its core value isn’t “making API calls easier.” It’s enforcing policy:

  • Your junior dev can’t accidentally burn $2k on gpt-4o by leaving a debug flag on in staging
  • Your compliance officer can pull a CSV of every /messages call made by team-marketing, including prompt tokens, response tokens, provider, model, and cost (yes — actual dollar cost, inferred from provider pricing tables)
  • Your SRE team can set rate_limit = "100 requests/hour" for anthropic.claude-3-5-sonnet-20240620 per service account, not per IP
  • Your security team can rotate one THINKWATCH_MASTER_KEY, not 17 .env files scattered across repos

ThinkWatch v0.4.2 (latest stable as of May 2024) supports OpenAI v1, Anthropic v1, Google Gemini (via google.generativeai REST), and any OpenAI-compatible self-hosted LLM (e.g., Ollama, vLLM, TGI, llama.cpp). It also supports MCP (Model Context Protocol) — yes, that emerging standard — for structured tool calling. That’s rare. Most proxies ignore MCP entirely.

How ThinkWatch Compares to Alternatives

If you’ve tried other tools, you’ve probably hit these pain points:

  • llama.cpp + llama-server: Great for local inference, zero auth, zero logging, zero cost tracking. You’re on your own for RBAC.
  • text-generation-inference (TGI): Powerful, but built for one model, no multi-provider proxying, and RBAC is bolted on via external auth (e.g., Keycloak).
  • Ollama: Developer-friendly, but no enterprise auth, no audit logs, no rate limiting beyond basic --num-ctx, and no cost visibility.
  • Plexus / LlamaIndex Gateway: More ML-framework than infrastructure — built for agents, not security. No native RBAC or billing.
  • NGINX + Lua scripts: Possible, but you’re writing and maintaining auth logic, token parsing, logging schema, rate-limiting counters, and cost math — all in Lua. No thanks.

ThinkWatch avoids this sprawl. It’s not a model server — it proxies. It’s not a framework — it’s a binary you deploy and forget (mostly). It ships with SQLite (default) or PostgreSQL for persistence, Rust-level memory safety, and a minimal attack surface (<15MB binary, no libc dependency when built with musl).

Here’s the kicker: ThinkWatch’s config is declarative and granular, not “set and pray.” You define:

  • providers: with API keys encrypted at rest (via age or openssl)
  • services: mapping internal service IDs (marketing-llm-client) to provider/model combos
  • policies: RBAC rules (role: analystallow: read on /v1/chat/completions)
  • quotas: per-service rate limits and spend caps (max_spend_usd_per_month = 450.0)

That’s not marketing fluff. That’s in config.yaml.

Installation & Docker-First Deployment

ThinkWatch is built for production — but it’s also delightfully simple to test locally. I spun it up on my M2 Mac (16GB RAM) in <90 seconds. For production, I run it on a t3.xlarge (4 vCPU / 16GB RAM) — overkill, but safe.

Quick Start (Docker Compose)

Use the official docker-compose.yml from the repo (v0.4.2):

# docker-compose.yml
version: '3.8'
services:
  thinkwatch:
    image: ghcr.io/thinkwatchproject/thinkwatch:0.4.2
    ports:
      - "8080:8080"
    environment:
      - THINKWATCH_CONFIG_PATH=/etc/thinkwatch/config.yaml
      - THINKWATCH_LOG_LEVEL=info
    volumes:
      - ./config.yaml:/etc/thinkwatch/config.yaml:ro
      - ./secrets:/run/secrets:ro
    restart: unless-stopped

You’ll need a config.yaml. Here’s a minimal working version (with real values I use in staging):

# config.yaml
server:
  bind_addr: "0.0.0.0:8080"
  tls: null  # disable TLS for local dev; enable in prod with cert paths

database:
  type: "sqlite"
  path: "/data/thinkwatch.db"

providers:
  - id: "openai-prod"
    type: "openai"
    base_url: "https://api.openai.com/v1"
    api_key: "age1...encrypted-key"  # use `age` to encrypt: `age -r $(cat key.txt) -a < key.txt | age -r $(cat key.txt) -a > secret.age`
  - id: "ollama-local"
    type: "openai"
    base_url: "http://host.docker.internal:11434/v1"
    api_key: "ollama"  # dummy key — Ollama ignores it

services:
  - id: "marketing-rag"
    provider_id: "openai-prod"
    model: "gpt-4o"
    allowed_paths: ["/v1/chat/completions"]
  - id: "internal-qa"
    provider_id: "ollama-local"
    model: "llama3:8b"
    allowed_paths: ["/v1/chat/completions"]

policies:
  - role: "dev"
    permissions:
      - action: "read"
        resource: "/v1/chat/completions"
        service_id: "marketing-rag"
  - role: "analyst"
    permissions:
      - action: "read"
        resource: "/v1/chat/completions"
        service_id: "internal-qa"

quotas:
  - service_id: "marketing-rag"
    rate_limit: "1000/hour"
    max_spend_usd_per_month: 1200.0

Then run:

# First, generate an age keypair (do this once)
age-keygen > key.txt

# Encrypt your OpenAI key (replace YOUR_KEY)
echo "sk-abc123..." | age -r $(cat key.txt) > secrets/openai_key.age

# Spin it up
docker compose up -d

That’s it. Your apps now hit http://localhost:8080/v1/chat/completions with an X-ThinkWatch-Service-ID: marketing-rag header — and ThinkWatch handles auth, routing, logging, and cost attribution.

Why Self-Host ThinkWatch? (Who Is This Actually For?)

Let’s cut the enterprise buzzword bingo. ThinkWatch isn’t for:

  • Solo devs doing local ollama run mistral
  • Teams using only one LLM, with no compliance or spend concerns
  • Companies already running Kong + Auth0 + Datadog + custom billing pipelines

It is for:

  • AI platform teams building internal LLM platforms for 50+ engineers — who need to enforce usage policies before someone fine-tunes claude-3-opus on prod data
  • FinTech / HealthTech startups where “we log all API calls” isn’t a nice-to-have — it’s an audit requirement (SOC2, HIPAA)
  • ML Ops teams tired of writing custom middleware in FastAPI just to add X-Request-ID + X-Model-Cost headers
  • Self-hosting purists who want zero vendor lock-in and zero blind spots in their AI stack

Hardware-wise: it’s lightweight. On my test instance (Rust release build, SQLite backend), ThinkWatch uses ~85MB RAM idle, peaks at ~220MB under 50 RPS. CPU stays under 0.3 cores. No GPU needed — it’s a proxy, not a model runner. Disk usage? 20MB for the binary + ~100MB/month for logs + audit DB (with 10k requests/day). You can run it on a $5/month DO droplet if you’re not under heavy load.

Audit Logs, Cost Tracking, and That MCP Thing

ThinkWatch’s audit log isn’t just “request timestamp + status.” It’s structured, queryable, and enriched. Every log entry (SQLite or PostgreSQL) includes:

  • service_id, provider_id, model
  • prompt_tokens, completion_tokens, total_tokens
  • cost_usd (calculated using built-in pricing tables — e.g., gpt-4o input: $5.00/million tokens)
  • user_agent, client_ip, X-Request-ID
  • policy_matched, quota_remaining, rate_limit_remaining

You can query it directly:

SELECT service_id, model, SUM(cost_usd) as monthly_spend
FROM audit_logs
WHERE created_at >= '2024-05-01'
GROUP BY service_id, model
ORDER BY monthly_spend DESC;

And yes — it supports MCP. As of v0.4.2, ThinkWatch transparently proxies POST /v1/mcp/ (and GET /v1/mcp/tools) to Anthropic and OpenAI endpoints that support it, and rewrites tool calls for self-hosted LLMs that don’t — injecting mcp:// tool URIs into the system prompt. This is huge if you’re building tool-using agents and want to avoid vendor-specific tool schemas.

The Rough Edges (My Honest Take)

I’ve run ThinkWatch in staging for 17 days. Here’s what’s solid — and where it stings:

Rust is real: Zero crashes, memory leaks, or weird segfaults. The binary just works.
RBAC is actually usable: Unlike some “RBAC” implementations that just check a header, ThinkWatch validates against the full policy tree, including path + method + service ID.
Cost tracking is shockingly accurate: I compared its cost_usd against raw OpenAI billing export — matched within $0.02 over 3 days.
Docker image is lean: ghcr.io/thinkwatchproject/thinkwatch:0.4.2 is 48MB — smaller than most Python-based proxies.

Docs are sparse: The GitHub README is good for setup, but config.yaml schema docs? Buried in docs/config.md. Took me 20 minutes to find how to configure PostgreSQL.
No built-in dashboard: You get /health, /metrics (Prometheus), and raw DB access — but no Grafana dashboards or web UI for logs. You will need to build that (or use sqlite-web).
MCP support is new: Only Anthropic + OpenAI endpoints work out-of-the-box. Gemini’s MCP support is “coming soon” (per issue #42).
No OAuth2 / SAML out-of-the-box: It supports API key auth and JWT, but you’ll need to wire up your IdP manually (e.g., via jwks_uri in config).

Is it worth deploying today? Yes — if you need production-grade AI governance and are willing to shell out 2–3 hours to wire up logging + auth. It’s not “install and go” like Ollama, but it’s far more mature than 90% of the “AI gateway” crates on crates.io.

The project is young (first commit: Dec 2023), but the maintainer is responsive (3 PRs merged in last week), the issue tracker is well-organized, and the Rust codebase is clean — not “Rust because it’s trendy,” but Rust because it matters here.

TL;DR: Should You Deploy ThinkWatch?

  • Do it if: You’re building an internal AI platform, care about cost control, need audit trails, and want to avoid stitching together 5 different tools.
  • Skip it if: You just want a quick local proxy, or you’re not comfortable reading Rust docs or writing a config.yaml with nested structs.
  • Watch closely if: You use MCP or need SAML — those features are rolling out fast (v0.5 roadmap includes IdP integrations and a basic web UI).

ThinkWatch isn’t perfect. But it’s the first AI bastion host that feels like it was designed, not assembled. And in a world of half-baked LLM proxies, that’s rare — and worth your time.

I’m keeping it in staging. Next week, it hits prod. Let’s see how long it takes before someone tries to curl it without X-ThinkWatch-Service-ID. (Spoiler: the audit log will know.)