floe-labs/floe-guard: open-source utility for developers and tinkerers

A local budget guardrail for AI agents. It hard-stops your agent before its next LLM call when it would cross a spend ceiling — so a runaway loop dies at $0.10 instead of $4,000. No account, no signup, no network. Runs in your process.

pip install floe-guard

from floe_guard import BudgetGuard

guard = BudgetGuard(limit_usd=5.00)   # your ceiling
guard.check()                         # before each LLM call — raises if it'd cross
response = call_your_llm(...)         # your existing call
guard.record("gpt-4o", response.usage.prompt_tokens, response.usage.completion_tokens)

When the next call would cross the ceiling, the guard raises BudgetExceeded and prints:

BUDGET EXCEEDED — call blocked
  spent so far: $5.001250  |  ceiling: $5.000000
  The next call would cross your budget; floe-guard stopped your agent before it ran.

floe-guard hard-stopping a runaway loop before it crosses a $0.10 ceiling

Run it yourself: python examples/runaway_loop.py — no API key, no account, no network.

See it stop a loop (no API key needed)

python examples/runaway_loop.py

This rigs a loop against a stub LLM — no real API key, no account, no network. It prices each fake gpt-4o call offline and the guard halts the loop after a few iterations. This is the reproducible "stop the loop" demo.

Why floe-guard?

You can already see what your agent spends — the problem is seeing it too late. floe-guard is the part that stops the call, not the part that reports the damage.

max_tokens / max_rpm cap size and rate, not dollars — a cheap model stuck in a loop still drains the budget.
Usage logs and provider dashboards tell you what you spent after it's gone. floe-guard refuses the call before it crosses your ceiling.
A cost callback that just logs is notified after the fact and can't halt the run — enforcement has to stand in front of the next call. That's where it lives.
A hand-rolled spent += cost counter races under parallel agents (CrewAI fan-out, asyncio, Promise.all): N calls read the same under-limit total and all fire. floe-guard reserves atomically (reserve()/settle()), so the ceiling holds under concurrency.

The whole job: a hard stop before the next call, that holds under fan-out — no account, no network, no crypto.

How it works

The guard sits in the call path, not on an event bus. A passive listener is told about spend after the fact and can't halt anything — so enforcement has to be the thing standing in front of the next call:

check() runs before each LLM call. It predicts the next call's cost from the last one and raises BudgetExceeded if that would cross your ceiling — the call never runs. (A running-total check also catches an overshoot if an estimate came in low.)
record(model, prompt_tokens, completion_tokens) runs after each response. It prices the tokens offline from a bundled LiteLLM cost map and adds the USD to a running total.

Unpriceable models fail closed

If a model isn't in the cost map and you didn't supply a price, the guard warns loudly and refuses (UnpriceableModelError) rather than silently treat it as free — you can't cap spend you can't measure. Give it a price to enforce it:

from floe_guard import BudgetGuard, ManualPrice

guard = BudgetGuard(
    limit_usd=5.00,
    price_overrides={"my-self-hosted-model": ManualPrice(1e-6, 2e-6)},  # USD/token
)
# or, set fail_closed=False to warn-and-skip for models you accept un-metered.

Context-aware budgeting

The hard-stop is the guarantee; advisory() is the upside. Read it before a step to let your agent adapt as it nears the cap — taper to a cheaper model, shrink the task, or wrap up — instead of getting cut off mid-run.

guard = BudgetGuard(limit_usd=0.10, near_limit_bps=7000)   # flag at 70% used

adv = guard.advisory()
# BudgetAdvisory(near_limit=False, used_bps=125, remaining_usd=0.0987, ...)
model = "gpt-4o-mini" if adv.near_limit else "gpt-4o"        # downshift near the cap

guard.check()                  # still the hard line — taper or not, this holds
response = call_your_llm(model)
guard.record(model, response.usage.prompt_tokens, response.usage.completion_tokens)

advisory() returns near_limit, used_bps (utilization in basis points), remaining_usd, and the budget totals. It's a soft signal — the model may ignore it; check() is what enforces the ceiling. See examples/budget_aware.py for a runnable taper demo (no API key).

This is the same advisory shape hosted Floe returns on every proxied call (the X-Floe-Budget-Advisory header), so the logic you write here ports unchanged — hosted just answers across every vendor and cap with server-truth balances and rolling-window reset timing, which a single local budget can't know. The TS package exposes the identical guard.advisory().

Framework adapters (optional extras)

CrewAI

pip install floe-guard[crewai]

from crewai import Crew
from floe_guard import BudgetGuard
from floe_guard.integrations.crewai import guard_crew

guard = BudgetGuard(limit_usd=1.00)
guard_crew(guard)              # one line — enforces across the whole crew
Crew(agents=[...], tasks=[...]).kickoff()

CrewAI runs on LiteLLM, so one callback caps every agent and task under a single budget.

LiteLLM

pip install floe-guard[litellm]

from floe_guard import BudgetGuard
from floe_guard.integrations.litellm import guarded_completion

guard = BudgetGuard(limit_usd=1.00)
response = guarded_completion(guard, model="gpt-4o", messages=[...])

Prefer the LiteLLM-native callback? Register budget_guard_callback(guard) on litellm.callbacks.

LangChain

pip install floe-guard[langchain] langchain-openai   # langchain-openai only for the ChatOpenAI example below

from langchain_openai import ChatOpenAI
from floe_guard import BudgetGuard
from floe_guard.integrations.langchain import budget_guard_callback_handler

guard = BudgetGuard(limit_usd=1.00)
llm = ChatOpenAI(model="gpt-4o", callbacks=[budget_guard_callback_handler(guard)])
llm.invoke("hello")            # checks budget before the call, records spend after

The handler checks the budget on LLM start (raising BudgetExceeded aborts the call before it runs) and records token usage on LLM end.

Vercel AI SDK

The Vercel AI SDK is TypeScript-only, so it ships as a separate npm package that lives in js/.

npm i floe-guard ai@4 @ai-sdk/openai

import { wrapLanguageModel } from "ai";
import { openai } from "@ai-sdk/openai";
import { BudgetGuard, budgetGuardMiddleware } from "floe-guard";

const guard = new BudgetGuard(5.0);                   // your ceiling, in USD
const model = wrapLanguageModel({
  model: openai("gpt-4o"),
  middleware: budgetGuardMiddleware(guard),           // throws before crossing
});

The middleware check()s before each call (throwing BudgetExceeded to halt the run) and record()s priced usage after — same semantics as the Python guard. See js/README.md.

Honest about what this is

floe-guard is a local, estimate-based guardrail. It prices tokens from a vendored cost map inside your process:

The cost map can drift as vendors change prices — refresh it like any snapshot.
It only sees the vendors you instrument.
A determined agent or a bug could route around an in-process check.
Under heavy or cold-start concurrency it bounds steady-state spend, not the first parallel wave. Reservations size from the last call's cost (0 until the first record()), so the opening fan-out has nothing to estimate from. Pass a known per-call max to reserve() to bound it, or use hosted Floe for a hard cap under arbitrary concurrency.

It's genuinely useful on its own, and it's honest about its limits. No inflated metrics, no "zero defaults" claims — it's a free local stop, not a vault.

Upgrade to hosted Floe

When you need the ceiling to be un-bypassable and cross-vendor, hosted Floe moves enforcement server-side against a real credit line:

Un-bypassable — enforced at the spend rail, not in your process.
Cross-vendor — one budget over LLM tokens and paid (x402) tool calls.
Team budgets + analytics — shared ceilings, per-agent isolation, spend history.

Set FLOE_API_KEY (your agent key, floe_<hex>) and floe-guard can read your agent's server-side remaining budget from the live Floe endpoint:

from floe_guard import hosted_enforcement_available, hosted_remaining_usd

if hosted_enforcement_available():       # True when FLOE_API_KEY is set
    remaining = hosted_remaining_usd()   # USD left, read from Floe's server

hosted_remaining_usd() GETs /v1/agents/credit-remaining and returns the USD remaining — the minimum of your auto-borrow headroom and your session spend remaining. It raises HostedEnforcementError on a bad/missing key (401), a closed or suspended agent (403), an unprovisioned agent (404), or a network failure.

Env vars:

FLOE_API_KEY — your agent key. Required for the read.
FLOE_API_BASE_URL — override the API host (defaults to https://credit-api.floelabs.xyz).

Honest scope: this call only reads the remaining budget. The un-bypassable, cross-vendor enforcement is the hosted Floe product running server-side — not this client. Use the number to inform a local ceiling; the server stays the source of truth.

→ dev-dashboard.floelabs.xyz · floelabs.xyz

Built with floe-guard

Using floe-guard in your project? Add the badge so others find it:

Development

pip install -e ".[dev]"
pytest
ruff check .