Tools

Tools: LLM Cost Tracking and Spend Management for Engineering Teams - 2025 Update

2026-04-01 0 views admin

The actual problem with LLM costs

What cost tracking actually requires

How we built cost tracking in Bifrost

Model Catalog with auto-synced pricing

Four-tier budget hierarchy

LogStore: per-request cost audit trail

Getting started

Cache-aware cost tracking

How other tools handle cost tracking

What we learned building this

Wrapping up Your team ships a feature using GPT-4, it works great in staging, and then production traffic hits. Suddenly you are burning through API credits faster than anyone expected. Multiply that across three providers, five teams, and a few hundred thousand requests per day. Good luck figuring out where the money went. We built Bifrost, an open-source LLM gateway in Go, and cost tracking was one of the first problems we had to solve properly. This post covers what we learned, how we designed spend management into the gateway layer, and what the alternatives look like. You can get started with the setup guide in under a minute. TL;DR: Bifrost gives you per-request cost logging, four-tier budget hierarchies (Customer, Team, Virtual Key, Provider Config), auto-synced model pricing, and cache-aware cost calculations. All at 11 microsecond latency overhead. You can run it right now with npx -y @maximhq/bifrost. Full docs here. Cloud compute costs are predictable. You pick an instance type, you know the hourly rate, you can forecast monthly spend within a few percent. LLM costs are nothing like that. A single API call costs somewhere between $0.0001 and $0.50 depending on the model, the input length, the output length, whether you are sending images or audio, and whether the context crosses the 128k token threshold (where pricing tiers change). That is per request. Now add multi-provider routing. Your app might use OpenAI for chat, Anthropic for analysis, and a smaller model for classification. Each provider has different pricing structures, different token counting methods, and different billing cycles. The result: engineering teams have no idea what they are spending until the invoice arrives. Most teams start with "we will check the provider dashboard." That breaks down fast for three reasons. Per-request granularity. You need to know the cost of every single API call, tied to which customer, which team, and which feature triggered it. Provider dashboards give you aggregate numbers, not per-request attribution. Real-time budget enforcement. Knowing you overspent last month does not help. You need the system to reject requests when a budget limit is hit, before the money is gone. Multi-modal cost calculation. If your app sends images, audio, or very long contexts, the cost calculation is not a simple token multiplication. You need tiered pricing support, per-image costs, per-second audio costs, and character-based pricing for certain models. We wanted cost management to be a gateway-level concern, not something each application team has to implement. Here is how the pieces fit together. The Model Catalog is the foundation. It maintains pricing data for every supported model across all providers. You can also force a pricing sync at any time via the API. On startup, Bifrost downloads the latest pricing sheet and loads it into memory. When a ConfigStore (SQLite or PostgreSQL) is available, it also persists the data and re-syncs every 24 hours automatically. All lookups are O(1) from memory. The pricing data covers multiple modalities: This means cost calculation is accurate for every request type, not an approximation based on token count alone. This is where spend management happens. Bifrost supports budgets at four levels: Each budget has a max_limit, a reset_duration (daily, weekly, monthly), and tracks current_usage in real time. Here is what creating a customer with a budget looks like via the API: The response includes the budget object with current_usage tracked automatically: When current_usage hits max_limit, requests are rejected. No surprises on the invoice. Every request that passes through Bifrost gets logged with full cost data. The LogStore captures: You can query this data with filters. Want to see all requests to OpenAI that cost more than $0.10 in the last hour? That is a single API call. The response includes aggregated stats alongside individual logs: total requests, success rate, average latency, total tokens, and total cost for the query. This is the data you need for cost attribution and chargeback. You can have this running in under a minute: Or with Docker if you prefer containerized deployment. Then point your LLM calls at the Bifrost endpoint instead of directly at the provider — it works as a drop-in replacement for the OpenAI SDK, Anthropic SDK, and Bedrock SDK. Cost tracking, budget enforcement, and logging happen automatically. Check the setup docs for configuration details. This is a detail that matters more than you would expect. Bifrost includes a dual-layer semantic cache (exact hash matching + semantic similarity via Weaviate). When a request hits the cache, the cost calculation changes: If you are not tracking cache-aware costs, your cost reports will overcount. Every cache hit that gets reported at full model price inflates your numbers and hides the ROI of caching. Credit where it is due. There are several tools in this space, and they each take a different approach. Helicone is a proxy-based observability platform. It logs requests and provides cost analytics through a dashboard. The cost tracking is solid, with per-request granularity. Where it differs from Bifrost: Helicone is primarily an observability tool. Budget enforcement and cache-aware cost calculations are not its focus. It is a good choice if you want analytics without gateway-level controls. OpenRouter acts as a unified API layer across multiple LLM providers. It handles routing and gives you a single bill, which simplifies accounting. However, OpenRouter is a hosted proxy — your requests pass through their infrastructure. There is no self-hosted option, no budget enforcement at the gateway level, and no per-customer or per-team spend hierarchy. If you need cost attribution beyond "which model was called," you will need to build that yourself on top of their logs. AWS API Gateway + Bedrock is what many AWS-native teams reach for. You get IAM-based access control and CloudWatch metrics. The limitation is that cost tracking is coarse-grained — you get aggregate billing through AWS Cost Explorer, not per-request cost breakdowns tied to your internal teams or customers. Building a four-tier budget hierarchy on top of AWS services means stitching together Lambda, DynamoDB, and custom billing logic. It works but it is a lot of glue code. Kong AI Gateway and Cloudflare AI Gateway both provide rate limiting and basic analytics for AI API traffic. Kong gives you plugin-based extensibility, and Cloudflare gives you edge caching and DDoS protection. Neither provides built-in per-request cost calculation with multi-modal pricing awareness, and neither offers the kind of budget hierarchy where you can set spending caps at the customer, team, and key level with automatic enforcement. LiteLLM is the most well-known Python-based proxy. It supports cost tracking and has a wide model coverage. The trade-off is performance. LiteLLM adds roughly 8ms of latency overhead per request. Bifrost adds 11 microseconds, which is about 50x faster. At 5,000 RPS, that difference compounds. If your use case is low-throughput internal tooling, LiteLLM works fine. If you are running production workloads at scale, the latency overhead matters. The math is straightforward: at 5,000 requests per second, 8ms overhead means 40 seconds of cumulative latency overhead per second of wall time. At 11 microseconds, it is 0.055 seconds. A few things surprised us during development. Pricing data goes stale fast. Providers update pricing regularly. We started with a static pricing file and quickly realized it needed to be auto-synced. The 24-hour sync interval with O(1) memory lookups was the balance we settled on. You can also trigger a manual pricing sync via POST /api/pricing/force-sync if a provider drops prices and you want immediate accuracy. Budget enforcement needs to be in the hot path. We tried implementing budgets as an async check initially. The problem: by the time the async check ran, the request was already sent to the provider and the cost was incurred. Budget checks have to happen before the request goes upstream. That is why Bifrost handles it at the gateway layer with in-memory state. Multi-modal cost calculation is harder than it looks. Text-only cost is straightforward: multiply tokens by price per token. But when a request includes images, the cost depends on the image resolution and the token context length. Audio adds per-second pricing. Some models charge per character instead of per token. The Model Catalog handles all of this, but getting it right required modelling each provider's pricing structure individually. Cost attribution needs hierarchy. Flat per-key budgets are not enough for real organizations. An engineering team needs to know: "How much is Customer X spending? How much of that is Team Y? Which virtual key is burning through budget?" That is why we built the four-tier hierarchy (Customer, Team, Virtual Key, Provider Config). You can create virtual keys via the API and attach budgets to each level. LLM cost management is not optional for production systems. If you are routing requests across multiple providers without per-request cost tracking, budget enforcement, and cache-aware calculations, you are flying blind. For enterprise teams, Bifrost also supports audit logs, log exports, and intelligent load balancing. Bifrost is open-source, written in Go, and runs with a single command. It handles cost tracking at the gateway layer so your application code does not have to. If you are dealing with LLM spend management, give it a try and let us know what is missing. We are actively building based on what teams actually need. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 500;">curl --request POST \ --url http://localhost:8080/api/governance/customers \ --header 'Content-Type: application/json' \ --data '{ "name": "acme-corp", "budget": { "max_limit": 500, "reset_duration": "monthly" } }' -weight: 500;">curl --request POST \ --url http://localhost:8080/api/governance/customers \ --header 'Content-Type: application/json' \ --data '{ "name": "acme-corp", "budget": { "max_limit": 500, "reset_duration": "monthly" } }' -weight: 500;">curl --request POST \ --url http://localhost:8080/api/governance/customers \ --header 'Content-Type: application/json' \ --data '{ "name": "acme-corp", "budget": { "max_limit": 500, "reset_duration": "monthly" } }' { "customer": { "id": "cust-abc123", "name": "acme-corp", "budget": { "id": "bdgt-xyz", "max_limit": 500, "reset_duration": "monthly", "current_usage": 0 } } } { "customer": { "id": "cust-abc123", "name": "acme-corp", "budget": { "id": "bdgt-xyz", "max_limit": 500, "reset_duration": "monthly", "current_usage": 0 } } } { "customer": { "id": "cust-abc123", "name": "acme-corp", "budget": { "id": "bdgt-xyz", "max_limit": 500, "reset_duration": "monthly", "current_usage": 0 } } } -weight: 500;">curl --request POST \ --url http://localhost:8080/api/logs/search \ --header 'Content-Type: application/json' \ --data '{ "filters": { "providers": ["openai"], "min_cost": 0.10, "start_time": "2026-03-31T00:00:00Z" }, "pagination": { "limit": 50, "sort_by": "cost", "order": "desc" } }' -weight: 500;">curl --request POST \ --url http://localhost:8080/api/logs/search \ --header 'Content-Type: application/json' \ --data '{ "filters": { "providers": ["openai"], "min_cost": 0.10, "start_time": "2026-03-31T00:00:00Z" }, "pagination": { "limit": 50, "sort_by": "cost", "order": "desc" } }' -weight: 500;">curl --request POST \ --url http://localhost:8080/api/logs/search \ --header 'Content-Type: application/json' \ --data '{ "filters": { "providers": ["openai"], "min_cost": 0.10, "start_time": "2026-03-31T00:00:00Z" }, "pagination": { "limit": 50, "sort_by": "cost", "order": "desc" } }' npx -y @maximhq/bifrost npx -y @maximhq/bifrost npx -y @maximhq/bifrost - Text: token-based and character-based pricing for chat completions, text completions, and embeddings - Audio: token-based and duration-based pricing for speech synthesis and transcription - Images: per-image costs with tiered pricing for high-token contexts - Tiered pricing: automatic rate changes above 128k tokens, reflecting actual provider pricing - Customer - set a spending cap for an entire customer account - Team - limit spend per team within a customer - Virtual Key - control costs per API key (useful for per-feature or per-environment budgets) - Provider Config - cap total spend on a specific provider - Provider and model used - Input tokens, output tokens, total tokens - Calculated cost (broken down into input cost, output cost, request cost, total cost) - Status (success or error) - Full input/output content (serialized as JSON) - Direct cache hit (exact match): zero cost. The response comes from cache, no provider API call is made. - Semantic cache hit (similar query found): the cost is the embedding generation cost only. No model inference cost. - Cache miss with storage: the cost is the base model usage plus the embedding generation cost for storing the result. - GitHub repo - Documentation

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolstrackingspendmanagementengineeringteamsupdate

More from Tools

Tools: Breaking: How to Troubleshoot Docker Swarm Issues

2026-04-01 0

Tools: Setting up a hugo static site hosted with Porkbun

2026-04-01 0

Tools: Stop staring at Claude Code's statusline: here's one you can actually customize (2026)

2026-04-01 0

Tools: Latest: Building an Efficient Embedded Linux Platform with Custom SBC and Buildroot SDK

2026-04-01 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: LLM Cost Tracking and Spend Management for Engineering Teams - 2025 Update

The actual problem with LLM costs

What cost tracking actually requires

How we built cost tracking in Bifrost

Model Catalog with auto-synced pricing

Four-tier budget hierarchy

LogStore: per-request cost audit trail

Getting started

Cache-aware cost tracking

How other tools handle cost tracking

What we learned building this

🏷️ Tags

More from Tools

Tools: Breaking: How to Troubleshoot Docker Swarm Issues

Tools: Setting up a hugo static site hosted with Porkbun

Tools: Stop staring at Claude Code's statusline: here's one you can actually customize (2026)

Tools: Latest: Building an Efficient Embedded Linux Platform with Custom SBC and Buildroot SDK

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting