Tools
Tools: How to Cut Your AI API Costs by 30% Without Changing Models
2026-02-27
0 views
admin
How to Cut Your AI API Costs by 30% Without Changing Models ## 1. Prompt Caching: The Biggest Win ## How It Works ## The Math ## Implementation ## 2. Smart Model Routing: Use the Right Model for Each Task ## The Routing Strategy ## Implementation ## Real Savings ## 3. Batch Processing: Lower Prices for Non-Urgent Work ## 4. Bonus: Reduce Token Count ## Putting It All Together Most teams overpay for AI API calls. Not because they picked the wrong model, but because they're ignoring three optimizations that require minimal code changes: prompt caching, smart model routing, and batch processing. Here's a breakdown of each technique with real numbers. If your application sends the same system prompt with every request, you're paying full price for tokens the provider has already processed. OpenAI caches prompts automatically for inputs over 1,024 tokens. Cached tokens cost 50% of the standard input price. You don't need to change anything in your code. Anthropic uses explicit caching via cache_control breakpoints. The write cost is 25% higher than standard input, but reads cost 90% less. Cache TTL is 5 minutes, extended on each hit. Take a typical customer support bot: With Anthropic prompt caching (assuming 95% cache hit rate): For OpenAI models, caching is automatic. Just make sure your prompts exceed 1,024 tokens and keep the static prefix consistent across requests. Not every request needs your most expensive model. A classification task that GPT-4.1 handles for $2.00/1M input tokens works just as well with GPT-4.1-mini at $0.40/1M, a 5x cost reduction. A coding assistant that routes 60% of requests (linting, formatting, simple completions) to GPT-4.1-mini and 40% (architecture, debugging) to Claude Sonnet 4.6: OpenAI offers a Batch API with 50% discount on input and output tokens. The trade-off: results are delivered within 24 hours instead of real-time. Good candidates for batching: Before optimizing at the API level, check if you're sending more tokens than necessary. A 30% reduction in prompt length directly translates to 30% lower input costs. These techniques compound. A team that implements all four can realistically cut their monthly API bill from $3,000 to under $1,000 without any degradation in output quality. The key insight: cost optimization in AI APIs isn't about finding cheaper providers. It's about using the right model, at the right price tier, with the right caching strategy, for each specific task. Start optimizing today: lemondata.cc gives you access to 300+ models through one API key, with full prompt caching support for OpenAI and Anthropic models. Try LemonData free — 300+ AI models, one key, 30-70% cheaper than official pricing → lemondata.cc/r/DEVTO-CUT-AI-API-COSTS Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
Daily input cost = 5,000 × 2,200 tokens × $3.00/1M = $33.00 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Daily input cost = 5,000 × 2,200 tokens × $3.00/1M = $33.00 CODE_BLOCK:
Daily input cost = 5,000 × 2,200 tokens × $3.00/1M = $33.00 CODE_BLOCK:
Cache writes: 250 × 2,200 × $3.75/1M = $2.06
Cache reads: 4,750 × 2,200 × $0.30/1M = $3.14
User tokens: 5,000 × 200 × $3.00/1M = $3.00
Daily total = $8.20 (75% savings on input costs) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Cache writes: 250 × 2,200 × $3.75/1M = $2.06
Cache reads: 4,750 × 2,200 × $0.30/1M = $3.14
User tokens: 5,000 × 200 × $3.00/1M = $3.00
Daily total = $8.20 (75% savings on input costs) CODE_BLOCK:
Cache writes: 250 × 2,200 × $3.75/1M = $2.06
Cache reads: 4,750 × 2,200 × $0.30/1M = $3.14
User tokens: 5,000 × 200 × $3.00/1M = $3.00
Daily total = $8.20 (75% savings on input costs) COMMAND_BLOCK:
from anthropic import Anthropic client = Anthropic( api_key="sk-lemon-xxx", base_url="https://api.lemondata.cc"
) response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": "You are a customer support agent for Acme Corp...", "cache_control": {"type": "ephemeral"} # This enables caching } ], messages=[{"role": "user", "content": user_message}]
) # Check cache performance in response headers
# cache_creation_input_tokens vs cache_read_input_tokens Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from anthropic import Anthropic client = Anthropic( api_key="sk-lemon-xxx", base_url="https://api.lemondata.cc"
) response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": "You are a customer support agent for Acme Corp...", "cache_control": {"type": "ephemeral"} # This enables caching } ], messages=[{"role": "user", "content": user_message}]
) # Check cache performance in response headers
# cache_creation_input_tokens vs cache_read_input_tokens COMMAND_BLOCK:
from anthropic import Anthropic client = Anthropic( api_key="sk-lemon-xxx", base_url="https://api.lemondata.cc"
) response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": "You are a customer support agent for Acme Corp...", "cache_control": {"type": "ephemeral"} # This enables caching } ], messages=[{"role": "user", "content": user_message}]
) # Check cache performance in response headers
# cache_creation_input_tokens vs cache_read_input_tokens COMMAND_BLOCK:
from openai import OpenAI client = OpenAI( api_key="sk-lemon-xxx", base_url="https://api.lemondata.cc/v1"
) def route_request(task_type: str, messages: list) -> str: """Pick the cheapest model that handles this task well.""" model_map = { "classification": "gpt-4.1-mini", "extraction": "gpt-4.1-mini", "summarization": "gpt-4.1-mini", "complex_reasoning": "gpt-4.1", "creative_writing": "claude-sonnet-4-6", "code_generation": "claude-sonnet-4-6", } model = model_map.get(task_type, "gpt-4.1-mini") response = client.chat.completions.create( model=model, messages=messages ) return response.choices[0].message.content Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from openai import OpenAI client = OpenAI( api_key="sk-lemon-xxx", base_url="https://api.lemondata.cc/v1"
) def route_request(task_type: str, messages: list) -> str: """Pick the cheapest model that handles this task well.""" model_map = { "classification": "gpt-4.1-mini", "extraction": "gpt-4.1-mini", "summarization": "gpt-4.1-mini", "complex_reasoning": "gpt-4.1", "creative_writing": "claude-sonnet-4-6", "code_generation": "claude-sonnet-4-6", } model = model_map.get(task_type, "gpt-4.1-mini") response = client.chat.completions.create( model=model, messages=messages ) return response.choices[0].message.content COMMAND_BLOCK:
from openai import OpenAI client = OpenAI( api_key="sk-lemon-xxx", base_url="https://api.lemondata.cc/v1"
) def route_request(task_type: str, messages: list) -> str: """Pick the cheapest model that handles this task well.""" model_map = { "classification": "gpt-4.1-mini", "extraction": "gpt-4.1-mini", "summarization": "gpt-4.1-mini", "complex_reasoning": "gpt-4.1", "creative_writing": "claude-sonnet-4-6", "code_generation": "claude-sonnet-4-6", } model = model_map.get(task_type, "gpt-4.1-mini") response = client.chat.completions.create( model=model, messages=messages ) return response.choices[0].message.content CODE_BLOCK:
Before (all Claude Sonnet 4.6): 1,000 req/day × 3K input × $3.00/1M = $9.00/day After (60/40 split): 600 req × 3K × $0.40/1M = $0.72/day (mini) 400 req × 3K × $3.00/1M = $3.60/day (sonnet) Total = $4.32/day (52% savings) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Before (all Claude Sonnet 4.6): 1,000 req/day × 3K input × $3.00/1M = $9.00/day After (60/40 split): 600 req × 3K × $0.40/1M = $0.72/day (mini) 400 req × 3K × $3.00/1M = $3.60/day (sonnet) Total = $4.32/day (52% savings) CODE_BLOCK:
Before (all Claude Sonnet 4.6): 1,000 req/day × 3K input × $3.00/1M = $9.00/day After (60/40 split): 600 req × 3K × $0.40/1M = $0.72/day (mini) 400 req × 3K × $3.00/1M = $3.60/day (sonnet) Total = $4.32/day (52% savings) COMMAND_BLOCK:
# Create a batch file (JSONL format)
import json requests = []
for i, doc in enumerate(documents): requests.append({ "custom_id": f"doc-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4.1-mini", "messages": [ {"role": "system", "content": "Classify this document..."}, {"role": "user", "content": doc} ] } }) # Write JSONL file
with open("batch_input.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n") # Submit batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h") Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Create a batch file (JSONL format)
import json requests = []
for i, doc in enumerate(documents): requests.append({ "custom_id": f"doc-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4.1-mini", "messages": [ {"role": "system", "content": "Classify this document..."}, {"role": "user", "content": doc} ] } }) # Write JSONL file
with open("batch_input.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n") # Submit batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h") COMMAND_BLOCK:
# Create a batch file (JSONL format)
import json requests = []
for i, doc in enumerate(documents): requests.append({ "custom_id": f"doc-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4.1-mini", "messages": [ {"role": "system", "content": "Classify this document..."}, {"role": "user", "content": doc} ] } }) # Write JSONL file
with open("batch_input.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n") # Submit batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h") - System prompt: 2,000 tokens
- User message: 200 tokens average
- 5,000 requests/day using Claude Sonnet 4.6 - Nightly content generation
- Bulk document classification
- Dataset labeling
- Scheduled report generation - Verbose system prompts that repeat instructions the model already follows
- Including full conversation history when only the last 3-5 turns matter
- Sending raw HTML/markdown when plain text would work
- Not using max_tokens to cap output length
how-totutorialguidedev.toaimlopenaigptrouting