Tools: Prompt Engineering System: Managing 50+ Prompts in Production

Tools: Prompt Engineering System: Managing 50+ Prompts in Production

Why You Can't Store Prompts in Code

Anatomy of a Prompt Engineering System

Registry: Centralized Prompt Storage

Approach 1: Langfuse Prompt Management

Approach 2: Prompts-as-Code

Testing: Eval Before Deploying a Prompt

Datasets: The Gold Standard

Eval Pipeline

CI/CD Integration

Deploy: Shipping Prompts Without Deploying Code

Instant Switch

Canary Deploy

Feature Flags

Monitor: Tying Metrics to Prompt Versions

Tracing with Prompt Version

Version Dashboard

Regression Alerts

Prompt Organization Patterns

Composition Over Monoliths

Naming Convention

Prompt Metadata

Scaling: From 5 to 500 Prompts

Prompt Management Tools

Common Mistakes

Context Engineering for Prompts

Where to Start The average LLM project in production uses 20–50 prompts. Classification, summarization, data extraction, response generation, quality evaluation. Each prompt requires iteration, and each iteration can break something that was working. At 50 prompts, managing them manually becomes chaos: who changed the classifier prompt? Why did summarizer accuracy drop? Which version is in production right now? This article covers how to build a prompt management system that scales from 5 to 500 prompts. A prompt looks like a string. Developers store it in code, next to the call logic. This works fine when there are only a few prompts and iterations are infrequent. Problems start at scale: Changing a prompt requires deploying the app. The prompt is hardcoded. To fix a single word in a system prompt, you need a PR, review, merge, deploy. Iteration cycle: hours instead of minutes. No versioning. Git stores history, but a diff on a 2,000-character prompt is unreadable. There's no fast path to roll back a prompt to a previous version without rolling back the entire app. No link between version and metrics. Prompt changed, quality dropped. Connecting a specific prompt version to specific metrics is manual work when the prompt lives in code. Cross-team chaos. The product manager wants to adjust the tone. The ML engineer is optimizing tokens. The developer is refactoring the template. All three are editing the same file, and the outcome is unpredictable. A mature prompt management system has four layers: Registry — a centralized prompt store with versioning, metadata, and access control. Testing — automated quality evaluation of a prompt against test datasets before deploying to production. Deploy — a mechanism to push a new prompt version to production without deploying the application. Monitor — tracking quality metrics tied to specific prompt versions. You don't need to build all four layers at once. A minimum viable system is registry + deploy. Without testing and monitoring, you're flying blind. The registry solves the basic problem: a single source of truth for all prompts. Two approaches. Langfuse provides prompt management out of the box. Each prompt is a named entity with versions, labels, and variables. Prompt structure in Langfuse: The prompt is decoupled from code. A product manager edits the prompt in the UI, assigns the staging label, tests it, and switches to production. The application code stays the same. For teams that prefer Git as the single source of truth: Both approaches support a hybrid variant: prompts live in Git, and CI/CD syncs them to Langfuse on every merge to main. A prompt without tests is a gamble. Every change can silently break edge cases. Automated evaluation before deployment catches regressions before they reach users. Every prompt needs a test dataset. Minimum size: 20–30 examples covering the main scenarios and edge cases. For complex cases, LLM-as-Judge fits well. A judge model evaluates response quality against defined criteria: relevance, completeness, tone. Every PR touching prompts automatically runs the eval pipeline and posts results as a comment. Three strategies for delivering a new prompt version to production. The simplest option. Flip the production label to a new prompt version. Good for non-critical prompts and quick fixes. Risk: 100% of traffic immediately hits the new version. Gradual traffic shift: 5% → 25% → 50% → 100%. Canary and production metrics are compared in real time. If canary degrades — automatic rollback. For teams with an existing feature flag system (LaunchDarkly, Unleash, or homegrown): You can also target specific users, segments, or regions. Monitoring without version context is useless. Quality dropped — but what broke: the prompt, the model, the data? Every LLM call should include the prompt version in metadata: Key metrics to monitor: A 3,000-token monolithic prompt is hard to test and maintain. Break it into components: At 50+ prompts, consistent naming matters: Each prompt should carry metadata for auditing: How the system evolves as the number of prompts grows: 10 prompts — you need a registry. Prompts in code become unmanageable. 30 prompts — you need CI eval. Manual testing doesn't scale; regressions slip through. 50 prompts — you need RBAC. Different teams own different prompts; access control becomes non-optional. 100 prompts — you need auto-rollback. Humans can't respond to regressions fast enough in real time. Langfuse covers most scenarios: registry with versioning, prompt-to-trace linking, dataset-based evals, MCP server for IDE management. Detailed walkthrough in the Langfuse guide. Prompts in .env or config files. No versioning, no testing, no connection to metrics. Fine for prototypes, falls apart in production. Testing on three examples. The prompt passes three tests and ships to production. A week later you discover it breaks on long inputs or edge case categories. No baseline. The new prompt version "works well." Without a baseline, there's nothing to compare against. The previous version may have been better. Optimizing tokens at the expense of quality. Prompt reduced from 800 to 300 tokens. Cost drops 60%. Accuracy drops from 0.94 to 0.81. Saving $50/month costs dozens of wrong responses every day. A prompt doesn't exist in isolation. Quality depends on what's fed alongside it: context engineering determines which data enters the context window and in what order. Three rules for production prompts: Variables instead of hardcoded values. Anything that might change (categories, languages, formats) goes into variables. The prompt stays stable. Few-shot examples at the end. Models "see" the end of the context more clearly. Placing examples after instructions improves accuracy. Minimal context. Every extra token in the prompt dilutes the model's attention. If an instruction doesn't affect quality — remove it. Week 1. Inventory. Collect all prompts from your codebase into one place — YAML files in Git or Langfuse. Standardize the format: name, version, model, messages, variables. Week 2. Datasets. For each prompt, collect 20–30 test examples from production logs. Label the expected output. Week 3. Eval pipeline. A script that runs the prompt against the dataset and outputs accuracy. Triggered in CI when prompts change. Week 4. Monitoring. Prompt version in every trace's metadata. Dashboard with metrics per version. Alert on > 10% degradation. After a month — a working system where every prompt change is tested, versioned, and monitored. No chaos, no regressions, no "who changed this prompt?" Templates let you quickly answer FAQs or store snippets for re-use. Managing a large number of prompts can quickly become unmanageable without a robust strategy. One effective approach we've seen is leveraging modular prompt templates, which allows for dynamic adjustments and scalability. By integrating a version control system like Git, teams can track changes across prompts, ensuring consistency and facilitating quick rollbacks when needed. This setup not only streamlines prompt management but also enhances collaboration among developers. - Ali Muwwakkil (ali-muwwakkil on LinkedIn) Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

┌─────────────────────────────────────────────────┐ │ Prompt Engineering System │ ├────────────┬────────────┬────────────┬──────────┤ │ Registry │ Testing │ Deploy │ Monitor │ │ │ │ │ │ ├────────────┼────────────┼────────────┼──────────┤ │ Storage │ Pre-deploy │ Canary / │ Metrics │ │ + versions │ eval │ A/B rollout│ + alerts │ └────────────┴────────────┴────────────┴──────────┘ ┌─────────────────────────────────────────────────┐ │ Prompt Engineering System │ ├────────────┬────────────┬────────────┬──────────┤ │ Registry │ Testing │ Deploy │ Monitor │ │ │ │ │ │ ├────────────┼────────────┼────────────┼──────────┤ │ Storage │ Pre-deploy │ Canary / │ Metrics │ │ + versions │ eval │ A/B rollout│ + alerts │ └────────────┴────────────┴────────────┴──────────┘ ┌─────────────────────────────────────────────────┐ │ Prompt Engineering System │ ├────────────┬────────────┬────────────┬──────────┤ │ Registry │ Testing │ Deploy │ Monitor │ │ │ │ │ │ ├────────────┼────────────┼────────────┼──────────┤ │ Storage │ Pre-deploy │ Canary / │ Metrics │ │ + versions │ eval │ A/B rollout│ + alerts │ └────────────┴────────────┴────────────┴──────────┘ from langfuse import Langfuse langfuse = Langfuse() # Get the production version of a prompt prompt = langfuse.get_prompt( name="ticket-classifier", label="production" # or "staging", "latest" ) # Prompt with variables system_message = prompt.compile( categories="billing,technical,general,urgent", language="en" ) from langfuse import Langfuse langfuse = Langfuse() # Get the production version of a prompt prompt = langfuse.get_prompt( name="ticket-classifier", label="production" # or "staging", "latest" ) # Prompt with variables system_message = prompt.compile( categories="billing,technical,general,urgent", language="en" ) from langfuse import Langfuse langfuse = Langfuse() # Get the production version of a prompt prompt = langfuse.get_prompt( name="ticket-classifier", label="production" # or "staging", "latest" ) # Prompt with variables system_message = prompt.compile( categories="billing,technical,general,urgent", language="en" ) prompts/ ├── ticket-classifier/ │ ├── prompt.yaml │ ├── config.yaml │ └── tests/ │ ├── dataset.jsonl │ └── eval.py ├── summarizer/ │ ├── prompt.yaml │ ├── config.yaml │ └── tests/ └── prompt_registry.py prompts/ ├── ticket-classifier/ │ ├── prompt.yaml │ ├── config.yaml │ └── tests/ │ ├── dataset.jsonl │ └── eval.py ├── summarizer/ │ ├── prompt.yaml │ ├── config.yaml │ └── tests/ └── prompt_registry.py prompts/ ├── ticket-classifier/ │ ├── prompt.yaml │ ├── config.yaml │ └── tests/ │ ├── dataset.jsonl │ └── eval.py ├── summarizer/ │ ├── prompt.yaml │ ├── config.yaml │ └── tests/ └── prompt_registry.py # prompts/ticket-classifier/prompt.yaml name: ticket-classifier type: chat model: gpt-4o-mini temperature: 0 messages: - role: system content: | You are a support ticket classifier. Categories: {{categories}}. Return JSON: {"category": "...", "confidence": 0.0-1.0, "reasoning": "..."} Response language: {{language}}. - role: user content: "{{ticket_text}}" variables: categories: "billing,technical,general,urgent" language: "en" # prompts/ticket-classifier/prompt.yaml name: ticket-classifier type: chat model: gpt-4o-mini temperature: 0 messages: - role: system content: | You are a support ticket classifier. Categories: {{categories}}. Return JSON: {"category": "...", "confidence": 0.0-1.0, "reasoning": "..."} Response language: {{language}}. - role: user content: "{{ticket_text}}" variables: categories: "billing,technical,general,urgent" language: "en" # prompts/ticket-classifier/prompt.yaml name: ticket-classifier type: chat model: gpt-4o-mini temperature: 0 messages: - role: system content: | You are a support ticket classifier. Categories: {{categories}}. Return JSON: {"category": "...", "confidence": 0.0-1.0, "reasoning": "..."} Response language: {{language}}. - role: user content: "{{ticket_text}}" variables: categories: "billing,technical,general,urgent" language: "en" # prompt_registry.py import yaml from pathlib import Path class PromptRegistry: def __init__(self, prompts_dir: str = "prompts"): self.prompts_dir = Path(prompts_dir) self._cache = {} def get(self, name: str) -> dict: if name not in self._cache: prompt_path = self.prompts_dir / name / "prompt.yaml" with open(prompt_path) as f: self._cache[name] = yaml.safe_load(f) return self._cache[name] def compile(self, name: str, **variables) -> list[dict]: prompt = self.get(name) messages = [] for msg in prompt["messages"]: content = msg["content"] for key, value in {**prompt.get("variables", {}), **variables}.items(): content = content.replace(f"{{{{{key}}}}}", str(value)) messages.append({"role": msg["role"], "content": content}) return messages # prompt_registry.py import yaml from pathlib import Path class PromptRegistry: def __init__(self, prompts_dir: str = "prompts"): self.prompts_dir = Path(prompts_dir) self._cache = {} def get(self, name: str) -> dict: if name not in self._cache: prompt_path = self.prompts_dir / name / "prompt.yaml" with open(prompt_path) as f: self._cache[name] = yaml.safe_load(f) return self._cache[name] def compile(self, name: str, **variables) -> list[dict]: prompt = self.get(name) messages = [] for msg in prompt["messages"]: content = msg["content"] for key, value in {**prompt.get("variables", {}), **variables}.items(): content = content.replace(f"{{{{{key}}}}}", str(value)) messages.append({"role": msg["role"], "content": content}) return messages # prompt_registry.py import yaml from pathlib import Path class PromptRegistry: def __init__(self, prompts_dir: str = "prompts"): self.prompts_dir = Path(prompts_dir) self._cache = {} def get(self, name: str) -> dict: if name not in self._cache: prompt_path = self.prompts_dir / name / "prompt.yaml" with open(prompt_path) as f: self._cache[name] = yaml.safe_load(f) return self._cache[name] def compile(self, name: str, **variables) -> list[dict]: prompt = self.get(name) messages = [] for msg in prompt["messages"]: content = msg["content"] for key, value in {**prompt.get("variables", {}), **variables}.items(): content = content.replace(f"{{{{{key}}}}}", str(value)) messages.append({"role": msg["role"], "content": content}) return messages # ci/sync_prompts.py — called in CI pipeline from langfuse import Langfuse from prompt_registry import PromptRegistry langfuse = Langfuse() registry = PromptRegistry() for prompt_name in ["ticket-classifier", "summarizer", "response-generator"]: prompt_data = registry.get(prompt_name) langfuse.create_prompt( name=prompt_name, prompt=prompt_data["messages"], config={"model": prompt_data["model"], "temperature": prompt_data["temperature"]}, labels=["production"], ) # ci/sync_prompts.py — called in CI pipeline from langfuse import Langfuse from prompt_registry import PromptRegistry langfuse = Langfuse() registry = PromptRegistry() for prompt_name in ["ticket-classifier", "summarizer", "response-generator"]: prompt_data = registry.get(prompt_name) langfuse.create_prompt( name=prompt_name, prompt=prompt_data["messages"], config={"model": prompt_data["model"], "temperature": prompt_data["temperature"]}, labels=["production"], ) # ci/sync_prompts.py — called in CI pipeline from langfuse import Langfuse from prompt_registry import PromptRegistry langfuse = Langfuse() registry = PromptRegistry() for prompt_name in ["ticket-classifier", "summarizer", "response-generator"]: prompt_data = registry.get(prompt_name) langfuse.create_prompt( name=prompt_name, prompt=prompt_data["messages"], config={"model": prompt_data["model"], "temperature": prompt_data["temperature"]}, labels=["production"], ) {"input": "Can't process payment, card is being declined", "expected": {"category": "billing", "confidence_min": 0.8}} {"input": "App crashes when opening the chat", "expected": {"category": "technical", "confidence_min": 0.8}} {"input": "I want to delete my account and all my data", "expected": {"category": "general", "confidence_min": 0.7}} {"input": "URGENT! Server is down, customers can't log in", "expected": {"category": "urgent", "confidence_min": 0.9}} {"input": "Can't process payment, card is being declined", "expected": {"category": "billing", "confidence_min": 0.8}} {"input": "App crashes when opening the chat", "expected": {"category": "technical", "confidence_min": 0.8}} {"input": "I want to delete my account and all my data", "expected": {"category": "general", "confidence_min": 0.7}} {"input": "URGENT! Server is down, customers can't log in", "expected": {"category": "urgent", "confidence_min": 0.9}} {"input": "Can't process payment, card is being declined", "expected": {"category": "billing", "confidence_min": 0.8}} {"input": "App crashes when opening the chat", "expected": {"category": "technical", "confidence_min": 0.8}} {"input": "I want to delete my account and all my data", "expected": {"category": "general", "confidence_min": 0.7}} {"input": "URGENT! Server is down, customers can't log in", "expected": {"category": "urgent", "confidence_min": 0.9}} import json from openai import OpenAI from prompt_registry import PromptRegistry client = OpenAI() registry = PromptRegistry() def evaluate_prompt(prompt_name: str, dataset_path: str, threshold: float = 0.85): """Evaluate a prompt against a dataset. Return pass/fail.""" with open(dataset_path) as f: examples = [json.loads(line) for line in f] correct = 0 total = len(examples) failures = [] for example in examples: messages = registry.compile(prompt_name, ticket_text=example["input"]) response = client.chat.completions.create( model=registry.get(prompt_name)["model"], messages=messages, temperature=0, ) result = json.loads(response.choices[0].message.content) if result["category"] == example["expected"]["category"]: if result["confidence"] >= example["expected"]["confidence_min"]: correct += 1 else: failures.append({ "input": example["input"], "reason": f"low confidence: {result['confidence']}", }) else: failures.append({ "input": example["input"], "reason": f"wrong category: {result['category']}", }) accuracy = correct / total passed = accuracy >= threshold return { "accuracy": accuracy, "threshold": threshold, "passed": passed, "failures": failures, } import json from openai import OpenAI from prompt_registry import PromptRegistry client = OpenAI() registry = PromptRegistry() def evaluate_prompt(prompt_name: str, dataset_path: str, threshold: float = 0.85): """Evaluate a prompt against a dataset. Return pass/fail.""" with open(dataset_path) as f: examples = [json.loads(line) for line in f] correct = 0 total = len(examples) failures = [] for example in examples: messages = registry.compile(prompt_name, ticket_text=example["input"]) response = client.chat.completions.create( model=registry.get(prompt_name)["model"], messages=messages, temperature=0, ) result = json.loads(response.choices[0].message.content) if result["category"] == example["expected"]["category"]: if result["confidence"] >= example["expected"]["confidence_min"]: correct += 1 else: failures.append({ "input": example["input"], "reason": f"low confidence: {result['confidence']}", }) else: failures.append({ "input": example["input"], "reason": f"wrong category: {result['category']}", }) accuracy = correct / total passed = accuracy >= threshold return { "accuracy": accuracy, "threshold": threshold, "passed": passed, "failures": failures, } import json from openai import OpenAI from prompt_registry import PromptRegistry client = OpenAI() registry = PromptRegistry() def evaluate_prompt(prompt_name: str, dataset_path: str, threshold: float = 0.85): """Evaluate a prompt against a dataset. Return pass/fail.""" with open(dataset_path) as f: examples = [json.loads(line) for line in f] correct = 0 total = len(examples) failures = [] for example in examples: messages = registry.compile(prompt_name, ticket_text=example["input"]) response = client.chat.completions.create( model=registry.get(prompt_name)["model"], messages=messages, temperature=0, ) result = json.loads(response.choices[0].message.content) if result["category"] == example["expected"]["category"]: if result["confidence"] >= example["expected"]["confidence_min"]: correct += 1 else: failures.append({ "input": example["input"], "reason": f"low confidence: {result['confidence']}", }) else: failures.append({ "input": example["input"], "reason": f"wrong category: {result['category']}", }) accuracy = correct / total passed = accuracy >= threshold return { "accuracy": accuracy, "threshold": threshold, "passed": passed, "failures": failures, } # .github/workflows/prompt-eval.yml name: Prompt Evaluation on: pull_request: paths: - 'prompts/**' jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install dependencies run: pip install openai langfuse pyyaml - name: Run prompt evaluations env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python ci/eval_prompts.py --changed-only - name: Comment PR with results uses: actions/github-script@v7 with: script: | const fs = require('fs'); const results = JSON.parse(fs.readFileSync('eval_results.json')); let body = '## Prompt Eval Results\n\n'; for (const [name, result] of Object.entries(results)) { const status = result.passed ? '✅' : '❌'; body += `| ${name} | ${status} | ${result.accuracy.toFixed(2)} | ${result.threshold} |\n`; } github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body }); # .github/workflows/prompt-eval.yml name: Prompt Evaluation on: pull_request: paths: - 'prompts/**' jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install dependencies run: pip install openai langfuse pyyaml - name: Run prompt evaluations env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python ci/eval_prompts.py --changed-only - name: Comment PR with results uses: actions/github-script@v7 with: script: | const fs = require('fs'); const results = JSON.parse(fs.readFileSync('eval_results.json')); let body = '## Prompt Eval Results\n\n'; for (const [name, result] of Object.entries(results)) { const status = result.passed ? '✅' : '❌'; body += `| ${name} | ${status} | ${result.accuracy.toFixed(2)} | ${result.threshold} |\n`; } github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body }); # .github/workflows/prompt-eval.yml name: Prompt Evaluation on: pull_request: paths: - 'prompts/**' jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install dependencies run: pip install openai langfuse pyyaml - name: Run prompt evaluations env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python ci/eval_prompts.py --changed-only - name: Comment PR with results uses: actions/github-script@v7 with: script: | const fs = require('fs'); const results = JSON.parse(fs.readFileSync('eval_results.json')); let body = '## Prompt Eval Results\n\n'; for (const [name, result] of Object.entries(results)) { const status = result.passed ? '✅' : '❌'; body += `| ${name} | ${status} | ${result.accuracy.toFixed(2)} | ${result.threshold} |\n`; } github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body }); # In Langfuse UI: assign label "production" to prompt v14 # The app picks it up automatically on the next request prompt = langfuse.get_prompt( name="ticket-classifier", label="production", cache_ttl_seconds=300, # 5-minute cache ) # In Langfuse UI: assign label "production" to prompt v14 # The app picks it up automatically on the next request prompt = langfuse.get_prompt( name="ticket-classifier", label="production", cache_ttl_seconds=300, # 5-minute cache ) # In Langfuse UI: assign label "production" to prompt v14 # The app picks it up automatically on the next request prompt = langfuse.get_prompt( name="ticket-classifier", label="production", cache_ttl_seconds=300, # 5-minute cache ) import random def get_prompt_with_canary( name: str, canary_percentage: int = 10, ) -> tuple[dict, str]: """Return a prompt and its version (production or canary).""" if random.randint(1, 100) <= canary_percentage: prompt = langfuse.get_prompt(name=name, label="canary") return prompt, "canary" else: prompt = langfuse.get_prompt(name=name, label="production") return prompt, "production" import random def get_prompt_with_canary( name: str, canary_percentage: int = 10, ) -> tuple[dict, str]: """Return a prompt and its version (production or canary).""" if random.randint(1, 100) <= canary_percentage: prompt = langfuse.get_prompt(name=name, label="canary") return prompt, "canary" else: prompt = langfuse.get_prompt(name=name, label="production") return prompt, "production" import random def get_prompt_with_canary( name: str, canary_percentage: int = 10, ) -> tuple[dict, str]: """Return a prompt and its version (production or canary).""" if random.randint(1, 100) <= canary_percentage: prompt = langfuse.get_prompt(name=name, label="canary") return prompt, "canary" else: prompt = langfuse.get_prompt(name=name, label="production") return prompt, "production" def get_prompt_version(name: str, user_id: str) -> str: """Determine the prompt version via feature flag.""" flag = feature_flags.get(f"prompt_{name}_version") if flag.is_enabled(user_id): return flag.get_variant(user_id) # "v14", "v15" return "production" def get_prompt_version(name: str, user_id: str) -> str: """Determine the prompt version via feature flag.""" flag = feature_flags.get(f"prompt_{name}_version") if flag.is_enabled(user_id): return flag.get_variant(user_id) # "v14", "v15" return "production" def get_prompt_version(name: str, user_id: str) -> str: """Determine the prompt version via feature flag.""" flag = feature_flags.get(f"prompt_{name}_version") if flag.is_enabled(user_id): return flag.get_variant(user_id) # "v14", "v15" return "production" trace = langfuse.trace( name="ticket-classification", metadata={ "prompt_name": "ticket-classifier", "prompt_version": prompt.version, # 14 "prompt_label": "production", "model": "gpt-4o-mini", }, ) generation = trace.generation( name="classify", model="gpt-4o-mini", prompt=prompt, # Langfuse automatically links the version input=messages, output=response, ) trace = langfuse.trace( name="ticket-classification", metadata={ "prompt_name": "ticket-classifier", "prompt_version": prompt.version, # 14 "prompt_label": "production", "model": "gpt-4o-mini", }, ) generation = trace.generation( name="classify", model="gpt-4o-mini", prompt=prompt, # Langfuse automatically links the version input=messages, output=response, ) trace = langfuse.trace( name="ticket-classification", metadata={ "prompt_name": "ticket-classifier", "prompt_version": prompt.version, # 14 "prompt_label": "production", "model": "gpt-4o-mini", }, ) generation = trace.generation( name="classify", model="gpt-4o-mini", prompt=prompt, # Langfuse automatically links the version input=messages, output=response, ) # Example: automatic comparison of two prompt versions def compare_prompt_versions( prompt_name: str, version_a: int, version_b: int, metric: str = "accuracy", ) -> dict: """Compare metrics for two prompt versions from Langfuse.""" traces_a = langfuse.fetch_traces( name=f"{prompt_name}-eval", metadata={"prompt_version": version_a}, limit=1000, ) traces_b = langfuse.fetch_traces( name=f"{prompt_name}-eval", metadata={"prompt_version": version_b}, limit=1000, ) scores_a = [t.scores[metric] for t in traces_a if metric in t.scores] scores_b = [t.scores[metric] for t in traces_b if metric in t.scores] return { "version_a": {"version": version_a, "mean": sum(scores_a) / len(scores_a)}, "version_b": {"version": version_b, "mean": sum(scores_b) / len(scores_b)}, "diff": (sum(scores_b) / len(scores_b)) - (sum(scores_a) / len(scores_a)), } # Example: automatic comparison of two prompt versions def compare_prompt_versions( prompt_name: str, version_a: int, version_b: int, metric: str = "accuracy", ) -> dict: """Compare metrics for two prompt versions from Langfuse.""" traces_a = langfuse.fetch_traces( name=f"{prompt_name}-eval", metadata={"prompt_version": version_a}, limit=1000, ) traces_b = langfuse.fetch_traces( name=f"{prompt_name}-eval", metadata={"prompt_version": version_b}, limit=1000, ) scores_a = [t.scores[metric] for t in traces_a if metric in t.scores] scores_b = [t.scores[metric] for t in traces_b if metric in t.scores] return { "version_a": {"version": version_a, "mean": sum(scores_a) / len(scores_a)}, "version_b": {"version": version_b, "mean": sum(scores_b) / len(scores_b)}, "diff": (sum(scores_b) / len(scores_b)) - (sum(scores_a) / len(scores_a)), } # Example: automatic comparison of two prompt versions def compare_prompt_versions( prompt_name: str, version_a: int, version_b: int, metric: str = "accuracy", ) -> dict: """Compare metrics for two prompt versions from Langfuse.""" traces_a = langfuse.fetch_traces( name=f"{prompt_name}-eval", metadata={"prompt_version": version_a}, limit=1000, ) traces_b = langfuse.fetch_traces( name=f"{prompt_name}-eval", metadata={"prompt_version": version_b}, limit=1000, ) scores_a = [t.scores[metric] for t in traces_a if metric in t.scores] scores_b = [t.scores[metric] for t in traces_b if metric in t.scores] return { "version_a": {"version": version_a, "mean": sum(scores_a) / len(scores_a)}, "version_b": {"version": version_b, "mean": sum(scores_b) / len(scores_b)}, "diff": (sum(scores_b) / len(scores_b)) - (sum(scores_a) / len(scores_a)), } # Check metrics every 15 minutes (cron job or Langfuse webhook) def check_prompt_regression(prompt_name: str): current_version = langfuse.get_prompt(name=prompt_name, label="production").version recent_scores = get_recent_scores(prompt_name, current_version, hours=1) baseline = get_baseline_scores(prompt_name, current_version) if recent_scores["accuracy"] < baseline["accuracy"] * 0.9: # > 10% degradation alert( channel="slack", message=f"Regression detected: {prompt_name} v{current_version}. " f"Accuracy: {recent_scores['accuracy']:.2f} " f"(baseline: {baseline['accuracy']:.2f})", ) # Automatic rollback to previous version rollback_prompt(prompt_name, to_version=current_version - 1) # Check metrics every 15 minutes (cron job or Langfuse webhook) def check_prompt_regression(prompt_name: str): current_version = langfuse.get_prompt(name=prompt_name, label="production").version recent_scores = get_recent_scores(prompt_name, current_version, hours=1) baseline = get_baseline_scores(prompt_name, current_version) if recent_scores["accuracy"] < baseline["accuracy"] * 0.9: # > 10% degradation alert( channel="slack", message=f"Regression detected: {prompt_name} v{current_version}. " f"Accuracy: {recent_scores['accuracy']:.2f} " f"(baseline: {baseline['accuracy']:.2f})", ) # Automatic rollback to previous version rollback_prompt(prompt_name, to_version=current_version - 1) # Check metrics every 15 minutes (cron job or Langfuse webhook) def check_prompt_regression(prompt_name: str): current_version = langfuse.get_prompt(name=prompt_name, label="production").version recent_scores = get_recent_scores(prompt_name, current_version, hours=1) baseline = get_baseline_scores(prompt_name, current_version) if recent_scores["accuracy"] < baseline["accuracy"] * 0.9: # > 10% degradation alert( channel="slack", message=f"Regression detected: {prompt_name} v{current_version}. " f"Accuracy: {recent_scores['accuracy']:.2f} " f"(baseline: {baseline['accuracy']:.2f})", ) # Automatic rollback to previous version rollback_prompt(prompt_name, to_version=current_version - 1) # prompts/components/output-format.yaml name: output-format-json content: | Respond STRICTLY in JSON. No text before or after the JSON. If you cannot determine the answer, return {"error": "unable to classify"}. # prompts/components/language-rules.yaml name: language-rules content: | Response language: {{language}}. Do not translate proper nouns or technical terms. # prompts/components/output-format.yaml name: output-format-json content: | Respond STRICTLY in JSON. No text before or after the JSON. If you cannot determine the answer, return {"error": "unable to classify"}. # prompts/components/language-rules.yaml name: language-rules content: | Response language: {{language}}. Do not translate proper nouns or technical terms. # prompts/components/output-format.yaml name: output-format-json content: | Respond STRICTLY in JSON. No text before or after the JSON. If you cannot determine the answer, return {"error": "unable to classify"}. # prompts/components/language-rules.yaml name: language-rules content: | Response language: {{language}}. Do not translate proper nouns or technical terms. def compose_prompt(*component_names: str, **variables) -> str: """Assemble a prompt from components.""" parts = [] for name in component_names: component = registry.get(f"components/{name}") content = component["content"] for key, value in variables.items(): content = content.replace(f"{{{{{key}}}}}", str(value)) parts.append(content) return "\n\n".join(parts) # Usage system_prompt = compose_prompt( "ticket-classifier-core", "output-format-json", "language-rules", categories="billing,technical,general", language="en", ) def compose_prompt(*component_names: str, **variables) -> str: """Assemble a prompt from components.""" parts = [] for name in component_names: component = registry.get(f"components/{name}") content = component["content"] for key, value in variables.items(): content = content.replace(f"{{{{{key}}}}}", str(value)) parts.append(content) return "\n\n".join(parts) # Usage system_prompt = compose_prompt( "ticket-classifier-core", "output-format-json", "language-rules", categories="billing,technical,general", language="en", ) def compose_prompt(*component_names: str, **variables) -> str: """Assemble a prompt from components.""" parts = [] for name in component_names: component = registry.get(f"components/{name}") content = component["content"] for key, value in variables.items(): content = content.replace(f"{{{{{key}}}}}", str(value)) parts.append(content) return "\n\n".join(parts) # Usage system_prompt = compose_prompt( "ticket-classifier-core", "output-format-json", "language-rules", categories="billing,technical,general", language="en", ) {domain}-{task}-{variant} ticket-classifier-v2 ticket-classifier-multilingual order-summarizer-short order-summarizer-detailed response-generator-formal response-generator-casual quality-judge-relevance quality-judge-toxicity {domain}-{task}-{variant} ticket-classifier-v2 ticket-classifier-multilingual order-summarizer-short order-summarizer-detailed response-generator-formal response-generator-casual quality-judge-relevance quality-judge-toxicity {domain}-{task}-{variant} ticket-classifier-v2 ticket-classifier-multilingual order-summarizer-short order-summarizer-detailed response-generator-formal response-generator-casual quality-judge-relevance quality-judge-toxicity name: ticket-classifier metadata: owner: ml-team created: 2026-01-15 last_tested: 2026-03-20 model_compatibility: - gpt-4o-mini - claude-3-5-sonnet-20241022 avg_tokens: 450 cost_per_call_usd: 0.002 test_accuracy: 0.92 dataset_size: 150 name: ticket-classifier metadata: owner: ml-team created: 2026-01-15 last_tested: 2026-03-20 model_compatibility: - gpt-4o-mini - claude-3-5-sonnet-20241022 avg_tokens: 450 cost_per_call_usd: 0.002 test_accuracy: 0.92 dataset_size: 150 name: ticket-classifier metadata: owner: ml-team created: 2026-01-15 last_tested: 2026-03-20 model_compatibility: - gpt-4o-mini - claude-3-5-sonnet-20241022 avg_tokens: 450 cost_per_call_usd: 0.002 test_accuracy: 0.92 dataset_size: 150 - Production logs. Real requests with labeled responses. The most valuable source. - Manual labeling. For new prompts with no production data yet. - Synthetic data. An LLM generates variations of existing examples. Useful for expanding edge case coverage. - Variables instead of hardcoded values. Anything that might change (categories, languages, formats) goes into variables. The prompt stays stable. - Few-shot examples at the end. Models "see" the end of the context more clearly. Placing examples after instructions improves accuracy. - Minimal context. Every extra token in the prompt dilutes the model's attention. If an instruction doesn't affect quality — remove it. - Location Wylie, TX - Work Managing Director at Colaberry — focused on AI training and enterprise deployment - Joined Mar 22, 2026