Tools: How to Add LLM Drift Monitoring to Your CI/CD Pipeline (Free, 5 Minutes)

Tools: How to Add LLM Drift Monitoring to Your CI/CD Pipeline (Free, 5 Minutes)

The Problem with Standard LLM CI

What Drift Monitoring Adds

Setting Up DriftWatch in 5 Minutes (Free Tier)

Step 1: Register and get your API key

Step 2: Add your production prompt

Step 3: Run a drift check in CI

What This Catches That Unit Tests Miss

Real Example: Why This Matters

When to Alert vs When to Fail

Getting Started Most LLM monitoring advice says "run evals in CI." What it doesn't say: how to structure those evals so you catch the class of failures that actually breaks production — format regressions, instruction compliance drift, punctuation changes in single-token outputs. Here's a practical CI/CD setup using DriftWatch's free tier that catches behavioral drift before it reaches production. A typical LLM test in CI looks like this: This passes even when the model starts returning "Positive." instead of "positive" — a trailing period that breaks any downstream code doing exact-match comparison. Unit tests verify your code. They don't verify whether the model is still behaving the same way it did when you wrote the code. Drift monitoring compares live model behavior against a saved baseline: The key difference: you're comparing against previous model behavior, not against a hardcoded expected value. Free tier: 3 prompts, no card required. The first four are what your unit tests won't catch. They're format and behavioral consistency checks — not correctness checks. In our own test run (same model, two consecutive calls, no update), we got: These are natural variance samples — not even from a model update. When a model actually gets updated, drift scores are higher and more consistent. For most production pipelines, I'd set the CI failure threshold at 0.5 and the alert threshold at 0.3. You get a Slack/email notification at 0.3 to investigate, and CI blocks deploy at 0.5. The free tier gives you 3 prompts and hourly monitoring — enough to protect your most critical LLM calls. → Set up your first drift check — 5 minutes, no card required. The GitHub repo (including the drift detection algorithm) is at GenesisClawbot/llm-drift if you want to self-host. What's the highest-risk LLM call in your production stack? For most teams it's either a JSON extraction or a classification prompt — those are the ones where small format changes have the biggest downstream impact. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

def test_sentiment_classifier(): response = call_llm("Classify: 'great product'. Return one word.") assert response.strip().lower() in ["positive", "negative", "neutral"] def test_sentiment_classifier(): response = call_llm("Classify: 'great product'. Return one word.") assert response.strip().lower() in ["positive", "negative", "neutral"] def test_sentiment_classifier(): response = call_llm("Classify: 'great product'. Return one word.") assert response.strip().lower() in ["positive", "negative", "neutral"] # Register curl -X POST https://your-driftwatch-url/auth/register \ -H "Content-Type: application/json" \ -d '{"email": "[email protected]", "password": "yourpassword"}' # Save the api_key from the response API_KEY="dw_your_api_key_here" # Register curl -X POST https://your-driftwatch-url/auth/register \ -H "Content-Type: application/json" \ -d '{"email": "[email protected]", "password": "yourpassword"}' # Save the api_key from the response API_KEY="dw_your_api_key_here" # Register curl -X POST https://your-driftwatch-url/auth/register \ -H "Content-Type: application/json" \ -d '{"email": "[email protected]", "password": "yourpassword"}' # Save the api_key from the response API_KEY="dw_your_api_key_here" curl -X POST https://your-driftwatch-url/prompts \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "sentiment-classifier", "prompt_text": "Classify the sentiment as exactly one word: positive, negative, or neutral. Review: \"The product works fine but packaging was damaged.\"", "model": "gpt-4o", "validators": ["single_word", "word_in:positive,negative,neutral"] }' curl -X POST https://your-driftwatch-url/prompts \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "sentiment-classifier", "prompt_text": "Classify the sentiment as exactly one word: positive, negative, or neutral. Review: \"The product works fine but packaging was damaged.\"", "model": "gpt-4o", "validators": ["single_word", "word_in:positive,negative,neutral"] }' curl -X POST https://your-driftwatch-url/prompts \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "sentiment-classifier", "prompt_text": "Classify the sentiment as exactly one word: positive, negative, or neutral. Review: \"The product works fine but packaging was damaged.\"", "model": "gpt-4o", "validators": ["single_word", "word_in:positive,negative,neutral"] }' # .github/workflows/llm-drift-check.yml name: LLM Drift Check on: schedule: - cron: '0 * * * *' # hourly push: branches: [main] jobs: drift-check: runs-on: ubuntu-latest steps: - name: Run drift check run: | RESULT=$(curl -s -X POST https://your-driftwatch-url/monitor/run \ -H "Authorization: Bearer ${{ secrets.DRIFTWATCH_API_KEY }}") MAX_DRIFT=$(echo $RESULT | jq '.summary.max_drift') echo "Max drift: $MAX_DRIFT" # Fail CI if drift exceeds threshold if (( $(echo "$MAX_DRIFT > 0.5" | bc -l) )); then echo "BREAKING CHANGE: drift score $MAX_DRIFT exceeds threshold" exit 1 fi if (( $(echo "$MAX_DRIFT > 0.3" | bc -l) )); then echo "WARNING: drift score $MAX_DRIFT above alert threshold" fi # .github/workflows/llm-drift-check.yml name: LLM Drift Check on: schedule: - cron: '0 * * * *' # hourly push: branches: [main] jobs: drift-check: runs-on: ubuntu-latest steps: - name: Run drift check run: | RESULT=$(curl -s -X POST https://your-driftwatch-url/monitor/run \ -H "Authorization: Bearer ${{ secrets.DRIFTWATCH_API_KEY }}") MAX_DRIFT=$(echo $RESULT | jq '.summary.max_drift') echo "Max drift: $MAX_DRIFT" # Fail CI if drift exceeds threshold if (( $(echo "$MAX_DRIFT > 0.5" | bc -l) )); then echo "BREAKING CHANGE: drift score $MAX_DRIFT exceeds threshold" exit 1 fi if (( $(echo "$MAX_DRIFT > 0.3" | bc -l) )); then echo "WARNING: drift score $MAX_DRIFT above alert threshold" fi # .github/workflows/llm-drift-check.yml name: LLM Drift Check on: schedule: - cron: '0 * * * *' # hourly push: branches: [main] jobs: drift-check: runs-on: ubuntu-latest steps: - name: Run drift check run: | RESULT=$(curl -s -X POST https://your-driftwatch-url/monitor/run \ -H "Authorization: Bearer ${{ secrets.DRIFTWATCH_API_KEY }}") MAX_DRIFT=$(echo $RESULT | jq '.summary.max_drift') echo "Max drift: $MAX_DRIFT" # Fail CI if drift exceeds threshold if (( $(echo "$MAX_DRIFT > 0.5" | bc -l) )); then echo "BREAKING CHANGE: drift score $MAX_DRIFT exceeds threshold" exit 1 fi if (( $(echo "$MAX_DRIFT > 0.3" | bc -l) )); then echo "WARNING: drift score $MAX_DRIFT above alert threshold" fi - Establish baseline — run your production prompts, save outputs - Run on schedule (or in CI) — same prompts, same parameters - Score the delta — format compliance + semantic similarity + output length - Alert on threshold — 0.3 = investigate, 0.5 = page - inst-01 (single-word classifier): drift score 0.575 — "Neutral." → "Neutral". Both pass the word_in:positive,negative,neutral validator. But response.strip() == "Neutral." is now false. - json-01 (JSON extraction): drift score 0.316 — whitespace stripped, trailing period removed from value. json.loads() works. baseline == current does not.