Tools

Tools: Why Your Chaos Experiments Are Probably Wasting Time (and How to Fix It)

2026-03-06 0 views admin

Tools: Why Your Chaos Experiments Are Probably Wasting Time (and How to Fix It)

The actual problem ## What I built ## Does it actually work? ## The honest limitations ## Try it You have 20 microservices. You want to run chaos experiments. Where do you start? If your answer is "the payment service" — why? Because it feels important? Because it failed last week? Because LitmusChaos defaulted to it? Most teams pick chaos targets the same way they pick where to eat lunch — gut feel, recent memory, or whoever spoke loudest in the meeting. That's fine when you're running 2 services. It breaks down fast when you're running 20. Chaos engineering has a prioritization gap. The tooling is excellent at how to break things — LitmusChaos, Chaos Mesh, Gremlin all do this well. None of them tell you what to break next. The result: teams either test the same high-visibility services repeatedly, or they run random experiments and hope they hit something real. Both approaches leave systematic gaps. The framing that fixed this for me came from fault tree analysis: A payment service with 15 downstream dependents and a clean incident history is a different risk than an auth service with 3 dependents and 5 high-severity incidents this month. Treating them identically isn't principled chaos engineering — it's just noise. ChaosRank takes your Jaeger trace export and incident history and produces a ranked list of services to target next, with a suggested fault type and confidence level for each. Blast radius is computed from your dependency graph — blended PageRank and in-degree centrality. PageRank captures transitive failure propagation. In-degree captures direct dependents. Neither alone is sufficient: a shallow-wide hub (payment called by 10 services) scores differently from a deep chain (A→B→C→D→E), but both are high-risk. Fragility is where the interesting engineering lives. The naive approach — count incidents, divide by traffic — produces ranking inversions at high traffic differentials. A service with 5x more incidents can rank below a quieter one if you normalize after aggregation. The fix is per-incident normalization: evaluate each event in its own traffic context before aggregating. Order matters. I evaluated it on the DeathStarBench social-network topology (31 services) from the UIUC/FIRM dataset, a published research dataset with real microservice call graphs and anomaly injection traces. The result across 20 trials: 9.8x faster to the first real weakness. That's the mean across 20 trials with seeded weaknesses placed based on structural importance and incident history. ChaosRank builds its graph from synchronous Jaeger spans. If your architecture is heavily event-driven — Kafka, SQS, async choreography — your blast radius scores will be wrong. A Kafka producer consumed by 10 downstream services shows zero dependents in trace data. That's a ranking inversion, not just a missing feature. A warning fires at startup when async messaging services are detected. It also only reads Jaeger JSON today. OTel OTLP is next. Note: Two warnings will fire against the included sample data — a phantom node warning for nginx-compose-post (expected, it's an entry point with no callers) and a signal misalignment warning (expected, the benchmark dataset has uniform incident severity by design). With your own incident history these warnings won't fire. The output pipes directly to LitmusChaos: If you're running chaos experiments and want to talk through whether this fits your setup — particularly if you have async dependencies — I'm around in the comments. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. as well , this person and/or CODE_BLOCK: risk = impact × likelihood CODE_BLOCK: risk = impact × likelihood CODE_BLOCK: risk = impact × likelihood CODE_BLOCK: Rank Service Risk Blast Fragility Suggested Fault Confidence 1 payment-service 0.91 0.88 0.96 partial-response high 2 auth-service 0.84 0.79 0.91 latency-injection high 3 inventory-service 0.71 0.82 0.54 pod-failure medium CODE_BLOCK: Rank Service Risk Blast Fragility Suggested Fault Confidence 1 payment-service 0.91 0.88 0.96 partial-response high 2 auth-service 0.84 0.79 0.91 latency-injection high 3 inventory-service 0.71 0.82 0.54 pod-failure medium CODE_BLOCK: Rank Service Risk Blast Fragility Suggested Fault Confidence 1 payment-service 0.91 0.88 0.96 partial-response high 2 auth-service 0.84 0.79 0.91 latency-injection high 3 inventory-service 0.71 0.82 0.54 pod-failure medium COMMAND_BLOCK: pip install chaosrank-cli git clone https://github.com/Medinz01/chaosrank cd chaosrank chaosrank rank --traces benchmarks/real_traces/social_network.json --incidents benchmarks/real_traces/social_network_incidents.csv COMMAND_BLOCK: pip install chaosrank-cli git clone https://github.com/Medinz01/chaosrank cd chaosrank chaosrank rank --traces benchmarks/real_traces/social_network.json --incidents benchmarks/real_traces/social_network_incidents.csv COMMAND_BLOCK: pip install chaosrank-cli git clone https://github.com/Medinz01/chaosrank cd chaosrank chaosrank rank --traces benchmarks/real_traces/social_network.json --incidents benchmarks/real_traces/social_network_incidents.csv CODE_BLOCK: chaosrank rank --output litmus | kubectl apply -f - CODE_BLOCK: chaosrank rank --output litmus | kubectl apply -f - CODE_BLOCK: chaosrank rank --output litmus | kubectl apply -f - - Impact — if this service degrades, how many others are affected? - Likelihood — based on history, how probable is a failure?

🏷️ Tags

toolsutilitiessecurity toolschaosexperimentsprobablywastingactual