Tools: Breakers Under Stress: Anatomy of a Payment Cascade Circuit

Tools: Breakers Under Stress: Anatomy of a Payment Cascade Circuit

The setup

What 10x RPS actually does

The cascade, step by step

The state machine, practically

Four changes that fixed it

1. Tighter, faster breakers

2. Per-endpoint bulkheads

3. Async outbox + Kafka retry

4. HPA on RPS and queue depth

What I'd tell past me

Takeaways A flash sale hit us at 10x baseline RPS. Within four minutes, our Payment Service circuit breaker tripped to OPEN, error rate climbed to 92%, and p99 latency on the payment path went from 200ms to 14.2 seconds. Here's the part nobody tells you on the conference circuit: the circuit breaker didn't fail. It worked exactly as designed. The failure was everywhere else. This is a postmortem of what we saw, why Resilience4j's defaults weren't enough, and the four changes that made the next sale boring. Standard Java microservices stack. Spring Cloud Gateway in front, JWT auth via Keycloak, Resilience4j wrapping every outbound call. Payment Service synchronously calls Stripe. Order Service synchronously calls Payment. PostgreSQL for orders, Redis for circuit breaker state, Kafka for the dead-letter queue. Six services. Five circuit breakers. One very stressed thread pool. Baseline was around 1,000 RPS. The flash sale pushed us to 10,243. The edge layer absorbed it fine — NGINX did its job, the rate limiter degraded gracefully, the CDN cached anything cacheable. Spring Cloud Gateway routed cleanly. The wheels came off at the Payment Service. Stripe's p99 latency under load climbed from a healthy 800ms to 14.2 seconds. That doesn't sound catastrophic until you do the math: every Payment thread now holds for ~14s instead of <1s. With a fixed thread pool, throughput collapses long before the breaker notices. A 50% failure threshold over 100 calls means the breaker waits for 50 failures before tripping. At 10x load with timeouts, that's roughly four minutes of users staring at spinners. By the time the breaker opened, the thread pool was already 98% saturated. Step 8 is the only reason this incident wasn't a full-platform outage. Without per-endpoint bulkheads, a slow Stripe would have eaten every thread in the gateway's pool, and User Service login requests would have queued behind dead Payment calls. If you've only read the docs, the circuit breaker looks like a tidy three-state diagram. In production it's noisier: That listener saved us during the postmortem. We could replay exactly when each breaker tripped, when probing started, and which trial calls failed. If you don't emit metrics on every state transition, you're flying blind. The HALF-OPEN state is the dangerous one. Resilience4j permits a small number of trial calls; if any of them fail, you slam back to OPEN for another waitDuration. Set the trial pool too low and you'll never recover; set it too high and you'll hammer a still-broken downstream. We dropped the threshold and shrunk the window: Two non-obvious knobs matter here. slowCallRateThreshold lets you trip on latency, not just errors — critical when a downstream is dying slowly rather than 500-ing. And the smaller window means the breaker reacts in seconds, not minutes. A single thread pool for "Payment Service" is too coarse. Split by downstream: Now a slow fraud engine can't drain Stripe's threads, and vice versa. Bulkhead-per-dependency is more YAML, but it's the only way to guarantee isolation when one downstream misbehaves. The synchronous Order → Payment → Stripe chain was the real sin. We moved Payment to an outbox pattern: orders write a payment intent to Postgres in the same transaction, a relay publishes to Kafka, and a worker calls Stripe asynchronously. The user gets an immediate "order placed" response; the charge happens within seconds, with retries handled by the consumer. Decoupling time-of-order from time-of-charge means a 14-second Stripe doesn't translate to a 14-second user experience. It also gives us natural retry and dead-lettering through Kafka, instead of bolting retry logic onto every caller. The Payment Service was scaled on CPU, which is useless when threads are blocked on I/O. We swapped to a custom Prometheus metric — RPS plus Kafka consumer lag — and let the HPA add pods when the queue grew faster than it drained. CPU never crossed 40% during the incident; if we'd been watching the right signal, we'd have scaled out three minutes earlier. The circuit breaker is a fire alarm, not a fire suppression system. By the time it trips, you've already had a fire for a while. The real defenses are the things that stop the fire from starting: bulkhead isolation per downstream, slow-call detection, async boundaries on anything you don't fully control, and autoscaling on signals that actually correlate with load. Resilience4j is excellent. The defaults are not your friend in production. If you take three things from this: The next flash sale ran 12x baseline. Payment p99 stayed under 600ms. Nobody paged. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# What we had — Resilience4j defaults, lightly tuned resilience4j.circuitbreaker: instances: paymentService: failureRateThreshold: 50 slidingWindowSize: 100 slidingWindowType: COUNT_BASED waitDurationInOpenState: 30s permittedNumberOfCallsInHalfOpenState: 10 # What we had — Resilience4j defaults, lightly tuned resilience4j.circuitbreaker: instances: paymentService: failureRateThreshold: 50 slidingWindowSize: 100 slidingWindowType: COUNT_BASED waitDurationInOpenState: 30s permittedNumberOfCallsInHalfOpenState: 10 # What we had — Resilience4j defaults, lightly tuned resilience4j.circuitbreaker: instances: paymentService: failureRateThreshold: 50 slidingWindowSize: 100 slidingWindowType: COUNT_BASED waitDurationInOpenState: 30s permittedNumberOfCallsInHalfOpenState: 10 // Resilience4j state transitions, simplified CircuitBreaker cb = CircuitBreaker.of("paymentService", config); cb.getEventPublisher() .onStateTransition(event -> { log.warn("CB {} : {} -> {}", event.getCircuitBreakerName(), event.getStateTransition().getFromState(), event.getStateTransition().getToState()); meterRegistry.counter("cb.transition", "name", event.getCircuitBreakerName(), "to", event.getStateTransition().getToState().name() ).increment(); }); // Resilience4j state transitions, simplified CircuitBreaker cb = CircuitBreaker.of("paymentService", config); cb.getEventPublisher() .onStateTransition(event -> { log.warn("CB {} : {} -> {}", event.getCircuitBreakerName(), event.getStateTransition().getFromState(), event.getStateTransition().getToState()); meterRegistry.counter("cb.transition", "name", event.getCircuitBreakerName(), "to", event.getStateTransition().getToState().name() ).increment(); }); // Resilience4j state transitions, simplified CircuitBreaker cb = CircuitBreaker.of("paymentService", config); cb.getEventPublisher() .onStateTransition(event -> { log.warn("CB {} : {} -> {}", event.getCircuitBreakerName(), event.getStateTransition().getFromState(), event.getStateTransition().getToState()); meterRegistry.counter("cb.transition", "name", event.getCircuitBreakerName(), "to", event.getStateTransition().getToState().name() ).increment(); }); resilience4j.circuitbreaker: instances: paymentService: failureRateThreshold: 30 # was 50 slowCallRateThreshold: 50 # NEW — slow calls also count slowCallDurationThreshold: 2s # NEW slidingWindowSize: 20 # was 100 minimumNumberOfCalls: 10 waitDurationInOpenState: 15s # was 30s permittedNumberOfCallsInHalfOpenState: 5 resilience4j.circuitbreaker: instances: paymentService: failureRateThreshold: 30 # was 50 slowCallRateThreshold: 50 # NEW — slow calls also count slowCallDurationThreshold: 2s # NEW slidingWindowSize: 20 # was 100 minimumNumberOfCalls: 10 waitDurationInOpenState: 15s # was 30s permittedNumberOfCallsInHalfOpenState: 5 resilience4j.circuitbreaker: instances: paymentService: failureRateThreshold: 30 # was 50 slowCallRateThreshold: 50 # NEW — slow calls also count slowCallDurationThreshold: 2s # NEW slidingWindowSize: 20 # was 100 minimumNumberOfCalls: 10 waitDurationInOpenState: 15s # was 30s permittedNumberOfCallsInHalfOpenState: 5 @Bean public ThreadPoolBulkhead stripeBulkhead() { ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom() .maxThreadPoolSize(20) .coreThreadPoolSize(10) .queueCapacity(50) .keepAliveDuration(Duration.ofMillis(500)) .build(); return ThreadPoolBulkhead.of("stripe", config); } @Bean public ThreadPoolBulkhead fraudBulkhead() { // Smaller — fraud is allowed to be slow, not allowed to starve payment return ThreadPoolBulkhead.of("fraud", ThreadPoolBulkheadConfig.custom() .maxThreadPoolSize(8) .coreThreadPoolSize(4) .build()); } @Bean public ThreadPoolBulkhead stripeBulkhead() { ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom() .maxThreadPoolSize(20) .coreThreadPoolSize(10) .queueCapacity(50) .keepAliveDuration(Duration.ofMillis(500)) .build(); return ThreadPoolBulkhead.of("stripe", config); } @Bean public ThreadPoolBulkhead fraudBulkhead() { // Smaller — fraud is allowed to be slow, not allowed to starve payment return ThreadPoolBulkhead.of("fraud", ThreadPoolBulkheadConfig.custom() .maxThreadPoolSize(8) .coreThreadPoolSize(4) .build()); } @Bean public ThreadPoolBulkhead stripeBulkhead() { ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom() .maxThreadPoolSize(20) .coreThreadPoolSize(10) .queueCapacity(50) .keepAliveDuration(Duration.ofMillis(500)) .build(); return ThreadPoolBulkhead.of("stripe", config); } @Bean public ThreadPoolBulkhead fraudBulkhead() { // Smaller — fraud is allowed to be slow, not allowed to starve payment return ThreadPoolBulkhead.of("fraud", ThreadPoolBulkheadConfig.custom() .maxThreadPoolSize(8) .coreThreadPoolSize(4) .build()); } @Transactional public Order placeOrder(OrderRequest req) { Order order = orderRepo.save(Order.from(req)); outboxRepo.save(new OutboxEvent( "payment.charge.requested", order.getId(), objectMapper.writeValueAsString(req.payment()) )); return order; // returns in <50ms regardless of Stripe latency } @Transactional public Order placeOrder(OrderRequest req) { Order order = orderRepo.save(Order.from(req)); outboxRepo.save(new OutboxEvent( "payment.charge.requested", order.getId(), objectMapper.writeValueAsString(req.payment()) )); return order; // returns in <50ms regardless of Stripe latency } @Transactional public Order placeOrder(OrderRequest req) { Order order = orderRepo.save(Order.from(req)); outboxRepo.save(new OutboxEvent( "payment.charge.requested", order.getId(), objectMapper.writeValueAsString(req.payment()) )); return order; // returns in <50ms regardless of Stripe latency } - Flash-sale spike hits the gateway at 10x RPS. - Order Service synchronously calls Payment for every checkout. - Stripe's p99 spikes to 14s under provider-side load. - Payment Service threads block on those timeouts. - failureRateThreshold=50% breached → Payment CB transitions to OPEN. - Subsequent calls fail-fast → fallback handler enqueues "deferred order" responses to Kafka. - Order Service's own CB drops to HALF-OPEN, probing with limited concurrency. - Bulkhead isolation prevents the cascade from reaching Inventory, Notifications, or User services. - Trip on latency, not just errors. slowCallRateThreshold is the most underused knob in Resilience4j. - One bulkhead per downstream, always. Coarse pools will betray you the moment two dependencies fail differently. - Synchronous chains across third-party APIs are tech debt. An outbox + queue is more code, but it's the difference between a postmortem and an incident report.