Resilience Benchmark
A comprehensive benchmarking suite for evaluating fault tolerance patterns in Spring Boot microservices. This research project, part of my MSc thesis, compares circuit breakers, retries, bulkheads, and rate limiters under various failure scenarios.
Overview
In distributed systems, failures are inevitable. The question isn't whether services will fail, but how the system behaves when they do. Resilience patterns like circuit breakers and bulkheads can prevent cascading failures, but choosing the right pattern for a given scenario requires empirical data.
This project provides a reproducible benchmark environment to measure the effectiveness of different resilience patterns against common failure modes: network latency, service crashes, resource exhaustion, and dependency failures.
Patterns Evaluated
Circuit Breaker
Prevents repeated calls to failing services, allowing them time to recover while failing fast for callers.
Retry with Backoff
Automatically retries failed requests with exponential backoff and jitter to avoid thundering herd.
Bulkhead
Isolates failures by limiting concurrent calls to each dependency, preventing resource exhaustion.
Rate Limiter
Protects services from overload by limiting request rates, with fair queuing for burst traffic.
Key Findings
The benchmarks revealed that pattern combinations outperform individual patterns. A circuit breaker alone can still overwhelm recovering services, but combined with rate limiting and bulkheads, recovery is significantly smoother.
We also found that default configurations are rarely optimal. Circuit breaker thresholds, bulkhead sizes, and retry counts all need tuning based on actual service characteristics and SLAs.
The optimal circuit breaker configuration reduced 99th percentile latency by 73% during partial outages compared to no resilience patterns.
Tech Stack
What I Learned
This research taught me that resilience engineering is as much about understanding failure modes as implementing patterns. Characterizing how services fail - gracefully, suddenly, partially - is essential for choosing appropriate mitigations.
I also gained experience with chaos engineering practices, learning to inject failures in controlled ways to validate system behavior before real outages occur.