Resilience Benchmark

Java
Microservices Research Fault Tolerance
View on GitHub

A comprehensive benchmarking suite for evaluating fault tolerance patterns in Spring Boot microservices. This research project, part of my MSc thesis, compares circuit breakers, retries, bulkheads, and rate limiters under various failure scenarios.

Overview

In distributed systems, failures are inevitable. The question isn't whether services will fail, but how the system behaves when they do. Resilience patterns like circuit breakers and bulkheads can prevent cascading failures, but choosing the right pattern for a given scenario requires empirical data.

This project provides a reproducible benchmark environment to measure the effectiveness of different resilience patterns against common failure modes: network latency, service crashes, resource exhaustion, and dependency failures.

Patterns Evaluated

Circuit Breaker

Prevents repeated calls to failing services, allowing them time to recover while failing fast for callers.

Retry with Backoff

Automatically retries failed requests with exponential backoff and jitter to avoid thundering herd.

Bulkhead

Isolates failures by limiting concurrent calls to each dependency, preventing resource exhaustion.

Rate Limiter

Protects services from overload by limiting request rates, with fair queuing for burst traffic.

Key Findings

The benchmarks revealed that pattern combinations outperform individual patterns. A circuit breaker alone can still overwhelm recovering services, but combined with rate limiting and bulkheads, recovery is significantly smoother.

We also found that default configurations are rarely optimal. Circuit breaker thresholds, bulkhead sizes, and retry counts all need tuning based on actual service characteristics and SLAs.

The optimal circuit breaker configuration reduced 99th percentile latency by 73% during partial outages compared to no resilience patterns.

Tech Stack

Java 17 Spring Boot Resilience4j Gatling Prometheus Grafana Docker Compose

What I Learned

This research taught me that resilience engineering is as much about understanding failure modes as implementing patterns. Characterizing how services fail - gracefully, suddenly, partially - is essential for choosing appropriate mitigations.

I also gained experience with chaos engineering practices, learning to inject failures in controlled ways to validate system behavior before real outages occur.