Building Resilient Software: Circuit Breakers, Retries, and Failure Strategies That Actually Work

ruby

Building Resilient Software: Circuit Breakers, Retries, and Failure Strategies That Actually Work

Learn to build resilient software that survives network failures, database crashes & service outages. Master circuit breakers, retries, fallbacks & error handling patterns.

Feb 18, 2026

Building Resilient Software: Circuit Breakers, Retries, and Failure Strategies That Actually Work

Building software that can withstand unexpected problems is a craft. I think of it like building a house in a region with storms. You don’t just hope the weather stays nice. You design a roof that won’t leak, a foundation that won’t flood, and a structure that won’t collapse at the first strong wind. In our world, the storms are network timeouts, database hiccups, and third-party services going offline. Let’s talk about building software that doesn’t just crash when these things happen.

First, consider what happens when you call an external service, like a payment gateway. If it starts responding slowly or failing, your application might keep trying. Each attempt uses a thread, waits, and eventually fails. This can quickly use up all your server’s resources, causing your entire app to slow down or crash because one small external part is broken. This is a cascading failure.

We can prevent this with a pattern that acts like an electrical circuit breaker. When a service fails too many times, we “trip” the breaker. All further calls immediately fail fast, without even trying the service. This gives the failing system time to recover. After a set period, we cautiously try again to see if it’s healthy.

# A simple version to show the idea
class PaymentService
  def initialize
    @failures = 0
    @breaker_tripped_at = nil
    @threshold = 3
    @cooldown = 30.seconds
  end

  def charge(order)
    # Check if the breaker is "open"
    if @breaker_tripped_at && (Time.now - @breaker_tripped_at < @cooldown)
      puts "Circuit is open! Using fallback."
      return queue_for_later(order)
    end

    # Try the call
    begin
      response = ExternalPaymentGateway.charge(amount: order.total, token: order.token)
      reset_breaker
      response
    rescue Timeout::Error, SocketError => e
      record_failure
      raise e
    end
  end

  private

  def record_failure
    @failures += 1
    if @failures >= @threshold
      puts "Threshold reached! Tripping the circuit breaker."
      @breaker_tripped_at = Time.now
    end
  end

  def reset_breaker
    @failures = 0
    @breaker_tripped_at = nil
  end

  def queue_for_later(order)
    PaymentRetryJob.perform_later(order.id)
    { status: :queued, message: 'Will try again shortly' }
  end
end

In this example, after three failed charges, the breaker trips. For the next 30 seconds, any call to charge skips the external gateway entirely and just queues the job. This stops the flood of failing requests. After the cooldown, the next request will attempt the gateway again. If it works, the breaker resets. If it fails, the cooldown period starts over. This simple logic protects your app.

Not all errors mean you should stop trying forever. Some are transient. A network blip or a brief database lock might be gone in a second. For these, we retry. But we must be smart about it. If every failed request retries immediately, you can create a “retry storm” that overwhelms the recovering service.

The smarter approach is to wait a bit between tries, and to increase the wait each time. This is called exponential backoff. Adding a small random variation, called jitter, prevents many clients from retrying at the exact same moment.

def fetch_remote_data(url)
  retries = 0
  max_retries = 4

  begin
    response = Net::HTTP.get_response(URI(url))
    JSON.parse(response.body)
  rescue Net::ReadTimeout, Net::OpenTimeout => e
    retries += 1
    if retries <= max_retries
      wait_time = (2 ** retries) + rand(0.0..1.0) # Exponential backoff + jitter
      puts "Retry #{retries}/#{max_retries} after #{wait_time.round(2)}s. Error: #{e.message}"
      sleep(wait_time)
      retry
    else
      raise "Failed after #{max_retries} retries: #{e.message}"
    end
  end
end

Here, the first retry waits about 2 seconds, the second about 4, the third about 8. The rand adds jitter. This gives the remote service breathing room. It’s a polite way to say, “I’ll come back in a moment.” You should only retry errors that make sense to retry, like timeouts or connection errors. Never retry a “404 Not Found” or a “card declined” error; the result will be the same.

Sometimes, you can’t get the perfect answer, but a good-enough answer is better than an error page. This is the idea of a fallback. Your primary source of data might be a fast cache. If the cache is empty or the cache service is down, you fall back to the slower database.

The key is to make the fallback automatic and seamless for the user.

class ProductFinder
  def find_for_homepage
    # Try the fast path first
    featured = Rails.cache.read('homepage_featured_products')
    if featured
      return featured
    end

    # Fallback to the slow path
    puts "Cache miss! Falling back to database query."
    Product.where(featured: true).limit(10).to_a
  rescue Redis::CannotConnectError => e
    # If the cache server itself is down, go straight to fallback
    puts "Cache service down! #{e.message}"
    Product.where(featured: true).limit(10).to_a
  end
end

You can get more sophisticated. Your fallback could be a stale copy of data, a default value, or even a call to a different, more reliable but slower service. The user might see slightly older data for a moment, but they see something, which is almost always better than seeing an error. I often use this for non-critical features like recommendation widgets or “recent news” sidebars.

Picture a ship with bulkheads. If one compartment floods, the watertight doors seal it off to keep the rest of the ship from sinking. We can do the same with software resources. If your payment processing starts using too many threads and slowing down, it shouldn’t be allowed to steal threads from your email sending or report generation.

We achieve this by giving each type of task its own isolated pool of resources.

# Using a concurrency library for thread pools
require 'concurrent'

class ResourceManager
  def initialize
    @payment_pool = Concurrent::ThreadPoolExecutor.new(
      min_threads: 1,
      max_threads: 3, # Only 3 threads max for payments
      max_queue: 5    # Only 5 jobs can wait in line
    )
    @email_pool = Concurrent::ThreadPoolExecutor.new(
      min_threads: 1,
      max_threads: 2,
      max_queue: 10
    )
  end

  def process_payment_async(order)
    Concurrent::Future.execute(executor: @payment_pool) do
      PaymentService.new.charge(order)
    end
  end

  def send_welcome_email_async(user)
    Concurrent::Future.execute(executor: @email_pool) do
      UserMailer.welcome(user).deliver_now
    end
  end
end

Now, even if payment processing gets swamped with 100 requests, only 3 will run concurrently and 5 will wait. The 92nd request will get rejected immediately, which we can handle gracefully (maybe by showing the user a “try again soon” message). Crucially, the email pool with its 2 threads remains untouched. The flood is contained. This is bulkheading.

A timeout is a simple form of bulkheading for time. You might use it for a call you know should be fast.

def call_with_timeout(service, max_time=2.seconds)
  future = Concurrent::Future.execute { service.call }
  begin
    future.value(max_time)
  rescue Concurrent::TimeoutError
    future.cancel # Stop the task if it's still running
    raise ServiceTimeoutError, "#{service.name} took longer than #{max_time}"
  end
end

When an error occurs, what you do next shouldn’t be a mystery. A “404 Not Found” is very different from a “database connection lost.” The first is a client error we should handle politely. The second is a server problem that might need a retry or an alert to the operations team.

Classifying errors lets you apply the right strategy automatically.

class ErrorHandler
  def self.handle(error, context)
    case error
    when ActiveRecord::RecordNotFound
      # User asked for something that doesn't exist.
      Rails.logger.info("Record not found: #{context}")
      { status: 404, json: { error: 'Not Found' } }
    when Net::ReadTimeout
      # Network issue, likely temporary.
      Rails.logger.warn("Timeout for #{context}. Will retry.")
      { status: 503, json: { error: 'Service Temporarily Unavailable' } }
    when PaymentService::DeclinedError
      # Business logic failure. Don't retry.
      Rails.logger.info("Payment declined: #{context}")
      { status: 422, json: { error: 'Payment Declined' } }
    else
      # Unexpected, serious error. Sound the alarms.
      Rails.logger.error("Unhandled error: #{error.message} - #{context}")
      ErrorMonitoringService.notify(error, context)
      { status: 500, json: { error: 'Internal Server Error' } }
    end
  end
end

# In your controller
rescue_from StandardError do |error|
  result = ErrorHandler.handle(error, request_details)
  render result
end

This turns a tangled mess of begin/rescue blocks into a clear decision tree. You can see at a glance how each error type is treated. Adding a new error type becomes a matter of adding another when clause. The context—like the user ID or request parameters—is passed along for better logging.

What happens when your system is truly struggling? Maybe database CPU is at 99% or response times are skyrocketing. Continuing to accept all requests will make things worse until everything crashes. A better strategy is to gracefully shed load. You identify requests that are less critical and start rejecting them with a friendly “busy” signal (HTTP 503). This preserves capacity for the most important traffic.

You need a way to measure your own system’s health. This could be based on response times, error rates, or queue lengths.

class LoadShedder
  def initialize
    @request_times = []
    @shedding = false
  end

  def track_request(duration)
    # Keep only the last 100 request times
    @request_times.shift if @request_times.size >= 100
    @request_times << duration

    # Decide if we're unhealthy
    avg_time = @request_times.sum / @request_times.size
    if avg_time > 1.0 # Average response over 1 second is bad
      enable_shedding
    elsif avg_time < 0.5
      disable_shedding
    end
  end

  def should_accept?(request)
    return true unless @shedding

    # When shedding, reject requests to expensive or non-critical endpoints
    !non_critical_path?(request.path)
  end

  private

  def non_critical_path?(path)
    path.include?('/api/analytics') || path.include?('/admin/reports')
  end

  def enable_shedding
    puts "System slow. Enabling load shedding for non-critical paths."
    @shedding = true
  end

  def disable_shedding
    puts "System healthy. Disabling load shedding."
    @shedding = false
  end
end

# In a Rack middleware
class SheddingMiddleware
  def initialize(app, shedder)
    @app = app
    @shedder = shedder
  end

  def call(env)
    req = Rack::Request.new(env)

    unless @shedder.should_accept?(req)
      return [503, { 'Content-Type' => 'text/plain' }, ['Service Overloaded']]
    end

    start = Time.now
    status, headers, body = @app.call(env)
    duration = Time.now - start

    @shedder.track_request(duration)

    [status, headers, body]
  end
end

This is a self-preservation mechanism. It’s better to tell some users “try again later” than to let the system fall over for everyone. You typically shed the fanciest features first—the complex reports, the real-time analytics dashboards. You keep the core “buy this product” or “log in” functionality running as long as possible.

In a complex operation, especially one that touches multiple services (like charging a card and updating inventory), things can fail in the middle. You’re left in an inconsistent state. The payment went through, but the inventory was never reserved. This is a bad place to be.

The solution is to plan for failure. For every action, you can define a compensating action—a way to undo it.

class OrderProcessor
  def process(order)
    steps_completed = []

    begin
      # Step 1: Reserve items
      InventoryService.reserve(order.items)
      steps_completed << :inventory_reserved

      # Step 2: Charge card
      PaymentService.charge(order.total, order.card_token)
      steps_completed << :payment_taken

      # Step 3: Ship order
      ShippingService.schedule(order)
      steps_completed << :shipping_scheduled

      order.mark_as_completed

    rescue => error
      puts "Failed at step! Rolling back. Error: #{error.message}"
      # Undo steps in reverse order
      steps_completed.reverse_each do |step|
        undo(step, order)
      end
      raise error # Re-raise after cleanup
    end
  end

  private

  def undo(step, order)
    case step
    when :inventory_reserved
      InventoryService.release(order.items)
    when :payment_taken
      PaymentService.refund(order.transaction_id)
    when :shipping_scheduled
      ShippingService.cancel(order.shipment_id)
    end
  rescue => undo_error
    # Log this aggressively! A failed compensation is serious.
    puts "CRITICAL: Failed to undo #{step}: #{undo_error.message}"
  end
end

This is like leaving a trail of breadcrumbs. If you make it to the end, great. If you fail at any point, you walk back along the trail, undoing each step. Notice the reverse_each. You undo the last thing you did first. This pattern, sometimes called a Saga, is crucial for maintaining data consistency across different systems without a single, global transaction.

Building these patterns into your Rails application changes how you think about code. It moves you from hoping things work to designing for when they don’t. You start asking: “If this API call times out, what should the user see?” or “If the database is slow, which features can we turn off first?”

This isn’t about preventing every single error—that’s impossible. It’s about containing failures, providing reasonable alternatives, and keeping the core experience intact. It’s the difference between a building that collapses in an earthquake and one that sways, loses some windows, but remains standing, protecting the people inside. Your code becomes not just functional, but resilient. And that is a quality users feel, even if they never see the careful engineering behind it.

Start small. Add a retry with backoff to one external API call. Wrap a non-critical feature in a fallback. The complexity you add is a trade-off, but for the critical paths of your application, it’s a trade-off that builds trust and reliability.