Building software that can withstand unexpected problems is a craft. I think of it like building a house in a region with storms. You don’t just hope the weather stays nice. You design a roof that won’t leak, a foundation that won’t flood, and a structure that won’t collapse at the first strong wind. In our world, the storms are network timeouts, database hiccups, and third-party services going offline. Let’s talk about building software that doesn’t just crash when these things happen.
First, consider what happens when you call an external service, like a payment gateway. If it starts responding slowly or failing, your application might keep trying. Each attempt uses a thread, waits, and eventually fails. This can quickly use up all your server’s resources, causing your entire app to slow down or crash because one small external part is broken. This is a cascading failure.
We can prevent this with a pattern that acts like an electrical circuit breaker. When a service fails too many times, we “trip” the breaker. All further calls immediately fail fast, without even trying the service. This gives the failing system time to recover. After a set period, we cautiously try again to see if it’s healthy.
# A simple version to show the idea
class PaymentService
def initialize
@failures = 0
@breaker_tripped_at = nil
@threshold = 3
@cooldown = 30.seconds
end
def charge(order)
# Check if the breaker is "open"
if @breaker_tripped_at && (Time.now - @breaker_tripped_at < @cooldown)
puts "Circuit is open! Using fallback."
return queue_for_later(order)
end
# Try the call
begin
response = ExternalPaymentGateway.charge(amount: order.total, token: order.token)
reset_breaker
response
rescue Timeout::Error, SocketError => e
record_failure
raise e
end
end
private
def record_failure
@failures += 1
if @failures >= @threshold
puts "Threshold reached! Tripping the circuit breaker."
@breaker_tripped_at = Time.now
end
end
def reset_breaker
@failures = 0
@breaker_tripped_at = nil
end
def queue_for_later(order)
PaymentRetryJob.perform_later(order.id)
{ status: :queued, message: 'Will try again shortly' }
end
end
In this example, after three failed charges, the breaker trips. For the next 30 seconds, any call to charge skips the external gateway entirely and just queues the job. This stops the flood of failing requests. After the cooldown, the next request will attempt the gateway again. If it works, the breaker resets. If it fails, the cooldown period starts over. This simple logic protects your app.
Not all errors mean you should stop trying forever. Some are transient. A network blip or a brief database lock might be gone in a second. For these, we retry. But we must be smart about it. If every failed request retries immediately, you can create a “retry storm” that overwhelms the recovering service.
The smarter approach is to wait a bit between tries, and to increase the wait each time. This is called exponential backoff. Adding a small random variation, called jitter, prevents many clients from retrying at the exact same moment.
def fetch_remote_data(url)
retries = 0
max_retries = 4
begin
response = Net::HTTP.get_response(URI(url))
JSON.parse(response.body)
rescue Net::ReadTimeout, Net::OpenTimeout => e
retries += 1
if retries <= max_retries
wait_time = (2 ** retries) + rand(0.0..1.0) # Exponential backoff + jitter
puts "Retry #{retries}/#{max_retries} after #{wait_time.round(2)}s. Error: #{e.message}"
sleep(wait_time)
retry
else
raise "Failed after #{max_retries} retries: #{e.message}"
end
end
end
Here, the first retry waits about 2 seconds, the second about 4, the third about 8. The rand adds jitter. This gives the remote service breathing room. It’s a polite way to say, “I’ll come back in a moment.” You should only retry errors that make sense to retry, like timeouts or connection errors. Never retry a “404 Not Found” or a “card declined” error; the result will be the same.
Sometimes, you can’t get the perfect answer, but a good-enough answer is better than an error page. This is the idea of a fallback. Your primary source of data might be a fast cache. If the cache is empty or the cache service is down, you fall back to the slower database.
The key is to make the fallback automatic and seamless for the user.
class ProductFinder
def find_for_homepage
# Try the fast path first
featured = Rails.cache.read('homepage_featured_products')
if featured
return featured
end
# Fallback to the slow path
puts "Cache miss! Falling back to database query."
Product.where(featured: true).limit(10).to_a
rescue Redis::CannotConnectError => e
# If the cache server itself is down, go straight to fallback
puts "Cache service down! #{e.message}"
Product.where(featured: true).limit(10).to_a
end
end
You can get more sophisticated. Your fallback could be a stale copy of data, a default value, or even a call to a different, more reliable but slower service. The user might see slightly older data for a moment, but they see something, which is almost always better than seeing an error. I often use this for non-critical features like recommendation widgets or “recent news” sidebars.
Picture a ship with bulkheads. If one compartment floods, the watertight doors seal it off to keep the rest of the ship from sinking. We can do the same with software resources. If your payment processing starts using too many threads and slowing down, it shouldn’t be allowed to steal threads from your email sending or report generation.
We achieve this by giving each type of task its own isolated pool of resources.
# Using a concurrency library for thread pools
require 'concurrent'
class ResourceManager
def initialize
@payment_pool = Concurrent::ThreadPoolExecutor.new(
min_threads: 1,
max_threads: 3, # Only 3 threads max for payments
max_queue: 5 # Only 5 jobs can wait in line
)
@email_pool = Concurrent::ThreadPoolExecutor.new(
min_threads: 1,
max_threads: 2,
max_queue: 10
)
end
def process_payment_async(order)
Concurrent::Future.execute(executor: @payment_pool) do
PaymentService.new.charge(order)
end
end
def send_welcome_email_async(user)
Concurrent::Future.execute(executor: @email_pool) do
UserMailer.welcome(user).deliver_now
end
end
end
Now, even if payment processing gets swamped with 100 requests, only 3 will run concurrently and 5 will wait. The 92nd request will get rejected immediately, which we can handle gracefully (maybe by showing the user a “try again soon” message). Crucially, the email pool with its 2 threads remains untouched. The flood is contained. This is bulkheading.
A timeout is a simple form of bulkheading for time. You might use it for a call you know should be fast.
def call_with_timeout(service, max_time=2.seconds)
future = Concurrent::Future.execute { service.call }
begin
future.value(max_time)
rescue Concurrent::TimeoutError
future.cancel # Stop the task if it's still running
raise ServiceTimeoutError, "#{service.name} took longer than #{max_time}"
end
end
When an error occurs, what you do next shouldn’t be a mystery. A “404 Not Found” is very different from a “database connection lost.” The first is a client error we should handle politely. The second is a server problem that might need a retry or an alert to the operations team.
Classifying errors lets you apply the right strategy automatically.
class ErrorHandler
def self.handle(error, context)
case error
when ActiveRecord::RecordNotFound
# User asked for something that doesn't exist.
Rails.logger.info("Record not found: #{context}")
{ status: 404, json: { error: 'Not Found' } }
when Net::ReadTimeout
# Network issue, likely temporary.
Rails.logger.warn("Timeout for #{context}. Will retry.")
{ status: 503, json: { error: 'Service Temporarily Unavailable' } }
when PaymentService::DeclinedError
# Business logic failure. Don't retry.
Rails.logger.info("Payment declined: #{context}")
{ status: 422, json: { error: 'Payment Declined' } }
else
# Unexpected, serious error. Sound the alarms.
Rails.logger.error("Unhandled error: #{error.message} - #{context}")
ErrorMonitoringService.notify(error, context)
{ status: 500, json: { error: 'Internal Server Error' } }
end
end
end
# In your controller
rescue_from StandardError do |error|
result = ErrorHandler.handle(error, request_details)
render result
end
This turns a tangled mess of begin/rescue blocks into a clear decision tree. You can see at a glance how each error type is treated. Adding a new error type becomes a matter of adding another when clause. The context—like the user ID or request parameters—is passed along for better logging.
What happens when your system is truly struggling? Maybe database CPU is at 99% or response times are skyrocketing. Continuing to accept all requests will make things worse until everything crashes. A better strategy is to gracefully shed load. You identify requests that are less critical and start rejecting them with a friendly “busy” signal (HTTP 503). This preserves capacity for the most important traffic.
You need a way to measure your own system’s health. This could be based on response times, error rates, or queue lengths.
class LoadShedder
def initialize
@request_times = []
@shedding = false
end
def track_request(duration)
# Keep only the last 100 request times
@request_times.shift if @request_times.size >= 100
@request_times << duration
# Decide if we're unhealthy
avg_time = @request_times.sum / @request_times.size
if avg_time > 1.0 # Average response over 1 second is bad
enable_shedding
elsif avg_time < 0.5
disable_shedding
end
end
def should_accept?(request)
return true unless @shedding
# When shedding, reject requests to expensive or non-critical endpoints
!non_critical_path?(request.path)
end
private
def non_critical_path?(path)
path.include?('/api/analytics') || path.include?('/admin/reports')
end
def enable_shedding
puts "System slow. Enabling load shedding for non-critical paths."
@shedding = true
end
def disable_shedding
puts "System healthy. Disabling load shedding."
@shedding = false
end
end
# In a Rack middleware
class SheddingMiddleware
def initialize(app, shedder)
@app = app
@shedder = shedder
end
def call(env)
req = Rack::Request.new(env)
unless @shedder.should_accept?(req)
return [503, { 'Content-Type' => 'text/plain' }, ['Service Overloaded']]
end
start = Time.now
status, headers, body = @app.call(env)
duration = Time.now - start
@shedder.track_request(duration)
[status, headers, body]
end
end
This is a self-preservation mechanism. It’s better to tell some users “try again later” than to let the system fall over for everyone. You typically shed the fanciest features first—the complex reports, the real-time analytics dashboards. You keep the core “buy this product” or “log in” functionality running as long as possible.
In a complex operation, especially one that touches multiple services (like charging a card and updating inventory), things can fail in the middle. You’re left in an inconsistent state. The payment went through, but the inventory was never reserved. This is a bad place to be.
The solution is to plan for failure. For every action, you can define a compensating action—a way to undo it.
class OrderProcessor
def process(order)
steps_completed = []
begin
# Step 1: Reserve items
InventoryService.reserve(order.items)
steps_completed << :inventory_reserved
# Step 2: Charge card
PaymentService.charge(order.total, order.card_token)
steps_completed << :payment_taken
# Step 3: Ship order
ShippingService.schedule(order)
steps_completed << :shipping_scheduled
order.mark_as_completed
rescue => error
puts "Failed at step! Rolling back. Error: #{error.message}"
# Undo steps in reverse order
steps_completed.reverse_each do |step|
undo(step, order)
end
raise error # Re-raise after cleanup
end
end
private
def undo(step, order)
case step
when :inventory_reserved
InventoryService.release(order.items)
when :payment_taken
PaymentService.refund(order.transaction_id)
when :shipping_scheduled
ShippingService.cancel(order.shipment_id)
end
rescue => undo_error
# Log this aggressively! A failed compensation is serious.
puts "CRITICAL: Failed to undo #{step}: #{undo_error.message}"
end
end
This is like leaving a trail of breadcrumbs. If you make it to the end, great. If you fail at any point, you walk back along the trail, undoing each step. Notice the reverse_each. You undo the last thing you did first. This pattern, sometimes called a Saga, is crucial for maintaining data consistency across different systems without a single, global transaction.
Building these patterns into your Rails application changes how you think about code. It moves you from hoping things work to designing for when they don’t. You start asking: “If this API call times out, what should the user see?” or “If the database is slow, which features can we turn off first?”
This isn’t about preventing every single error—that’s impossible. It’s about containing failures, providing reasonable alternatives, and keeping the core experience intact. It’s the difference between a building that collapses in an earthquake and one that sways, loses some windows, but remains standing, protecting the people inside. Your code becomes not just functional, but resilient. And that is a quality users feel, even if they never see the careful engineering behind it.
Start small. Add a retry with backoff to one external API call. Wrap a non-critical feature in a fallback. The complexity you add is a trade-off, but for the critical paths of your application, it’s a trade-off that builds trust and reliability.