When a user clicks a button in your Rails application, they expect something to happen quickly. But what if that “something” takes ten seconds, or a minute, or an hour? Making them wait isn’t an option. This is where background job processing becomes essential. It’s the system that quietly handles the heavy lifting—sending emails, generating reports, processing payments—while your application remains snappy and responsive.
I think of it like a restaurant kitchen. The waiter (your web server) takes the order and immediately returns to the customer with a smile, a receipt, and a “your food is being prepared.” The real work happens in the kitchen (your job queue) by specialized chefs (your job workers). The customer isn’t kept waiting at the table, and the kitchen can manage multiple orders efficiently. Over the years, I’ve learned that building this kitchen reliably requires more than just installing a job processor like Sidekiq or Resque. It needs thoughtful patterns.
Let’s start with the most fundamental idea: making jobs safe to run multiple times. In a distributed system, jobs can and will execute more than once. Networks fail, servers restart, and retries happen. If processing an order twice charges a customer twice, you have a serious problem.
An idempotent job guarantees the same outcome no matter how many times it runs. The key is to design jobs so that repeated execution is harmless. Here’s how I typically approach it.
First, I make the job check the current state before doing any work. I look for a flag or a timestamp that indicates the work is already done. If I find it, I exit early. This is a simple but powerful guard.
Second, for operations that must not run concurrently, I use an advisory lock. This is a coordination mechanism, often using Redis, that acts like a “do not disturb” sign. Before the job starts its core work, it tries to acquire a unique lock for that specific task. If it gets the lock, it proceeds. If it doesn’t, it means another job worker is already handling it, so the current job can safely quit.
class InvoiceGenerationJob
include Sidekiq::Job
def perform(invoice_id)
invoice = Invoice.find(invoice_id)
# Guard: Is this already done?
return if invoice.generated_at.present?
# Coordination: Is someone else doing this right now?
lock_key = "invoice_lock:#{invoice_id}"
lock_acquired = Redis.current.set(lock_key, "1", nx: true, ex: 30)
unless lock_acquired
Rails.logger.info("Invoice #{invoice_id} generation is already in progress.")
return
end
begin
# The actual work, which is now safe.
PDFGenerator.generate_for(invoice)
invoice.update!(generated_at: Time.current)
ensure
# Always clean up the lock, even if an error occurs.
Redis.current.del(lock_key)
end
end
end
This pattern transforms a potentially dangerous operation into a safe one. The early return check handles retries after a completion. The lock handles simultaneous triggers. It gives me peace of mind.
The next consideration is what you send to the job queue. A job queue is not a database; it’s a messaging system. It serializes your job arguments into a format like JSON to store them temporarily. A common mistake is passing full, complex Ruby objects.
Passing an entire User object might seem convenient, but it’s fragile. That object’s attributes might change between when the job is enqueued and when it runs. It also makes the job payload large and slow to serialize. Instead, I always pass primitive, stable identifiers—usually just the database ID.
The job is then responsible for fetching the fresh data it needs. This keeps the queue fast and the data current.
# Less optimal: Passing an object
UserMailer.welcome_email(@user).deliver_later # @user object gets serialized
# Better: Passing an identifier
class SendWelcomeEmailJob
include Sidekiq::Job
def perform(user_id)
user = User.find(user_id) # Fresh fetch on execution
UserMailer.welcome_email(user).deliver_now
end
end
# To call it:
SendWelcomeEmailJob.perform_async(current_user.id)
Sometimes you need to pass more than an ID. For that, I use simple, serializable data structures: strings, numbers, arrays, and basic hashes. I avoid passing things like Time or Date objects directly; I convert them to strings.
class GenerateReportJob
include Sidekiq::Job
# Pass IDs and simple parameters
def perform(user_id, report_type, start_date_string, end_date_string)
user = User.find(user_id)
start_date = Date.parse(start_date_string)
end_date = Date.parse(end_date_string)
# ... generate report logic
end
end
# Enqueue with clean arguments
GenerateReportJob.perform_async(
user.id,
"monthly_sales",
Date.today.beginning_of_month.to_s,
Date.today.end_of_month.to_s
)
This practice of passing minimal, serializable data keeps the job system lean and predictable.
Now, things will go wrong. An external API will be down. A payment gateway will timeout. A third-party service will return an unexpected error. A robust job system doesn’t just fail; it has a plan for failure. That plan is a retry strategy.
The simplest strategy is immediate retry, but this can overwhelm a struggling service. A better approach is exponential backoff. This means waiting longer between each retry attempt—maybe 10 seconds, then 30, then 90. It gives the external service time to recover.
Most job frameworks have built-in retry mechanisms. In Sidekiq, you can configure it easily. But I often need more control. I want to handle a temporary network timeout differently from a permanent configuration error.
Here’s an example where I customize the retry logic based on the type of failure.
class ProcessWebhookJob
include Sidekiq::Job
sidekiq_options retry: 5 # Try up to 5 times
# Define custom retry delays
sidekiq_retry_in do |count, exception|
case exception
when Net::OpenTimeout, Net::ReadTimeout
# For network timeouts, back off quickly.
(count ** 3) + 10
when VendorAPI::RateLimitError
# For rate limits, wait as long as the API tells us to.
# The exception might hold a 'retry-after' header value.
60 * 5 # Wait 5 minutes
when VendorAPI::ClientError
# A 4xx error is likely our fault (bad data). Don't retry much.
count < 2 ? 10 : :discard
else
# Default exponential backoff for other errors.
10 * (count + 1)
end
end
def perform(webhook_payload_id)
payload = WebhookPayload.find(webhook_payload_id)
result = VendorAPI.process(payload.data)
# Raise specific errors for the retry logic above to catch
raise VendorAPI::RateLimitError if result.rate_limited?
raise VendorAPI::ClientError if result.client_error?
payload.update!(processed: true, result: result)
end
end
This level of control is crucial. It means a temporary network blip won’t cause a job to fail permanently, but a genuine data error won’t waste resources on endless retries. I also add a small random “jitter” to the delay in some systems to prevent all failed jobs from retrying at the exact same moment, which could cause a retry storm.
Some tasks are just big. Importing 100,000 records, processing a high-resolution video, or calculating analytics for an entire year. You can’t do it in one giant, monolithic job. If it fails after 99%, you lose all progress. It also blocks a worker for a very long time.
The solution is batch processing. Break the large task into many small, independent units of work. Enqueue a job for each unit, or for small batches of units. This gives you automatic parallelism, finer-grained retries, and progress tracking.
# A master job that orchestrates the batch
class LargeDataExportJob
include Sidekiq::Job
def perform(export_id)
export = DataExport.find(export_id)
export.update!(status: 'preparing', total_items: 0)
user_ids = User.where(created_at: export.date_range).pluck(:id)
export.update!(total_items: user_ids.count)
# Enqueue one job per user, or per small batch
user_ids.each_slice(100) do |user_id_batch|
GenerateUserExportSliceJob.perform_async(export_id, user_id_batch)
end
export.update!(status: 'enqueued')
end
end
# A worker job that does a small piece
class GenerateUserExportSliceJob
include Sidekiq::Job
def perform(export_id, user_ids)
export = DataExport.find(export_id)
slice_data = []
User.where(id: user_ids).find_each do |user|
slice_data << calculate_export_data_for(user)
end
# Append this slice's data to a shared store (e.g., a file on S3, a Redis list)
append_to_export_file(export, slice_data)
# Update a counter to track progress
Redis.current.incrby("export_progress:#{export_id}", user_ids.count)
end
end
You can then have a separate process or a finalizing job that checks the progress counter. When it matches the total_items, it knows all slices are done and can compile the final result. This pattern turns a scary, long-running process into a manageable flow of small, reliable steps.
Not all jobs are created equal. A job to charge a customer’s credit card is more urgent than a job to update a recommendation algorithm. If both jobs are in the same queue, a backlog of recommendation jobs could delay critical payments.
This is where priority queues come in. You define multiple queues—like critical, high, default, and low—and assign your jobs to them based on importance. Your job processing system then allocates more workers to higher-priority queues, or processes jobs from those queues first.
Configuring this depends on your job backend. In Sidekiq, you specify the queue when defining the job.
class ProcessPaymentJob
include Sidekiq::Job
sidekiq_options queue: :critical, retry: 5
def perform(payment_id)
# ... critical payment logic
end
end
class UpdateSearchIndexJob
include Sidekiq::Job
sidekiq_options queue: :low, retry: 3
def perform(record_id, record_type)
# ... less urgent indexing logic
end
end
You then start your Sidekiq workers to listen on specific queues, often giving more processes to the critical ones.
# Start 5 workers dedicated to the 'critical' queue
bundle exec sidekiq -q critical -q default -q low -c 5
# In a separate terminal, start 1 worker for low-priority tasks
bundle exec sidekiq -q low -c 1
The pattern here is intentional separation. By segregating jobs, you ensure that a traffic spike generating thousands of low-priority jobs won’t impact your application’s core money-making functionality. It’s a way of building resilience and guaranteeing service levels for what matters most.
You can’t manage what you can’t measure. Once you have jobs running in the background, you need visibility. Are they succeeding? How long are they taking? Is a specific job type failing more often today than yesterday?
Basic logging is not enough. I implement structured job monitoring. Every job execution logs its start time, end time, and outcome. I track metrics like average duration, 95th percentile duration, and failure rate.
This data serves two purposes. First, it’s for alerting. If the failure rate for ProcessPaymentJob jumps above 2%, I want a notification in my operations channel immediately. Second, it’s for capacity planning. Seeing that a job’s average duration is creeping up over weeks tells me I might need to optimize the code or add more workers.
Here’s a simplified version of a monitoring module I might include in my jobs.
module JobInstrumentation
def perform_with_instrumentation(*args)
start_time = Time.current
job_id = args.first # or generate a unique ID
Rails.logger.info(
event: 'job_started',
job_class: self.class.name,
job_id: job_id,
arguments: args
)
begin
result = perform_without_instrumentation(*args)
duration = Time.current - start_time
Rails.logger.info(
event: 'job_succeeded',
job_class: self.class.name,
job_id: job_id,
duration_ms: (duration * 1000).round
)
# Send metric to a monitoring system like StatsD
StatsD.measure("jobs.#{self.class.name.underscore}.duration", duration)
StatsD.increment("jobs.#{self.class.name.underscore}.success")
result
rescue StandardError => e
duration = Time.current - start_time
Rails.logger.error(
event: 'job_failed',
job_class: self.class.name,
job_id: job_id,
duration_ms: (duration * 1000).round,
error: e.message,
backtrace: e.backtrace.first(5)
)
StatsD.measure("jobs.#{self.class.name.underscore}.duration", duration)
StatsD.increment("jobs.#{self.class.name.underscore}.failure")
raise # Re-raise the error so Sidekiq's retry mechanism still works
end
end
# Hook this module into a job class
def self.included(base)
base.alias_method :perform_without_instrumentation, :perform
base.alias_method :perform, :perform_with_instrumentation
end
end
class MyJob
include Sidekiq::Job
include JobInstrumentation
def perform(user_id)
# Your original job logic here
end
end
With this pattern, I have a rich stream of data about my job system’s health. It moves me from wondering “is the email queue working?” to knowing its exact status and historical performance.
Finally, as your application grows, one job server might not be enough. You need to scale horizontally. This means running job workers on multiple servers. The queue (usually Redis) becomes the central coordination point, and multiple worker processes across multiple machines pull jobs from it.
The pattern here is stateless workers. Each job worker should be independent. It should not rely on local memory or files that other workers can’t access. All state must be in the shared database or cache (like Redis).
Configuration management becomes important. You need to ensure all worker servers have the same code version, environment variables, and access to external services.
# A simplified Docker Compose snippet showing horizontal scaling
version: '3.8'
services:
redis:
image: redis:alpine
ports:
- "6379:6379"
web:
build: .
command: bundle exec puma -C config/puma.rb
environment:
- REDIS_URL=redis://redis:6379/0
depends_on:
- redis
job_worker:
build: .
command: bundle exec sidekiq -c 10 -q critical -q default -q low
environment:
- REDIS_URL=redis://redis:6379/0
depends_on:
- redis
deploy:
replicas: 3 # Run three identical worker containers
In this setup, three job_worker containers are all connected to the same Redis instance. Sidekiq handles the concurrency, ensuring the same job isn’t grabbed by two workers simultaneously. The load is naturally distributed.
The key to scaling horizontally is ensuring your job logic and its dependencies are designed for it. Avoid file system assumptions, use centralized logging, and make sure your database connections are pooled correctly.
Building a reliable background job system in Rails is about applying these layered patterns together. Start with idempotency and safe arguments for correctness. Add intelligent retries for resilience. Use batch processing and priority queues for efficiency and control. Implement monitoring for visibility, and design for horizontal scaling from the start.
Each pattern solves a specific problem that appears as your application moves from a simple prototype to a system serving real users with real expectations. By thinking of your background jobs as a core, production-grade subsystem, you build an application that isn’t just fast for the user, but is also robust, maintainable, and ready to grow. The kitchen, to return to my earlier metaphor, becomes a well-organized, scalable operation that can handle a steady Tuesday lunch or a frantic Saturday night rush with equal competence.