Mastering Rust Application Observability: From Logging to Distributed Tracing in Production

rust

Mastering Rust Application Observability: From Logging to Distributed Tracing in Production

Learn essential Rust logging and observability techniques from structured logging to distributed tracing. Master performance monitoring for production applications.

Dec 8, 2025

Mastering Rust Application Observability: From Logging to Distributed Tracing in Production

Let’s talk about making sense of what your Rust application is doing, especially when it’s running somewhere you can’t directly see. When a program works on your machine, it’s one thing. When it’s serving a thousand requests per second in the cloud, it’s a different world. You need windows into its soul. That’s what logging and observability are for. They are your eyes and ears.

Rust gives you the power to build fast, reliable systems. But that speed is useless if you can’t understand why it suddenly slows down or, worse, stops responding. The good news is that Rust’s ecosystem provides superb tools for this. I want to walk you through several practical methods I use to make my applications transparent and debuggable. We’ll go from simple print statements to complex distributed systems tracking.

The first and biggest leap forward from using simple println! statements is moving to structured logging. Imagine your logs as a stream of random text versus a stream of data. The former is for humans to read once; the latter is for both humans and machines to analyze forever.

Libraries like tracing are built for this. Instead of a line of text, you log an event with named fields. This means every log entry is a small packet of data that can be searched, filtered, and graphed by tools like Loki, Elasticsearch, or Datadog.

use tracing::{info, Level};
use tracing_subscriber::fmt::format::FmtSpan;

fn main() {
    // Initialize the default subscriber to format and output logs
    tracing_subscriber::fmt()
        .with_max_level(Level::INFO)
        .with_span_events(FmtSpan::CLOSE) // Record when spans close
        .init();

    let user_id = 42;
    let transaction_id = "tx-abc123";
    // This log has structured fields `user_id` and `transaction_id`
    info!(
        user_id,
        transaction_id,
        amount = 99.99,
        "Charge processed for user"
    );
}

When this runs, instead of just “Charge processed for user”, you get a line like this: INFO charge processed for user user_id=42 transaction_id=tx-abc123 amount=99.99. A log aggregator can now easily find all logs for user_id=42 or alert you if there are errors for transaction_id=tx-abc123. It changes debugging from grepping through text files to running queries.

Once you have structured logging, you’ll quickly face a new problem. In development, you want DEBUG level logs. In production, you usually want INFO or WARN. But when something goes wrong in production, you desperately wish you had those DEBUG logs for a specific module or request. Restarting the service with a new config is slow and might hide the bug.

This is where dynamic log level control saves you. You can change how verbose your logging is while the application is running. The tracing-subscriber crate has a reload layer that allows this.

use tracing_subscriber::{filter::EnvFilter, reload, Layer};
use std::{thread, time::Duration};

fn main() {
    // Start with an INFO global filter
    let filter = EnvFilter::new("info");
    // Create a reloadable layer
    let (filter_handle, reloadable_layer) = reload::Layer::new(filter);

    tracing_subscriber::registry()
        .with(reloadable_layer)
        .init();

    // Your application starts here
    tracing::info!("Application started at INFO level.");
    tracing::debug!("This debug message is NOT visible yet.");

    // Simulate an admin action 10 seconds later to enable debug logs
    thread::spawn(move || {
        thread::sleep(Duration::from_secs(10));
        println!("[Admin] Switching to DEBUG logging...");
        // This modifies the filter for all subsequent events
        filter_handle.modify(|filter| *filter = EnvFilter::new("debug")).expect("Could not modify filter");
    });

    // Main loop
    for i in 0..20 {
        thread::sleep(Duration::from_secs(1));
        tracing::debug!("Loop iteration {} - now visible after change", i);
    }
}

I’ve used this pattern to diagnose tricky race conditions. I had a service where a specific user’s request would occasionally fail. By dynamically enabling TRACE logging for just that user’s ID when they made a request, I could see the exact flow without drowning in logs from every other user. It felt like turning on a spotlight in a dark room.

Modern applications are rarely a single process. A web request might go from a load balancer, to a gateway, to an authentication service, to three different microservices, and finally to a database. When there’s an error, which link in the chain broke? Traditional logs leave you guessing. Distributed tracing gives you the answer.

The idea is to generate a unique trace ID at the very start of a request (e.g., at the load balancer) and pass it through every single service. Each service creates its own spans (representing units of work) that are children of this trace. In the end, you can see the entire life of that request as a timeline.

OpenTelemetry is the standard for this. In Rust, you can use the opentelemetry and tracing-opentelemetry crates.

use opentelemetry::global;
use opentelemetry_sdk::propagation::TraceContextPropagator;
use tracing::{info_span, Instrument};
use tracing_subscriber::prelude::*;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Set the global propagator to use W3C Trace Context
    global::set_text_map_propagator(TraceContextPropagator::new());

    // Initialize a simple tracing subscriber that also exports traces
    let (tracer, _uninstall) = opentelemetry_jaeger::new_pipeline()
        .with_service_name("my_service")
        .install()?;

    tracing_subscriber::registry()
        .with(tracing_opentelemetry::layer().with_tracer(tracer))
        .init();

    // Simulate handling an incoming HTTP request with trace headers
    let fake_headers = vec![
        ("traceparent".to_string(), "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01".to_string()),
    ];

    // Extract the trace context from the headers
    let parent_cx = global::get_text_map_propagator().extract(&fake_headers);

    // Create a span that is a child of the extracted context
    let root_span = info_span!("handle_request", "http.method"="GET", "http.route"="/api");
    root_span.set_parent(parent_cx);

    // All work within this instrumented block is part of this trace
    async {
        info!("Starting request processing");
        call_database().await;
        info!("Request complete");
    }
    .instrument(root_span)
    .await;

    // Shut down the tracer
    global::shutdown_tracer_provider();
    Ok(())
}

async fn call_database() {
    let _span = info_span!("database_query", "db.system"="postgres").entered();
    tokio::time::sleep(std::time::Duration::from_millis(50)).await;
}

When you run this and send a request, the Jaeger UI (or a similar tool) will show you a visual graph. You’ll see the handle_request span, and inside it, the database_query span, with their exact durations. If the database call is slow, it’s immediately obvious. I remember the first time I used this on a complex pipeline; seeing the entire flow of a single user’s action across six services, perfectly timed, was a revelation. It turned days of detective work into minutes of reading a chart.

Logs tell you what happened. Traces show you the journey. Metrics tell you how much and how often. They are the numbers you put on a dashboard: requests per second, error rates, response time percentiles, memory usage.

The metrics crate provides a simple, powerful API. You define counters, gauges, and histograms, and then expose them via an endpoint for a tool like Prometheus to scrape.

use metrics::{counter, gauge, histogram};
use metrics_exporter_prometheus::PrometheusBuilder;
use std::thread;
use std::time::{Duration, Instant};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Install a Prometheus recorder/exporter that listens on port 9000
    let builder = PrometheusBuilder::new();
    builder.install()?;

    // Simulate some application work in a loop
    loop {
        let start = Instant::now();

        // Increment a counter for each "request"
        counter!("http.requests.total", "method" => "GET", "route" => "/api/status").increment(1);

        // Simulate work
        thread::sleep(Duration::from_millis(rand::random::<u64>() % 100));

        // Record the duration as a histogram
        histogram!("http.request.duration.seconds").record(start.elapsed().as_secs_f64());

        // Set a gauge for a cache size (simulated)
        let simulated_cache_size = rand::random::<f64>() * 1000.0;
        gauge!("cache.items.count").set(simulated_cache_size);

        thread::sleep(Duration::from_secs(2));
    }
}

Now, if you visit http://localhost:9000/metrics, you’ll see output Prometheus can understand. You can create alerts for when http.requests.total stops increasing (service is down) or when http.request.duration.seconds’s 99th percentile goes above 1 second (service is slow). Metrics give you the “big picture” health of your system at a glance.

During development, you want logs right in your terminal. In production, you need them in a file for persistence and in a system like journald or a network socket for aggregation. You shouldn’t have to choose. You can send logs to multiple places at once.

The tracing-subscriber registry model is perfect for this. You add multiple layers, each with its own formatter and writer.

use tracing_subscriber::{fmt, prelude::*};
use tracing_appender::rolling;
use std::io;

fn main() {
    // Layer 1: Write formatted logs to stdout (for local/dev)
    let stdout_layer = fmt::layer()
        .with_writer(io::stdout)
        .with_target(true); // Include the target (module path)

    // Layer 2: Write JSON logs to a daily rotating file (for production ingestion)
    let file_appender = rolling::daily("/var/log/myapp", "app.log");
    let (file_writer, _guard) = tracing_appender::non_blocking(file_appender);
    let json_file_layer = fmt::layer()
        .with_writer(file_writer)
        .json() // Output as JSON, perfect for Logstash or similar
        .with_current_span(true); // Include current span info in JSON

    // Combine both layers
    tracing_subscriber::registry()
        .with(stdout_layer)
        .with(json_file_layer)
        .init();

    // Now logs go to both your terminal and a structured JSON file
    tracing::info!("This message appears in two places with two formats.");
}

The _guard is important. It ensures the non-blocking writer thread flushes all logs before the program exits. I learned this the hard way after losing the last few crucial log lines from a crashing process. Always keep that guard in scope for the program’s life.

Adding context like a user_id or request_id to every single log line in a function is tedious and error-prone. What if you have a function that calls ten other functions, and you need that ID everywhere? Spans solve this elegantly.

A span represents a period of time. When you enter a span, any log event created inside it automatically inherits the fields of that span. It’s like setting a scope for your logs.

use tracing::{info, info_span, Instrument};

fn handle_http_request(request_id: &str, user_agent: &str) {
    // Create a span for the entire request handling.
    // The fields `request_id` and `user_agent` are attached to this span.
    let request_span = info_span!("http_request",
        request_id = request_id,
        user_agent = user_agent,
        http.method = "GET",
    );

    // Execute the future within this span's context.
    let body = async {
        info!("Starting to process request.");
        let result = validate_and_process().await;
        info!(?result, "Request processing finished."); // `?result` means debug-print the result
        result
    }.instrument(request_span);

    // ... spawn or await the future
}

async fn validate_and_process() -> Result<(), String> {
    // This function is called inside the `http_request` span.
    // Its logs will automatically include `request_id` and `user_agent`.
    info!("Inside processing function.");
    Ok(())
}

When you look at the logs, every single one from the moment the request enters handle_http_request until it finishes will have request_id="abc" and user_agent="Mozilla..." attached. This makes it trivial to filter an entire user’s session or a specific failed request across all your logs. It cleans up your code dramatically.

You’ve got logs, traces, and metrics. Now you need to get them off your servers and into a place where you can see them. This is where exporters and integrations come in. You configure your Rust application to send data to your existing observability stack.

A common setup is: logs to Loki via a sidecar or direct output, traces to Tempo or Jaeger via OTLP, and metrics to Prometheus via the exposition endpoint. The tracing-opentelemetry layer makes sending traces to an OpenTelemetry Protocol (OTLP) endpoint straightforward.

use opentelemetry_otlp::WithExportConfig;
use tracing_subscriber::{prelude::*, EnvFilter};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Set up OTLP tracing pipeline to send to a collector
    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint("http://otel-collector:4317") // Your collector address
        )
        .install_batch(opentelemetry_sdk::runtime::Tokio)?;

    // 2. Set up logging layer with an EnvFilter
    let env_filter = EnvFilter::try_from_default_env().unwrap_or(EnvFilter::new("info"));

    // 3. Combine them into a single subscriber
    tracing_subscriber::registry()
        .with(env_filter)
        .with(tracing_opentelemetry::layer().with_tracer(tracer))
        .init();

    // Your application code
    tracing::info!("This log is in stdout and its trace is sent to the collector.");

    // Ensure all spans are exported before shutdown
    opentelemetry::global::shutdown_tracer_provider();
    Ok(())
}

The collector (like the OpenTelemetry Collector) then becomes the central hub. It can receive OTLP, batch the data, and forward traces to Jaeger, metrics to Prometheus, and logs to Loki based on your configuration. Your Rust code only needs to know about one destination.

Rust is fast, but even in Rust, creating a detailed debug string for a large data structure can take microseconds. If you’re logging that at the DEBUG level but running in production at the INFO level, you’re paying that cost for no benefit. You must be lazy.

In tracing, field values are evaluated lazily when the event is actually recorded. If the event is filtered out because its level is disabled, the costly operation to compute the value is never run.

use tracing::{debug, info, trace};
use std::collections::HashMap;

fn process_complex_data(data: &HashMap<String, Vec<u64>>) {
    // This `debug!` macro checks the level first.
    // If DEBUG logging is disabled, the closure is NEVER called.
    debug!(
        processed_data = ?{
            // This is a potentially expensive serialization/summary
            let mut total = 0;
            for vec in data.values() {
                total += vec.iter().sum::<u64>();
            }
            format!("Processed {} keys, sum={}", data.len(), total)
        },
        "Finished processing"
    );

    // Another example: only serialize this large struct for tracing at the TRACE level.
    trace!(full_input = ?serde_json::to_string(data).ok(), "Trace detail");

    info!("This info log is cheap and always runs if INFO is enabled.");
}

This is a critical habit. I once optimized a hot path by 5% just by moving expensive debug formatting behind a level check. It ensures your observability doesn’t become the source of your performance problems.

Bringing it all together, effective observability in Rust isn’t about using one perfect tool. It’s about combining these techniques into a cohesive strategy. You use structured logging for discrete events, spans to group them contextually, and dynamic control to adjust detail on the fly. You emit metrics for system health and dashboards. You propagate traces to understand cross-service flows. You send all this data to external platforms where you can visualize and alert on it. And you do it all in a way that respects the performance profile of your application.

The goal is to shift from reacting to outages to understanding behavior. When a user reports a problem, you should be able to find their trace, see their logs, and check the relevant metrics for the time of the error, often without needing to reproduce the issue locally. It turns production debugging from a panic-stricken hunt into a methodical investigation. Rust, with its focus on zero-cost abstractions, lets you build this visibility without giving up the speed and safety that brought you to the language in the first place. Start with structured logging, then add spans, then metrics. Soon, you’ll have a system that tells you its own story.