Observability: Logs vs. Metrics vs. Tracing — The 'Doctor's Kit' Mental Model
Your app is slow. Is it the DB? The queue? The network? A mastery guide to the three pillars of observability and how they work together.
Your production app is slow. A user complains. Your manager pings you on Slack.
You SSH into the server and… stare at it.
Without observability, debugging production is like a doctor diagnosing a patient they cannot see, touch, or talk to. This guide gives you Doctor’s instruments.
Part 1: Foundations (The Mental Model)
The Doctor’s Kit
A doctor has three instruments for different questions:
- Stethoscope (Logs): “Let me LISTEN to what the patient is saying moment to moment.” Raw events. Timestamped. Detailed. Noisy.
- Vital Signs Dashboard (Metrics): “Let me CHECK key numbers: heart rate, blood pressure, temperature.” Aggregated. Numeric. Great for dashboards.
- X-Ray (Tracing): “Let me SEE inside and follow this exact problem through every organ.” End-to-end request flow across services.
| Pillar | Question Answered | Unit | Tool |
|---|---|---|---|
| Logs | ”What exactly happened?” | Text events | Datadog, Loki, ELK Stack |
| Metrics | ”How is the system in general?” | Numbers over time | Prometheus, Grafana, Datadog |
| Tracing | ”Where did this request slow down?” | Spans + Traces | Jaeger, Zipkin, OpenTelemetry |
Part 2: The Investigation (Each Pillar Deep Dive)
1. Logs — “The Patient’s Diary”
Logs are timestamped records of what happened. The key discipline: structured logging.
# ❌ BAD: Unstructured (hard to query)
print(f"Error processing order for user {user_id}")
# ✅ GOOD: Structured JSON (searchable, filterable)
import structlog
log = structlog.get_logger()
log.error(
"order_processing_failed",
user_id=user_id,
order_id=order_id,
reason=str(e),
duration_ms=123
)
# Output: {"event": "order_processing_failed", "user_id": 42, "order_id": 999, ...}
Log Levels (Use them correctly!):
DEBUG: Verbose detail for developers. Off in production.INFO: Normal operations. “User 42 logged in.”WARNING: Something bad might happen. “Disk is 80% full.”ERROR: Something failed. “Payment failed for order 999.”CRITICAL: The app is dying. On-call engineer wakes up.
2. Metrics — “The Vital Signs”
Metrics tell you how things are trending over time. The four Golden Signals (Google SRE):
- Latency: How long do requests take? (p50, p95, p99)
- Traffic: How many requests per second?
- Errors: What % of requests are failing?
- Saturation: How full is your system? (CPU %, Memory %, Queue depth)
# Using Prometheus with Python (via prometheus-client)
from prometheus_client import Counter, Histogram
REQUEST_LATENCY = Histogram("api_request_duration_seconds", "API request latency")
ERROR_COUNT = Counter("api_errors_total", "Total API errors", ["endpoint"])
@REQUEST_LATENCY.time() # Automatically times the function
def process_order(order_id):
try:
...
except Exception:
ERROR_COUNT.labels(endpoint="/orders").inc()
3. Tracing — “The X-Ray”
In a microservices system, one user request might touch 7 services. If it takes 2 seconds, where did those 2 seconds go?
Tracing answers this with a Waterfall diagram:
Request: GET /checkout (2.0s total)
├── Auth Service (0.05s) ✅
├── Cart Service (0.08s) ✅
├── Inventory Service (1.6s) ← 🔴 THE BOTTLENECK
│ └── DB Query: SELECT ... (1.55s) ← THE ACTUAL SLOW THING
└── Payment Service (0.27s) ✅
OpenTelemetry is the open standard. Write it once; send to any backend (Jaeger, Datadog, etc.):
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def checkout(user_id):
with tracer.start_as_current_span("checkout") as span:
span.set_attribute("user.id", user_id)
with tracer.start_as_current_span("fetch_cart"):
cart = get_cart(user_id) # This span will be a child
with tracer.start_as_current_span("process_payment"):
charge(cart.total) # This span will be a child
Part 3: The Diagnosis (When to Use What)
| Problem | First Look At | Why |
|---|---|---|
| ”Error 500 for user 42” | Logs | Find the exact error message and stack trace. |
| ”API is slow since 3 PM” | Metrics | Find the P95 latency spike on the dashboard. |
| ”This request takes 2s” | Tracing | See the Waterfall and find which service is slow. |
| ”Server is down” | Metrics | CPU/Memory alerts triggered before the crash. |
Part 4: The Resolution (The Alerting Stack)
You cannot watch dashboards 24/7. You need alerts.
Good Alert: “P99 latency exceeded 2 seconds for 5 minutes.” (Specific, actionable). Bad Alert: “CPU is above 50%.” (Vague, fires all the time, causes alert fatigue).
# Prometheus alerting rule
groups:
- name: api_alerts
rules:
- alert: HighLatencyP99
# Fire if 99th percentile latency > 2s for 5 minutes
expr: histogram_quantile(0.99, api_request_duration_seconds_bucket) > 2
for: 5m
annotations:
summary: "API P99 latency is critically high"
Final Mental Model
Logs -> The Patient's Diary. "What happened? When? With what details?"
Metrics -> The Vital Signs. "How is the system trending over time?"
Tracing -> The X-Ray. "Where did THIS specific request go wrong?"
Use Logs to understand events.
Use Metrics to build dashboards and alerts.
Use Tracing to debug slow distributed transactions.
You need all three. Logs without Metrics is flying blind. Metrics without Tracing is knowing you’re sick but not knowing why.
Related posts
-
MoneyPrinterV2: What 18,000 Stars Worth of Automated Content Actually Looks Like
An assembly line for AI content — local LLMs write the script, KittenTTS reads it, Gemini paints the pictures. The video uploads itself.
-
Unleashing the Super Agent Harness: A Deep Dive into Bytedance's DeerFlow
Discover how DeerFlow 2.0 transforms from a deep research tool into a full-fledged agent harness with sandboxing, sub-agents, and persistent memory.
-
OpenBB Explained: The Open Data Platform for Investment Research
A deep dive into OpenBB, the open-source platform that unifies financial data APIs into a single interface for Python developers, analysts, and AI agents.
-
AI Agents & Multi-Agent Systems: The 'Company of Robots' Mental Model
What's the difference between ChatGPT and an AI Agent? A mastery guide to tool calling, LangGraph orchestration, and why multi-agent is the architecture of 2026.