Troubleshooting

Never hypothesize before you scope. Never scope before you clarify.

The 8-Step Framework

1. CLARIFY    — what does the symptom actually mean?
2. SCOPE      — how widespread? when did it start?
3. CHANGES    — what deployed recently?
4. METRICS    — quantify before diagnosing
5. LAYERS     — top-down hypothesis tree
6. LOGS       — evidence before conclusion
7. HYPOTHESIZE — cheapest test first
8. FIX vs MITIGATE → POSTMORTEM

Step 1: CLARIFY

The symptom is always vague. Your job is to make it specific.

"Slow" → ask:

This isn't stalling — this is the job. A marketing manager's "slow" could be a 200ms regression or a 30-second timeout. They're completely different problems.

Step 2: SCOPE

Once clarified, scope the blast radius:

Scope tells you urgency and which layer to start at. 100% of users affected = infrastructure. 10% of users = likely specific cohort (region, device, account tier).

Step 3: CHANGES

Before touching metrics, ask: what changed recently?

80% of production incidents trace back to a recent change. This is the cheapest hypothesis to test.

Step 4: METRICS

Now you look at data. The four numbers that matter:

Quick reminder on direction: p50 is your median user. p99 is your worst 1%. If p50 is fine but p99 is bad, you have a tail latency problem — likely a specific slow query or cold start. If both are bad, you have a systemic problem.