Never hypothesize before you scope. Never scope before you clarify.
1. CLARIFY — what does the symptom actually mean?
2. SCOPE — how widespread? when did it start?
3. CHANGES — what deployed recently?
4. METRICS — quantify before diagnosing
5. LAYERS — top-down hypothesis tree
6. LOGS — evidence before conclusion
7. HYPOTHESIZE — cheapest test first
8. FIX vs MITIGATE → POSTMORTEM
The symptom is always vague. Your job is to make it specific.
This isn't stalling — this is the job. A marketing manager's "slow" could be a 200ms regression or a 30-second timeout. They're completely different problems.
Scope tells you urgency and which layer to start at. 100% of users affected = infrastructure. 10% of users = likely specific cohort (region, device, account tier).
80% of production incidents trace back to a recent change. This is the cheapest hypothesis to test.
Quick reminder on direction: p50 is your median user. p99 is your worst 1%. If p50 is fine but p99 is bad, you have a tail latency problem — likely a specific slow query or cold start. If both are bad, you have a systemic problem.