Never hypothesize before you scope. Never scope before you clarify.

The 8-Step Framework

1. CLARIFY    — what does the symptom actually mean?
2. SCOPE      — how widespread? when did it start?
3. CHANGES    — what deployed recently?
4. METRICS    — quantify before diagnosing
5. LAYERS     — top-down hypothesis tree
6. LOGS       — evidence before conclusion
7. HYPOTHESIZE — cheapest test first
8. FIX vs MITIGATE → POSTMORTEM

Step 1: CLARIFY

The symptom is always vague. Your job is to make it specific.

This isn't stalling — this is the job. A marketing manager's "slow" could be a 200ms regression or a 30-second timeout. They're completely different problems.


Step 2: SCOPE

Scope tells you urgency and which layer to start at. 100% of users affected = infrastructure. 10% of users = likely specific cohort (region, device, account tier).


Step 3: CHANGES

80% of production incidents trace back to a recent change. This is the cheapest hypothesis to test.


Step 4: METRICS

Quick reminder on direction: p50 is your median user. p99 is your worst 1%. If p50 is fine but p99 is bad, you have a tail latency problem — likely a specific slow query or cold start. If both are bad, you have a systemic problem.