Offline eval tells you the system can be good. Observability tells you whether it is good right now. This is where the JD's "tokens/sec, cost-per-request" lives.