Slow Model Loading on Fresh Nodes
Loading Whisper large-v2 onto a GPU can take upwards of 70 seconds depending on disk speeds and model optimizations. If workloads scale horizontally only when traffic spikes, this cold start means the first few hundred API requests queue or time out. Pre-loading doesn't work if nodes are ephemeral or you're using spot/preemptible GPUs.