White Paper

Generative AI on Kubernetes

Generative AI on Kubernetes

Pages 396 Pages

This early release O’Reilly book focuses on the real-world challenge of running large language models in production using Kubernetes. Rather than diving deep into model theory, it tackles deployment, GPU scheduling, scaling, observability, and tuning. The authors break down inference phases, memory demands, KV caching, and production readiness concerns such as latency and cost control. It also explores agentic workflows and AI-driven applications. The core message is clear: Kubernetes can handle LLM workloads, but doing it well requires new patterns, smarter resource management, and strong operational discipline.

Join for free to read