GenAI isn’t magic — it’s a workload. Inference, training, serving — Kubernetes eats them for breakfast.
But Agentic Apps on K8s bring new scheduling and utilization challenges. Low GPU utilization? Wrong operators? Misaligned resource requests? That’s wasted money.
Here’s how I see it coming together:
1. Treat GenAI as a First-Class Workload
- Training, inference, serving — schedule them like any microservice
- Separate control-plane and data-plane for AI pipelines
- Apply standard K8s patterns — don’t reinvent orchestration
2. GPU Scheduling With DRA
- Dynamic Resource Allocation replaces static GPU assignments
- NVIDIA KAI Scheduler integrates directly with K8s APIs
- Prevents idle GPU time — turn CapEx into utilization
3. Use the Right Operators & Frameworks
- Operators handle provisioning + lifecycle for AI jobs
- Examples: Kserve for inference, Kubeflow for pipelines
- Keeps the AI stack K8s-native
4. Bring Models Close to Compute
- Host local models like ollama for low-latency responses
- Reduce dependency on external APIs for inference
- Fit deployment footprint to your cost/perf sweet spot
5. Orchestrate the Full Agentic Stack
- Data + MCP as the context backbone for agents
- Frameworks like LangChain, KAgent handle orchestration
- Keep agents stateless — scale horizontally under load
Remember, basics matter:
- Idle GPUs are the silent killer — DRA fixes that
- Local models + optimized scheduling cut inference delays
- Kubernetes gives you elasticity — let Agentic Apps just ride the wave
GenAI on K8s isn’t an experiment; it’s the next wave of platform engineering. Run it like you’d run any other production workload.