Quiet Clairvoyance

Foresight you earn in hindsight.

K8s + GenAI Agentic Workloads Are Just Another Deployment

GenAI isn’t magic — it’s a workload. Inference, training, serving — Kubernetes eats them for breakfast.

But Agentic Apps on K8s bring new scheduling and utilization challenges. Low GPU utilization? Wrong operators? Misaligned resource requests? That’s wasted money.

Here’s how I see it coming together:

1. Treat GenAI as a First-Class Workload

  • Training, inference, serving — schedule them like any microservice
  • Separate control-plane and data-plane for AI pipelines
  • Apply standard K8s patterns — don’t reinvent orchestration

2. GPU Scheduling With DRA

  • Dynamic Resource Allocation replaces static GPU assignments
  • NVIDIA KAI Scheduler integrates directly with K8s APIs
  • Prevents idle GPU time — turn CapEx into utilization

3. Use the Right Operators & Frameworks

  • Operators handle provisioning + lifecycle for AI jobs
  • Examples: Kserve for inference, Kubeflow for pipelines
  • Keeps the AI stack K8s-native

4. Bring Models Close to Compute

  • Host local models like ollama for low-latency responses
  • Reduce dependency on external APIs for inference
  • Fit deployment footprint to your cost/perf sweet spot

5. Orchestrate the Full Agentic Stack

  • Data + MCP as the context backbone for agents
  • Frameworks like LangChain, KAgent handle orchestration
  • Keep agents stateless — scale horizontally under load

Remember, basics matter:

  • Idle GPUs are the silent killer — DRA fixes that
  • Local models + optimized scheduling cut inference delays
  • Kubernetes gives you elasticity — let Agentic Apps just ride the wave

GenAI on K8s isn’t an experiment; it’s the next wave of platform engineering. Run it like you’d run any other production workload.