6 min read
Self-Hosting an LLM on Kubernetes
Managed inference APIs are convenient until they are not. Here is the full picture of running your own LLM on Kubernetes: GPU scheduling, model storage, vLLM vs Ollama, and the operational tradeoffs.
Tag
3 posts on Kubernetes
Managed inference APIs are convenient until they are not. Here is the full picture of running your own LLM on Kubernetes: GPU scheduling, model storage, vLLM vs Ollama, and the operational tradeoffs.
Pinecone, Weaviate, Milvus, pgvector, Qdrant — five viable choices for a vector database. Here is why I picked Qdrant for production, how the 3-node cluster is laid out, and what the other options actually trade away.
Docker solves the packaging problem. Kubernetes solves the operational problem. Here is what K8s actually adds, how its core objects work, and why rolling updates change how you think about deployments.