Feature Flags Die in Production
Feature flags start as a deployment safety tool and end as permanent conditionals no one understands. Here is how to prevent the graveyard.
Tag
16 posts on Infrastructure
Feature flags start as a deployment safety tool and end as permanent conditionals no one understands. Here is how to prevent the graveyard.
Paging fatigue isn't a staffing problem. It's a design problem. Systems that generate noise do so because they weren't designed for operability.
Every team believes their staging environment reflects production. Almost none of them do. Here is how to test in production safely instead.
Application code that breaks can be rolled back in seconds. A migration that breaks has already changed your data. Migrations deserve more caution than any other code in your pipeline — and usually get less.
Retries, timeouts, and health checks are supposed to make systems resilient. Configured naively, they turn a recoverable blip into a self-sustaining outage. The resilience code becomes the incident.
Staging exists to catch problems before production. Most staging environments catch the wrong problems and miss the real ones, because they differ from production in exactly the ways that matter.
Shared cluster, isolated tenants, write-through pipelines, and the index design choices that decide whether you scale or burn down.
Datadog at series A is fine. Datadog at seed is malpractice. Here's a stack that gets you 80% of the value for 1% of the cost.
Most slow queries aren't about hardware. They're about three indexes you didn't add. Here's the playbook.
Your dashboards are slow. Engineers want ClickHouse. The CFO is nervous. Here's the real decision framework.
Three queueing options with very different cost, throughput, and operational profiles. Pick the wrong one early and you'll re-platform later.
Every HTTP request you make likely passes through a proxy you never configured. Here is the network-level mechanism — iptables NAT REDIRECT, TPROXY, Squid in action, and why HTTPS only partially protects you.
Managed inference APIs are convenient until they are not. Here is the full picture of running your own LLM on Kubernetes: GPU scheduling, model storage, vLLM vs Ollama, and the operational tradeoffs.
LLMs don't know your data. RAG fixes that by turning your documents into a searchable knowledge base. Here is the full pipeline: chunking strategies, dense vs hybrid retrieval, re-ranking, and when to reach for graph-based RAG with LightRAG.
Pinecone, Weaviate, Milvus, pgvector, Qdrant — five viable choices for a vector database. Here is why I picked Qdrant for production, how the 3-node cluster is laid out, and what the other options actually trade away.
Docker solves the packaging problem. Kubernetes solves the operational problem. Here is what K8s actually adds, how its core objects work, and why rolling updates change how you think about deployments.