Why I Run Qdrant in Production: A 3-Node Cluster vs the Alternatives

A vector database is a boring choice in the same way Postgres is a boring choice. You want the boring one. Once a RAG system goes to production, the database stops being the interesting part — it has to be fast, cheap to operate, and impossible to lose data on. Everything else is a feature you may or may not use.

I run Qdrant. Three nodes, self-hosted on Kubernetes, replication factor 2, around 40 million vectors. It was not the obvious choice when I started; the obvious choice was "use Pinecone, it's a managed service." This post is the long version of why I went the other way and what the cluster looks like.

The five candidates

In the order I evaluated them:

Pinecone — fully managed, proprietary, the safe corporate choice.
Weaviate — open-source, batteries-included (modules for embeddings, classification), GraphQL-first.
Milvus — open-source, very large scale, complicated operationally.
pgvector — Postgres extension, no new system to run if you already have Postgres.
Qdrant — open-source, written in Rust, HTTP/gRPC API, focused on doing one thing well.

These all do approximate nearest neighbor search over high-dimensional vectors. They all support HNSW. They all do filtering. The differences are in everything around the search.

What I actually needed

Before comparing, I wrote down the constraints. This is the step most people skip and regret later.

Scale: 40M vectors today, planning for 200M within a year. 1024-dim vectors from a multilingual embedding model.
Latency: p95 under 50ms for top-10 retrieval with metadata filters.
Filtering: heavy. Almost every query has filters on tenant ID, language, timestamp, and document type. The vector search alone is rarely useful.
Updates: continuous ingestion, not batch. Documents change, get reindexed, get deleted.
Operating cost: bounded. This is a side of a larger product, not the whole product.
Self-hosting: required. The data is sensitive enough that exporting it to a third-party SaaS was not an option.

That last bullet eliminated Pinecone immediately. I still evaluated it because the comparison is useful — and because "you should just use Pinecone" is the default advice on the internet, and that advice is wrong for plenty of teams.

Pinecone: the managed default

Pinecone is good. It is genuinely easy to operate, the latency is consistent, and you do not have to think about replication or sharding. If you are a small team with no infrastructure expertise and your data is not regulated, this is probably the right answer.

Why I did not pick it:

Cost at scale. The pricing model gets expensive fast once you cross a few tens of millions of vectors with high query volume. The serverless tier is cheap to start with, then jumps when you need consistent throughput.
Self-hosting was a hard requirement. Pinecone is closed-source and runs only as a SaaS.
Vendor lock-in. The query API is theirs. Migrating away later means rewriting every retrieval call site.
Filter performance. Pinecone's filtered search has historically been slower than its pure vector search. With my workload, where almost every query is filtered, this matters.

The right framing: Pinecone is the right call when "ease of operation" is worth more than the cost difference and the lock-in. For my situation it was not.

Weaviate: too much in one box

Weaviate is the most feature-dense of the open-source options. It bundles a vector store, embedding model integrations, a hybrid search engine, and a GraphQL query layer. You can hand it documents and have it embed them for you.

The features are real. The problem is that they are coupled. The same process that does ANN search also runs the embedding pipeline. In a production system I want those decoupled — embedding is a CPU/GPU-bound batch operation that should run on its own infrastructure, retrieval is a latency-sensitive read path. Bundling them means scaling decisions for one affect the other.

The other thing that bothered me: GraphQL as the primary API. Not because GraphQL is bad, but because it is a layer on top of what should be a simple query → top-K results call. Every retrieval call now goes through a GraphQL parser and resolver layer, and you end up debugging GraphQL field selection issues when what you wanted was a vector search.

Weaviate's clustering story is also less mature than Qdrant's. As of when I evaluated it, replication and sharding worked but had more sharp edges in the failure-recovery paths.

Milvus: too much complexity for my scale

Milvus is the option for billion-vector workloads. The architecture is impressive — separate components for query nodes, data nodes, index nodes, root coordinator, an external object store for cold data, an external metadata store. It scales to scales I do not have.

It also requires you to operate all of those components. The minimum production deployment has, depending on how you count, six or seven separate services plus etcd plus MinIO or S3. For 40M vectors, this is overkill in the worst way: you pay the operational cost without getting the benefit.

If you have a billion vectors and a dedicated platform team, Milvus is great. I have neither.

pgvector: the seductive wrong answer

pgvector is the option that almost won, because the argument is so clean: "you already run Postgres, just add a column."

I ran a serious benchmark. Fed it 10M vectors, 1024 dimensions, HNSW index. Filtered queries on three columns. The numbers were OK at 10M — p95 around 80ms — and got worse predictably as I scaled toward 40M. Memory usage was higher than Qdrant for the same dataset because Postgres stores vectors as full-precision floats by default and the HNSW index is on top of the heap.

The real problem was not raw performance. It was operational impedance mismatch. Postgres is built around transactional row-by-row work. A vector workload is almost the opposite: huge index builds, occasional bulk reindex, ANN search that is fundamentally not a B-tree lookup. Running both on one Postgres instance means a long-running reindex starves your transactional queries; isolating them means running a separate Postgres just for vectors, at which point you have a dedicated vector database that happens to speak SQL.

There is also the upgrade story. Postgres major version upgrades are slow, careful, planned events. pgvector itself moves faster — new index types, quantization features — and you cannot adopt them until the Postgres extension catches up and your DBA is comfortable upgrading.

pgvector is a great answer for "I have under 5M vectors and I want one less system to operate." Past that, the trade goes the wrong way.

Why Qdrant won

Qdrant is the option that keeps doing the right thing. It is one binary, written in Rust, that does vector search with metadata filtering and nothing else. The API is HTTP and gRPC, both straightforward.

Specific things that made me pick it:

Filtering is first-class. Qdrant has a payload index — separate from the vector index — for filter fields. When you do a filtered search, it intersects the payload index with the HNSW traversal. With my workload (almost every query filtered on tenant, language, timestamp), this is the single biggest performance lever, and Qdrant exploits it harder than any of the others.

Quantization without drama. Scalar quantization (int8) and binary quantization are flags on the collection config. You enable them, the recall drops a small amount, the memory footprint drops by 4x or 32x. I run scalar quantization in production — the recall hit at top-10 is under 1% on my data and the cluster fits comfortably on machines I would have needed three of otherwise.

Replication and sharding are simple. You declare a collection with shard_number and replication_factor, the cluster handles the rest. Failover is automatic, recovery is observable, you do not need a separate coordinator service.

Single-binary operations. No external metadata store. No external object store. The data lives on local SSDs (or a CSI volume in Kubernetes), the cluster talks Raft for consensus, that is the entire operational picture.

Open source, permissive license. Apache 2.0. No risk of a relicense that locks me out of features.

It is fast. On the same 10M vector benchmark, Qdrant came in roughly 2-3x faster on filtered queries than pgvector, and used about 60% less memory. Against Weaviate it was closer, but the operational story still favored Qdrant.

The thing I will admit: Qdrant is a younger project than Pinecone or Weaviate. The bug-fix turnaround is fast, but you do hit the occasional rough edge. I have hit two in a year. Neither was unrecoverable.

How the 3-node cluster is laid out

Qdrant 3-node cluster: 6 shards across 3 nodes, replication factor 2

Three nodes. Replication factor 2. Six shards per collection. This is the smallest cluster that gives me both horizontal scale-out and survival of a single-node failure, and it is what I would recommend as a starting point for anyone running Qdrant in production.

The math: with 6 shards and replication factor 2, each shard has two copies that get distributed across different nodes. Each node holds 4 shards (out of 12 total shard replicas). Lose any one node, every shard still has one live replica, the cluster keeps serving reads and writes. The remaining two nodes have to absorb the lost node's load, so I keep each node provisioned at around 50-60% capacity in steady state to leave headroom.

Node sizing

Each node is the same shape:

16 vCPU, 64GB RAM
1TB local NVMe SSD (this is the one I refuse to compromise on)
Kubernetes pod with local-path storage class on dedicated node-local disks, not networked storage

The RAM number is what it is because Qdrant keeps the HNSW graph in memory for fast search. With scalar quantization enabled, my 40M vectors at 1024 dimensions need roughly 40GB of memory for the quantized vectors plus overhead for the graph and payload indices. Sixty-four gives me headroom and lets the OS page cache absorb cold reads.

Local NVMe matters because the segments on disk get read during cold start, during snapshot creation, and when a node rejoins after a failure and has to catch up. I tried networked block storage on a previous attempt — it added 40-80ms to recovery operations and made replica catch-up painful enough that I switched.

Sharding and replication

Six shards is more than three for a reason. With three shards and three nodes, you cannot rebalance — every node holds exactly one shard, and adding a fourth node has nothing to take. Six shards lets me scale to four, six, or twelve nodes later without redistributing data twice. It is cheap insurance.

Replication factor 2 is the minimum for fault tolerance. RF=3 would be nicer for read throughput (more replicas to serve reads from) but would also use 50% more disk and memory. At my scale and with quorum reads (which require RF=2 minimum to be meaningful), RF=2 is the right balance.

Quorum and consistency

Qdrant uses Raft for consensus on cluster metadata (collection definitions, shard assignments). Data writes go to the primary replica of each shard and are replicated asynchronously by default. You can request synchronous replication on a per-write basis if you need strong durability for that specific operation.

For my use case — a continuous ingestion pipeline where individual writes are not life-critical, but eventual consistency within a few seconds is required — async replication with a 2-second target lag is fine. Reads use consistency=majority for queries where freshness matters, and consistency=any for queries where it does not.

Snapshots and backups

Qdrant snapshots are full per-collection dumps to local disk. I run a CronJob in Kubernetes that takes a snapshot every six hours, then rsyncs it to an S3-compatible object store with a 30-day retention. A full restore from snapshot has been tested and takes about 25 minutes for the 40M-vector collection.

This is separate from the cluster's own replication. Replication protects against node failure. Snapshots protect against operator error — the moment somebody runs DELETE on the wrong collection, the only thing between you and a very bad afternoon is a recent snapshot in object storage.

What I had to learn the hard way

A few things I would tell past-me if I could.

Do not run on networked storage. I covered this above. Use local NVMe or you will fight latency and recovery problems forever.

Set the HNSW m and ef_construct parameters deliberately. The defaults (m=16, ef_construct=100) are conservative. For high-recall workloads, bumping ef_construct to 200 during indexing improves recall at the cost of a one-time longer index build. m=16 is fine for most cases; bump to 24 if you need top-10 recall above 99%.

Quantization is not free. It is mostly free, but for very low-dimensional vectors (under 256) the recall hit is more noticeable. Run a recall benchmark on your actual data before turning it on.

Beware the payload size. Qdrant lets you store arbitrary JSON payloads alongside vectors. It is convenient, and it is also a footgun — large payloads slow down everything because they get fetched on every result. Store IDs in the payload, store the actual document text somewhere else.

Monitor the segment count. Qdrant's storage is segment-based and segments get merged in the background. If merge falls behind ingestion, segment count climbs, search latency climbs with it. There is a Prometheus metric for it. Alert on it.

Plan for the upgrade path. Qdrant releases are reasonably frequent. The 0.x to 1.x transition was painful for early adopters. Now that it is on stable 1.x the upgrade story is much better, but I still test every minor on a staging cluster before production.

Would I make the same choice again?

Yes. The thing I weighted most heavily — operational simplicity, with full control of the data — keeps paying off. Six months in, I have spent essentially zero time on Qdrant itself; the cluster runs, ingestion runs, queries return in time, and the only adjustments have been re-tuning shard counts as data grew.

The honest version of the trade: Pinecone would have been less work to start. Three months in, the cost difference was already significant. Six months in, the freedom to tune quantization, sharding, and indexing parameters specifically for my data is worth more than I expected.

If you are building a vector workload right now, the decision tree I would use:

Under 5M vectors, no heavy filtering, you already run Postgres → pgvector.
Small team, no infra expertise, regulated data is not a concern → Pinecone.
Billion-scale, dedicated platform team → Milvus.
You want bundled embedding pipelines and don't mind GraphQL → Weaviate.
Self-hosted, filter-heavy, between 10M and a few hundred million vectors, want operational simplicity → Qdrant.

The last category is bigger than people realize. It is the one I was in. It is probably the one you are in too.