Elasticsearch Across Many Services: The Right Way

Elasticsearch in a small app: trivial. One index, one cluster, dump documents, query, ship.

Elasticsearch across ten services in a real company: a graveyard. Mapping conflicts. Noisy-neighbor outages. A 2 AM page because someone in fulfillment shipped a text field where the search team had a keyword. A reindex job that takes a week because nobody set index.lifecycle three years ago.

The mistakes are predictable. So are the fixes.

The first decision: one cluster or many

Most teams default to one shared cluster because it's cheaper and operationally simpler. Then one service writes 50k docs/sec of telemetry, and the cluster starts dropping search requests for the checkout team.

Use one cluster when: total data fits comfortably on one tier (say, under 10 TB hot), all services share the same SLO, and no tenant is bursty in a way that kicks the others.

Use multiple clusters when: you have wildly different SLOs (search vs logs vs analytics), regulated data needing isolation (PII, payments, audit logs), or one tenant generates orders of magnitude more load than the rest.

A useful middle ground: one cluster per workload class. Hot search cluster, warm analytics cluster, dedicated logs cluster (or just use a logs-specific tool — see below). Three clusters, not ten. Each tuned for its access pattern.

The second decision: stop using Elasticsearch for logs

The single biggest reason Elasticsearch becomes a nightmare is logs. Logs grow without bound, have terrible query patterns (full-text scans across petabytes), and starve real search workloads.

If you're using Elasticsearch for application logs in 2026, look at:

OpenSearch with a dedicated logs cluster and ISM policies, if you need ES API compatibility.
Loki + Grafana for cheaper, less queryable logs.
ClickHouse for structured logs you actually query analytically.
Datadog/Honeycomb/etc. if you'd rather pay than operate.

Elasticsearch is a search engine. It's been bent into a logs and metrics tool because it could. That doesn't mean it should.

Index design: namespace by service, not by feature

The most common mistake: indices named after product features. products, orders, customers. Two years later you have products_v2, products_search, products_legacy, and three teams writing to the same index with conflicting mappings.

Better convention:

{service}-{entity}-{version}
catalog-products-v3
fulfillment-orders-v1
identity-customers-v2

The service name is the owner, written into the index name. When the cluster is on fire and you're looking at hot shards, you can immediately see who to call.

Pair this with index aliases so consumers query catalog-products (the alias) and never need to know about versions:

POST /_aliases
{
  "actions": [
    { "remove": { "index": "catalog-products-v2", "alias": "catalog-products" } },
    { "add":    { "index": "catalog-products-v3", "alias": "catalog-products" } }
  ]
}

Now v3 rollouts are atomic. Consumers don't change. You can keep v2 around for a week as a rollback.

Mappings: explicit, versioned, owned

Never let dynamic mapping decide your schema in production. The first document with a malformed field locks you into the wrong type forever for that index.

Two non-negotiables:

dynamic: strict at the index level. Unknown fields throw, not silently get indexed.
Mapping templates checked into git, applied via component templates. Same as schema migrations for SQL.

PUT /_component_template/catalog-products-mapping
{
  "template": {
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "sku":         { "type": "keyword" },
        "title":       { "type": "text", "analyzer": "english", "fields": { "raw": { "type": "keyword" } } },
        "price_cents": { "type": "long" },
        "tags":        { "type": "keyword" },
        "created_at":  { "type": "date" }
      }
    }
  }
}

When fulfillment wants to add a warehouse_id field to their orders index, they update their mapping template, push their PR. They never touch catalog's templates. Index naming gives you that boundary for free.

Writes: never write directly from app services

The pattern that fails: every service has Elasticsearch as a dependency, writes to it synchronously inside the request path, and treats it like a primary store.

Now ES has a network blip. Every service times out. Your app is down because search is down.

The pattern that scales: the database is the source of truth, ES is a derived view.

[ App writes ] → [ Postgres / DynamoDB ]
                          ↓ CDC stream
                  [ Kafka / Kinesis / DynamoDB Streams ]
                          ↓ consumer
                  [ Indexer service ] → [ Elasticsearch ]

Benefits:

App services don't depend on ES at write time. ES being down means search is degraded, not the app.
Reindexing is a matter of replaying the stream. No backfill scripts hitting the primary DB.
Schema changes mean rebuilding the indexer to a new index version, then aliasing over.
One central indexer (or one per service) owns the mapping, the bulk batching, the retry logic, the dead-letter queue. App developers don't need to learn the ES bulk API.

Tools that already do this: Debezium (Postgres/MySQL CDC → Kafka), DynamoDB Streams, Kafka Connect Elasticsearch Sink. You usually don't need to write the indexer from scratch.

Multi-tenant: routing, not separate indices

If you're SaaS with thousands of tenants, do not create one index per tenant. You'll hit shard limits within a year and your cluster master will spend more time on cluster state than on queries.

Use a single index with a tenant_id field and custom routing:

PUT /catalog-products/_doc/abc-123?routing=tenant-456
{ "tenant_id": "tenant-456", "title": "...", ... }

Then queries pin to one shard:

GET /catalog-products/_search?routing=tenant-456
{ "query": { "bool": { "filter": [ { "term": { "tenant_id": "tenant-456" } } ] } } }

Big tenants that genuinely need isolation: split them into their own index later. Small tenants share. This is essentially the same pattern Stripe and Algolia use.

Capacity: shard count is the trap

The default of 1 primary shard per index is fine for a lot of workloads. Heavy write workloads benefit from more, but every shard costs cluster overhead, and over-sharding is a worse problem than under-sharding.

Rules of thumb that have aged well:

Aim for shards between 10 GB and 50 GB. Smaller wastes overhead, larger slows recovery.
Total shards per node: under 20 per GB of heap. A 31 GB heap node tops out around 600 shards.
Time-series data (orders, events): use data streams with ILM, not manually managed indices.

If you're already over-sharded, the fix is _shrink for hot indices, then a reindex strategy with sane shard counts going forward. It's painful. Avoid it by starting with sane numbers.

Observability: instrument the indexer, not just the cluster

Cluster health metrics are necessary but not sufficient. The first sign that your search infra is degrading is rarely a yellow cluster — it's the indexer falling behind.

Track per service:

Indexer lag (CDC offset vs latest committed offset). If this grows, search is going stale.
Bulk reject rate. Non-zero means you need more shards or smaller batches.
Per-index 99p query latency. Know which tenant's index is slow before they tell you.
Refresh rate per index. The default 1s refresh is expensive for write-heavy indices — bump to 5-30s for logs/analytics.

What good looks like

A team running Elasticsearch right across many services usually has:

Two or three clusters max, segmented by workload class.
A platform team that owns the cluster, the indexer framework, the templates infrastructure. App teams own their indices.
Index naming, mapping templates, and ILM policies all in git, deployed via the same CI as the rest of their infra.
CDC-based indexing, never synchronous writes from app services.
A canary index per service that exercises the mapping in CI before deploy.
Logs and metrics elsewhere. Probably ClickHouse or a SaaS.

If you have most of those, you can add the eleventh service without anyone losing sleep. If you're missing more than two, you're one outage away from a re-platform conversation.

The good news: every one of these is a code change, not an architectural rewrite. Start with index naming and write-through pipelines. The rest follows.