Skip to main content
← All posts
4 min read

SQS vs Kafka vs Redis Streams: Choose Wrong, Pay for Years

Three queueing options with very different cost, throughput, and operational profiles. Pick the wrong one early and you'll re-platform later.

Share

You need a queue. The team has opinions. Someone says Kafka. Someone says SQS. Someone says "we already have Redis, let's use Streams."

These are three radically different products. Picking the wrong one isn't a small mistake — it's a six-month migration two years from now.

Here's how to actually decide.

What you're picking between

SQS — fully managed AWS queue. Pay-per-message. Effectively infinite scale. Limited features.

Kafka — distributed log. High throughput, replay, event sourcing. Either run it yourself (operational burden) or pay Confluent/MSK (expensive at scale).

Redis Streams — append-only log inside Redis. Cheap, fast, simple. Limited durability and scale.

These overlap in the diagram but solve different problems.

The decision tree

Question 1: Do you need to replay messages?

If yes (event sourcing, ML training pipelines, audit logs that downstream services consume) — Kafka or compatible (Redpanda, MSK).

If no (most CRUD work, background jobs) — keep going.

Question 2: Do you need >10k messages/second per topic?

If yes — Kafka. SQS can technically scale this high but costs and ergonomics break down.

If no — keep going.

Question 3: Are you already on AWS and don't want to operate anything?

If yes — SQS. It's the right answer for 80% of "we need a queue" use cases.

Question 4: Do you already have Redis, low message volumes (<1k/sec), and want zero new infra?

If yes — Redis Streams. Good for short-term internal job queues.

That covers most cases. If you find yourself answering "yes" to multiple — pick the most expensive answer (Kafka). It's the most flexible.

SQS: where it shines

  • Background jobs (email sending, image resizing, webhook delivery)
  • Decoupling services (producer doesn't care about consumer health)
  • Spike absorption (front-end can write fast, processing catches up)
  • Anything that doesn't need ordering across the whole queue (FIFO queues add complexity)

Cost: $0.40 per million requests. A million jobs/day = $12/month. You will not beat this with self-hosted anything.

Limitations:

  • Max message size 256KB (use S3 for blob, send pointer)
  • Visibility timeout model — if your consumer takes longer than expected, message redelivered
  • No replay — once consumed, gone (unless you wrote it to S3 yourself)
  • FIFO mode is slower (300 msg/sec/group) than standard

Kafka: where it shines

  • Event sourcing, where new services want to replay history
  • High-throughput data pipelines (millions of msgs/sec)
  • Multi-consumer fanout (10 services consume the same topic, each at their own pace)
  • Stream processing (with Kafka Streams or Flink)

Cost reality:

  • Self-hosted: at least 3 brokers + ZooKeeper/KRaft. ~$500/month minimum for a small cluster. Plus operational time.
  • Confluent Cloud: ~$1/GB-month for storage, $0.11/GB ingress. A modest pipeline runs $1-5k/month.
  • MSK: AWS-managed. Cheaper than Confluent, more operational overhead.

Limitations:

  • Operational complexity (partitions, rebalancing, schema management)
  • Painful cost curve once you scale
  • Easy to misuse — using Kafka for a simple job queue is over-engineering

Redis Streams: where it shines

  • Internal job queues at low volume
  • Real-time dashboards (consumer reads recent events)
  • Anything where you already pay for Redis and don't want to add a new service

Limitations:

  • Durability is "as good as your Redis backup strategy" — for many setups, that's "not great"
  • No partitioning model. Single-node throughput cap (~100k msgs/sec, but practical ceiling is lower)
  • Consumer groups exist but the ergonomics are clunky compared to Kafka or SQS
  • Can grow your Redis memory unexpectedly if consumers fall behind

For low-volume internal queues, this is genuinely fine. For anything customer-facing or load-bearing — pick differently.

Common wrong picks

"We chose Kafka for our background jobs." You set up a 5-broker cluster to deliver 100 emails/minute. You spent 3 weeks. You're now paying $2k/month plus an engineer's time. SQS would have cost $0.50.

"We chose SQS for event sourcing." No replay, no fanout, no log compaction. You'll re-implement Kafka inside SQS, badly.

"We chose Redis Streams for our durable order pipeline." Redis crashed. You lost a queue. You found out backups were the previous day's. The order pipeline is the last place to discover this.

The migration cost

Switching queue products later is expensive:

  • Producer code changes (different SDKs, different semantics)
  • Consumer code changes (different ack/visibility model)
  • Replay or migration of in-flight messages
  • Two systems running in parallel during cutover
  • Updated monitoring, alerting, runbooks

Estimate ~2 engineer-months per migration. Pick well now.

A reasonable default

Most teams need: SQS for background jobs, Kafka if/when they need event sourcing, Redis Streams nowhere.

If I'm being concrete: 90% of "we need a queue" requests are SQS. 8% are Kafka. 2% are Redis Streams (for narrow internal use).

Default to SQS. Only escalate to Kafka when you can articulate exactly why (and "we might need replay someday" doesn't count — wait until you actually do).

The takeaway

Queue products look similar in slides. They're not. Pick by the actual question: do you need replay (Kafka), high throughput (Kafka), AWS-native simplicity (SQS), or zero new infra at low volume (Redis Streams). Default to SQS. Avoid Kafka until you genuinely need its specific properties.

Work with me

I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.

Get in touch →

Explore more on these topics:

Subscribe to new posts

Get an email when I publish something new. No spam, unsubscribe any time.