Interview Question kafka

can two consumer group consume from a single partition ?

The simple picture

In Kafka, topics have partitions (like buckets of messages).
Consumer Groups are like teams of people reading from these buckets.
Rule: Inside one consumer group, only one consumer can read from a partition at a time.
But different consumer groups are completely independent — they can both read the same messages from the same partition.

Your Question:

"Can two consumer groups consume from a single partition?"
✅ YES — they can.
Because consumer groups are isolated — they don’t affect each other.

What is Kafka?

Apache Kafka is a distributed messaging system.
It lets applications send (produce) and receive (consume) messages in real-time.
Think of it as a high-speed post office for data between different systems.

How Kafka Works (Easy Flow)

Producer → sends messages to a topic in Kafka.
Kafka stores messages in partitions inside that topic.
Consumer Groups → read messages from those partitions.
Messages are kept for a set time (retention), so consumers can re-read if needed.

How Kafka Handles Real-Time Message Processing

Partitioning → Splits data into multiple partitions so many consumers can read in parallel (faster processing).
Consumer Groups → Distribute messages across multiple consumers (load balancing).
Offset tracking → Kafka remembers where each consumer left off, so it can continue from that point.
Replication → Messages are stored in multiple brokers for fault tolerance.
Low latency → Kafka writes to disk in a very efficient way (sequential I/O), making it fast enough for real-time.

Example in Real Life

Imagine UPI transactions:
1. Bank app sends transaction event to Kafka (Producer → Topic: transactions).
2. Fraud detection service reads the event in real-time (Consumer Group 1).
3. Notification service reads the same event to send SMS (Consumer Group 2).
- Both services process the same message at the same time without slowing each other down.

Interview-Friendly Quick Answer

"Kafka is a distributed messaging platform used for real-time data processing. Producers send messages to topics, which are split into partitions. Consumer groups read these messages in parallel for high throughput. Kafka tracks offsets so consumers can continue where they left off, and it replicates data for fault tolerance. This makes Kafka ideal for use cases like real-time analytics, streaming pipelines, and event-driven microservices."

Question:

"How do you troubleshoot an issue where a service is down and not responding?"

Easy Answer (5 Steps)

Check if the service is actually running
- Use tools like ps -ef, systemctl status, or cloud dashboard.
- Make sure it’s not crashed or stopped.
Check service logs
- Look in application logs for errors/exceptions.
- This often gives a direct clue (e.g., DB connection failed, out of memory).
Check infrastructure (CPU, Memory, Disk, Network)
- Sometimes the service is alive but stuck because system resources are exhausted.
Check dependencies
- If the service depends on a database, API, or Kafka — see if those are up and reachable.
Restart and monitor
- Restart the service if needed, but also find the root cause so it doesn’t happen again.

Interview-Friendly Quick Answer

"First, I check if the service process is running and reachable. If it’s not, I look at application and system logs to see errors. I also check CPU, memory, and network to ensure it’s not a resource issue. Then I verify dependencies like databases or APIs are working. Once I fix or restart the service, I monitor it to confirm it’s stable and note the root cause."

How to Configure Kafka Consumer for High Throughput & Fault Tolerance

1. High Throughput

These settings help your consumer process messages faster:

fetch.min.bytes → Increase this so the consumer waits for more data before fetching (reduces number of requests).
fetch.max.wait.ms → Set a slightly higher wait time so Kafka can batch more messages before sending.
max.poll.records → Increase this to process more messages per poll.
Enable batch processing → Process records in bulk instead of one by one.
Use multiple consumers in a group → More consumers = parallel processing across partitions.

2. Fault Tolerance

These settings make sure the consumer can recover if something goes wrong:

enable.auto.commit = false → Commit offsets manually after processing, so you don’t lose or skip messages.
auto.offset.reset = earliest → If there’s no saved offset, start from the earliest message (avoid missing data).
session.timeout.ms and heartbeat.interval.ms → Tune these so Kafka quickly detects and reassigns partitions if a consumer dies.
Replication on broker side → (Not in consumer config, but important) Make sure topic has replication factor ≥ 3.
Retries in your processing logic → So failed messages can be retried without losing them.

Interview-Friendly Quick Answer

"For high throughput, I increase fetch.min.bytes, fetch.max.wait.ms, and max.poll.records to batch more messages per fetch, and I use multiple consumers in a group for parallelism. For fault tolerance, I disable auto commit and commit offsets after processing, set auto.offset.reset to earliest, and tune session.timeout.ms so partitions are reassigned quickly if a consumer fails. I also ensure topics have a good replication factor on the broker side."

Kafka vs RabbitMQ

Feature	Kafka	RabbitMQ
Type	Distributed event streaming platform (good for real-time data pipelines).	Traditional message broker (good for task queues).
Message Model	Publish–subscribe (Producers → Topics → Consumers). Consumers read at their own pace.	Push-based queueing (Producers → Queues → Consumers). Messages are removed once consumed.
Storage	Stores messages for a set retention period (e.g., 7 days), even after they are read.	Deletes messages immediately after delivery (unless explicitly persisted).
Order Guarantee	Order guaranteed within a partition.	Order guaranteed per queue, but not across queues.
Throughput	Extremely high throughput (millions of messages/sec).	Moderate throughput compared to Kafka.
Use Case	Real-time streaming, log aggregation, analytics, event sourcing.	Background job processing, RPC calls, reliable task queues.
Scalability	Highly scalable — add brokers & partitions.	Scales well, but not as horizontally as Kafka.
Fault Tolerance	Built-in replication of partitions for durability.	Supports clustering & mirrored queues for fault tolerance.

In Short (Interview One-Liner)

Kafka: "Best for real-time, high-volume event streaming. Data stays for a retention period and can be replayed."
RabbitMQ: "Best for reliable message delivery in task queues. Messages are removed after they’re consumed."

1️⃣ How do you ensure Kafka messages are getting consumed?

You can confirm message consumption using:

Consumer lag monitoring → Check the difference between the latest offset in the partition and the committed offset of the consumer.
- If lag = 0 → Consumer is up-to-date.
- Tools: Kafka’s kafka-consumer-groups.sh, Prometheus + Grafana, Confluent Control Center.
Application logs / metrics → Log every message consumed (or count) for debugging.

2️⃣ Suppose one of the consumers goes down — what happens?

Kafka consumer groups ensure load balancing.
If one consumer dies → Kafka rebalances and remaining consumers take over its partitions.
If no consumer is available for a partition → The messages still stay in Kafka (until retention time is over) and will be consumed once a consumer comes back.

3️⃣ What if one message got truncated (corrupted)?

Kafka has checksums (CRC32) for each message to detect corruption.
If corruption is detected →
- Kafka will throw an error to the consumer.
- Consumer can retry fetching the message from the broker.
If the producer sent a bad message (e.g., incomplete payload) → Application-level validation must handle this.

4️⃣ What is `max.poll.interval.ms` vs `max.poll.records` in Kafka?

Config	Meaning	When to change
`max.poll.interval.ms`	Maximum time between two calls to `poll()` before Kafka considers the consumer dead and triggers a rebalance.	Increase if message processing takes longer (e.g., big batch processing).
`max.poll.records`	Maximum number of messages returned in one `poll()` call.	Increase for higher throughput (batch processing), decrease for lower memory use.

💡 Analogy:

max.poll.records = How many food items you take in one plate.
max.poll.interval.ms = How long you can take to eat before the waiter thinks you left the restaurant.

5️⃣ What is the max number of records you can store in Kafka?

There’s no fixed “max number” — Kafka stores data until:
- Retention time expires (log.retention.hours / log.retention.ms)
- OR log size exceeds limit (log.retention.bytes)
In practice, you can store petabytes of data if your cluster is big enough.
Real limit = Your disk capacity + retention settings.

EASY LEVEL

1. What is Kafka and why is it used in microservices?
Kafka is a distributed, fault-tolerant messaging system used for real-time data streaming. In microservices, it’s used for decoupling services, event-driven communication, and handling high-throughput asynchronous data flow.

2. Difference between a Kafka topic and a partition?

Topic = Logical category of messages.
Partition = Physical division of a topic for parallelism and scalability. Messages in a partition are ordered.

3. Difference between producer and consumer in Kafka?

Producer sends messages to Kafka topics.
Consumer reads messages from Kafka topics.

4. Can multiple consumers read from the same Kafka partition?
No, within the same consumer group, only one consumer can read from a given partition at a time.

5. Difference between consumer group and individual consumer?

Consumer Group = Set of consumers working together to read from a topic in parallel.
Individual Consumer = Reads all assigned partitions alone.

6. How does Kafka ensure message ordering?
Messages are ordered within a partition, not across partitions.

7. What is offset in Kafka?
Offset is the unique ID of a message in a partition, used by consumers to track read position.

8. Difference between ack=0, ack=1, ack=all?

ack=0 → Producer doesn’t wait for broker acknowledgment (fast, but risky).
ack=1 → Leader acknowledges after writing to itself (safe but possible loss if leader fails).
ack=all → All in-sync replicas acknowledge (safest).

9. What is retention policy in Kafka?
Defines how long Kafka keeps messages (time-based or size-based), even after they’re read.

10. How does Kafka handle backpressure?
Kafka stores messages in a durable log so slow consumers can catch up later.

MEDIUM LEVEL

11. If a consumer is down for 2 hours, what happens to unconsumed messages?
Messages remain in the topic until retention expires; when the consumer restarts, it reads from the last committed offset.

12. Explain at-most-once, at-least-once, exactly-once delivery.

At-most-once: Messages may be lost, no duplicates.
At-least-once: No loss, but duplicates possible.
Exactly-once: No loss, no duplicates (requires idempotent producer & transactional consumer).

13. How does Kafka achieve high throughput?

Sequential disk writes.
Partition-based parallelism.
Zero-copy via OS page cache.
Batching messages.

14. How to ensure fault tolerance in Kafka consumers?

Use consumer groups with multiple instances.
Commit offsets only after successful processing.
Use replication at broker level.

15. What happens if a message gets corrupted in Kafka?
Kafka uses CRC checksums. If corrupted, the message is skipped, and the consumer moves to the next offset.

16. What is the role of Zookeeper in Kafka?
Manages broker metadata, leader election, and configuration. (Newer Kafka uses KRaft mode without Zookeeper.)

17. How does rebalance work in consumer groups?
When consumers join/leave, Kafka reassigns partitions so that all are covered. During rebalance, consumption pauses.

18. How does Kafka replication work, and what is ISR?
Each partition has a leader and followers. ISR (In-Sync Replicas) are replicas fully caught up with the leader.

19. Difference between commitSync() and commitAsync()?

commitSync() → Blocks until offset is committed (safe, but slower).
commitAsync() → Returns immediately (faster, but may lose commits on crash).

20. How does Kafka handle schema evolution?
With Schema Registry, producers and consumers can handle backward/forward-compatible schema changes.

HARD / LOGICAL LEVEL

21. Large backlog in one partition — process faster without losing order?
Add more consumers won’t help (since one consumer per partition). You can:

Increase partition count & rebalance (future scaling).
Optimize consumer processing speed (batching, async I/O).

22. Consumers are slower than producers — fix?

Increase partitions for parallelism.
Tune fetch.min.bytes and max.poll.records.
Scale consumers horizontally.
Optimize processing logic.

23. Kafka design for real-time stock price updates?

Topic per stock or sector.
Multiple partitions for parallel reads.
Low retention (only last few mins/hours).
Use compacted topics for latest price only.

24. 10 partitions, 7 consumers in a group — partition assignment?
Some consumers will get more than one partition, since 10 partitions ÷ 7 consumers = uneven distribution.

25. Hot partition issue — solution?
Use better partitioning strategy (key hashing, random, round-robin) to avoid sending too many messages to the same partition.

26. How to monitor if Kafka is consuming correctly?

Track lag using Kafka metrics or tools (Burrow, Confluent Control Center).
Alert on high consumer lag.

27. Exactly-once processing example?
Enable idempotent producer, use transactional API so producer writes + offset commits happen atomically.

28. Ensure data consistency between Kafka and DB?
Use transactional writes (Kafka → DB) with two-phase commit or Outbox Pattern.

29. Replay Kafka messages for debugging?
Reset consumer group offset using kafka-consumer-groups.sh --reset-offsets.

30. Kafka slow — first config changes to try?

Increase num.partitions.
Tune batch.size, linger.ms.
Increase replication factor & ISR settings for safety.
Allocate more heap/memory for brokers.

1. How do you handle high availability in a distributed system?

Easy Answer:

“I use replication and failover. For example, in a URL shortener, I can keep multiple copies of the same data in different servers or regions, so if one server fails, traffic switches to another without downtime.”

✅ Example: Multi-region AWS RDS with read replicas.

2. How do you scale a system to handle millions of requests?

Easy Answer:

“I use horizontal scaling with load balancers. Instead of one powerful server, I add more servers to share the load. The load balancer directs traffic based on server health and capacity.”

✅ Example: Adding more API servers behind an AWS ELB.

3. How do you ensure data consistency across multiple servers?

Easy Answer:

“I choose the right consistency model based on needs. For financial transactions, I use strong consistency (2-phase commit). For social feeds, I can use eventual consistency to improve performance.”

✅ Example: UPI payments → strong consistency, Instagram likes → eventual consistency.

4. How do you handle database bottlenecks?

Easy Answer:

“I use caching, indexing, read replicas, and sharding to spread the load. For example, I store frequently accessed short URL data in Redis to avoid hitting the main DB.”

✅ Example: Redis cache for short URL → long URL mapping.

5. How do you design for fault tolerance?

Easy Answer:

“I design the system so it can recover from failures. I use retry logic, circuit breakers, backup databases, and message queues to ensure no data is lost if a service is down.”

✅ Example: Kafka queue for URL analytics so data is not lost if DB is down.

6. How do you handle large file storage?

Easy Answer:

“I store large files in object storage like Amazon S3 instead of the database. Then I store only the file link in the DB.”

✅ Example: User profile picture stored in S3, link stored in DB.

7. How do you manage service discovery in microservices?

Easy Answer:

“I use a service registry like Eureka, Consul, or Zookeeper so services can find each other without hardcoding IPs.”

✅ Example: User service calling Order service via service name instead of fixed IP.

8. How do you reduce latency in APIs?

Easy Answer:

“I use CDN for static content, caching for dynamic data, and optimize database queries. Also, I keep services close to users with geo-distribution.”

✅ Example: Cloudflare CDN for images, Redis for API caching.

9. How do you handle sudden traffic spikes?

Easy Answer:

“I use auto-scaling and a message queue to smooth sudden bursts. This prevents overloading the database.”

✅ Example: If 1M users hit URL shortener at once, requests are queued in Kafka and processed gradually.

10. How do you ensure security in system design?

Easy Answer:

“I use HTTPS, authentication, authorization, input validation, and encryption at rest and in transit.”

✅ Example: JWT for API auth, AES encryption for stored passwords.