architecture – Idempotent Kafka consumer with an “eventually consistent” database

What are my options if I want to make my Kafka consumer idempotent (due to Kafka’s at-least-once semantics) if my project’s only database is DynamoDB, which is eventually consistent?

I’m afraid setting the “consistent read” flag might slow down and overload the leader node;

I have neither control over the topics I consume nor over the Kafka cluster.

Debug Kafka pipeline – Stack Overflow

I have a Kafka topic which is streaming data in my production. I want to use the same data stream for my debugging purpose and not impact the offsets for existing pipeline.

I remember using creating different consumer groups for this purpose in earlier versions but I am using Spark structured streaming to read data from kafka and it discourages to use groupID while reading data from Kafka.

What to process in a Kafka broker vs in a Kafka Streams client application?

Using Kafka I understand that it makes little sense to simply pass events in and out of a Kafka cluster, and that the real benefit comes when doing some processing in the events received in the cluster. So there is processing that can be done in the cluster itself.

Using Kafka Streams we can do some processing too, but this time it is done in the client application itself, not in the cluster.

So what sort of processing should be done in the cluster and what sort in the client application ?

scala – How to partition Spark dataset using a field from Kafka input

I have a Spark application that reads data from a Kafka topic. My code is –

val df = spark.readStream
 .format("kafka")
 .option("kafka.bootstrap.servers", "server1")
 .option("subscribe", "topic1")
 .load()
df.printSchema()

This yields

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)

I need to

  1. query for both the key and value
  2. deserialize the binary key to a string
  3. deserialize the binary value to an object (I have a deserializer for this)
  4. repartition the input tuples/DataSet based on the key string from step 2

Would the code below be the correct way to do it?

val dataset = spark.readStream
     .format("kafka")
     .option("kafka.bootstrap.servers", "server1")
     .option("subscribe", "topic1")
     .load()
     .select($"key", $"value")
     .selectExpr("CAST(key as STRING) AS partitionId", "value AS value")
     .repartition(col("partitionId"))
     .mapPartitions{row => createOutputMessages()}

It does run fine, but I have no way to know if the partitioning was done correctly.

apache kafka – How would a Search API microservice fit in an event driven architecture?

I apologize if this sounds really strange. I am very new to System Design and have just been reading up on Event driven systems. To gain better understanding and to perfect my thought process I decided to try my hand at designing a very simple version of a food ordering/delivery system (like an Uber Eats). What I have explained below is just a subset of the whole system and for just a handful of users.

My original design was to have a message broker like Kafka in the middle and having Order, Payments, Restaurant, Notification microservices all consuming events from it. The Restaurant microservice would be responsible for all restaurant related actions such as accept customer orders, update customer orders, etc. I also wanted this microservice to be in charge of operations such as adding a new restaurant, updating a restaurant, etc. And here’s where I am facing a problem.

When the user decides to search for “Mexican”, I want this query to be directed to a SEARCH microservice which would query the Restaurant database and return the results. However, since the rest of the microservices use events to communicate with each other, can this SEARCH microservice bypass the Restaurant microservice and directly query the Restaurant database? Would this be an acceptable design? Or should I scrap the message broker design entirely and design something which communicates via REST calls?

Kafka Consumer Scalability and Fast

I have huge volume of data in Kafka system, and wanted to read them and apply filter then transform them to JSON. The JSON would be exposed to REST API call.

Already my producer is sent data, i just connect Kafka and get the data such as polling and sent to REST API then close Kafka Connection.

I checked Apche Flink Framework which highly scalable and stuck on the implementation with that while exposing to REST end point.

Considering Fast, and Scalable, Do you have any suggestion on Framework for my use case?

What is the difference between Kafka partitions and Kafka replicas?

I created 3 Kafka brokers setup with broker id’s 20,21,22. Then I created this topic:

bin/kafka-topics.sh –zookeeper localhost:2181
–create –topic zeta –partitions 4 –replication-factor 3

which resulted in:

enter image description here

When a producer sends message “hello world” to topic zeta, to which partition the message first gets written to by Kafka?

The “hello world” message gets replicated in all 4 partitions?

Each broker among the 3 brokers contain all the 4 partitions? How is that related to replica factor of 3 in above context?

If I have 8 consumers running in their own processes or threads in parallel subscribed to zeta topic, how partitions or brokers are assigned by Kafka to serve these in parallel?

java – Scaling in P2P Kafka vs JMS/MQ

Consider using Kafka as a traditional point-to-point queue (by using just one consumer group). If you have, say 4 partitions of your Kafka topic, you can have a maximum of 4 consumers in your consumer group. Contrast this with a JMS implementation (ActiveMQ, etc.) of a P2P queue. You can add as many consumers as you want (although that will have performance degradations).

I have worked on JMS applications with consumer count in the 50 – 100 range. And no one advocates those many partitions for a Kafka topic. So, theoretically, you cannot achieve the same degree of parallelism in a P2P Kafka?

Given this, for a P2P queue, why would one prefer Kafka over JMS/MQ?

node.js – 10k partitions in Kafka?

I have 10k+ “devices” and require an independent “queue” for each one.
Each device writes the same type of data, for example:
{ deviceId, key, value, timestamp }

Then I have, say, 4 consumers. Each get 2500 deviceIds they are responsible for.
(A deviceId can only be consumed by one consumer at a time).

What’s the “best” way to implement this in Kafka?
Currently I’m thinking of a single topic, with 10k partitions with all consumers in a single consumer group.

Kafka producer dealing with lost connection to broker

With a producer configuration like below, I am creating a Singleton producer that is used throughout the application:

properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka.consul1:9092,kafka.consul2:9092");
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.ACKS_CONFIG, "1");

I am connected to a k8s hosted Kafka cluster. The broker’s advertised.listeners is configured to return me the IP addresses and not host names. While normally everything works as expected, the problem occurs when Kafka is restarted, sometimes the IP address changes. Since the producer only knows about the older IP it keeps trying to connect to that host to send messages and none of the messages go through.

I observe that a org.apache.kafka.common.errors.TimeoutException exception is thrown when the send fails. Currently the messages are sent async:

producer.send(data,
                (RecordMetadata recordMetadata, Exception e) -> {
                    if (e != null) {
                        LOGGER.error("Exception while sending message to kafka", e);
                    }
                });

How should the Timeoutexception be handled now? Given that the producer is shared across the application, closing and recreating in the callback does not sound right.