Streaming Data to RDBMS by Kafka (Part 1)

Ahmet GÜL
VakıfBank Teknoloji
4 min readApr 14, 2022

--

Photo by Manuel Geissinger: https://www.pexels.com/photo/black-server-racks-on-a-room-325229/

As people’s interaction with digital word keep increasing, amount of data produced and consumed is on the growth. Almost all devices that humans use are data producers. In other words, all applications are data producers. As different systems produces different types of data, it is inevitable to integrate them. Therefore, integrating data in different systems is still a challenge.

Kafka is a messaging system that integrates applications. It is scalable, open-source and free. It uses publish and subscribe system. In this point, Kafka is a solution to collect and integrate different systems’ data.

In this article, I will explain how to produce data in Avro format to Kafka. Then, in second article, my colleagues from Kafka-Elastic group explains how to stream data from Kafka to RDBMS. Basically, the first part is about collecting data. Then the second part is about creating the system and streaming.

Let’s start with Avro.

What is Avro ?

According to Wikipedia;

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing (Avro IDL) and another which is more machine-readable based on JSON.

Basically, Avro is based on JSON format.

Why to use Avro ?

As our aim is to stream data to RDBMS by using Kafka, data have to be in a form which is compatible with Kafka connectors. What I actually meant to say is that producing data is easy yet collecting data in a sequence is complex. Therefore, it’s better use Kafka connector. Imagine a single row, adding this row to a table is a very easy process yet if the same row is updated many times or deleted, the final form of the row that would be transferred to RDBMS has to be arranged. Data with Avro format will be ordered ( based on execution order ) in Kafka Connector without any work. Then, the final form will be in RDBMS.

Well, of course, it was not the only advantage. Another one is that Avro format is compatible with schemas. That’s why using Avro format ensures that the columns of table in RDBMS and the properties in object written to Kafka are compatible. Basically, different components of system can verify the data types.

What is Schema ?

Schema is a form that contains data types. In other words, it is a map to fill an object for a specific table for us.

A Sample Key Schema
A Sample Value Schema

How to create connection with Kafka ?

Here is our Kafka producer class.

How to read Schema ?

We used Confluent to get schema registry. Basically, Confluent stores schema to show data format. For each table, we define two schemas i.e. key schema and value schema.

Key schema is a primary key of the table. Here is a code to get key schema.

Record schema is other columns of the table. Basically, it is the data we want to insert or update.

How to fill object ?

Our application gets DTO as input, then converts it to Avro formatted record by using schemas. We parsed schema in previous section. Now, it’s time to convert C# object to Avro formatted object. Basically, we convert each C# data-type to avro formatted type.

Here is the conversion code to populate key-record.

After this step, we do the same thing to populate value-record. As it can be seen from the code, we define mapping for each data type.

How to produce Avro Object to Kafka ?

There are 2 different strategies to produce.

The first one is that producing a key and a value. This one is used for insert and update operations.

The second is that producing only a key. This one is used for delete operations. The reason we only used a key is that, we can just delete the row by a key. Basically, we do not need the row. The good thing is that Confluent knows that it’s the deletion operation and handle other things for us.

Finally, the code snippet below is enough to produce. Basically, by using this code snippet, we send Avro formatted object to Kafka.

Anything else to say ?

During the development, we had memory leaks in producer. To solve the problem, we edited producer config.

We close to delivery reports. Turning off this feature made the application work exactly as fire and forget. Additionally, the producer reduced the resource usage on server as it did not wait for a delivery response from Kafka.

EnableDeliveryReports = false

Another change that reduces CPU usage and memory consumption is removing ContinueWith statement from asynchronous produce method. Basically, the task object waits a continuation error report when using ContinueWith statement. We enable this feature only in debugging.

In addition to them, Singleton pattern is used for producer builder. In Confluent documentation, there is no singleton pattern but the keyword using. This sample causes memory leak in our servers. Therefore, we used singleton pattern. Basically, we just create a connection and use the same object for specific period of time.

Next Article ?

In next article, my colleagues from Kafka-Elastic group will explain how to start Kafka and Confluent environments. After that, they will write about the streaming part.

--

--