Using Python — Jupyter Notebook as Kafka Producer (Updated 28 Oct 21)

3 min readApr 8, 2021

Introduction

On this occasion, I will explain the use of Python scripts to produce data for Kafka Producer using a fake pizza-based dataset to then be pushed into the Kafka Topic.

Some of the terms commonly used include the following:

· Apache Kafka: a platform used for data transfer using a publish-subscribe messaging system between processes, applications, and servers.

· Topic: It is a storage medium for received data. Topics are similar to tables in the database concept

· Kafka Producer: an application that publishes data into the Topic

Prerequisites

I will use Aiven’s Project with a slight modification to the Jupyter Notebook. Some of the platforms used as prerequisites in this article are as follows:

· Apache Kafka

· Jupyter Notebook (Using Anaconda)

· Ubuntu 20.04

Java Installation

Apache Kafka uses Java to run its programs. So, Java installation is required. For that, open your terminal and do the installation using the following syntax.

Download Apache Kafka

Download Apache Kafka from the official website.

then, do the file extraction process

then, do the process of writing the configuration from Zookeeper.

then add the syntax below and save it in the file.

Then, create a systemd unit file using the command below:

where to go next, add the configuration to the notepad opened and save the file.

then, run Kafka with the following syntax.

Python Package Installation

Python packages used are using faker and kafka-python. The installation process (inside Jupyter Notebook) uses the following syntax:

Then, start by creating a topic on Kafka using the terminal on ubuntu that you already have Apache Kafka installed on. In this example, I am using a topic called pizza-orders.

Create Python Producer Program

Then, make a new notebook using Anaconda-Jupyter Notebook which is used as a fake data provider agent. In other words, this notebook has a role to replace the data we usually get from other systems. An example is the Twitter API which used to download tweet data for a certain keyword. Fake data is useful for someone who is learning to process data and pipelining data from one system to another.

The class above is a class that is used as source data which will be selected randomly.

Then, the necessary basic parameters such as the maximum number of pizzas to be ordered and the maximum number of toppings sprinkled on the pizza are also defined. Then, if you want the results for each loop to be the same, you can solve it using the Faker.seed (4321) method. 4321 is just a constant that is used as a randomization factor and can be replaced by any other number. The add_provider method is a method used to add custom data classes to the packages.

This function is a function that is used to generate data in a form that is ready to be published in Kafka. This generated data can be entered into non-relational databases such as MongoDB or Cassandra. The data contains the id which is the counting order, then shop is the shop where the pizza was purchased. Then, respectively, name, phone_number, address are random names, random phone numbers, and random names found in the Faker packages. Then the last one is the type of pizza ordered from the loop on lines 6–14.

Then finally, this function is used to call the produce_pizza_order function according to the number of messages you want to create and forward to the Kafka Topic.

Run Kafka Consumer

Run Kafka Consumer to see data receipt notifications from Jupyter Notebook into Apache Kafka by running the following syntax using the help of Ubuntu Terminal.

Then, run the Python / Jupyter Notebook program from the program we discussed earlier. Then it will appear that the data has been received into the pizza-orders topic which we can see together on the Ubuntu Terminal.

Figure: Kafka Consumer on pizza-orders Topic. Data Created from Python/Jupyter Notebook.

Full Code

Happy coding! :)