Kafka connect is an open source component for easily integrate external systems with Kafka. It works with any Kafka product like IBM Event Streams. It uses the concepts of source and sink connectors to ingest or deliver data to / from Kafka topics.
- Connector represents a logical job to move data from / to kafka to / from external systems. A lot of existing connectors can be reused, or you can implement your own.
- Workers are JVM running the connector. For production deployment workers run in cluster or “distributed mode”, and leverage the group management protocol to scale task horizontally.
- Tasks: each worker coordinates a set of tasks to copy data. In distributed mode, task states are saved in kafka topics. They can be started, stopped at any time to support resilience, and scalable data pipeline.
- REST API to configure the connectors and monitors the tasks.
The following figure illustrates a classical ‘distributed’ deployment of a Kafka Connect cluster. Workers are the running process to execute connectors and tasks. Workers are JVMs. Tasks are threads in a JVM. For fault tolerance and offeset management, Kafka connect uses Kafka topics (
-offsets, -config, -status)
When a connector is first submitted to the cluster, the workers rebalance the full set of connectors in the cluster and their tasks so that each worker has approximately the same amount of work.
- Connector and tasks are not guaranteed to run on the same instance in the cluster, especially if you have multiple tasks and multiple instances in your cluster.
- Connector coordinates data streaming by managing tasks
- The connector may be configured to add
Converters(code used to translate data between Connect and the system sending or receiving data), and
Transforms: Simple logic to alter each message produced by or sent to a connector.
Connector keeps state into three topics, which may be created when the connectors start are:
- connect-configs: This topic stores the connector and task configurations.
- connect-offsets: This topic stores offsets for Kafka Connect.
- connect-status: This topic stores status updates of connectors and tasks.
- Copy vast quantity of data from source to kafka: work at the datasource level. So when the source is a database, it uses JDBC API for example.
- Support streaming and batch.
- Scale from standalone, mono connector approach to start small, to run in parallel on distributed cluster.
- Copy data, externalizing transformation in other framework.
- Kafka Connect defines three models: data model, worker model and connector model.
When a worker fails:
Tasks allocated in a worker that fails are reallocated to existing workers, and the task’s state, offset, source record mapping to offset are reloaded from the different topics.
If you are using IBM Event Streams 2020.x on Cloud Pak for Integration, the connectors setup is part of the user admin console toolbox:
Deploying connectors against an IBM Event Streams cluster, you need to have an API key with Manager role, to be able to create topic, produce and consume messages for all topics.
As an extendable framework, Kafka Connect, can have new connector plugins. To deploy new connector, you need to use the kafka docker image which needs to be updated with the connector jars and redeployed to kubernetes cluster or to other environment. With IBM Event Streams on Openshift, the toolbox includes a kafka connect environment packaging, that defines a Dockerfile and configuration files to build your own image with the connectors jar files you need. The configuration files defines the properties to connect to Event Streams kafka brokers using API keys and SASL.
The following public IBM messaging github account includes supported, open sourced, connectors (search for
Here is the list of supported connectors for IBM Event Streams.
We have developed different scenarios in this lab to remotely connect a Kafka Connect distributed cluster to Event Streams on cloud.