Kudu Sink

Download connector Kudu Connector for Kafka 2.1.0

A Connector and Sink to write events from Kafka to Kudu. The connector takes the value from the Kafka Connect SinkRecords and inserts a new entry to Kudu.

Prerequisites

  • Apache Kafka 0.11.x of above
  • Kafka Connect 2.x or above
  • Kudu 0.8 or above
  • Java 1.8

Features

  1. The KCQL routing querying - Kafka topic payload field selection is supported, allowing you to select fields written to Kudu.
  2. Error policies for handling failures.
  3. Auto creation of tables.
  4. Auto evolution of tables.

KCQL Support

{ INSERT | UPSERT } INTO kudu_table SELECT { FIELDS, ... }
FROM kafka_topic_name
[AUTOCREATE]
[DISTRIBUTE BY FIELD, ... INTO NBR_OF_BUCKETS BUCKETS]
[AUTOEVOLVE]

Tip

You can specify multiple KCQL statements separated by ; to have a the connector sink multiple topics.

The Kudu sink supports KCQL, Kafka Connect Query Language. The following KCQL support is available:

  1. Field selection.
  2. Target table selection.
  3. Insert and upset modes.
  4. Auto creation of tables.
  5. Auto evolution of tables.

Kudu at present does not have a concept of databases like Impala. If you want to insert into Impala you need to set the target table accordingly:

impala:database.impala_table

Example:

-- Insert mode, select all fields from topicA and write to tableA
INSERT INTO tableA SELECT * FROM topicA

-- Insert mode, select 3 fields and rename from topicB and write to tableB
INSERT INTO tableB SELECT x AS a, y AS b and z AS c FROM topicB

-- Upsert mode, select all fields from topicC, auto create tableC and auto evolve, use field1 and field2 as the primary keys
UPSERT INTO tableC SELECT * FROM topicC AUTOCREATE DISTRIBUTEBY field1, field2 INTO 10 BUCKETS AUTOEVOLVE

Write Modes

Insert Mode

Insert is the default write mode of the sink.

Kafka currently can provide exactly delivery semantics, however to ensure no errors are produced if unique constraints have been implemented on the target tables, the sink can run in UPSERT mode. If the error policy has been set to NOOP then the Sink will discard the error and continue to process, however, it currently makes no attempt to distinguish violation of integrity constraints from other exceptions such as casting issues.

Upsert Mode

The Sink supports Kudu upserts which replaces the existing row if a match is found on the primary keys. Kafka can provide exactly once delivery semantics and order is a guaranteed within partitions. This mode will, if the same record is delivered twice to the sink, result in an idempotent write. If records are delivered with the same field or group of fields that are used as the primary key on the target table, but different values, the existing record in the target table will be updated.

Since records are delivered in the order they were written per partition. The write is idempotent on failure or restart. Redelivery produces the same result.

Auto Create

If you set AutoCreate, the sink will use the schema attached to the topic to create a table in Kudu. The primary keys are set by the PK keyword. The field values will be concatenated and separated by a -. If no fields are set the topic name, partition and message offset are used.

Kudu has a number of partition strategies. The sink only supports the HASH partition strategy, you can control which fields are used to hash and the number of buckets which should be created. This behaviour is controlled via the DISTRIBUTEBY clause.

DISTRIBUTE BY id INTO 10 BUCKETS

Note

Tables that are created are not visible to Impala. You must map them in Impala yourself.

Auto Evolution

The Sink supports auto evolution of tables for each topic. This mapping is set in the connect.kudu.kcql option. When set the Sink will identify new schemas for each topic based on the schema version from the Schema registry. New columns will be identified and an alter table DDL statement issued against Kudu.

Note

Schema registry support is required for this feature.

Schema evolution can occur upstream, for example new fields added or change in data type in the schema of the topic.

Upstream changes must follow the schema evolution rules laid out in the Schema Registry. This Sink only supports BACKWARD and FULLY compatible schemas. If new fields are added the Sink will attempt to perform a ALTER table DDL statement against the target table to add columns. All columns added to the target table are set as nullable.

Fields cannot be deleted upstream. Fields should be of AVRO union type [null, <dataType>] with a default set. This allows the Sink to either retrieve the default value or null. The Sink is not aware that the field has been deleted as a value is always supplied to it.

If a upstream field is removed and the topic is not following the Schema Registry’s evolution rules, i.e. not full or backwards compatible, any errors will default to the error policy.

Note

The Sink will never remove columns on the target table or attempt to change data types.

Error Polices

Landoop sink connectors support error polices. These error polices allow you to control the behavior of the sink if it encounters an error when writing records to the target system. Since Kafka retains the records, subject to the configured retention policy of the topic, the sink can ignore the error, fail the connector or attempt redelivery.

Throw

Any error on write to the target system will be propagated up and processing is stopped. This is the default behavior.

Noop

Any error on write to the target database is ignored and processing continues.

Warning

This can lead to missed errors if you do not have adequate monitoring. Data is not lost as it’s still in Kafka subject to Kafka’s retention policy. The sink currently does not distinguish between integrity constraint violations and or other exceptions thrown by any drivers or target system.

Retry

Any error on write to the target system causes the RetryIterable exception to be thrown. This causes the Kafka Connect framework to pause and replay the message. Offsets are not committed. For example, if the table is offline it will cause a write failure, the message can be replayed. With the Retry policy, the issue can be fixed without stopping the sink.

Lenses QuickStart

The easiest way to try out this is using Lenses Box the pre-configured Docker image, that comes with this connector pre-installed and ready to use. You would need to go to Connectors –> New Connector –> Sink –> Kudu and paste your configuration

../../_images/lenses-create-kudu-sink-connector.png

Kudu Setup

Download and check Kudu QuickStart VM starts up (VM password is demo).

curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash

Kudu Table

Lets create a table in Kudu via Impala. The Sink does support auto creation of tables but they are not sync’d yet with Impala.

#demo/demo
ssh demo@quickstart -t impala-shell

CREATE TABLE default.kudu_test (id INT,random_field STRING, PRIMARY KEY(id))
PARTITION BY HASH PARTITIONS 16
STORED AS KUDU

Note

The Sink will fail to start if the tables matching the topics do not already exist and the KQL statement is not in auto create mode.

When creating a new Kudu table using Impala, you can create the table as an internal table or an external table.

Internal

An internal table is managed by Impala, and when you drop it from Impala, the data and the table truly are dropped. When you create a new table using Impala, it is generally a internal table.

External

An external table is not managed by Impala and dropping such a table does not drop the table from its source location. Instead, it only removes the mapping between Impala and Kudu.

See the Impala documentation for more information about internal and external tables. Here’s an example:

CREATE EXTERNAL TABLE default.kudu_test
STORED AS KUDU
TBLPROPERTIES ('kudu.table_name'='kudu_test');

Impala Databases and Kudu

Every Impala table is contained within a namespace called a database. The default database is called default, and users may create and drop additional databases as desired.

Note

When a managed Kudu table is created from within Impala, the corresponding Kudu table will be named impala::my_database.table_name

Tables created by the sink are not automatically visible to Impala. You must map the table in Impala.

CREATE EXTERNAL TABLE my_mapping_table
STORED AS KUDU
TBLPROPERTIES (
'kudu.table_name' = 'my_kudu_table'
);

Sink Connector QuickStart

Start Kafka Connect in distributed mode (see install). In this mode a Rest Endpoint on port 8083 is exposed to accept connector configurations. We developed Command Line Interface to make interacting with the Connect Rest API easier. The CLI can be found in the Stream Reactor download under the bin folder. Alternatively the Jar can be pulled from our GitHub releases page.

Installing the Connector

Connect, in production should be run in distributed mode

  1. Install and configure a Kafka Connect cluster
  2. Create a folder on each server called plugins/lib
  3. Copy into the above folder the required connector jars from the stream reactor download
  4. Edit connect-avro-distributed.properties in the etc/schema-registry folder and uncomment the plugin.path option. Set it to the root directory i.e. plugins you deployed the stream reactor connector jars in step 2.
  5. Start Connect, bin/connect-distributed etc/schema-registry/connect-avro-distributed.properties

Connect Workers are long running processes so set an init.d or systemctl service accordingly.

Starting the Connector (Distributed)

Download, and install Stream Reactor. Follow the instructions here if you have not already done so. All paths in the quickstart are based on the location you installed the Stream Reactor.

Once the Connect has started we can now use the kafka-connect-tools cli to post in our distributed properties file for Kudu. For the CLI to work including when using the dockers you will have to set the following environment variable to point the Kafka Connect Rest API.

export KAFKA_CONNECT_REST="http://myserver:myport"
➜  bin/connect-cli create kudu-sink < conf/kudu-sink.properties

name=kudu-sink
connector.class=com.datamountaineer.streamreactor.connect.kudu.sink.KuduSinkConnector
tasks.max=1
topics=kudu-test
connect.kudu.master=quickstart
connect.kudu.kcql=INSERT INTO impala::default.kudu_test SELECT * FROM kudu-test

If you switch back to the terminal you started Kafka Connect in, you should see the Elastic Search Sink being accepted and the task starting.

We can use the CLI to check if the connector is up but you should be able to see this in logs as well.

#check for running connectors with the CLI
➜ bin/connect-cli ps
kudu-sink
   INFO
    __                    __
   / /   ____ _____  ____/ /___  ____  ____
  / /   / __ `/ __ \/ __  / __ \/ __ \/ __ \
 / /___/ /_/ / / / / /_/ / /_/ / /_/ / /_/ /
/_____/\__,_/_/ /_/\__,_/\____/\____/ .___/
                                   /_/
          __ __          __      _____ _       __
         / //_/_  ______/ /_  __/ ___/(_)___  / /__
        / ,< / / / / __  / / / /\__ \/ / __ \/ //_/
       / /| / /_/ / /_/ / /_/ /___/ / / / / / ,<
      /_/ |_\__,_/\__,_/\__,_//____/_/_/ /_/_/|_|

   by Andrew Stevenson
          (com.datamountaineer.streamreactor.connect.kudu.KuduSinkTask:37)
   INFO KuduSinkConfig values:
       connect.kudu.master = quickstart
    (com.datamountaineer.streamreactor.connect.kudu.KuduSinkConfig:165)
   INFO Connecting to Kudu Master at quickstart (com.datamountaineer.streamreactor.connect.kudu.KuduWriter$:33)
   INFO Initialising Kudu writer (com.datamountaineer.streamreactor.connect.kudu.KuduWriter:40)
   INFO Assigned topics  (com.datamountaineer.streamreactor.connect.kudu.KuduWriter:42)
   INFO Sink task org.apache.kafka.connect.runtime.WorkerSinkTask@b60ba7b finished initialization and start (org.apache.kafka.connect.runtime.WorkerSinkTask:155)

Test Records

Tip

If your input topic does not match the target use Lenses SQL to transform in real-time the input, no Java or Scala required!

Now we need to put some records it to the kudu-test topics. We can use the kafka-avro-console-producer to do this. Start the producer and pass in a schema to register in the Schema Registry. The schema has a id field of type int and a random_field of type string.

bin/kafka-avro-console-producer \
 --broker-list localhost:9092 --topic kudu-test \
 --property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"id","type":"int"},
{"name":"random_field", "type": "string"}]}'

Now the producer is waiting for input. Paste in the following, one record at a time:

{"id": 999, "random_field": "foo"}
{"id": 888, "random_field": "bar"}

Check for records in Kudu

Now check the logs of the connector you should see this:

INFO Sink task org.apache.kafka.connect.runtime.WorkerSinkTask@68496440 finished initialization and start (org.apache.kafka.connect.runtime.WorkerSinkTask:155)
INFO Written 2 for kudu_test (com.datamountaineer.streamreactor.connect.kudu.KuduWriter:90)

In Kudu:

#demo/demo
ssh demo@quickstart -t impala-shell

SELECT * FROM kudu_test;

Query: select * FROM kudu_test
+-----+--------------+
| id  | random_field |
+-----+--------------+
| 888 | bar          |
| 999 | foo          |
+-----+--------------+
Fetched 2 row(s) in 0.14s

Configurations

The Kafka Connect framework requires the following in addition to any connectors specific configurations:

Config Description Type Value
name Name of the connector string This must be unique across the Connect cluster
topics
The topics to sink.
The connector will check that this matches the KCQL statement
string  
tasks.max The number of tasks to scale output int 1
connector.class Name of the connector class string com.datamountaineer.streamreactor.connect.kudu.sink.KuduSinkConnector

Connector Configurations

Config Description Type
connect.kudu.master Specifies a Kudu server or a comma separated list of endpoints string
connect.kudu.kcql Kafka connect query language expression string

Optional Configurations

Config Description Type  
connect.kudu.write.flush.mode
Flush mode on write.
1. SYNC - flush each sink record. Batching is disabled.
2. BATCH_BACKGROUND - flush batch of sink records in background thread.
3. BATCH_SYNC - flush batch of sink records
string SYNC
connect.kudu.mutation.buffer.space Kudu Session mutation buffer space (bytes) 10000  
connect.kudu.schema.registry.url The url for the Schema registry. This is used to | retrieve the latest schema for table creation string http://localhost:8081
connect.kudu.error.policy
Specifies the action to be
taken if an error occurs while inserting the data.
There are three available options, NOOP, the error
is swallowed, THROW, the error is allowed
to propagate and retry.
For RETRY the Kafka message is redelivered up
to a maximum number of times specified by the
connect.kudu.max.retries option
string THROW
connect.kudu.max.retries
The maximum number of times a message
is retried. Only valid when the
connect.kudu.error.policy is set to RETRY
string 10
connect.kudu.retry.interval
The interval, in milliseconds between retries,
if the sink is using
connect.kudu.error.policy set to RETRY
string 60000
connect.progress.enabled
Enables the output for how many
records have been processed
boolean false

Example

name=kudu-sink
connector.class=com.datamountaineer.streamreactor.connect.kudu.sink.KuduSinkConnector
tasks.max=1
topics=kudu-test
connect.kudu.master=quickstart
connect.kudu.schema.registry.url=http://myhost:8081
connect.kudu.kcql=INSERT INTO impala::default.kudu_test SELECT * FROM kudu_test AUTOCREATE DISTRIBUTEBY id INTO 5 BUCKETS

Data Type Mappings

Connect Type Kudu Data Type
INT8 INT8
INT16 INT16
INT32 INT32
INT64 INT64
BOOLEAN BOOLEAN
FLOAT32 FLOAT
FLOAT64 FLOAT
BYTES BINARY

Kubernetes

Helm Charts are provided at our repo, add the repo to your Helm instance and install. We recommend using the Landscaper to manage Helm Values since typically each Connector instance has its own deployment.

Add the Helm charts to your Helm instance:

helm repo add landoop https://lensesio.github.io/kafka-helm-charts/

TroubleShooting

Please review the FAQs and join our slack channel