This Kafka Connect sink connector facilitates the seamless transfer of records from Kafka to Azure Datalake Buckets. It offers robust support for various data formats, including AVRO, Parquet, JSON, CSV, and Text, making it a versatile choice for data storage. Additionally, it ensures the reliability of data transfer with built-in support for exactly-once semantics.
The connector uses KCQL to map topics to Datalake buckets and paths. The full KCQL syntax is:
INSERT INTO bucketAddress[:pathPrefix] SELECT * FROM kafka-topic [PARTITIONBY (partition[, partition] ...)] [STOREAS storage_format] [WITH_FLUSH_SIZE = flush_size] [WITH_FLUSH_INTERVAL = flush_interval] [WITH_FLUSH_COUNT = flush_count] [PROPERTIES( 'property.1'=x, 'property.2'=x, )]
Please note that you can employ escaping within KCQL for the INSERT INTO, SELECT * FROM, and PARTITIONBY clauses when necessary. For example, an incoming Kafka message stored as Json can use fields contaiing .:
.
{ ... "a.b": "value", ... }
In this case you can use the following KCQL statement:
INSERT INTO `container-name`:`prefix` SELECT * FROM `kafka-topic` PARTITIONBY `a.b`
The target bucket and path are specified in the INSERT INTO clause. The path is optional and if not specified, the connector will write to the root of the bucket and append the topic name to the path.
Here are a few examples:
INSERT INTO testcontainer:pathToWriteTo SELECT * FROM topicA; INSERT INTO testcontainer SELECT * FROM topicA; INSERT INTO testcontainer:path/To/Write/To SELECT * FROM topicA PARTITIONBY fieldA;
Currently, the connector does not offer support for SQL projection; consequently, anything other than a SELECT * query is disregarded. The connector will faithfully write all fields from Kafka exactly as they are.
The source topic is defined within the FROM clause. To avoid runtime errors, it’s crucial to configure either the topics or topics.regex property in the connector and ensure proper mapping to the KCQL statements.
topics
topics.regex
A notable exception is supported when the FROM clause is set to *. In this scenario, the connector will automatically map all topics to the KCQL statement and apply the same behavior to each of them.
The object key serves as the filename used to store data in Datalake. There are two options for configuring the object key:
PARTITIONBY
$container/[$prefix]/$topic/customKey1=customValue1/customKey2=customValue2/topic(partition_offset).extension
Custom keys and values can be extracted from the Kafka message key, message value, or message headers, as long as the headers are of types that can be converted to strings. There is no fixed limit to the number of elements that can form the object key, but you should be aware of Azure Datalake key length restrictions.
To extract fields from the message values, simply use the field names in the PARTITIONBY clause. For example:
PARTITIONBY fieldA, fieldB
However, note that the message fields must be of primitive types (e.g., string, int, long) to be used for partitioning.
You can also use the entire message key as long as it can be coerced into a primitive type:
PARTITIONBY _key
In cases where the Kafka message Key is not a primitive but a complex object, you can use individual fields within the message Key to create the Datalake object key name:
PARTITIONBY _key.fieldA, _key.fieldB
Kafka message headers can also be used in the Datalake object key definition, provided the header values are of primitive types easily convertible to strings:
PARTITIONBY _header.<header_key1>[, _header.<header_key2>]
Customizing the object key can leverage various components of the Kafka message. For example:
PARTITIONBY fieldA, _key.fieldB, _headers.fieldC
This flexibility allows you to tailor the object key to your specific needs, extracting meaningful information from Kafka messages to structure Datalake object keys effectively.
To enable Athena-like partitioning, use the following syntax:
INSERT INTO $container[:$prefix] SELECT * FROM $topic PARTITIONBY fieldA, _key.fieldB, _headers.fieldC STOREAS `AVRO` PROPERTIES ( 'partition.include.keys'=true, )
Storing data in Azure Datalake and partitioning it by time is a common practice in data management. For instance, you may want to organize your Datalake data in hourly intervals. This partitioning can be seamlessly achieved using the PARTITIONBY clause in combination with specifying the relevant time field. However, it’s worth noting that the time field typically doesn’t adjust automatically.
To address this, we offer a Kafka Connect Single Message Transformer (SMT) designed to streamline this process. You can find the transformer plugin and documentation here.
Let’s consider an example where you need the object key to include the wallclock time (the time when the message was processed) and create an hourly window based on a field called timestamp. Here’s the connector configuration to achieve this:
timestamp
connector.class=io.lenses.streamreactor.connect.datalake.sink.DatalakeSinkConnector connect.datalake.kcql=insert into lensesio:demo select * from demo PARTITIONBY _value.metadata_id, _value.customer_id, _header.ts, _header.wallclock STOREAS `JSON` WITH_FLUSH_SIZE=1000000 WITH_FLUSH_INTERVAL=30 WITH_FLUSH_COUNT=5000 topics=demo name=demo value.converter=org.apache.kafka.connect.json.JsonConverter key.converter=org.apache.kafka.connect.storage.StringConverter transforms=insertFormattedTs,insertWallclock transforms.insertFormattedTs.type=io.lenses.connect.smt.header.TimestampConverter transforms.insertFormattedTs.header.name=ts transforms.insertFormattedTs.field=timestamp transforms.insertFormattedTs.target.type=string transforms.insertFormattedTs.format.to.pattern=yyyy-MM-dd-HH transforms.insertWallclock.type=io.lenses.connect.smt.header.InsertWallclock transforms.insertWallclock.header.name=wallclock transforms.insertWallclock.value.type=format transforms.insertWallclock.format=yyyy-MM-dd-HH
In this example, the incoming Kafka message’s Value content includes a field called timestamp, represented as a long value indicating the epoch time in milliseconds. The TimestampConverter SMT will expertly convert this into a string value according to the format specified in the format.to.pattern property. Additionally, the insertWallclock SMT will incorporate the current wallclock time in the format you specify in the format property.
The PARTITIONBY clause then leverages both the timestamp field and the wallclock header to craft the object key, providing you with precise control over data partitioning.
While the STOREAS clause is optional, it plays a pivotal role in determining the storage format within Azure Datalake. It’s crucial to understand that this format is entirely independent of the data format stored in Kafka. The connector maintains its neutrality towards the storage format at the topic level and relies on the key.converter and value.converter settings to interpret the data.
STOREAS
key.converter
value.converter
Supported storage formats encompass:
Opting for BYTES ensures that each record is stored in its own separate file. This feature proves particularly valuable for scenarios involving the storage of images or other binary data in Datalake. For cases where you prefer to consolidate multiple records into a single binary file, AVRO or Parquet are the recommended choices.
By default, the connector exclusively stores the Kafka message value. However, you can expand storage to encompass the entire message, including the key, headers, and metadata, by configuring the store.envelope property as true. This property operates as a boolean switch, with the default value being false. When the envelope is enabled, the data structure follows this format:
store.envelope
{ "key": <the message Key, which can be a primitive or a complex object>, "value": <the message Key, which can be a primitive or a complex object>, "headers": { "header1": "value1", "header2": "value2" }, "metadata": { "offset": 0, "partition": 0, "timestamp": 0, "topic": "topic" } }
Utilizing the envelope is particularly advantageous in scenarios such as backup and restore or replication, where comprehensive storage of the entire message in Datalake is desired.
Storing the message Value Avro data as Parquet in Datalake:
... connect.datalake.kcql=INSERT INTO lensesioazure:car_speed SELECT * FROM car_speed_events STOREAS `PARQUET` value.converter=io.confluent.connect.avro.AvroConverter value.converter.schema.registry.url=http://localhost:8081 key.converter=org.apache.kafka.connect.storage.StringConverter ...
The converter also facilitates seamless JSON to AVRO/Parquet conversion, eliminating the need for an additional processing step before the data is stored in Datalake.
... connect.datalake.kcql=INSERT INTO lensesioazure:car_speed SELECT * FROM car_speed_events STOREAS `PARQUET` value.converter=org.apache.kafka.connect.json.JsonConverter key.converter=org.apache.kafka.connect.storage.StringConverter ...
Enabling the full message stored as JSON in Datalake:
... connect.datalake.kcql=INSERT INTO lensesioazure:car_speed SELECT * FROM car_speed_events STOREAS `JSON` PROPERTIES('store.envelope'=true) value.converter=org.apache.kafka.connect.json.JsonConverter key.converter=org.apache.kafka.connect.storage.StringConverter ...
Enabling the full message stored as AVRO in Datalake:
... connect.datalake.kcql=INSERT INTO lensesioazure:car_speed SELECT * FROM car_speed_events STOREAS `AVRO` PROPERTIES('store.envelope'=true) value.converter=io.confluent.connect.avro.AvroConverter value.converter.schema.registry.url=http://localhost:8081 key.converter=org.apache.kafka.connect.storage.StringConverter ...
If the restore (see the Datalake Source documentation) happens on the same cluster, then the most performant way is to use the ByteConverter for both Key and Value and store as AVRO or Parquet:
... connect.datalake.kcql=INSERT INTO lensesioazure:car_speed SELECT * FROM car_speed_events STOREAS `AVRO` PROPERTIES('store.envelope'=true) value.converter=org.apache.kafka.connect.converters.ByteArrayConverter key.converter=org.apache.kafka.connect.converters.ByteArrayConverter ...
The connector offers three distinct flush options for data management:
It’s worth noting that the interval flush is a continuous process that acts as a fail-safe mechanism, ensuring that files are periodically flushed, even if the other flush options are not configured or haven’t reached their thresholds.
Consider a scenario where the flush size is set to 10MB, and only 9.8MB of data has been written to the file, with no new Kafka messages arriving for an extended period of 6 hours. To prevent undue delays, the interval flush guarantees that the file is flushed after the specified time interval has elapsed. This ensures the timely management of data even in situations where other flush conditions are not met.
The flush options are configured using the WITH_FLUSH_COUNT, WITH_FLUSH_SIZE, and WITH_FLUSH_INTERVAL clauses. The settings are optional and if not specified the defaults are:
The PROPERTIES clause is optional and adds a layer of configurability to the connector. It enhances versatility by permitting the application of multiple configurations (delimited by ‘,’). The following properties are supported:
The sink connector optimizes performance by padding the output files, a practice that proves beneficial when using the Datalake Source connector to restore data. This file padding ensures that files are ordered lexicographically, allowing the Datalake Source connector to skip the need for reading, sorting, and processing all files, thereby enhancing efficiency.
AVRO and Parquet offer the capability to compress files as they are written. The Datalake Sink connector provides advanced users with the flexibility to configure compression options. Here are the available options for the connect.datalake.compression.codec, along with indications of their support by Avro and Parquet writers:
connect.datalake.compression.codec
Please note that not all compression libraries are bundled with the Datalake connector. Therefore, you may need to manually add certain libraries to the classpath to ensure they function correctly.
The connector offers two distinct authentication modes:
When selecting the “Credentials” mode, it is essential to provide the necessary access key and secret key properties. Alternatively, if you prefer not to configure these properties explicitly, the connector will follow the credentials retrieval order as described here.
Here’s an example configuration for the “Credentials” mode:
... connect.datalake.azure.auth.mode=Credentials connect.datalake.azure.account.name=$AZURE_ACCOUNT_NAME connect.datalake.azure.account.key=$AZURE_ACCOUNT_KEY ...
And here is an example configuration using the “Connection String” mode:
... connect.datalake.azure.auth.mode=ConnectionString connect.datalake.azure.connection.string=$AZURE_CONNECTION_STRING ...
For enhanced security and flexibility when using either the “Credentials” or “Connection String” modes, it is highly advisable to utilize Connect Secret Providers. You can find detailed information on how to use the Connect Secret Providers here. This approach ensures robust security practices while handling access credentials.
Please refer to the information provided in the Error polices section for detailed guidance.
Depending on the storage format of Kafka topics’ messages, the need for replication to a different cluster, and the specific data analysis requirements, there exists a guideline on how to effectively utilize converters for both sink and source operations. This guidance aims to optimize performance and minimize unnecessary CPU and memory usage.
Adapt the key.converter and value.converter properties accordingly to the table above.
On this page