Release notes


6.0.0 

New Connectors introduced in 6.0.0: 

Deprecated Connectors removed in 6.0.0: 

  • Kudu
  • Hazelcast
  • Hbase
  • Hive
  • Pulsar

All Connectors: 

  • Standardising package names. Connector class names and converters will need to be renamed in configuration. Find out more.
  • Some clean up of unused dependencies.
  • Introducing cloud-common module to share code between cloud connectors.
  • Cloud sinks (AWS S3, Azure Data Lake and GCP Storage) now support BigDecimal and handle nullable keys.

5.0.1 

New Connectors introduced in 5.0.1: 

  • Consumer Group Offsets S3 Sink Connector

AWS S3 Connector - S3 Source & Sink 

  • Enhancement: BigDecimal Support

Redis Sink Connector 

  • Bug fix: Redis does not initialise the ErrorHandler

5.0.0 

All Connectors 

  • Test Fixes and E2E Test Clean-up: Improved testing with bug fixes and end-to-end test clean-up.
  • Code Optimization: Removed unused code and converted Java code and tests to Scala for enhanced stability and maintainability.
  • Ascii Art Loading Fix: Resolved issues related to ASCII art loading.
  • Build System Updates: Implemented build system updates and improvements.
  • Stream Reactor Integration: Integrated Kafka-connect-query-language inside of Stream Reactor for enhanced compatibility.
  • STOREAS Consistency: Ensured consistent handling of backticks with STOREAS.

AWS S3 Connector - S3 Source & Sink 

The source and sink has been the focus of this release.

  • Full message backup. The S3 sink and source now supports full message backup. This is enabled by adding in the KCQL PROPERTIES('store.envelope'=true)
  • Removed Bytes_*** storage format. For those users leveraging them there is a migration information below. Storing raw Kafka message the storage format should be AVRO/PARQUET/JSON(less ideal).
  • Introduced support for BYTES storing single message as raw binary. Typically, storing images or videos are the use case for this. This is enabled by adding in the KCQL STOREAS BYTES
  • Introduced support for PROPERTIES to drive new settings required to drive the connectors’ behaviour. The KCQL looks like this: INSERT INTO ... SELECT ... FROM ... PROPERTIES(property=key, ...)

Sink 

  • Enhanced PARTITIONBY Support: expanded support for PARTITIONBY fields, now accommodating fields containing dots. For instance, you can use PARTITIONBY a, `field1.field2` for enhanced partitioning control.
  • Advanced Padding Strategy: a more advanced padding strategy configuration. By default, padding is now enforced, significantly improving compatibility with S3 Source.
  • Improved Error Messaging: Enhancements have been made to error messaging, providing clearer guidance, especially in scenarios with misaligned topic configurations (#978).
  • Commit Logging Refactoring: Refactored and simplified the CommitPolicy for more efficient commit logging (#964).
  • Comprehensive Testing: Added additional unit testing around configuration settings, removed redundancy from property names, and enhanced KCQL properties parsing to support Map structures.
  • Consolidated Naming Strategies: Merged naming strategies to reduce code complexity and ensure consistency. This effort ensures that both hierarchical and custom partition modes share similar code paths, addressing issues related to padding and the inclusion of keys and values within the partition name.
  • Optimized S3 API Calls: Switched from using deleteObjects to deleteObject for S3 API client calls (#957), enhancing performance and efficiency.
  • JClouds Removal: The update removes the use of JClouds, streamlining the codebase.
  • Legacy Offset Seek Removal: The release eliminates legacy offset seek operations, simplifying the code and enhancing overall efficiency

Source 

  • Expanded Text Reader Support: new text readers to enhance data processing flexibility, including:
    • Regex-Driven Text Reader: Allows parsing based on regular expressions.
    • Multi-Line Text Reader: Handles multi-line data.
    • Start-End Tag Text Reader: Processes data enclosed by start and end tags, suitable for XML content.
  • Improved Parallelization: enhancements enable parallelization based on the number of connector tasks and available data partitions, optimizing data handling.
  • Data Consistency: Resolved data loss and duplication issues when the connector is restarted, ensuring reliable data transfer.
  • Dynamic Partition Discovery: No more need to restart the connector when new partitions are added; runtime partition discovery streamlines operations.
  • Efficient Storage Handling: The connector now ignores the .indexes directory, allowing data storage in an S3 bucket without a prefix.
  • Increased Default Records per Poll: the default limit on the number of records returned per poll was changed from 1024 to 10000, improving data retrieval efficiency and throughput.
  • Ordered File Processing: Added the ability to process files in date order. This feature is especially useful when S3 files lack lexicographical sorting, and S3 API optimisation cannot be leveraged. Please note that it involves reading and sorting files in memory.
  • Parquet INT96 Compatibility: The connector now allows Parquet INT96 to be read as a fixed array, preventing runtime failures.

Kudu and Hive 

  • The Kudu and Hive connectors are now deprecated and will be removed in a future release.

InfluxDB 

  • Fixed a memory issue with the InfluxDB writer.
  • Upgraded to Influxdb2 client (note: doesn’t yet support Influxdb2 connections).

S3 upgrade notes 

Upgrading from 5.0.0 (preview) to 5.0.0 

For installations that have been using the preview version of the S3 connector and are upgrading to the release, there are a few important considerations:

Previously, default padding was enabled for both “offset” and “partition” values starting in June.

However, in version 5.0, the decision to apply default padding to the “offset” value only, leaving the " partition" value without padding. This change was made to enhance compatibility with querying in Athena.

If you have been using a build from the master branch since June, your connectors might have been configured with a different default padding setting.

To maintain consistency and ensure your existing connector configuration remains valid, you will need to use KCQL configuration properties to customize the padding fields accordingly.

INSERT INTO $bucket[:$prefix]
SELECT *
FROM $topic
...
PROPERTIES(
  'padding.length.offset'=12,
  'padding.length.partition'=12
)

Upgrading from 4.x to 5.0.0 

Starting with version 5.0.0, the following configuration keys have been replaced.

FieldOld PropertyNew Property
AWS Secret Keyaws.secret.keyconnect.s3.aws.secret.key
Access Keyaws.access.keyconnect.s3.aws.access.key
Auth Modeaws.auth.modeconnect.s3.aws.auth.mode
Custom Endpointaws.custom.endpointconnect.s3.custom.endpoint
VHost Bucketaws.vhost.bucketconnect.s3.vhost.bucket

Upgrading from 4.1.* and 4.2.0 

In version 4.1, padding options were available but were not enabled by default. At that time, the default padding length, if not specified, was set to 8 characters.

However, starting from version 5.0, padding is now enabled by default, and the default padding length has been increased to 12 characters.

Enabling padding has a notable advantage: it ensures that the files written are fully compatible with the Lenses Stream Reactor S3 Source, enhancing interoperability and data integration.

Sinks created with 4.2.0 and 4.2.1 should retain the padding behaviour, and, therefore should disable padding:

INSERT INTO $bucket[:$prefix]
SELECT *
FROM $topic
...
PROPERTIES (
  'padding.type'=NoOp
)

If padding was enabled in 4.1, then the padding length should be specified in the KCQL statement:

INSERT INTO $bucket[:$prefix]
SELECT *
FROM $topic
...
PROPERTIES (
  'padding.length.offset'=12,
  'padding.length.partition'=12
)

Upgrading from 4.x to 5.0.0 only when STOREAS Bytes_*** is used 

The Bytes_*** storage format has been removed. If you are using this storage format, you will need to install the 5.0.0-deprecated connector and upgrade the connector instances by changing the class name:

Source Before:

class.name=io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector 
...

Source After:

class.name=io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnectorDeprecated
... 

Sink Before:

class.name=io.lenses.streamreactor.connect.aws.s3.sink.S3SinkConnector
...

Sink After:

class.name=io.lenses.streamreactor.connect.aws.s3.sink.S3SinkConnectorDeprecated
connect.s3.padding.strategy=NoOp
...

The deprecated connector won’t be developed any further and will be removed in a future release. If you want to talk to us about a migration plan, please get in touch with us at sales@lenses.io.

Upgrade a connector configuration 

To migrate to the new configuration, please follow the following steps:

  • stop all running instances of the S3 connector
  • upgrade the connector to 5.0.0
  • update the configuration to use the new properties
  • resume the stopped connectors

4.2.0 

  • All

    • Ensure connector version is retained by connectors
    • Lenses branding ASCII art updates
  • AWS S3 Sink Connector

    • Improves the error in case the input on BytesFormatWriter is not binary
    • Support for ByteBuffer which may be presented by Connect as bytes

4.1.0 

  • All

    • Scala upgrade to 2.13.10
    • Dependency upgrades
    • Upgrade to Kafka 3.3.0
    • SimpleJsonConverter - Fixes mismatching schema error.
  • AWS S3 Sink Connector

    • Add connection pool config
    • Add Short type support
    • Support null values
    • Enabling Compression Codecs for Avro and Parquet
    • Switch to AWS client by default
    • Add option to add a padding when writing files, so that files can be restored in order by the source
    • Enable wildcard syntax to support multiple topics without additional configuration.
  • AWS S3 Source Connector

    • Add connection pool config
    • Retain partitions from filename or regex
    • Switch to AWS client by default
  • MQTT Source Connector

    • Allow toggling the skipping of MQTT Duplicates
  • MQTT Sink Connector

    • Functionality to ensure unique MQTT Client ID is used for MQTT sink
  • Elastic6 & Elastic7 Sink Connectors

    • Fixing issue with missing null values

4.0.0 

  • All

    • Scala 2.13 Upgrade
    • Gradle to SBT Migration
    • Producing multiple artifacts supporting both Kafka 2.8 and Kafka 3.1.
    • Upgrade to newer dependencies to reduce CVE count
    • Switch e2e tests from Java to Scala.
  • AWS S3 Sink Connector

    • Optimal seek algorithm - see documentation for more detail.
    • Parquet data size flushing fixes.
    • Adding date partitioning capability - see documentation for more detail.
    • Adding switch to use official AWS library - see documentation for more detail.
    • Add AWS STS dependency to ensure correct operation when assuming roles with a web identity token.
    • Provide better debugging in case of exceptions.
  • FTP Source Connector

    • Fixes to slice mode support.
  • Hazelcast Sink Connector

    • Upgrade to HazelCast 4.2.4. The configuration model has changed and now uses clusters instead of username and password configuration.
  • Hive Sink Connector

    • Update of parquet functionality to ensure operation with Parquet 1.12.2.
    • Support for Hive 3.1.3.
  • JMS Connector

    • Enable protobuf support.
  • Pulsar

    • Upgrade to Pulsar 2.10 and associated refactor to support new client API.

3.0.1 

  • All

    • Replace Log4j with Logback to overcome CVE-2021-44228
    • Bringing code from legacy dependencies inside of project
  • Cassandra Sink Connector

    • Ensuring the table name is logged on encountering an InvalidQueryException
  • HBase Sink Connector

    • Alleviate possible race condition

3.0.0 

  • All

    • Move to KCQL 2.8.9
    • Change sys.errors to ConnectExceptions
    • Additional testing with TestContainers
    • Licence scan report and status
  • AWS S3 Sink Connector

    • S3 Source Offset Fix
    • Fix JSON & Text newline detection when running in certain Docker images
    • Byte handling fixes
    • Partitioning of nested data
    • Error handling and retry logic
    • Handle preCommit with null currentOffsets
    • Remove bucket validation on startup
    • Enabled simpler management of default flush values.
    • Local write mode - build locally, then ship
    • Deprecating old properties, however rewriting them to the new properties to ensure backwards compatibility.
    • Adding the capability to specify properties in yaml configuration
    • Rework exception handling. Refactoring errors to use Either[X,Y] return types where possible instead of throwing exceptions.
    • Ensuring task can be stopped gracefully if it has not been started yet
    • ContextReader testing and refactor
    • Adding a simple state model to the S3Writer to ensure that states and transitions are kept consistent. This can be improved in time.
  • AWS S3 Source Connector

    • Change order of match to avoid scala.MatchError
    • S3 Source rewritten to be more efficient and use the natural ordering of S3 keys
    • Region is necessary when using the AWS client
  • Cassandra Sink & Source Connectors

    • Add connection and read client timeout
  • FTP Connector

    • Support for Secure File Transfer Protocol
  • Hive Sink Connector

    • Array Support
    • Kerberos debug flag added
  • Influx DB Sink

    • Bump influxdb-java from version 2.9 to 2.29
    • Added array handling support
  • MongoDB Sink Connector

    • Nested Fields Support
  • Redis Sink Connector

    • Fix Redis Pubsub Writer
    • Add support for json and json with schema

2.1.3 

  • Move to connect-common 2.0.5 that adds complex type support to KCQL

2.1.2 

  • AWS S3 Sink Connector
    • Prevent null pointer exception in converters when maps are presented will null values
    • Offset reader optimisation to reduce S3 load
    • Ensuring that commit only occurs after the preconfigured time interval when using WITH_FLUSH_INTERVAL
  • AWS S3 Source Connector (New Connector)
  • Cassandra Source Connector
    • Add Bucket Timeseries Mode
    • Reduction of logging noise
    • Proper handling of uninitialized connections on task stop()
  • Elasticsearch Sink Connector
    • Update default port
  • Hive Sink
    • Improve Orc format handling
    • Fixing issues with partitioning by non-string keys
  • Hive Source
    • Ensuring newly written files can be read by the hive connector by introduction of a refresh frequency configuration option.
  • Redis Sink
    • Correct Redis writer initialisation

2.1.0 

  • AWS S3 Sink Connector
  • Elasticsearch 7 Support

2.0.1 

  • Hive Source

    • Rename option connect.hive.hive.metastore to connect.hive.metastore
    • Rename option connect.hive.hive.metastore.uris to connect.hive.metastore.uris
  • Fix Elastic start up NPE

  • Fix to correct batch size extraction from KCQL on Pulsar


2.0.0 

  • Move to Scala 2.12
  • Move to Kafka 2.4.1 and Confluent 5.4
  • Deprecated:
    • Druid Sink (not scala 2.12 compatible)
    • Elastic Sink (not scala 2.12 compatible)
    • Elastic5 Sink(not scala 2.12 compatible)
  • Redis
    • Add support for Redis Streams
  • Cassandra
    • Add support for setting the LoadBalancer policy on the Cassandra Sink
  • ReThinkDB
    • Use SSL connection on Rethink initialize tables is ssl set
  • FTP Source
    • Respect connect.ftp.max.poll.records when reading slices
  • MQTT Source
    • Allow lookup of avro schema files with wildcard subscriptions