Learn in 10 minutes¶
Welcome to this Lenses SQL engine tutorial! This document takes the user through typical use cases for querying existing data.
Each little exercise first builds the topics and insert the data before proceeding to the actual SELECT
statement.
It is expected that the reader has an understanding of the concepts behind the SQL engine and that Lenses is running.
The most convenient option to run Lenses is using Lenses-Box, which is a self-contained Docker image packing Lenses and all its dependencies.
Assuming that the Docker image is up and running, the reader should navigate to http://localhost:3030, log in with the default username and password (admin/admin) and then navigate to the SQL
screen from the top menu.
For the SQL engine, the term table is equivalent to Kafka topic. The two are used interchangeably.
First query¶
Let us create an Apache Kafka topic which stores String values for both Key and Value and then inserts one entry. The CREATE TABLE is used in the SQL engine to create a Kafka topic.
-- Run
CREATE TABLE greetings(_key string, _value string) FORMAT (string, string);
-- And then
INSERT INTO greetings(_key, _value) VALUES("Hello", "World");
Go ahead and run your first SELECT
query now:
SELECT * FROM greetings;
It is always good practice to clean after finishing, so please go ahead and drop the greetings
table:
DROP TABLE greetings;
Record metadata¶
Each Kafka message contains information related to partition, offset, timestamp, and topic. On top of that, the engine adds the key and value raw byte size. You can easily find out how big or small your Kafka message is. Let us create a topic and insert a few entries.
-- Run
CREATE TABLE tutorial(_key string, name string, difficulty int) FORMAT (Avro, Avro);
-- And then
INSERT INTO tutorial(_key, name, difficulty)
VALUES
("1", "Learn Lenses SQL", 3),
("2", "Learn Quantum Physics", 10),
("3", "Learn French", 5);
With the data available, a query can ask for specific metadata related to the records.
To query for metadata such as the underlying Kafka topic offset, partition and timestamp prefix your desired fields with _meta
.
Run the following query to see each tutorial name along with its metadata information:
SELECT name, _meta.offset, _meta.timestamp, _meta.partition, _meta.__keysize, _meta.__valsize
FROM tutorial
/* The output is (timestamp will be different)
Learn Lenses SQL 0 1540575169198 0 7 23
Learn Quantum Physics 1 1540575169226 0 7 28
Learn French 2 1540575169247 0 7 19
*/
Performing simple arithmetic¶
Through the SQL expressions, you can perform basic arithmetic. Let us create a grocery table and insert data into it:
-- Run
CREATE TABLE groceries (_key int, id int, name string, quantity int, price double) FORMAT(INT, Avro);
-- And then
INSERT INTO groceries(_key, id, name, quantity , price)
VALUES
(1, 1, "Fairtrade Bananas", 1, 1.90),
(2, 2, "Meridian Crunchy Peanut Butter", 1, 2.50),
(3, 3, "Green & Black's organic 85% dark chocolate bar", 2, 1.99),
(4, 4, "Activia fat free cherry yogurts", 3, 2.00),
(5, 5, "Green & Blacks Organic Chocolate Ice Cream", 2, 4.20);
With the data available, let us calculate the price of each purchase:
SELECT name, quantity * price as total FROM groceries
/*The output:
Fairtrade Bananas 1.9
Meridian Crunchy Peanut Butter 2.5
Green & Black's organic 85% dark chocolate bar 3.98
Activia fat free cherry yogurts 6
Green & Blacks Organic Chocolate Ice Cream 8.4
*/
The common arithmetic operations +, -, *, / and % are supported. Try using a different arithmetic operation on your own.
Using functions¶
Lenses provides quite an array of functions out of the box – it even has support for user-defined functions. Built-in functions give flexibility when processing your data. The ROUND
function allows you to round numeric functions:
SELECT name , round(quantity * price) as rounded_total
FROM groceries
/*The output:
Fairtrade Bananas 2
Meridian Crunchy Peanut Butter 3
Green & Black's organic 85% dark chocolate bar 4
Activia fat free cherry yogurts 6
Green & Blacks Organic Chocolate Ice Cream 8
*/
Manipulating text columns¶
String functions allow you to perform different operations on text. Take CONCAT
for example it allows combining multiple values.
Let us first create a customer table:
-- Run
CREATE TABLE customers (_key string, first_name string, last_name string) FORMAT(string, Avro);
-- And then
INSERT INTO customers(_key, first_name, last_name)
VALUES
("mikejones", "Mike", "Jones"),
("anasmith", "Ana", "Smith"),
("shannonelliott", "Shannon","Elliott");
In the following example, the query combines first and last names into full names.
SELECT CONCAT(first_name, " ", last_name) as name
FROM customers;
/* Output
Mike Jones
Ana Smith
Shannon Elliott
*/
Filtering with WHERE¶
WHERE clause allows the user to define a set of logical predicates the data needs to match in order to be returned. All of the standard comparison operators are supported (>, >=, <, <=, =, and !=) as well as calling functions.
We are going to use the groceries table created earlier – please recreate it if you have dropped it. Let us select all items purchased where the prices are greater or equal to 2.00:
SELECT name
, price
FROM groceries
WHERE price >= 2.0
/* Output
Meridian Crunchy Peanut Butter 2.5
Activia fat free cherry yogurts 2
Green & Blacks Organic Chocolate Ice Cream 4.2
*/
Now let us select all customers whose last name length equals to 5:
SELECT *
FROM customers
WHERE LEN(last_name)=5
/* Output
key value.first_name value.last_name
mikejones Mike Jones
anasmith Ana Smith
*/
Try changing the comparison to something else, maybe use > or < instead of =.
Before moving to the next section, let us search all customers containing Ana in their first name:
SELECT *
FROM customers
WHERE first_name like '%Ana%'
Keep in mind that text search is case sensitive. To use case insensitive text search, you could write:
SELECT *
FROM customers
WHERE lowercase(first_name) like '%ana%';
-- And here is the negated version
SELECT *
FROM customers
WHERE lowercase(first_name) not like '%ana%';
Missing values¶
Sometimes the data can contain explicit NULL values, or it can omit fields entirely. Using IS [ NOT ] NULL, or EXISTS function allows you to check for these situations. Exists is a particular keyword in Lenses SQL grammar, as a result, when using it as a function, it needs to be escaped. You will see in the code example the verb is wrapped by the escaping character: `.
Since Lenses SQL is agnostic of the actual payload, let us rely on the JSON storage format to create the table. JSON is not enforcing a schema and therefore is not as restrictive when inserting the records.
Let us create the following table named customers_json
:
-- Run
CREATE TABLE customers_json (_key string, first_name string, last_name string, middle_name string) FORMAT(string, json);
-- And then
INSERT INTO customers_json(_key, first_name, last_name, middle_name) VALUES("mikejones", "Mike", "Jones", "Albert");
INSERT INTO customers_json(_key, first_name, last_name) VALUES("anasmith", "Ana", "Smith");
INSERT INTO customers_json(_key, first_name, last_name) VALUES("shannonelliott", "Shannon","Elliott");
First, let us query this table for all its entries:
SELECT * FROM customers_json
/* The output
key value.first_name value.middle_name value.last_name
mikejones Mike Albert Jones
anasmith Ana Smith
shannonelliott Shannon Elliott
*/
As you can see the middle_name is only present on the mikejones record. Let us write a query which filters out records where middle_name is not present:
SELECT *
FROM customers_json
WHERE `exists`(middle_name)
/* The output
key value.first_name value.middle_name value.last_name
mikejones Mike Albert Jones
*/
A similar outcome can be achieved with the following query:
SELECT *
FROM customers_json
WHERE middle_name is NULL
When a field is actually NULL
or is missing, checking like in the above query has the same outcome.
Multiple WHERE conditions¶
Scenarios, where one condition is required, are not covering the entire domain space. To cover them, you can use AND/OR to specify complex conditions for filtering your data. Let us use the groceries table again. Please go ahead and recreate it if it is not there. Here is how you could filter the purchased items where more than one item has been bought for a given product, and the unit price is greater than 2
SELECT *
FROM groceries
WHERE quantity > 1 AND price > 2
Now try changing the AND logical operand to OR and see the differences in output.
Limit the output¶
Sometimes, you would want to return just a few records - you are just sampling the data. Also, returning tens of thousands of records, or millions will probably be something that you do not need. It is unlikely you going through each record one by one.
In this scenario, there are a few things you can do to limit the data returned.
The first one is delivered via the LIMIT N
syntax which specifies the maximum number of records that will be returned when querying the data:
SELECT *
FROM groceries
LIMIT 1
An advanced user can also set the maximum data size to be returned:
-- set the data size returned to be 1 megabyte.
SET max.size="1m";
-- on the small dataset we have here 1 MB will accommodate all records added and more
SELECT *
FROM groceries
One other scenario you might encounter is sampling data but discarding a few records before the actual sample ones. Run this query and see the difference between this and the first one in this section.
SELECT *
FROM groceries
LIMIT 1,2
What the syntax above does is instruct the engine to skip the first record matched and then sample the next two. Run the query to see the results.
Read a table partition only¶
By now you know that a Kafka topic maps to a table. Since a Kafka topic is partitioned, the table itself is partitioned. A SELECT
statement targets all the table partitions at once. However, you have the option to look at specific partitions.
Let us create a partitioned table first. We are going to recreate the customers
table used earlier. And we will set the partitions count to 3. A record is sent to a specific partition based on the Key.
The Key is hashed and then the remainder from doing HashValue % partitions will be the table partition the record is sent to.
-- Run
CREATE TABLE customers_partitioned (_key string, first_name string, last_name string) FORMAT(string, Avro) properties(partitions = 3);
-- And then
INSERT INTO customers_partitioned(_key, first_name, last_name)
VALUES
("mikejones", "Mike", "Jones"),
("anasmith", "Ana", "Smith"),
("shannonelliott", "Shannon","Elliott"),
("tomwood", "Tom","Wood"),
("adelewatson", "Adele","Watson"),
("mariasanchez", "Maria", "Sanchez");
Next, run the following query:
SELECT *
FROM customers_partitioned
/* The output
offset partition timestamp key value.first_name value.last_name
0 0 1540830780401 mikejones Mike Jones
1 0 1540830780441 anasmith Ana Smith
2 0 1540830780464 shannonelliott Shannon Elliott
0 2 1540831270170 mariasanchez Maria Sanchez
0 1 1540830984698 tomwood Tom Wood
1 1 1540831183308 adelewatson Adele Watson
*/
As you can see from the results (your timestamps will be different) the records span over the three partitions. Now let us query specific partitions:
-- selects only records from partition = 0
SELECT *
FROM customers_partitioned
WHERE _meta.partition = 0
-- selects only records from partition 0 and 2
SELECT *
FROM customers_partitioned
WHERE _meta.partition = 0
OR _meta.partition = 2
Count the records¶
Looking and manipulating single entries can only take you so far. There are scenarios where you want information about groups of data instead of individual rows.
In this exercise, we are going to use the COUNT
aggregate to count the records on a table. Run the following SQL to see how many records we have on the customers_partitioned:
SELECT count(*) as total
FROM customers_partitioned
Use SUM to aggregate your amounts¶
Earlier on this page, the groceries table was used. You can find out the amount spent with a very simple query using the SUM
function. Please execute the following query:
SELECT SUM(quantity * price) as amount
FROM groceries
Group data with GROUP BY¶
There are scenarios where you want to group the data of a table based on some of the fields of the table. Imagine a customer table where you store the customer country information. Let us create a new table customers with a bit more information on each row. Let us name it customer:
CREATE TABLE customer (firstName string, lastName string, city string, country string, phone string) FORMAT(string, avro) properties(compact=true);
INSERT INTO customer (_key,firstName, lastName, city, country, phone)
VALUES
('1','Craig', 'Smith', 'New York', 'USA', '1-01-993 2800'),
('2','William', 'Maugham','Toronto','Canada','+1-613-555-0110'),
('3','William', 'Anthony','Leeds','UK', '+44 1632 960427'),
('4','S.B.','Swaminathan','Bangalore','India','+91 7513200000'),
('5','Thomas','Morgan','Arnsberg','Germany','+49-89-636-48018'),
('6','Thomas', 'Merton','New York','USA', '+1-617-555-0147'),
('7','Piers','Gibson','London','UK', '+44 1632 960269'),
('8','Nikolai','Dewey','Atlanta','USA','+1-404-555-0178'),
('9','Marquis', 'de Ellis','Rio De Janeiro','Brazil','+55 21 5555 5555'),
('10','Joseph', 'Milton','Houston','USA','+1-202-555-0153'),
('11','John','Betjeman Hunter','Sydney','Australia','+61 1900 654 321'),
('12','Evan', 'Hayek','Vancouver','Canada','+1-613-555-0130'),
('13','E.','Howard','Adelaide','Australia','+61 491 570 157'),
('14','C. J.', 'Wilde','London','UK','+44 1632 960111'),
('15','Butler', 'Andre','Florida','USA','+1-202-555-0107');
Now let us see how many customers there are from each country. Here is the code which computes that:
SELECT COUNT(*) as count
, country
FROM customer
GROUP BY country
/* Output
2 Canada
3 UK
2 Australia
1 India
1 Brazil
5 USA
1 Germany
*/
Nested queries¶
The tabled query allows you to nest queries. Let us take the query in the previous section and say we are only interested in those entries where there exist more than 1 customer per country.
SELECT *
FROM (
SELECT COUNT(*) as count
, country
FROM customer
GROUP BY country)
WHERE count > 1
Run the query, and you will only see those entries for which there is more than one person registered per country.
Filtering grouped data with HAVING¶
As you saw from the previous query, you might want to filter the entries returned from a grouping query. Just like the WHERE
statement, you can use HAVING
syntax to achieve the same result when it comes to grouped queries.
You will most likely agree that this is a much easier syntax than nesting two queries. Let us rewrite the query from the previous section and make it as follows:
SELECT COUNT(*) as count
, country
FROM customer
GROUP BY country
HAVING count > 1
Just run the code, and you will get the same output you got with the nested query.
Joins¶
Lenses Tabled-query allows you to combine records from two tables. A query can contain zero, one or multiple JOIN operations. Let us create an orders table and insert some data into it:
CREATE TABLE orders(_key int, orderDate string, customerId string, amount double) format(int, avro);
INSERT INTO orders (_key, orderDate, customerId, amount)
VALUES
(1, '2018-10-01', '1', 200.50),
(2, '2018-10-11', '1', 813.00),
(3, '2018-10-11', '3', 625.20),
(4, '2018-10-11', '14', 730.00),
(5, '2018-10-11', '10', 440.00),
(6, '2018-10-11', '9', 444.80);
With these tables in place, let us join them to get more information about an order by combining it with the customer information found in the customer table:
SELECT o._key as orderNumber
, o.amount as totalAmount
, c.firstName
, c.lastName
, c.city
, c.country
FROM orders o INNER JOIN customer c
ON o.customerId = c._key
/*
city orderNumber country totalAmount lastName firstName
New York 1 USA 200.5 Smith Craig
New York 2 USA 813 Smith Craig
Leeds 3 UK 625.2 Anthony William
Rio De Janeiro 6 Brazil 444.8 de Ellis Marquis
Houston 5 USA 440 Milton Joseph
London 4 UK 730 Wilde C. J.
*/