REGEXΒΆ
Processing application logs as they arrive is quite a common scenario. Looking at the log entries as a stream of data, which can be aggregated or even joined in real-time, can add a lot of value when it comes to monitoring what your applications are doing.
The Lenses SQL engine supports the projection of an unstructured text line to a well defined structure based on a regular expression used to extract the information. This way turning your log data into searchable data can be achieved in a matter of minutes.
Each log line is published to Kafka as a message of type STRING. To be able to project from the primitive type to a structured type, in this case JSON, the following decoder has been introduced STRING_TO_JSON. When used, it wraps the text in a simple structure and the message value is referenced via the content field.
SELECT content
FROM `topic`
WHERE _vtype=STRING_TO_JSON
Imagine there is a requirement to process the logs from a set of running web application and extract the information like incoming IP address, the API call, response code, etc. Here is an example of such a log entry:
127.0.0.1.800.00, 127.0.0.1.800.00 - - [08/Feb/2017:16:33:27 +0100] "GET /api/house/get_for_compare?idn=33&code=99992&type= HTTP/1.1<em>" 404 19636 "mywebsitelocation.com" "Mozilla/5.0 (Linux; Android 5.0.1; GT-I9505 Build/LRX22C) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36"</em>
One possible regular expression to apply to the given input can be seen below:
(.+?)\s-\s-\s\[(.+?)\]\s\"(.+?)\s(.+?)\s(.+?)\"\s(.+?)\s(.+?)\s\"(.+?)\"\s\"(.+?)\".*
Based on the above the output produced should be the following:
{
"ip":"127.0.0.1.800.00, 127.0.0.1.800.00",
"timestamp":"08/Feb/2017:16:33:27 +0100",
"type":"GET",
"request":"/api/house/get_for_compare?idn=33&code=99992&type=",
"status":"404",
"response_size":"19636",
"website":"mywebsitelocation.com"
}
To achieve the above, the next SQL code should be used:
INSERT INTO $toTopic
WITH log_regex as
(
SELECT STREAM regexp(content, '(.+?)\\s-\\s-\\s\\[(.+?)\\]\\s\\\"(.+?)\\s(.+?)\\s(.+?)\\\"\\s(.+?)\\s(.+?)\\s\\\"(.+?)\\\"\\s\\\"(.+?)\\\".*') as regx
FROM $fromTopic
WHERE _vtype='STRING_AS_JSON'
)
SELECT STREAM
regx[1] as ip,
regx[2] as `timestamp`,
regx[3] as `type`,
regx[4] as request,
regx[6] as status,
regx[7] as response_size,
regx[8] as website
FROM log_regex
WHERE regx is not null
Above, is just a stream transformation. log_regex stream can be grouped or joined further if the requirements dictate to.