Working With Dates And Times In Logstash

Mike Hillwig continues his Logstash series:

So far, I’ve done a decent job getting the data into shape. My biggest challenge, though, was the dates and times. Dates are in one field, and the times are in another. Dates look like 2014-02-26 and times look like 0852 Using a traditional datetime datatype would be nice to have, so I’ll have to do it myself. In order to turn a date and time into a datetime, I need to abut the two fields and then convert it.

I accomplished this by using a mutate filter, employing by several add_field commands. Notice how I simply abut the two times.

Read on to see how Mike does it.

Dropping Columns With Logstash

Mike Hillwig shows how to ignore columns with Logstash:

Like I said earlier, we have some data that I know I’ll never use. This is flight performance data. The dataset contains diversion information. If a flight gets diverted more than once, it’s tracked here. I don’t care about that, so I’m dropping the diversion information for the second through fifth diversions. I’m also dropping some information about the airports that I believe I won’t need. This is the tricky part. Somewhere down the road, I’m going to need to enhance this data by converting all of the times to UTC.

Mike’s slowly building up to a complete, working example and it’s interesting to watch the progress along the way.

Using Kafka And Elasticsearch For IoT Data

Angelos Petheriotis talks about building an IoT structure which handles ten billion messages per day:

We splitted the pipeline into 2 main units: The aggregator job and the persisting job. The aggregator has one and only one responsibility. To read from the input kafka topic, process the messages and finally emit them to a new kafka topic. The persisting job then takes over and whenever a message is received from topic temperatures.aggregated it persists to elasticsearch.

The above approach might seem to be an overkill at first but it provides a lot of benefits (but also some drawbacks). Having two units means that each unit’s health won’t directly affect each other. If the processing job fails due OOM, the persisting job will still be healthy.

One major benefit we’ve seen using this approach is the replay capabilities this approach offers. For example, if at some point we need to persist the messages from temperatures.aggregated to Cassandra, it’s just a matter of wiring a new pipeline and start consuming the kafka topic. If we had one job for processing and persisting, we would have to reprocess every record from the, which comes with a great computational and time cost.

Angelos also discusses some issues he and his team had with Spark Streaming on this data set, so it’s an interesting comparison.

Parsing CSVs With Logstash

Mike Hillwig continues his Logstash series by reading in a CSV:

As I was writing this, I thought I’d play with the autodetect_column_namessetting. Unfortunately, it wasn’t an option for this particular file. Logstash threw an error :exception=>java.lang.ArrayIndexOutOfBoundsException: -1which leads me to guess that my file is too wide for this setting. This file is staggeringly wide with 75 columns. If you have a more narrow file, this could be a really cool option. If your file format changes by someone adding or removing a column from the CSV, it’ll be a lot easier to maintain. Alas, it’s not an option in this situation.

Check out the script.

Configuring Logstash

Mike Hillwig gets us started on Logstash:

Logstash is an incredibly powerful tool. If you can put data into a text file, Logstash can parse it. It works well with a lot of data, but I’m finding myself using it more to use it for event data. When I say event data, if it triggers a log event and it writes to a log, it’s an event. For the purposes of my demos, I’m using data from the Bureau of Transportation Statistics. They track flight performance data, which works perfectly for my uses. It’s a great example dataset without using anything related to my real job.

Logstash configuration files typically have three sections, INPUT, FILTER, and OUTPUT. However, FILTER is optional.

This is the first part in a series, so stay tuned.

Querying Elasticsearch

Swatee Chand has a tutorial on querying Elasticsearch:

In Elasticsearch, aggregations framework is responsible for providing the aggregated data based on a search query. Aggregations can be composed together in order to build complex summaries of the data. For a better understanding, consider it as a unit-of-work. It develops analytic information over a set of documents that are available in Elasticsearch. Various types of aggregations are available, each of them having its own purpose and output. For simplification, they are generalized to 4 major families:

  1. Bucketing

    Here each bucket is associated with a key and a document. Whenever the aggregation is executed, all the buckets criteria are evaluated on every document. Each time a criterion matches, the document is considered to “fall in” the relevant bucket.

  2. Metric

    Metrics are the aggregations which are responsible for keeping a track and computing the metrics over a set of documents.

  3. Matrix

    Matrix are the aggregations which are responsible for operating on multiple fields. They produce a matrix result out of the values extracted from the requested document fields. Matrix does not support scripting.

  4. Pipeline

    Pipeline are the aggregations which are responsible for aggregating the output of other aggregations and their associated metrics together.

If you deal with Elasticsearch (or have log data that you want to query through), this tutorial will give you an idea of what you can do.

Basics Of Elasticsearch In .NET

Ivan Cesar gives us a brief tutorial of the Elasticsearch .NET API:

To be able to search something, we must store some data into ES. The term used is “indexing.”

The term “mapping” is used for mapping our data in the database to objects which will be serialized and stored in Elasticsearch. We will be using Entity Framework (EF) in this tutorial.

Generally, when using Elasticsearch, you are probably looking for a site-wide search engine solution. You will either use some sort of feed or digest, or Google-like search which returns all the results from various entities, such as users, blog entries, products, categories, events, etc.

These will probably not just be one table or entity in your database, but rather, you will want to aggregate diverse data and maybe extract or derive some common properties like title, description, date, author/owner, photo, and so on. Another thing is, you probably won’t do it in one query, but if you are using an ORM, you will have to write a separate query for each of those blog entries, users, products, categories, events, or something else.

Check out Ivan’s tutorial for several examples.  Elasticsearch is really good for text-based search and simple aggregations, but it probably shouldn’t be a primary data store for any data you really care about.

Kafka Connect To Elasticsearch

Robin Moffatt shows how to take data from Kafka Connect and feed it into Elasticsearch:

Whilst Kafka Connect is part of Apache Kafka itself, if you want to stream data from Kafka to Elasticsearch you’ll want the Confluent Open Source distribution (or at least, the Elasticsearch connector).

The configuration is pretty simple. As before, see inline comments for details

It’s worth noting that if you’re using the same convertor throughout your pipelines (Avro, in this case) you’d actually put this in the Connect worker config itself rather than repeating it for each connector configuration.

This is a simple example which shows just how easy it can be.

Grafana On Elasticsearch

Daniel Berman shows how to replace Kibana with Grafana:

While very similar in terms of what can be done with the data itself within the two tools. The main differences between Kibana and Grafana lie in configuring how the data is displayed. Grafana has richer display features and more options for playing around with how the data is represented in the graphs.

While it takes some time getting accustomed to building graphs in Grafana — especially if you’re coming from Kibana — the data displayed in Grafana dashboards can be read and analyzed more easily.

I prefer Grafana over Kibana for a few reasons, so I’m happy to see Grafana articles popping up.

Logstash Filters

Nicolas Frankel explains how the grok and dissect filters work in Logstash:

The Grok filter gets the job done. But it seems to suffer from performance issues, especially if the pattern doesn’t match. An alternative is to use the dissect filter instead, which is based on separators.

Unfortunately, there’s no app for that – but it’s much easier to write a separator-based filter than a regex-based one. The mapping equivalent to the above is:

%{timestamp} %{+timestamp} %{level}[%{application},%{traceId},%{spanId},%{zipkin}]\n
%{pid} %{}[%{thread}] %{class}:%{log}
(broken on 2 lines for better readability)

One of the big secrets to effective debugging of code is having good logging mechanisms in place.


April 2018
« Mar