Press "Enter" to skip to content

Day: June 28, 2018

Enriching Syslog Data In A Kafka Pipeline

Robin Moffatt continues his syslog processing series with Kafka and KSQL:

In this article we’re going to conclude our fun with syslog data by looking at how we can enrich inbound streams of syslog data with reference information from elsewhere to produce a real-time enriched data stream. The syslog data in this example comes from various servers and network devices, and the additional information with which we’re going to enrich it is from MongoDB, which happens to be the datastore used by Ubiquiti network devices. With the enriched data we’re going to drive some real-time analytics through Elasticsearch and Kibana, as well as trigger push notifications based on activity of certain devices on the network.

I’ve enjoyed this series—it was a full, end-to-end look at a realistic business problem in Kafka Streams.  If you want to get started with Kafka Streams, I’d be hard-pressed to find a better example.

Comments closed

Mapping In Scala When Dealing With Futures & Options

Shubham Verma shows us what happens when we use the map() and flatMap() functions in Scala on fields which are marked as Options or Futures:

now let’s move towards the interesting part flatMap(), what it is supposed to do in case of Option so the flatMap() gives us the liberty to return the type of value that we want to return after the transformation, unlike map() in wherein when a parameter has the value Some the value would be of type Some what so ever, but its not with the case with flatMap()

scala> option.flatMap(x => None)
res13: Option[Nothing] = None
scala>
scala> option.map(x => None)
res14: Option[None.type] = Some(None)

the code snippet above clearly shows it, so is that it ? No not yet lets look on to one more feature of flatMap() on Option[+A]that comes to be real handy when we need to extract value out of options, supposedly we have list of type List[Option[Int]] now I am only interested in values that have some value which seems to be an obvious usecase in most of the times, we can simply do it using a ​flatMap() operation

In short, it’s a little more complex, but you can still get useful information.

Comments closed

Taxi Cab Data On sqlite And Parquet

Mark Litwintschik loads the 1.1 billion rows of New York City taxi data into a SQLite database using data stored on Parquet-formatted files living on HDFS:

The dataset used in this benchmark has 1.1 billion records, 51 columns and is 500 GB in size when in uncompressed CSV format. Instructions on producing the dataset can be found in my Billion Taxi Rides in Redshift blog post. The CSV files were converted into Parquet format using Hive and Snappy compression on an AWS EMR cluster. The conversion resulted in 56 Parquet files which take up 105 GB of space.

Where decompression is I/O or network bound it makes sense to keep the compressed data as compact as possible. That being said, there are cases where decompression is compute bound and compression schemes like Snappy play a useful role in lowering the overhead.

I’ve downloaded the Parquet files to my local file system and imported them onto HDFS. Since this is all running on a single SSD drive I’ve set the HDFS replication factor to 1.

It’s not the fastest result I’ve seen from Mark’s work, but I was impressed that SQLite could take that abuse.

Comments closed

Studying Performance Problems With Web Applications

Adrian Colyer reviews a paper on fixing performance problems with ORMs:

This is a fascinating study of the problems people get into when using ORMs to handle persistence concerns in their web applications. The authors study real-world applications and distil a catalogue of common performance anti-patterns. There are a bunch of familiar things in the list, and a few that surprised me with the amount of difference they can make. By fixing many of the issues that they find, Yang et al., are able to quantify how many lines of code it takes to address the issue, and what performance improvement the fix delivers.

Much of this is straightforward, but worth the read.

Comments closed

Graphics In R

David Smith is following the kerfuffle that Edward Tufte unleashed on Twitter recently:

While graphics guru Edward Tufte recently claimed that “R coders and users just can’t do words on graphics and typography” and need additonal tools to make graphics that aren’t “clunky”, data journalists at major publications beg to differ. The BBC has been creating graphics “purely in R” for some time, with a typography style matching that of the BBC website. Senior BBC Data Journalist Christine Jeavans offers several examples, including this chart of life expectancy differences between men and women:

I think Tufte’s off base here.

Comments closed

Just Update Those Servers Already

Randolph West wants none of your excuses:

Folks, we all like to make sure we’re doing our level best to make things work smoothly.

So why am I staring at someone’s server that has never been updated since it was first set up almost three years ago?

Do better, so that I don’t have to yell at you. Seriously.

When we ignore updates, we are ignoring preventable catastrophic problems; we are ignoring fixes to security bugs, performance bugs, and data corruption bugs. Each one of these things could give you a really bad day. In two out of three cases it might even be a career-limiting move.

There are risks to patching servers, but the downside risk tends to be much larger and administrators need to be able to mitigate update risks through redundancy, automation, and having a rollback plan if needed.  It’s more work than not patching, but the outcome is much better.

Comments closed

Scaling Azure Analysis Services

Chris Seferlis helps us make the decision between scaling up or out with Azure Analysis Services:

Some of you may not know when or how to scale up your queries or scale out your processing. Today I’d like to help with understanding when and how using Azure Analysis Services. First, you need to decide which tier you should be using. You can do that by looking at the QPUs (Query Processing Units) of each tier on Azure. Here’s a quick breakdown:

  • Developer Tier – gives you up to 20 QPUs

  • Basic Tier – is a mid-scale tier, not meant for heavy loads

  • Standard Tier (currently the highest available) – allows you more capability and flexibility

Read on for some pointers.

Comments closed

OFFSET – FETCH Versus ROWNUM In Oracle

Lukas Eder compares the OFFSET FETCH logic versus using ROWNUM for grabbing an ordered sub-selection of rows in Oracle:

Now, while the SQL transformation from FETCH FIRST to ROW_NUMBER() filtering is certainly correct, the execution plan doesn’t really make me happy. Consider the ROWNUM based query:

---------------------------------------------------------
| Id | Operation | Name | Rows |
---------------------------------------------------------
| 0 | SELECT STATEMENT | | |
|* 1 | COUNT STOPKEY | | |
| 2 | VIEW | | 1 |
| 3 | TABLE ACCESS BY INDEX ROWID| FILM | 1000 |
| 4 | INDEX FULL SCAN | PK_FILM | 1 |
---------------------------------------------------------
Predicate Information (identified by operation id):
--------------------------------------------------- 1 - filter(ROWNUM=1)

And compare that to the FETCH FIRST query:

-------------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------------
| 0 | SELECT STATEMENT | | |
|* 1 | VIEW | | 1 |
|* 2 | WINDOW SORT PUSHED RANK| | 1000 |
| 3 | TABLE ACCESS FULL | FILM | 1000 |
-------------------------------------------------
Predicate Information (identified by operation id):
--------------------------------------------------- 1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=1) 2 - filter(ROW_NUMBER() OVER ( ORDER BY "FILM"."FILM_ID")<=1)

Lukas digs into this and is not the biggest fan of OFFSET-FETCH.  On the SQL Server side, my anecdotal experience has been that it doesn’t perform nearly as well as you’d like either.

Comments closed