Investigating London Crime Data

Carl Goodwin digs into London crime data by borough and sees if he can predict crime rates:

Optimal predictions sit close to, or on, the dashed line in the graphic below, i.e. where the prediction for each observation equals the actual. The Root Mean Squared Error (RMSE) measures the average differences, so should be as small as possible. And R-squared measures the correlation between prediction and actual, where 0 reflects no correlation, and 1 perfect positive correlation.

Our supervised machine learning outcomes from the CART and GLMmodels have weaker RMSEs, and visually exhibit some dispersion in the predictions at higher counts. Stochastic Gradient Boosting, Cubist and Random Forest have handled the higher counts better as we see from the visually tighter clustering.

It was Random Forest that produced marginally the smallest prediction error. And it was a parameter unique to the Random Forest model which almost tripped me up as discussed in the supporting documentation.

Also be sure to read his notebook to get the full story.  H/T R-Bloggers

The Optimal Kafka Message Size

Guy Shilo wants to figure out the right chunk size for a Kafka message:

I wrote a python program that runs a producer and a consumer for 30 minutes with different message sizes and measures how many messages per second it can deliver, or the Kafka cluster throughput.

I did not care about the message content, so the consumer only reads the messages from the topic and then discards them. I used a Three partition topic. I guess that on larger clusters with more partitions the performance will be better, but the message size – throughput ratio will remain roughly the same.

So I wrote a small python program that generates a dummy message in the desired size, then spawns two threads, one is a producer and the other is a consumer. The producer send the same message over and over and the consumer reads the messages from the topic and count how many messages it has read. The main program stops after 30 minutes but before it stops it prints how many messages were consumed and how many messages were consumed per second.

Read on for the results.  More importantly, test in your own environment with your own equipment, as that value’s likely to differ a bit.

Using Kafka And Elasticsearch For IoT Data

Angelos Petheriotis talks about building an IoT structure which handles ten billion messages per day:

We splitted the pipeline into 2 main units: The aggregator job and the persisting job. The aggregator has one and only one responsibility. To read from the input kafka topic, process the messages and finally emit them to a new kafka topic. The persisting job then takes over and whenever a message is received from topic temperatures.aggregated it persists to elasticsearch.

The above approach might seem to be an overkill at first but it provides a lot of benefits (but also some drawbacks). Having two units means that each unit’s health won’t directly affect each other. If the processing job fails due OOM, the persisting job will still be healthy.

One major benefit we’ve seen using this approach is the replay capabilities this approach offers. For example, if at some point we need to persist the messages from temperatures.aggregated to Cassandra, it’s just a matter of wiring a new pipeline and start consuming the kafka topic. If we had one job for processing and persisting, we would have to reprocess every record from the, which comes with a great computational and time cost.

Angelos also discusses some issues he and his team had with Spark Streaming on this data set, so it’s an interesting comparison.

Running Cassandra On EC2

Prasad Alle and Provanshu Dey share some tips if you’re running Cassandra on Amazon’s EC2:

Apache Cassandra is a commonly used, high performance NoSQL database. AWS customers that currently maintain Cassandra on-premises may want to take advantage of the scalability, reliability, security, and economic benefits of running Cassandra on Amazon EC2.

Amazon EC2 and Amazon Elastic Block Store (Amazon EBS) provide secure, resizable compute capacity and storage in the AWS Cloud. When combined, you can deploy Cassandra, allowing you to scale capacity according to your requirements. Given the number of possible deployment topologies, it’s not always trivial to select the most appropriate strategy suitable for your use case.

In this post, we outline three Cassandra deployment options, as well as provide guidance about determining the best practices for your use case in the following areas:

  • Cassandra resource overview

  • Deployment considerations

  • Storage options

  • Networking

  • High availability and resiliency

  • Maintenance

  • Security

Click through to see these tips.

Gartner’s BI Magic Quadrant For 2018

Bruno Aziza looks at the new Gartner magic quadrant for business intelligence solutions:

For the first time in 3 years, Gartner dropped a significant amount of vendors off its quadrant.  There were 24 vendors in the firm’s quadrant in 2016 and 2017.  This year, the Magic Quadrant only lists 20 vendors…that’s a 16% quadrant reduction.  Has the market shrunk?!

Not exactly: the market has evolved….and in a pretty predictable way actually.  Take a look at our 3-year-movement analysis table below: we see a pretty consistent story, e.g. the big are getting bigger, some of the visionaries got absorbed (or disappeared) and few ‘trend-setters’ graduated up.

Read on for more.  The leader quadrant pretty much fits my expectations in terms of the major vendors and their rank ordering.

Checking Plan Compilation Time

Eric Blinn looks at plan compilation time in SQL Server:

The query returns 4 rows. By including STATISTICS TIME we get extra information on the Messages output tab.  We can see from the execution on my laptop that the optimizer took 6ms to compile a query plan and the actual query executed in only 1ms.

Run the query batch a few more times and notice that the parse and compile time drops to zero.  This is because SQL Server keeps a list of compiled plans and tries to reuse them without having to recompile.  In this case the optimizer has recognized that this query is exactly identical to one it has previously executed and it reuses the previously compiled plan.  That list of plans is called the Plan Cache and will be covered in much more detail in a subsequent post.

This cost is something we tend to forget about, but can make a big difference for a user’s experience.

Using Power BI For Hockey Stats

Stacia Varga continues her Power BI + hockey series:

My last data acquisition step is to get statistics data for each player. I just need to build a function to dynamically get data by team like I did above using this endpoint as my base structure:

It turns out there are many different kinds of statistics that I can get in addition to these statistics by season. I’ll probably get them all added into my model eventually, but the process is the same. For a list of other available statistics to use instead of yearByYear, see

It’s another nice use of Power BI to read from a web-based API.

Allowing Azure Service Access

Arun Sirpal points out the importance of a tiny checkbox:

When you create a “logical” Azure SQL Server (I say logical because we are not really physically creating anything) there is a setting that is ticked ON by default which is called “Allow Azure services to access server”.

The question is, what does it mean? (See the highlighted section below)

Read on to see what this does and why Arun doesn’t like the default.


March 2018
« Feb