Press "Enter" to skip to content

Month: March 2018

Order Of Operations With Logical Types

Thomas Rushton explains the order of operations, particularly around boolean operators:

The order in which calculations are done – not just reading from left to right, but remembering that things like multiplication and division happen before addition and subtraction. My son tells me that kids nowadays are being taught something called “BIDMAS” – which stands for “Brackets, Indices, Division, Multiplication, Addition, Subtraction”. Or it can be BODMAS – Brackets, Operations, Division… (Operation is a fancy new way of describing indices – ie xy)

Unsurprisingly, there are similar rules for Boolean operators.

It’s a valuable lesson oft learned.

Comments closed

Using Extended Properties For Documentation

Phil Factor shows us how we can use Extended Properties to build database documentation:

Once you’ve got into the habit of using Extended Properties to document your database, there are obvious benefits:

  • You can explain why you added that index or modified that constraint.
  • You can describe exactly what that rather obscure column does.
  • You can add a reasoned explanation to the use of a table.

You will often need these explanations because, sadly, DDL code isn’t ‘self-documenting’, and human memory is fallible. Extended Properties are easily searched because they are all exposed in one system view.

It is great to add explanations to lists of procedures, functions and views once the database becomes sizeable. Extended Properties are useful when exploring the metadata, but the requirement isn’t quite so essential because comments are preserved along with the source code. Tables, however, are a big problem because SQL Server throws away the script that produces the table, along with all the comments. The reason that this happens is that there are many ways you can alter parts of a table without scripting the entire table. How could one infallibly preserve all these ALTER statements in the preserved script? It’s tricky. Table scripts that you get from SSMS or script via SMO are therefore synthesised from the system tables but without those comments or even Extended Properties.

Extended Properties are useful, but I think the lack of tooling around them prevented widespread adoption.  Now that there are a few tools which support them (including SSMS’s data classification mechanism), I wonder if these will get a second look.

Comments closed

Query Store Indexes

Arthur Daniels shows what you can learn from the indexes on Query Store tables:

It looks like internally Query Store is referred to as plan_persist. That makes sense, thinking about how the Query Store persists query plans to your database’s storage. Let’s take a look at those catalog views vs their clustered and nonclustered indexes. I’ve modified the query a bit at this point, to group together the key columns.

This lets you see how the Query Store authors expected us to use these tables.  Which isn’t always how people use them…

Comments closed

Error Running Analysis Services Processing Task

Angela Henry ran into a problem with the SSIS Analysis Services processing task:

In both of these scenarios you will not be able to save the package.  So what the heck are you supposed to do?!  Here’s where my tunnel vision (and panic) sets in.  How was I supposed to get my SSAS objects processed?

I could always script out my processing tasks using SSMS and drop them in a SQL Agent job step.  But I have multiple environments and multiple cubes so each one would have to be hard coded.  Not a great idea, so scratch that.

Click through to learn the best way to fix this.

Comments closed

Viewing N Months In Power BI

Jason Thomas shows a nice method to combine data for the last N months along with a month-by-month breakdown using a single date dimension:

Create 3 measures as shown below, and then add those 3 measures in the report along with a month slicer as shown below. You can change the month in the slicer and verify that the measure values change for the selected month.


Sales (Selected Month)
 = SUM ( Sales[Sales] )

Sales Last Year
 = CALCULATE ( SUM ( Sales[Sales] )SAMEPERIODLASTYEAR ( ‘Date'[Date] ) )

Sales YTD 
TOTALYTD ( SUM ( Sales[Sales] ), ‘Date'[Date] )

This is the first time I’ve seen a What If parameter in use.  Very interesting.

Comments closed

Investigating London Crime Data

Carl Goodwin digs into London crime data by borough and sees if he can predict crime rates:

Optimal predictions sit close to, or on, the dashed line in the graphic below, i.e. where the prediction for each observation equals the actual. The Root Mean Squared Error (RMSE) measures the average differences, so should be as small as possible. And R-squared measures the correlation between prediction and actual, where 0 reflects no correlation, and 1 perfect positive correlation.

Our supervised machine learning outcomes from the CART and GLMmodels have weaker RMSEs, and visually exhibit some dispersion in the predictions at higher counts. Stochastic Gradient Boosting, Cubist and Random Forest have handled the higher counts better as we see from the visually tighter clustering.

It was Random Forest that produced marginally the smallest prediction error. And it was a parameter unique to the Random Forest model which almost tripped me up as discussed in the supporting documentation.

Also be sure to read his notebook to get the full story.  H/T R-Bloggers

Comments closed

The Optimal Kafka Message Size

Guy Shilo wants to figure out the right chunk size for a Kafka message:

I wrote a python program that runs a producer and a consumer for 30 minutes with different message sizes and measures how many messages per second it can deliver, or the Kafka cluster throughput.

I did not care about the message content, so the consumer only reads the messages from the topic and then discards them. I used a Three partition topic. I guess that on larger clusters with more partitions the performance will be better, but the message size – throughput ratio will remain roughly the same.

So I wrote a small python program that generates a dummy message in the desired size, then spawns two threads, one is a producer and the other is a consumer. The producer send the same message over and over and the consumer reads the messages from the topic and count how many messages it has read. The main program stops after 30 minutes but before it stops it prints how many messages were consumed and how many messages were consumed per second.

Read on for the results.  More importantly, test in your own environment with your own equipment, as that value’s likely to differ a bit.

Comments closed

Using Kafka And Elasticsearch For IoT Data

Angelos Petheriotis talks about building an IoT structure which handles ten billion messages per day:

We splitted the pipeline into 2 main units: The aggregator job and the persisting job. The aggregator has one and only one responsibility. To read from the input kafka topic, process the messages and finally emit them to a new kafka topic. The persisting job then takes over and whenever a message is received from topic temperatures.aggregated it persists to elasticsearch.

The above approach might seem to be an overkill at first but it provides a lot of benefits (but also some drawbacks). Having two units means that each unit’s health won’t directly affect each other. If the processing job fails due OOM, the persisting job will still be healthy.

One major benefit we’ve seen using this approach is the replay capabilities this approach offers. For example, if at some point we need to persist the messages from temperatures.aggregated to Cassandra, it’s just a matter of wiring a new pipeline and start consuming the kafka topic. If we had one job for processing and persisting, we would have to reprocess every record from the thermostat.data, which comes with a great computational and time cost.

Angelos also discusses some issues he and his team had with Spark Streaming on this data set, so it’s an interesting comparison.

Comments closed

Running Cassandra On EC2

Prasad Alle and Provanshu Dey share some tips if you’re running Cassandra on Amazon’s EC2:

Apache Cassandra is a commonly used, high performance NoSQL database. AWS customers that currently maintain Cassandra on-premises may want to take advantage of the scalability, reliability, security, and economic benefits of running Cassandra on Amazon EC2.

Amazon EC2 and Amazon Elastic Block Store (Amazon EBS) provide secure, resizable compute capacity and storage in the AWS Cloud. When combined, you can deploy Cassandra, allowing you to scale capacity according to your requirements. Given the number of possible deployment topologies, it’s not always trivial to select the most appropriate strategy suitable for your use case.

In this post, we outline three Cassandra deployment options, as well as provide guidance about determining the best practices for your use case in the following areas:

  • Cassandra resource overview

  • Deployment considerations

  • Storage options

  • Networking

  • High availability and resiliency

  • Maintenance

  • Security

Click through to see these tips.

Comments closed

Gartner’s BI Magic Quadrant For 2018

Bruno Aziza looks at the new Gartner magic quadrant for business intelligence solutions:

For the first time in 3 years, Gartner dropped a significant amount of vendors off its quadrant.  There were 24 vendors in the firm’s quadrant in 2016 and 2017.  This year, the Magic Quadrant only lists 20 vendors…that’s a 16% quadrant reduction.  Has the market shrunk?!

Not exactly: the market has evolved….and in a pretty predictable way actually.  Take a look at our 3-year-movement analysis table below: we see a pretty consistent story, e.g. the big are getting bigger, some of the visionaries got absorbed (or disappeared) and few ‘trend-setters’ graduated up.

Read on for more.  The leader quadrant pretty much fits my expectations in terms of the major vendors and their rank ordering.

Comments closed