Lessons From A Data Analysis Exercise

Bill Schmarzo has an interesting post summarizing the results of an MBA class exercise involving data analysis:

Lesson #2:  Quick and dirty visualizations are critical in understanding what is happening in the data and establishing hypotheses to be tested. For example, the data visualization in Figure 1 quickly highlighted the importance of offensive rebounds and three-point shooting percentage in the Warriors’ overtime losses.

Read the whole thing.

Understanding Bootstrap Aggregating (Bagging)

Gabriel Vasconcelos explains the bagging technique:

The name bagging comes from boostrap aggregating. It is a machine learning technique proposed by Breiman (1996) to increase stability in potentially unstable estimators. For example, suppose you want to run a regression with a few variables in two steps. First, you run the regression with all the variables in your data and select the significant ones. Second, you run a new regression using only the selected variables and compute the predictions.

This procedure is not wrong if your problem is forecasting. However, this two step estimation may result in highly unstable models. If many variables are important but individually their importance is small, you will probably leave some of them out, and small perturbations on the data may drastically change the results.

Read on to see how bootstrap aggregation works and how it solves this solution instability problem.

Presto On HDInsight

Ashish Thapliyal shows how to install Presto on an HDInsight cluster:

What is Presto?

Presto is a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions. Presto is becoming popular SQL interactive query engine that has grabbed the attention and mind-share in Big data communities.

What are the key advantages of Presto?

1- It’s very fast – Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses.

2- Presto can query data where it lives – Presto supports many data sources via the number of connectors that community has built. You can query HDFS , Hive, Azure Storage or data stored in SQL Server , My SQL , CosmosDB or Cassandra etc.

You can install Presto in one simple step with HDInsight Script Action feature

Read on for instructions and showing how to connect this to other Azure products like CosmosDB and Azure SQL Database.

Star-Schema Benchmark With Hive + Druid

Carter Shanklin and Slim Bouguerra run a Hadoop OLAP system running Hive and Druid against the Star-Schema Benchmark battery of queries:

How did we arrive at the query used to build the OLAP index? There is a systematic procedure:

  1. The union of all dimensions used by the SSB queries is included in the index.
  2. The union of all measures is included in the index. Notice that we pre-compute some products in the index.
  3. Druid requires a timestamp, so the date of the transaction is used as the timestamp.

You can see that building the index requires knowledge of the query patterns. Either an expert in the query patterns architects the index, or a tool is needed to analyze queries or to dynamically build indexes on the fly. A lot of time can be spent in this architecture phase, gathering requirements, designing measures and so on, because changing your mind after-the-fact can be very difficult.

One thing I don’t like so much is that they removed the ORDER BY clauses from some of the queries, as making this change makes it more difficult to use these results for “it’s totally not a comparison so don’t sue us Oracle” purposes.

SQL In Spaaaaaaacccce!

Drew Furgiuele knows that in space, no-one can hear your Sev 18 alerts:

Over the last few months, I’ve had a new itch: I wanted to get into the world of high altitude ballooning. The concept is pretty simple: get a balloon and some helium, tie it to a payload, and let it go. The balloon travels a certain height and distance, then bursts, and your payload falls back to earth. That in itself is pretty interesting to me, and it’s not prohibitively expensive: students have done it for a couple hundred dollars. For a few dollars more, you can put a camera on it and take pictures as it travels.

The thing is, I wanted to do more than that. The maker in me wanted to do something special, something no one (to my knowledge) has done before. I not only wanted to launch a balloon and a camera, I wanted to put SQL Server up there, too. So that’s why we’re announcing the High Altitude SQL Server Project (HASSP).

I love it.

Figuring Out If That Powershell Variable Exists

Richie Lee has a method for checking whether a variable exists in Powershell:

Recently I needed to check that a variable exists in the PowerShell session currently running. This is actually far easier than it sounds. So here is a simple demo for how it works. The magic here is the “Test-Path variable:my_variable” on lines 4. It tests that a variable of that name exists. If it does, great, let’s print out the value. If not, let’s alert that it doesn’t. The second example of this on line 11 will do exactly that.

Read on to see Test-Path in action.

T-SQL FizzBuzz Golf

Kenneth Fisher plays the FizzBuzz game with T-SQL and his chat piles on:

Now normally FizzBuzz is done with a loop, but as Russ said, we are using T-SQL so batch code is always the goal. That said, what table should I query to get the numbers 1-100? Well, I decided I’d just do something simple here and use a system view that’s more than 100 rows, the ROW_NUMBER function, and restrict it using TOP.

Read on to see several answers to this problem, some better than others.

Hierarchy Slicer Custom Visual

Devin Knight continues his Power BI custom visuals series:

Key Takeaways

  • Let’s you expand and dig into each level of you hierarchy data.

  • Allows you to display a measure value at the lowest level.

  • Replaces the idea using multiple regular slicers to represent what the Hierarchy Slicer does.

This is a great visual if your dimensional data fits a good natural hierarchy.

SQL On Docker Environment Variables

James Anderson lists the current environment variables that you can set for a SQL Server on Linux Docker container:

As of CTP 2.1 for SQL Server 2017 a set of new environment variables are available. These variables allow us to configure each SQL Server container as we spin them up. The first version of SQL Server on Linux came with:


These had to be set for the container to start. The SA_PASSWORD has be a complex password or the container will not start. CTP 2.1 introduced:

Read on for the new variables and an example on how to use them.

Three-Way Variance Analysis

Bogdan Anastasiei shows how to perform a three-way variance analysis when the third-order and second-order effects are both statistically significant:

In the formula above the interaction effect is, of course, dosegendertype. The ANOVA results can be seen below (we have only kept the line presenting the third-order interaction effect).

Df Sum Sq Mean Sq F value   Pr(>F)
dose:gender:type   2    187    93.4  22.367 3.81e-10

The interaction effect is statistically significant: F(2)=22.367, p<0.01. In other words, we do have a third-order interaction effect. In this situation, it is not advisable to report and interpret the second-order interaction effects (they could be misleading). Therefore, we are going to compute the simple second-order interaction effects.

This is definitely not a trivial article, but there are useful techniques in it.


May 2017
« Apr