Data Manipulation In R

Kevin Feasel



Casimir Saternos has an article on matrix operations and other data transformations in R:

Operations that are conceptually simple can be difficult to perform using SQL.  Consider the common requirements to pivot or transpose a dataset.   Each of these actions are conceptually straightforward but are complex to implement using SQL.  The examples that follow are somewhat verbose, but the details are not significant. The main point is to illustrate is that, by using specialized functions outside of SQL,  R makes trivial some of those operations that would otherwise require complex SQL statements.  The contrast in the amount of code required is striking.  The simpler approach allows you to focus attention on the scientific or business problem at hand, rather than expending energy reading documentation or laboriously testing complex statements.

I consider this where the second-order value of R comes in.  The initial “wow” factor is in how easy you can plot things, and this ease of data cleansing is the next big time-saver.

Monitoring MDX Query Result Serialization

Chris Webb digs into MDX query serialization.

Part 1:

The Serialize Results Begin event marks the point where SSAS starts to construct the cellset returned. The Serialize Results Current events that immediately follow it, with EventSubclass “1 – Serialize Axes” show SSAS serialising the tuples that are present on the Columns axis (listed as Axis 0 in the TextData column), the Rows axis (Axis 1) and the Where clause (Slicer Axis). The numeric values in the ProgressTotal column for the Serialize Results Current events shows the number of tuples on each axis: the two tuples on columns are the two measures, the six tuples on rows are the six years, and there’s one tuple on the slicer. After that SSAS gets the data for each of the cell values (as shown by the Query Subcube Verbose event – note that this query is running on a warm cache) and there is then a Serialize Results Current event with EventSubclass “2 – Serialize Cells”; the ProgressTotal column shows that twelve cells (2 columns * 6 rows) in total were returned.  The Serialize Results End event shows that SSAS has finished constructing the cellset and not surprisingly it’s followed immediately by the Query Cube End and Query End events.

Part 2:

There are a couple of interesting things to note about this query. First, SQL Server Management Studio on my laptop says that it takes nine seconds to run, even on a warm cache; the Duration column for the Query End event in Profiler, however, shows a value of around six seconds. The three second difference must be the time it takes for SSAS to return the cellset to SQL Server Management Studio, and for SQL Server Management Studio to render the results (my guess is that it’s the second operation that takes the majority of this time – other client tools may be more efficient at rendering large resultsets).

I have so little experience with MDX that this is a wide world of amazement to me.  So it’s very nice that Chris is digging into this and peeling away some of the mystery.

Finding A Query In Query Store

Grant Fritchey shows us how to find a query in the Query Store:

The primary views you’ll want are sys.query_store_query and sys.query_store_query_text. They join together based on the query_text_id. Let’s take four scenarios and see if we can retrieve the correct query.

This is a little more complicated than I would have hoped, but as Grant notes, this is the CTP, so maybe there will be some tooling added to make life a little easier.

Index Scans

Gail Shaw proves that sometimes, an index scan isn’t a full index scan:

A scan of the entire index is over 1600 pages. This query read three. It’s a scan, but it’s a scan which stopped after reading one page of the leaf (the other two are likely the root and intermediate pages, used to locate the first page in the leaf).

The scan read one row and then stopped, because that’s all that was needed. It did that, because there was a 1 row row-goal added to the query. For more details on row goals, see Paul White’s article on the subject.

Read the whole thing.

Corollary:  sometimes a seek isn’t really a seek; sometimes it’s a scan even when the icon says “seek” because that’s more efficient.

Managing SQL Server Error Logs

Mike Fal shows us how to mange SQL Server error logs using Powershell:

Reading SQL Server error logs is pretty simple, but requires us to make use of the SMO objects and the .ReadErrorLog() method to get the data from the log. Let’s start by simply reading the SQL Server error log and see what the output looks like (I’ll be using the SQL Server provider for this)

The big benefit, as Mike points out, is that the log rows are considered objects, meaning you don’t need to try to parse out things like the log date.

Fixing Partition Table Boundary Points

Kendra Little shows us how to fix a common partitioning issue:

Everything was fine, but we forgot to put a boundary point in for Jan 1, 2016… and now it’s well into January. All our data for January 2016 is in the partition with the December data.

The more time passes, the more data is going to go into that partition, and it’ll get bigger and more lopsided. If we’re switching out old partitions by month, eventually that’s not going to work. And partition elimination won’t work for anything after Dec 1, 2015, either!

Be sure to keep those empty partitions around, just in case .


January 2016
« Dec Feb »