Press "Enter" to skip to content

Author: Kevin Feasel

Polybase: Inserting Into Azure Blob Storage

I have a post which uses Polybase to insert into Azure Blob Storage:

One additional question I have involves whether the process for loading data is round-robin on a row-by-row basis.  My conjecture is that it is not (particularly given that our first example had 4 files with zero records in them!), but I figured I’d create a new table and test.  In this case, I’m using three fixed-width data types and loading 10 million identical records.  I chose to use identical record values to make sure that the text length of the columns in this line were exactly the same; the reason is that we’re taking data out of SQL Server (where an int is stored in a 4-byte block) and converting that int to a string (where each numeric value in the int is stored as a one-byte character).  I chose 10 million because I now that’s well above the cutoff point for data to go into each of the eight files, so if there’s special logic to handle tiny row counts, I’d get past it.

Read on for the exciting(?) conclusion.

Comments closed

Replication Subscriber To Azure SQL DB

Jes Borland continues her series on setting up transactional replication between an on-prem SQL Server Availability Group and Azure SQL Database:

This subscription is going to use an Azure SQL Database.

Go to the AG primary replica. (In this demo, this is SQL2014AG2.)

Expand Replication. Expand Local Publications. Right-click the publication and select New Subscription.

It turns out that this is a basic push subscription.  Jes’s post is full of screenshots, making it even easier to follow.

Comments closed

Computed Columns And Columnstore

Kendra Little exposes a gotcha with non-clustered columnstore indexes and computed columns:

Looking at the execution plan, SQL Server decided to scan the non-clustered columnstore index, even though it doesn’t contain the computed column BirthYear! This surprised me, because I have a plain old non-clustered index on BirthYear which covers the query as well. I guess the optimizer is really excited about that nonclustered columnstore.

Kendra links to a Connect item from Niko Neugebauer to add persisted computed columns to columnstore indexes.

Comments closed

Horizontal Funnel

Devin Knight shows off the horizontal funnel Power BI custom visual:

In this module you will learn how to use the Horizontal Funnel Power BI Custom Visual.  The Horizontal Funnel functions somewhat similar to the traditional funnel but it allows you to display a secondary measure and has a few more customizations than you would normally get. You’ll find that the Horizontal Funnel is great for displaying a flow of data.

One of the better non-sales uses of funnels I’ve seen is tracking completion rates on multi-page forms or multi-step processes.  If you see a huge drop-off at one step in the process, it might indicate a bug in the form or some incongruity with the end user’s expectation.

Comments closed

A T-SQL Date Dimension

Vladimir Oselsky builds a date dimension in T-SQL:

Before we get into discussing how to create it date dimension and how to use it, first let’s talk about what it is and why do we need it. Depending on who you talk to, people can refer to this concept as “Calendar table” or “Date Dimension,” which is usually found in Data Warehouse. No matter how it is called, at the end of the day, it is a table in SQL Server which is populated with different date/calendar related information to help speed up SQL queries which require specific parts of dates.

In my case, I have created it to be able to aggregate data by quarters, years and month. Depending on how large your requirements are it will add additional complexity to building it. Since I don’t care about holidays (for now at least), I will not be creating holiday schedule which can be complicated to populate.

I love date dimensions, even on non-warehouse databases, because it’s an easy way of providing additional context to time series data.  Think about graphing orders per day in an industry with weekday-versus-weekend trends; a date dimension lets you strip out weekends (maybe plotting them separately) or even lets you build day-of-week analysis for each day, or looking at week of the month, etc.  You might also be interested in computing holidays.

Comments closed

Basic Non-Linear Regression In R

Renata Ghisloti Duarte de Souza gives an example of running a non-linear regression in R:

Now, suppose you were able to find a good function to model your data. With that, we are able to predict future values for our small dataset.

One important thing about the predict() function in R is that it expects a similar dataframe with the same column name and type as the one you used in your model.

Click through for several examples.

Comments closed

Logstash Filters

Nicolas Frankel explains how the grok and dissect filters work in Logstash:

The Grok filter gets the job done. But it seems to suffer from performance issues, especially if the pattern doesn’t match. An alternative is to use the dissect filter instead, which is based on separators.

Unfortunately, there’s no app for that – but it’s much easier to write a separator-based filter than a regex-based one. The mapping equivalent to the above is:

%{timestamp} %{+timestamp} %{level}[%{application},%{traceId},%{spanId},%{zipkin}]\n
%{pid} %{}[%{thread}] %{class}:%{log}
(broken on 2 lines for better readability)

One of the big secrets to effective debugging of code is having good logging mechanisms in place.

Comments closed

HDInsight With Hive LLAP

Rashin Gupta explains some performance benefits of using Hive 2.0 (LLAP) on HDInsight:

With LLAP, we allow data scientists to query data interactively in the same storage location where data is prepared. This means that customers do not have to move their data from a Hadoop cluster to another analytic engine for data warehousing scenarios. Using ORC file format, queries can use advanced joins, aggregations and other advanced Hive optimizations against the same data that was created in the data preparation phase.

In addition, LLAP can also cache this data in its containers so that future queries can be queried from in-memory rather than from on-disk. Using caching brings Hadoop closer to other in-memory analytic engines and opens Hadoop up to many new scenarios where interactive is a must like BI reporting and data analysis.

Even with this, Hive is still more of a “warehousing” technology, but this moves it closer to real-time (or at least “not slow”) warehousing.

Comments closed

Polybase Execution Plan With Blob Storage

I look at an execution plan and packet capture of a Polybase query which reads from Azure Blob Storage:

In this case, all of those packets were 1514 bytes, so it’s an easy multiplication problem to see that we downloaded approximately 113 MB.  The 2008.csv.bz2 file itself is 108 MB, so factoring in TCP packet overhead and that there were additional, smaller packets in the stream, I think that’s enough to show that we did in fact download the entire file.  Just like in the Hadoop scenario without MapReduce, the Polybase engine needs to take all of the data and load it into a temp table (or set of temp tables if you’re using a Polybase scale-out cluster) before it can pull out the relevant rows based on our query.

The upshot is that Polybase behaves very similarly on Azure Blob Storage as it does with on-prem Hadoop for non-MapReduce queries.

Comments closed