Press "Enter" to skip to content

Category: Self-Promotion

RetainSameConnection

I explain what the RetainSameConnection property on an Integration Services connection does:

My co-worker had set up a dynamic connection (see Rafael Salas or Hari Bagra for details on how to do this), but something weird was happening:  the package was trying to push everything to the same server.  I confirmed that if all relevant customers loaded were for the same server, that the process would work correctly, and that I could run each server load one at a time, so there weren’t any problems connecting to particular servers or parameters overriding this choice.  It’s like the connection was “sticky,” connecting successfully to the first server and then ignoring the later changes.

RetainSameConnection allows for certain benefits, but has specific limitations.  Click through to see those details.

Comments closed

SQL Saturday Attendee Distances

I have a long, long post on figuring out how far SQL Saturday attendees tend to drive:

Before I begin, allow me to perform the data science Airing of Grievances.  This is an important part of data analysis which most people gloss over, instead jumping right into the “clean up the dirty data” phase.  But no, let’s revel in its filth for just a few moments.

Despite my protestations and complaints, I think there are some reasonable conclusions.  If you need to look like you’re working for a couple of hours (or at least want to play around a bit with SQL and R), this is the post for you.

Comments closed

The Benefits Of Polybase

I take a look at running a Hadoop query against a big(gish) data set:

Nearly 12 minutes doesn’t sound fantastic, but let’s remember that this is running on a single-node sandbox hosted on my laptop.  That’s hardly a fair setup for a distributed processing system.  Also, I have done nothing to optimize the files; I’m using compressed, comma-separated text files, have not partitioned the data in any meaningful way, and have taken the easy way out whenever possible.  This means that an optimized file structure running on a real cluster with powerful servers behind it could return the data set a lot faster…but for our purposes, that’s not very important.  I’m using the same hardware in all three cases, so in that sense this is a fair comp.

Despite my hemming and hawing, Polybase still performed as well as Hive and kicked sand in the linked server’s face.  I have several ideas for how to tune and want to continue down this track, showing various ways to optimize Polybase and Hive queries.

Comments closed

Unpivoting With APPLY

I have a post on using the APPLY operator to unpivot data:

This code works, but if you have dozens of years, it gets messy writing those case statements and you’re a bit more likely to make a mistake when refactoring code. Here’s a simpler version using CROSS APPLY:

1
2
3
4
5
6
7
8
9
10
11
SELECT
    s.Product,
    y.[Year],
    y.Quantity,
    y.[Value]
FROM #Sales s
    CROSS APPLY (VALUES
        (2013, [Qty2013], [Val2013]),
        (2014, [Qty2014], [Val2014]),
        (2015, [Qty2015], [Val2015])
    ) y([Year], Quantity, [Value]);

It’s a little easier to read than the other version, and adding additional years is pretty straightforward.  That makes for a great tip when you’re trying to refactor poorly-thought-out tables or bring into your system potentially well-thought-out flat files.

APPLY is an elegant solution to so many different classes of problem.

Comments closed

Getting Started With Spark

I discuss getting up and running with Databricks Community Edition:

There are a couple of notes with these clusters:

  1. These are not powerful clusters.  Don’t expect to crunch huge data sets with them.  Notice that the cluster has only 6 GB of RAM, so you can expect to get maybe a few GB of data max.

  2. The cluster will automatically terminate after one hour without activity.  The paid version does not have this limitation.

  3. You interact with the cluster using notebooks rather than opening a command prompt.  In practice, this makes interacting with the cluster a little more difficult, as a good command prompt can provide features such as auto-complete.

Databricks Community Edition has a nice interface, is very easy to get up and running and—most importantly—is free.  Read the whole thing.

Comments closed

Polybase DMVs

I look at the DMVs associated with Polybase and external table creation:

Let’s walk through this one step at a time and understand what the DMV is telling us.  Unfortunately, the DMV documentation is a little sparse, so some of this is guesswork on my part.

  1. A RandomIDOperation appears to create a temporary table.  In this case, the table (whose name is randomly generated) is named TEMP_ID_53.  I’m not sure where that name comes from; the session I ran this from was 54, so it wasn’t a session ID.

  2. After the table gets created, each Compute node gets told to create a table called TMP_ID_53 in tempdb whose structure matches our external table’s structure.  One thing you can’t see from the screenshot is that this table is created with DATA_COMPRESSION = PAGE.  I have to wonder if that’d be the same if my Compute node were on Standard edition.

  3. We add an extended property on the table, flagging it as IS_EXTERNAL_STREAMING_TABLE.

  4. We then update the statistics on that temp table based on expected values.  629 rows are expected here.

  5. Then, we create the dest stat, meaning that the temp table now has exactly the same statistics as our external table.

  6. The next step is that the Head node begins a MultiStreamOperation, which tells the Compute nodes to begin working.  This operator does not show up in the documentation, but we can see that the elapsed time is 58.8 seconds, which is just about as long as my query took.  My guess is that this is where the Head node passes code to the Compute nodes and tells them what to do.

  7. We have a HadoopRoundRobinOperation on DMS, which stands for “Data Movement Step” according to the location_type documentation.  What’s interesting is that according to the DMV, that operation is still going.  Even after I checked it 40 minutes later, it still claimed to be running.  If you check the full query, it’s basically a SELECT * from our external table.

  8. Next is a StreamingReturnOperation, which includes our predicate WHERE dest = ‘ORD’ in it.  This is a Data Movement Step and includes all of the Compute nodes (all one of them, that is) sending data back to the Head node so that I can see the results.

  9. Finally, we drop TEMP_ID_53 because we’re done with the table.

This post was about 70% legwork and 30% guesswork.  That’s a bit higher a percentage than I’d ideally like, but there isn’t that much information readily available yet, so I’m trying (in my own small way) to fix that.

Comments closed

Where Polybase Stats Live

I dig into where the statistics against a Polybase table actually live:

Today, we learned that Polybase statistics are stored in the same way as other statistics; as far as SQL Server is concerned, they’re just more statistics built from a table (remembering that the way stats get created involves loading data into a temp table and building stats off of that temp table).  We can do most of what you’d expect with these stats, but beware calling sys.dm_db_stats_properties() on Polybase stats, as they may not show up.

Also, remember that you cannot maintain, auto-create, auto-update, or otherwise modify these stats.  The only way to modify Polybase stats is to drop and re-create them, and if you’re dealing with a large enough table, you might want to take a sample.

The result isn’t very surprising in retrospect, and it’s good that “stats are stats are stats” is the correct answer.

Comments closed

Configuring Polybase

I have a blog post up on configuring Polybase:

Microsoft’s next recommendation is to make sure that predicate pushdown is enabled.  To do that, we’re going to go back to the Hadoop VM and grab our yarn.application.classpath from there.  To do that, cd to /etc/hadoop/conf/ and vi yarn-site.xml (or use whatever other text reader you want).  Copy the value for yarn.application.classpath, which should be a pretty long string.  Mine looks like:

1
<value>$HADOOP_CONF_DIR,/usr/hdp/current/hadoop-client/*,/usr/hdp/current/hadoop-client/lib/*,/usr/hdp/current/hadoop-hdfs-client/*,/usr/hdp/current/hadoop-hdfs-client/lib/*,/usr/hdp/current/hadoop-yarn-client/*,/usr/hdp/current/hadoop-yarn-client/lib/*</value>

Now that you have a copy of that value, go to your SQL Server installation directory (by default, C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf) and open up yarn-site.xml.  Paste the value into the corresponding yarn.application.classpath setting and you’re good to go.

This is part one of a series on using Polybase.

Comments closed

SQL Data Partners Podcast: The Wide World Of Data

Carlos L. Chacon was nice enough to interview me on his podcast:

The expansion of data sets and increased expectations of businesses for analysis and modeling of data has led developers to create a number of database products to meet those needs. As data professionals, it is incumbent upon us to understand how these tools work and put them to their best use–before somebody else puts them to sub-optimal use.  I am joined by Kevin Feasel who walks us through some of the technologies available and sorts out under what circumstances we want to consider using each one.

Show notes are on the SQL Data Partners podcast site.  My presentation slides are available online.  And if I get just a few more people to dig Aphyr as much as I do, the world will be a better place.

Comments closed

Normalize

I walk through a scenario which underscores the importance of normalization:

This joins my records to a tally table, which gives one row for each character in RemovedValue (that is, the numbers without recordset separators).  I then retain only the values which start a sequence, and use SUBSTRING to snatch up four digits. What I’m left with is a column named SplitVersion, which has one row for each customer, campaign, and 4-digit value (which is equivalent to my normalized table’s structure).

If that wasn’t exciting enough, we now need to slam this back together into our denormalized format, and that’s what tallyjoin does. It uses the FOR XML PATH trick to concatenate my four-digit values into one string, separated by commas. You might be wondering why I use comma instead of CHAR(30), and the answer is that converting CHAR(30) to XML returns a nasty result, so instead of trying to handle that, I use a character which is copacetic and translate it back using the REPLACE function after casting my “XML” result to varchar.

The implicit story here is, you can find someone who knows how to use tally tables, how to concatenate strings of data (quickly!), who knows how to tie various pieces of the puzzle together, and so on…or design the database the right way and avoid this pain later.

Comments closed