BigQuery Versus Redshift

Kiyoto Tamura compares Google’s BigQuery versus Amazon’s Redshift for cloud-based warehousing:

Neither service is truly “set and forget” and requires a dedicated engineer to learn the service and maintain it. You can use various tools to automate many aspects of the operation, but someone will have to maintain automation scripts and workflows.

That said, here are things that I’ve heard first-hand from talking to users

The bottom line there is that Redshift is a bit more mature than BigQuery today, but keep an eye on both of them.

LAST_VALUE

Kevin Feasel

2016-06-22

T-SQL

Steve Jones plays with a window function new to SQL Server 2012:

The important thing to understand with window functions is that there is a frame at any point in time when the data is being scanned or processed. I’m not sure what the best term to use is.

Let’s look at the same data set Kathi used. For simplicity, I’ll use a few images of her dataset, but I’ll examine the SalesOrderID. I think that can be easier than looking at the amounts.

Here’s the base dataset for two customers, separated by CustomerID and ordered by the OrderDate. I’ve included amount, but it’s really not important.

Steve goes into detail and explains what’s going on each step of the way.  Window functions are extremely useful; check them out if you’re not already familiar with them.

Connecting SQL Server To Hadoop Using Polybase

I have a post up on using Polybase to create an external table which points to Hadoop:

An interesting thing about FIELD_TERMINATOR is that it can be multi-character.  MSDN uses ~|~ as a potential delimiter.  The reason you’d look at a multi-character delimiter is that not all file formats handle quoted identifiers—for example, putting quotation marks around strings that have commas in them to indicate that commas inside quotation marks are punctuation marks rather than field separators—very well.  For example, the default Hive SerDe (Serializer and Deserializer) does not handle quoted identifiers; you can easily grab a different SerDe which does offer quoted identifiers and use it instead, or you can make your delimiter something which is guaranteed not to show up in the file.

You can also set some defaults such as date format, string format, and data compression codec you’re using, but we don’t need those here.  Read the MSDN doc above if you’re interested in digging into that a bit further.

It’s a bit of a read, but the end result is that we can retrieve data from a Hadoop cluster as though it were coming from a standard SQL Server table.  This is easily my favorite feature in SQL Server 2016.

Blitz Scripts Open Sourced

Kevin Feasel

2016-06-22

Tools

Brent Ozar announces that the sp_Blitz series of scripts is now open source:

Our prior copyright license said you couldn’t install this on servers you don’t own. We’d had a ton of problems with consultants and software vendors handing out outdated or broken versions of our scripts, and then coming to us for support.

Now, it’s a free-for-all! If you find the scripts useful, go ahead and use ’em. Include sp_Blitz, sp_BlitzCache, sp_BlitzIndex, etc as part of your deployments for easier troubleshooting.

This is very good news.

Data Science At Stack Overflow

David Robinson discusses his role as a data scientist at Stack Overflow:

The most prominent example of where machine learning is used in our product is Providence; our system for matching users to jobs they’ll be interested in. (For example, if you visit mostly Python and Javascript questions on Stack Overflow, you’ll end up getting Python web development jobs as advertisements). I work with engineers on the Data team (Kevin Montrose,Jason Punyon, and Nick Larsen) to design, improve and implement these machine learning algorithms. (Here’s some more about the architecture of the system, built before I joined). For example, we’ve worked to get the balance right between jobs that are close to a user geographically and jobs that are well-matched in terms of technology, and ensuring that users get a variety of jobs rather than seeing the same ones over and over.

A lot of this process involves designing and analyzing A/B tests, particularly about changing our targeting algorithms, ad design, and other factors to improve clickthrough rate (CTR). This process is more statistically interesting than I’d expected, in some cases letting me find new uses for methods I’d used to analyze biological experiments, and in other cases encouraging me to learn new statistical tools. In fact, much of my series on applying Bayesian methods to baseball batting statistics is actually a thinly-veiled version of methods I’ve used to analyze CTR across ad campaigns.

Sounds like a fun place to be.

Configuring Polybase

I have a blog post up on configuring Polybase:

Microsoft’s next recommendation is to make sure that predicate pushdown is enabled.  To do that, we’re going to go back to the Hadoop VM and grab our yarn.application.classpath from there.  To do that, cd to /etc/hadoop/conf/ and vi yarn-site.xml (or use whatever other text reader you want).  Copy the value for yarn.application.classpath, which should be a pretty long string.  Mine looks like:

1
<value>$HADOOP_CONF_DIR,/usr/hdp/current/hadoop-client/*,/usr/hdp/current/hadoop-client/lib/*,/usr/hdp/current/hadoop-hdfs-client/*,/usr/hdp/current/hadoop-hdfs-client/lib/*,/usr/hdp/current/hadoop-yarn-client/*,/usr/hdp/current/hadoop-yarn-client/lib/*</value>

Now that you have a copy of that value, go to your SQL Server installation directory (by default, C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf) and open up yarn-site.xml.  Paste the value into the corresponding yarn.application.classpath setting and you’re good to go.

This is part one of a series on using Polybase.

Reshaping Data With R

Kevin Feasel

2016-06-21

R

Alberto Giudici compares tidyr to reshape2 for data cleansing in R:

We see a different behaviour: gather() has brought messy into a long data format with a warning by treating all columns as variable, while melt() has treated trt as an “id variables”. Id columns are the columns that contain the identifier of the observation that is represented as a row in our data set. Indeed, if melt() does not receive any id.variables specification, then it will use the factor or character columns as id variables. gather() requires the columns that needs to be treated as ids, all the other columns are going to be used as key-value pairs.

Despite those last different results, we have seen that the two functions can be used to perform the exactly same operations on data frames, and only on data frames! Indeed, gather() cannot handle matrices or arrays, while melt() can as shown below.

It seems that these two tools have some overlap, but each has its own point of focus:  tidyr is simpler for data tidying, whereas reshape2 has functionality (like data aggregation) which tidyr does not include.

Relational Data In Data Lakes

Shankar Selvam discusses one company’s tool for bringing relational data into a data lake:

The next step in building this pipeline is to configure the sink or destination for the imported data. Hydrator provides capabilities to store data in time-partitioned directories via a built-in CDAP Dataset called Time-partitioned File Set.  Once the data is stored in the fileset, CDAP automatically adds a partition which can be queried using Hive.

In this use case we will configure a Time-partitioned File Set that stores data in Avro format by usingTPFSAvro as the sink.

I like the fact that there’s a UI for this.  Between this tool and NiFi, the Hadoop ecosystem is getting some tools to make data migration easier to understand, and I think that will help adoption.

Sparse Columns

Slava Murygin discusses sparse columns:

“Sparsing” is the way SQL Server optimizes spacing for NULL values at the cost of overhead for non-NULL values.
In other words, if you expect having in your column more nulls than not nulls you can SPARSE that column to optimize the space.

I’ve seen the situations when a lot of columns in Data Mart tables were almost completely filled with NULLS and I started wondering if “SPARSE” can be a good tool to gain some space.

Read the whole thing.  I am not a fan of sparse columns because they prohibit things like page-level compression.  Be sure to read the restrictions on using sparse columns before you give them a try; on net, I think they’re more trouble than they’re worth except in edge cases like extremely denormalized tables collecting thousands of data points from sensors.

Handling Division Failure

Reza Rad looks at how division can go wrong in Power Query:

In this example I just returned zero if I find error. but you can return error message if you like with [Revenue Per Item][ErrorMessage]. This method is great error handling method when an error out of blue happens in your data set. I always recommend using TRY method to get rid of errors that might stop the whole solution to work properly.

I have to mention that steps above are separated to show you how the output of try expression looks like. In fact you can combine both steps above in single step with TRY OTHERWISE as below (Thanks to Maxim Zelensky for pointing this out);

The end result is code which is a bit more complex, but safely handles a number of edge cases.

Categories

September 2018
MTWTFSS
« Aug  
 12
3456789
10111213141516
17181920212223
24252627282930