Configuring Polybase

I have a blog post up on configuring Polybase:

Microsoft’s next recommendation is to make sure that predicate pushdown is enabled.  To do that, we’re going to go back to the Hadoop VM and grab our yarn.application.classpath from there.  To do that, cd to /etc/hadoop/conf/ and vi yarn-site.xml (or use whatever other text reader you want).  Copy the value for yarn.application.classpath, which should be a pretty long string.  Mine looks like:


Now that you have a copy of that value, go to your SQL Server installation directory (by default, C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf) and open up yarn-site.xml.  Paste the value into the corresponding yarn.application.classpath setting and you’re good to go.

This is part one of a series on using Polybase.

Reshaping Data With R

Kevin Feasel



Alberto Giudici compares tidyr to reshape2 for data cleansing in R:

We see a different behaviour: gather() has brought messy into a long data format with a warning by treating all columns as variable, while melt() has treated trt as an “id variables”. Id columns are the columns that contain the identifier of the observation that is represented as a row in our data set. Indeed, if melt() does not receive any id.variables specification, then it will use the factor or character columns as id variables. gather() requires the columns that needs to be treated as ids, all the other columns are going to be used as key-value pairs.

Despite those last different results, we have seen that the two functions can be used to perform the exactly same operations on data frames, and only on data frames! Indeed, gather() cannot handle matrices or arrays, while melt() can as shown below.

It seems that these two tools have some overlap, but each has its own point of focus:  tidyr is simpler for data tidying, whereas reshape2 has functionality (like data aggregation) which tidyr does not include.

Relational Data In Data Lakes

Shankar Selvam discusses one company’s tool for bringing relational data into a data lake:

The next step in building this pipeline is to configure the sink or destination for the imported data. Hydrator provides capabilities to store data in time-partitioned directories via a built-in CDAP Dataset called Time-partitioned File Set.  Once the data is stored in the fileset, CDAP automatically adds a partition which can be queried using Hive.

In this use case we will configure a Time-partitioned File Set that stores data in Avro format by usingTPFSAvro as the sink.

I like the fact that there’s a UI for this.  Between this tool and NiFi, the Hadoop ecosystem is getting some tools to make data migration easier to understand, and I think that will help adoption.

Sparse Columns

Slava Murygin discusses sparse columns:

“Sparsing” is the way SQL Server optimizes spacing for NULL values at the cost of overhead for non-NULL values.
In other words, if you expect having in your column more nulls than not nulls you can SPARSE that column to optimize the space.

I’ve seen the situations when a lot of columns in Data Mart tables were almost completely filled with NULLS and I started wondering if “SPARSE” can be a good tool to gain some space.

Read the whole thing.  I am not a fan of sparse columns because they prohibit things like page-level compression.  Be sure to read the restrictions on using sparse columns before you give them a try; on net, I think they’re more trouble than they’re worth except in edge cases like extremely denormalized tables collecting thousands of data points from sensors.

Handling Division Failure

Reza Rad looks at how division can go wrong in Power Query:

In this example I just returned zero if I find error. but you can return error message if you like with [Revenue Per Item][ErrorMessage]. This method is great error handling method when an error out of blue happens in your data set. I always recommend using TRY method to get rid of errors that might stop the whole solution to work properly.

I have to mention that steps above are separated to show you how the output of try expression looks like. In fact you can combine both steps above in single step with TRY OTHERWISE as below (Thanks to Maxim Zelensky for pointing this out);

The end result is code which is a bit more complex, but safely handles a number of edge cases.

SSIS Method Not Found

Regis Baccaro ran into a rather lengthy error when trying to create an SSISDB catalog:

The error I got was :

Method not found: ‘Void Microsoft.SqlServer.Management.IntegrationServices.EnableSsisSupportAlwaysOnSqmHelper.Initialize()’. (Microsoft.SqlServer.IntegrationServices.UITasks)

Looking at the documentation for the namespaceMicrosoft.SqlServer.Management.IntegrationServices I quickly figured out that I would be able to create the SSIS Catalog manually using PowerShell.

But then I couldn’t locate the Microsoft.SqlServer.Management.IntegrationServices dll anywhere except from in the GAC so I had to load it a somewhat cumbersome way (with help from Remo). Below is the script I used for doing that.

It’s a strange error, but Regis does provide a workaround.

UPDATE() Inside Triggers

James Anderson shows how to use the UPDATE() function inside a trigger to operate selectively on data:

This use of the UPDATE function for selective logging can be very useful when used on tables with columns such as: LastOrderDate, LastLoginDate, etc as these columns are often updated but those changes are probably not required to be logged.

One interesting point is that even if our trigger was configured to fire on DELETEs, the UPDATE function would not return true and therefore the change would not be logged. This makes sense as a DELETE affects all columns, so checking for a particular column is not required. If we wanted to log DELETEs to our ProductPriceLog table, we would use a trigger that fired on DELETEs.

But check the comments to make sure you know when UPDATE() fires—it’s not just when a particular column changes values.

Cardinality Estimator Trace Flags

Kim Tripp shows how to set which cardinality estimator you want to use on a per-query basis:

However, the bad news is the QUERYTRACEON is limited to SysAdmin only (be sure to read the UPDATEs at the end of this post). Jack Li (Microsoft CSS) wrote a great article about a problem they solved by using a logon trigger to change the CE for an entire session: Wanting your non-sysadmin users to enable certain trace flags without changing your app? Now, I do want to caution you that setting master to trustworthy is not something you should take lightly. But, you should NOT let anyone other than SysAdmin have any other rights in master (other than the occasional EXEC on an added user-defined SP). Here are a couple of posts to help warn you of the danger:

A warning about the TRUSTWORTHY database option
Guidelines for using the TRUSTWORTHY database setting in SQL Server

Read on for a couple of options.

Columnstore With Integer Sequences

Niko Neugebauer talks about handling sequences and default values within columnstore indexes:

There are still no dictionaries – and trying to rebuild this table will not bring any effect at all, but take a look at the size of the segments – their size was lowered for almost 40% to ~1.6 MB!

This technic is very effective if you are compressing the columns that you do access rarely – it should be considered for the log tables for example.
Also notice that Columnstore Archival compression will not bring any significant changes – the original 2.6 MB will lower to 2.42 while the variable char column will not get any further improvements, making the improvement difference around 32%.

Warning: Do NOT use this technic without understanding the consequences – the processing of such columns will lower their effectiveness, since Predicate Pushdown will work in a very limited way, plus the Segment Elimination will not work at all.



June 2016
« May Jul »