Press "Enter" to skip to content

Author: Kevin Feasel

Configuring Polybase

I have a blog post up on configuring Polybase:

Microsoft’s next recommendation is to make sure that predicate pushdown is enabled.  To do that, we’re going to go back to the Hadoop VM and grab our yarn.application.classpath from there.  To do that, cd to /etc/hadoop/conf/ and vi yarn-site.xml (or use whatever other text reader you want).  Copy the value for yarn.application.classpath, which should be a pretty long string.  Mine looks like:

1
<value>$HADOOP_CONF_DIR,/usr/hdp/current/hadoop-client/*,/usr/hdp/current/hadoop-client/lib/*,/usr/hdp/current/hadoop-hdfs-client/*,/usr/hdp/current/hadoop-hdfs-client/lib/*,/usr/hdp/current/hadoop-yarn-client/*,/usr/hdp/current/hadoop-yarn-client/lib/*</value>

Now that you have a copy of that value, go to your SQL Server installation directory (by default, C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf) and open up yarn-site.xml.  Paste the value into the corresponding yarn.application.classpath setting and you’re good to go.

This is part one of a series on using Polybase.

Comments closed

Relational Data In Data Lakes

Shankar Selvam discusses one company’s tool for bringing relational data into a data lake:

The next step in building this pipeline is to configure the sink or destination for the imported data. Hydrator provides capabilities to store data in time-partitioned directories via a built-in CDAP Dataset called Time-partitioned File Set.  Once the data is stored in the fileset, CDAP automatically adds a partition which can be queried using Hive.

In this use case we will configure a Time-partitioned File Set that stores data in Avro format by usingTPFSAvro as the sink.

I like the fact that there’s a UI for this.  Between this tool and NiFi, the Hadoop ecosystem is getting some tools to make data migration easier to understand, and I think that will help adoption.

Comments closed

Sparse Columns

Slava Murygin discusses sparse columns:

“Sparsing” is the way SQL Server optimizes spacing for NULL values at the cost of overhead for non-NULL values.
In other words, if you expect having in your column more nulls than not nulls you can SPARSE that column to optimize the space.

I’ve seen the situations when a lot of columns in Data Mart tables were almost completely filled with NULLS and I started wondering if “SPARSE” can be a good tool to gain some space.

Read the whole thing.  I am not a fan of sparse columns because they prohibit things like page-level compression.  Be sure to read the restrictions on using sparse columns before you give them a try; on net, I think they’re more trouble than they’re worth except in edge cases like extremely denormalized tables collecting thousands of data points from sensors.

Comments closed

Handling Division Failure

Reza Rad looks at how division can go wrong in Power Query:

In this example I just returned zero if I find error. but you can return error message if you like with [Revenue Per Item][ErrorMessage]. This method is great error handling method when an error out of blue happens in your data set. I always recommend using TRY method to get rid of errors that might stop the whole solution to work properly.

I have to mention that steps above are separated to show you how the output of try expression looks like. In fact you can combine both steps above in single step with TRY OTHERWISE as below (Thanks to Maxim Zelensky for pointing this out);

The end result is code which is a bit more complex, but safely handles a number of edge cases.

Comments closed

SSIS Method Not Found

Regis Baccaro ran into a rather lengthy error when trying to create an SSISDB catalog:

The error I got was :

Method not found: ‘Void Microsoft.SqlServer.Management.IntegrationServices.EnableSsisSupportAlwaysOnSqmHelper.Initialize()’. (Microsoft.SqlServer.IntegrationServices.UITasks)

Looking at the documentation for the namespaceMicrosoft.SqlServer.Management.IntegrationServices I quickly figured out that I would be able to create the SSIS Catalog manually using PowerShell.

But then I couldn’t locate the Microsoft.SqlServer.Management.IntegrationServices dll anywhere except from in the GAC so I had to load it a somewhat cumbersome way (with help from Remo). Below is the script I used for doing that.

It’s a strange error, but Regis does provide a workaround.

Comments closed

UPDATE() Inside Triggers

James Anderson shows how to use the UPDATE() function inside a trigger to operate selectively on data:

This use of the UPDATE function for selective logging can be very useful when used on tables with columns such as: LastOrderDate, LastLoginDate, etc as these columns are often updated but those changes are probably not required to be logged.

One interesting point is that even if our trigger was configured to fire on DELETEs, the UPDATE function would not return true and therefore the change would not be logged. This makes sense as a DELETE affects all columns, so checking for a particular column is not required. If we wanted to log DELETEs to our ProductPriceLog table, we would use a trigger that fired on DELETEs.

But check the comments to make sure you know when UPDATE() fires—it’s not just when a particular column changes values.

Comments closed

Cardinality Estimator Trace Flags

Kim Tripp shows how to set which cardinality estimator you want to use on a per-query basis:

However, the bad news is the QUERYTRACEON is limited to SysAdmin only (be sure to read the UPDATEs at the end of this post). Jack Li (Microsoft CSS) wrote a great article about a problem they solved by using a logon trigger to change the CE for an entire session: Wanting your non-sysadmin users to enable certain trace flags without changing your app? Now, I do want to caution you that setting master to trustworthy is not something you should take lightly. But, you should NOT let anyone other than SysAdmin have any other rights in master (other than the occasional EXEC on an added user-defined SP). Here are a couple of posts to help warn you of the danger:

A warning about the TRUSTWORTHY database option
Guidelines for using the TRUSTWORTHY database setting in SQL Server

Read on for a couple of options.

Comments closed

Columnstore With Integer Sequences

Niko Neugebauer talks about handling sequences and default values within columnstore indexes:

There are still no dictionaries – and trying to rebuild this table will not bring any effect at all, but take a look at the size of the segments – their size was lowered for almost 40% to ~1.6 MB!

This technic is very effective if you are compressing the columns that you do access rarely – it should be considered for the log tables for example.
Also notice that Columnstore Archival compression will not bring any significant changes – the original 2.6 MB will lower to 2.42 while the variable char column will not get any further improvements, making the improvement difference around 32%.

Warning: Do NOT use this technic without understanding the consequences – the processing of such columns will lower their effectiveness, since Predicate Pushdown will work in a very limited way, plus the Segment Elimination will not work at all.

fff

Comments closed

SOS_RWLock

Ewald Cress looks at SOS_RWLock, a reader-writer lock (at least the pre-2016 version):

This lock class can best be appreciated by comparing it to a mutex. Like the mutex, a reader-writer lock can only be acquired in exclusive mode by one requestor at a time, but instead of only exposing this exclusive-acquire (Writer) option, it alternatively allows acquisition in shared (Reader) mode. This stuff is completely natural to us database folks of course, because the semantics is a subset of the behaviours we get from familiar database locks.

Basic rules of the road:

  • Any number of simultaneous “clients” can share ownership of the lock in Read mode.

  • Readers block writers.

  • Writers block readers and other writers.

  • Blocking means that the requesting worker gets suspended (scheduled off the processor) and accrues a wait of a type specified in the lock acquisition request.

There’s a huge amount of detail here, and I for one am glad that there isn’t a quiz later..

Comments closed

U-SQL Parallelism

Saveen Reddy discusses parallelism with U-SQL jobs:

Previously an Data Lake Analytics account was limited to a total of 60 units on parallelism available to 3 concurrent U-SQL jobs. And each U-SQL job was limited to only using a maximum of 20 units of parallelism. However, sometimes developers want to run a U-SQL job that uses more parallelism. For example: they might want two concurrently running jobs that each use 30 units. Today, we are removing the per-job limit. Now you can concurrently run jobs that use any amount of parallelism as long as the total for the running jobs doesn’t go beyond the maximum for your account (currently 60 units of parallelism).

If you are already running U-SQL jobs, there’s no need to worry. This change doesn’t require any action on your part. Your jobs run just as they did before. But now, if you want to you can take advantage of all the parallelism in your account.

This doesn’t change the available performance units, so there’s no billing change.

Comments closed