Gigantic Row Custom U-SQL Extractor

Phillip Seamark has created a custom U-SQL extractor which handles rows larger than 4 MB:

It seemed some of the rows in my CSV files exceeded an upper limit on how much the Extractor.Csv function can handle and adding the silent:true  parameter didn’t solve the issue.

I dug a bit deeper and found rows in some of the files that are long –  really long.  One in particular was 47MB long just for the row and this was valid data.  I could have manually edited these outs by hand but thought I’d see if I could solve another way.

After some internet research and a couple of helpful tweets to and from Michael Rys, I decided to have a go at making my own custom U-SQL extractor.

Phillip has included the custom extractor code, so if you find yourself needing to parse very large rows of data in U-SQL, you’ll definitely be interested in this.

Counting Rows In Spark With Dplyr

John Mount discusses the difficulty of using dplyr to count rows in Spark:

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

There are a number of common legitimate uses of nrow() in user code and package code including:

  • Checking if a table is empty.

  • Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).

  • Confirming data size is the same as reported in other sources (Sparkdatabase, and so on).

  • Reporting amount of work performed or rows-per-second processed.

Read the whole thing; this seems unnecessarily complicated.

The Power Of Out-GridView

Mike F. Robbins shows off Out-GridView in the context of working with Azure:

Although the Out-GridView cmdlet existed in PowerShell version 2.0, you’ll need PowerShell version 3.0 to use it as shown in this blog article. The OutputMode parameter of Out-GridView which is used in this blog article was added in PowerShell version 3.0. Also, Out-GridView can only be used on operating systems with a GUI (it cannot be used on server core).

As far as I know, there’s no way to set a default region in Azure like there is with the AWS Initialize-AWSDefaultConfiguration cmdlet. I guess if you really wanted to set a default, you could always use $PSDefaultParameterValues to set a default value for the Location parameter for the cmdlets that require it.

Out-GridView is great for what it is:  an easy UI within Powershell, with sorting and filtering built in.

Creating Minidumps In SQL Server

Joe Obbish shows how to read the call stack by creating a minidump:

We can use minidumps to generate small files that contain SQL Server call stacks. Note that if you aren’t careful you can end up writing the full contents of memory to a dump file. I believe that writing the file is a single-threaded process, so this can take hours and hundreds of GBs on a real server. Also SQL Server is frozen while the dump is happening, so don’t do it in production.

Dumps are most useful when you want to examine a long running, very simple query or a query that seems to be stuck at a certain point. They aren’t useful to see a full call stack of a non-simple query or to see something that happens just once during execution, like a specific task during query compilation. All that you can get is a snapshot and the snapshot may not contain the information that you’re looking for.

There are multiple ways to generate and read dump files. For this blog post I’ll be using sqldumper.exe and WinDbg.

Click through to read the whole thing.

Working With AT TIME ZONE In SQL Server

Louis Davidson has a post up showing how to use AT TIME ZONE:

I have a series of blogs about temporal tables in SQL Server 2016, which you can see here. However, one thing I never had mentioned was that the Temporal Row Start and End Times are always stored in UTC times (which is generally inconvenient to work with unless you are from the Great Britain area).  However, when you are looking at your data and want to see what the data looked like at ‘Yesterday at 12:00 at the computer you are working from, this can be a bit more work.

In 2016 SQL Server, a new feature was added that totally passed me by. AT TIME ZONE. It is used to manipulate the time zone offset of a point in time value,either to append the time zone to a value, or to shift the time zone value of a DATETIMEOFFSET value. The most common practical use that I will show is converting a value in your local time to UTC (and vice versa).

This is something I tend to forget about, but can be useful when building quick reports from UTC.  I’d store all data in UTC, just to make sure dates and times are consistent across servers in different locations, but you don’t have to do those calculations in your head all the time.

Join Elimination

Lukas Eder has a nice post explaining different forms of automatic join elimination:

We intended to fetch all customers and their addresses. But observe: We project only columns from the CUSTOMER table and we don’t have any predicates at all, specifically not predicates using the ADDRESS table. So, we’re completely ignoring any contributions from the ADDRESS table. We never really needed the JOIN in the first place!

And in fact, the optimiser can prove this too, because of the FOREIGN KEY constraint on C.ADDRESS_ID, which guarantees that every CUSTOMER record has exactly one corresponding ADDRESS record. The JOIN does not duplicate, nor remove any CUSTOMER rows, so it is unneeded and thus eliminated (by some, not all databases, will list each database at the end of the article).

So, the database can rewrite the SQL statement to the following, equivalent SQL statement in the presence of said FOREIGN KEY

Read on for a comparison across different products as well.

Categories

September 2017
MTWTFSS
« Aug  
 123
45678910
11121314151617
18192021222324
252627282930