Press "Enter" to skip to content

Curated SQL Posts

Buck Woody On R & Python

Buck Woody’s back to blogging, and his focus is data science.  Over the past month, he’s looked at R and Python.

First, on installing R:

In future notebook entries we’ll explore working with R, but for now, we need to install it. That really isn’t that difficult, but it does bring up something we need to deal with first. While the R environment is truly amazing, it has some limitations. It’s most glaring issue is that the data you want to work with is loaded into memory as a frame, which of course limits the amount of data you can process for a given task. It’s also not terribly suited for parallelism – many things are handled as in-line tasks. And if you use a package in your script, you have to ensure others load that script, and at the right version.

Enter Revolution Analytics – a company that changed R to include more features and capabilities to correct these issues, along with a few others. They have a great name in the industry, bright people, and great products – so Microsoft bought them. That means the “RRE” engine they created is going to start popping up in all sorts of places, like SQL Server 2016, Azure Machine Learning, and many others. But the “stand-alone” RRE products are still available, and at the current version. So that’s what we’ll install.

Also on installing and getting started with Python:

Python has some distinct differences that make it attractive for working in data analytics. It scales well, is fairly easy to learn and use, has an extensible framework, has support for almost every platform around, and you can use it to write extensive programs that work with almost any other system and platform.

R and Python are the two biggest languages in this slice of the field, and you’ll gain a lot from learning at least one of these languages.

Comments closed

Hash Join Performance

Paul White wrote something.  That’s good enough to get tagged here:

It might be tempting to ask why the optimizer does not routinely add null-rejecting filters prior to equality joins. One can only suppose that this would not be beneficial in enough common cases. Most joins are not expected to encounter many null = null rejections, and adding predicates routinely could quickly become counter-productive, particularly if many join columns are present. For most joins, rejecting nulls inside the join operator is probably a better option (from a cost model perspective) than introducing an explicit Filter.

It does seem that there is an effort to prevent the very worst cases from manifesting through the post-optimization rewrite designed to reject null join rows before they reach the build input of a hash join. It seems that an unfortunate interaction exists between the effect of optimized bitmap filters and the application of this rewrite. It is also unfortunate that when this performance problem does occur, it is very difficult to diagnose from the execution plan alone.

I’m only on reading #2 of the post, so I’ve got a few more things yet to learn from it.

Comments closed

Get Started With U-SQL

Microsoft is pushing U-SQL pretty hard.  Here’s a tutorial by Jonathan Gao to whet your appetite:

U-SQL is a language that unifies the benefits of SQL with the expressive power of your own code to process all data at any scale. U-SQL’s scalable distributed query capability enables you to efficiently analyze data in the store and across relational stores such as Azure SQL Database. It enables you to process unstructured data by applying schema on read, insert custom logic and UDF’s, and includes extensibility to enable fine grained control over how to execute at scale. To learn more about the design philosophy behind U-SQL, please refer to this Visual Studio blog post.

You do need Data Lake Tools for Visual Studio, but it looks like you can run it locally.

The VS blog had something a month ago on the topic.  I’m not saying get on it…yet…

Comments closed

Drop IF EXISTS

It’s syntactic sugar, but oh, is it tasty.  From Manoj Pandey:

To make sure I checked the MSDN BOL and found that this is a new feature added to the SQL Server 2016 version. And as per this msdn article this enhancement has been add with theCTP 3 release.

For IF EXISTS option/syntax the MSDN BoL mentions: Conditionally drops the [object] only if it already exists.
This is a very helpful enhancement added to these DDL statements and would reduce a lot of effort and coding lines.

Create IF NOT EXISTS is also a thing.  A glorious thing.

Comments closed

Grouping Clauses

William Brewer goes into nice detail on the grouping clauses ROLLUP, CUBE, and GROUPING SETS.

ROLLUP and CUBE had their heyday before SSAS. They were useful for providing the same sort of facilities offered by the cube in OLAP. It still has its uses though. In AdventureWorks, it is overkill, but if you are handling large volumes of data you need to pass over your data only once, and do as much as possible on data that has been aggregated. Events that happened in the past can’t be changed, so it is seldom necessary to retain historic data on an active OLTP system. Instead, you only need to retain the aggregated data at the level of detail (‘granularity’) required for all foreseeable reports.

Imagine you are responsible for reporting on a telephone switch that has two million or so calls a day. If you retain all these calls on your OLTP server, you are soon going to find the SQL Server labouring over usage reports. You have to retain the original call information for a statutory time period, but you determine from the business that they are, at most, only interested in the number of calls in a minute. Then you have reduced your storage requirement on the OLTP server to 1.4% of what it was, and the call records can be archived off to another SQL Server for ad-hoc queries and customer statements. That’s likely to be a saving worth making. The CUBE and ROLLUP clauses allow you to even store the row totals, column totals and grand totals without having to do a table, or clustered index, scan of the summary table.

As long as changes aren’t made retrospectively to this data, and all time periods are complete, you never have to repeat or alter the aggregations based on past time-periods, though grand totals will need to be over-written!

I’ve used ROLLUP and GROUPING SETS fairly regularly but not so much CUBE.  Read the whole thing and figure out that the aggregation & summarization monster you have to maintain can maybe be re-written in a much simpler way.

Via Database Weekly.

Comments closed

Session Wait Stats

SQL Server 2016 has a per-session wait stats DMV:  sys.dm_exec_session_wait_stats.  That’s exciting; wait stats are extremely interesting, but until now, impossible to use on a per-item level in a busy production system (where you’d most want to use them).

Daniel Farina looks at how the new DMV relates to sys.dm_exec_wait_stats (via Database Weekly):

[R]esetting the data of sys.dm_os_wait_stats operating system view doesn’t affect the values of sys.dm_exec_session_wait_stats view.

Based on my MSDN reading, the sys.dm_exec_session_wait_stats DMV resets if the connection pool context is re-used or if the session closes.  This is why DBCC SQLPERF doesn’t include a reset option for session-specific wait stats.

Comments closed

Microsoft & Red Hat Sittin’ In A Tree

Microsoft and Red Hat have joined together to support Linux in Azure.

Customers can already run Linux on Azure, but the new partnership will expand support for running so-called “hybrid clouds,” in which applications may exist in both private data centers and on public cloud services. More significantly, Microsoft and Red Hat support teams will work together from the same facilities to support Red Hat customers using Azure. Microsoft vice president of cloud and enterprise Scott Guthrie said during a webcast today that this is the first time that he knows of that Microsoft has “co-located” support teams with another company.

The deal is the latest example of Microsoft playing nice with a former rival. “When we started [Red Hat Enterprise Linux] I never would have thought we’d do this,” Red Hat president of product and technology Paul Cormier said during the webcast.

Free speculation with no evidence:  at some point, Microsoft will offer SQL Server on Linux.  My guess is 3-5 years from now, but other co-speculators have suggested maybe even as soon as 18 months.  Whatever the case, I’ll be a happy man when I can run SSMS in Linux.

Comments closed

Nested Transactions Aren’t

Friends don’t let friends nest transactions:

Before getting into the details, I need to make one thing clear. Nested transactions are a lie. They do not exist in SQL Server.

This is part 1 of a three-part series by Gail Shaw.  Read the whole thing.  Also read Paul Randal:

Nested transactions do not actually behave the way the syntax would have you believe. I have no idea why they were coded this way in SQL Server – all I can think of is someone from the dim and distant past is continually thumbing their nose at the SQL Server community and going “ha – fooled you!!”.

Nested transactions are somebody’s attempt at trolling.  They succeeded.

Comments closed

Goodbye 32-Bit SQL

32-bit SQL Server is gone; long live 64-bit SQL Server:

Does this mean there will be no 32-bit version of SQL Server 2016? They may make some desktop version; I don’t know nor have I been following. But as a Server product? RIP, and good riddance.

So you can thank (or damn) me for this one. Me, I’m going to celebrate. Where’s my bottle of Coca Cola with real sugar?

Allan Hirt, I thank you.  I have one 32-bit device left:  it’s a cheap tablet.  Let’s not wait until 2038 to get rid of x86.

Comments closed