Curated SQL is on semi-holiday for Memorial Day. Instead of posting new links to items of interest in the wide world of data, I want to point out a couple longer works that I normally would not be able to talk about given the site’s motif. If you’re in the office on a slow day, here are a few items that will get you through.
Comments closedMonth: May 2016
Steve Jones gives a simple recursive CTE example and an important lesson on compounding:
Recursion is an interesting computer science technique that stumps lots of people. When I was learning programming, it seemed that recursion (in Pascal) and pointers (in C), were the weed out topics.
However, they aren’t that bad, and with CTEs, we can write recursion in T-SQL. I won’t cover where this might be used in this post, though I will give you a simple CTE to view.
There are two parts you need: the anchor and the recursive member. These are connected with a UNION ALL. There can be multiple items, but we’ll keep things simple.
You can play query golf and find a way to remove the recursion, but it’s good to know how to create a recursive CTE. It’s also good to know that you typically do not want recursion in a database process…
Comments closedEwald Cress looks at SystemThread:
SystemThread, a class within sqldk.dll, can be considered to be at the root of SQLOS’s scheduling capabilities. While it doesn’t expose much obviously exciting functionality, it encapsulates a lot of the state that is necessary to give a thread a sense of self in SQLOS, serving as the beacon for any code to find its way to an associated SQLOS scheduler etc. I won’t go into much of the SQLOS object hierarchy here, but suffice it to say that everything becomes derivable by knowing one’s SystemThread. As such, this class jumps the gap between a Windows thread and the object-oriented SQLOS.
The simplest way to conceptualise a thread in SQL Server is to think of a spid or connection busy executing a simple query, old skool sysprocesses style. It’s not hip, but it’s close enough to the truth to be useful. This conflates a few things that are separate entities, but it is a good starting point for teasing things apart as needed
This is another dive into internals; prepare your thinking caps.
Comments closedBen Snively discusses using Spark SQL as part of an ETL process:
Now interact with SparkSQL through a Zeppelin UI, but re-use the table definitions you created in the Hive metadata store. You’ll create another table in SparkSQL later in this post to show how that would have been done there.
Connect to the Zeppelin UI and create a new notebook under the Notebook tab. Query to show the tables. You can see that the two tables you created in Hive are also available in SparkSQL.
There are a bunch of tools in here, but for me, the moral of the story is that SQL is a great language for data processing. Spark SQL has gaps, but has filled many of those gaps over the past year or so, and I recommend giving it a shot.
Comments closedTwitter is open sourcing Heron, a competitor to Apache Storm:
Apache Storm was the original solution to Twitter’s problems. It was created by a marketing intelligence company called BackType, and Twitter bought the company in 2011 and eventually open-sourced Storm, providing it to the Apache Foundation.
There’s no question Storm has a lot of advantages. It’s scalable and fault-tolerant, with a decent ecosystem of “spouts,” or systems for receiving data from established sources. But it was reputedly also hard to work with and hard to get good results from, and despite a recent 1.0 renovation, it’s been challenged by other projects, including Apache Spark and its own revised streaming framework.
It’s good to see this competition in the streaming space.
Comments closedSlava Murygin shows how to create a box and whisker plot in SSMS using spatial data types:
If you have no idea what Box-and-Whisker Plot is, please visit following link: http://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots
At first, I will show how to do it based on AdventureWorks database in SQL Server 2014.
We will analyze amounts of Individual lines of Sales Orders within each month.
The first step is to create a Data Set to process. That Data Set will contain a Month, Single Line amount and order number of that record within a month.
This is really cool…but I wonder if it wouldn’t be better to do this in R, where it’d take a lot less code. If you can’t reach out to R, though, this is a good way of visualizing results.
Comments closedSQLWayne has a script to help find unused indexes:
Here’s some code that can show you what indexes are unused or empty. An empty index just means that there’s no data in that table right now, it may always be populated later, so I would not drop an empty index. Besides, how much space would an empty index take?
For my personal preferences, I order the output by table then index name, also I put a u.* at the end of the select statement so the more interesting usage stat columns can be seen.
If an index truly is unused, it’s a waste of resources. The problem is, sometimes you’ll think an index is unused but it’s really a vital part of month-end reporting or used for the CEO’s favorite dashboard.
Comments closedKendra Little answers a question on version novelty:
Dear SQL DBA,
What are your thoughts on the early adoption of new SQL Server versions? Specifically, if the business is willing to assume the risk of early adoption just to get one new feature that they can probably live without, should DBAs be happy and willing to assume that risk too? Or, is it our responsibility to “just say no” until it has been tested? I would like to hear about any experience you have with this. Thanks.
Bleeding in Edgeville
Go read Kendra’s answer because it’s a good one. My answer is, I want to be on the edge. I’ve run into V1 bugs and had to spike projects before they made it to production, but if there’s a good benefit to moving, and if your business side is supportive, I’d lean heavily toward fast upgrades.
Comments closedBrent Ozar reports on a NOLOCK bug in SQL Server 2014 SP1 CU6:
-
While one transaction is holding an exclusive lock on an object (Ex. ongoing table update), another transaction is executing parallelized SELECT (…) FROM SourceTable, using the NOLOCK hint, under the default SQL Server lock-based isolation level or higher. In this scenario, the SELECT query trying to access SourceTable will be blocked.
-
Executing a parallelized SELECT (…) INTO Table FROM SourceTable statement, specifically using the NOLOCK hint, under the default SQL Server lock-based isolation level or higher. In this scenario, other queries trying to access SourceTable will be blocked.
The current recommendation is not to install CU6 until the issue is fixed.
Comments closedAdnan Masood and David Lazar have a list of fallacies in the world of data science:
Extrapolating beyond the range of training data, especially in the case of time series data, is fine providing the data-set is large enough.
Strong Evidence is same as a Proof! Prediction intervals and confidence intervals are the same thing, just like statistical significance and practical significance.
These are some good things to think about if you’re getting into analytics.
Comments closed