Month: March 2019

Finding Singleton Tables

Published 2019-03-13 by Kevin Feasel

Michael J. Swart wants to help you find single tables in your area:

What’s achievable? I want to identify tables to extract from the database that won’t take years. Large monolithic systems can have a lot of dependencies to unravel.
So what tables in the database have the least dependencies? How do I tell without a trustworthy data model? Is it the ones with the fewest foreign keys (in or out)? Maybe, but foreign keys aren’t always defined properly or they can be missing all together.
My thought is that if two tables are joined together in some query, then they’re related or connected in some fashion. So that’s my idea. I can look at the procedure cache of a database in production to see where the connections are. And when I know that, I can figure out what tables are not connected.

Click through for the script to help you do it.

Comments closed

Changing Constraints in Near-Zero Downtime Situations

Published 2019-03-13 by Kevin Feasel

I have part six of my interminable series on near-zero downtime deployments:

The locking story is not the same as with the primary and unique key constraints. First, there’s one extra piece: the transition will block access to dbo.LookupTable as well as the table we create the constraint on. That’s to keep us from deleting rows in our lookup table before the key is in place.
Second, the locks begin as soon as we hit F5. Even SELECT statements get blocked requesting a LCK_M_SCH_S lock. Bad news, people.
So what can we do to get around this problem? Two routes: the ineffectual way and the ugly way.

Despite my being a ray of sunshine here, you should still check this out. It’s shorter than the average Russian novel, at least.

Comments closed

Complexities with Binary Collations

Published 2019-03-13 by Kevin Feasel

Solomon Rutzky takes us through the nuances of binary collations:

Still, there are some complexities related to binary collations that you might not be aware of. To figure out what they are, we need to look at why there are so many binary collations in the first place. I mean, binary collations work on the underlying values of the characters, and comparing numbers doesn’t change between cultures or versions: 12 = 12, 12 > 11, and 12 <13, always. So, then what is the difference between:
– Latin1_General_100_BIN2 and Hebrew_100_BIN2 (only the culture is different), or
– Latin1_General_100_BIN2 and Latin1_General_BIN2 (only the version is different), or
–Latin1_General_100_BIN2 and Latin1_General_100_BIN (only the binary comparison type is different)

Read on to find out.

Comments closed

Finding Missing Index Hints in Query Store

Published 2019-03-13 by Kevin Feasel

Grant Fritchey shows us another place where we can find missing index hints:

A couple of notes on the query. I cast the query_plan as xml so that I can use the XQuery to pull out the information. It is possible that the plan might be so large that you get an error because of the limit on nesting levels within XML. Also, I aggregate the information from the sys.query_store_runttime_stats. You may want to modify this to only look at limited ranges. I’ll leave that to you as an exercise.

Do read Grant’s warning in the conclusion.

Comments closed

Gaps and Islands with Dates

Published 2019-03-13 by Kevin Feasel

Bert Wagner hits one of my favorite topics:

In a traditional gaps and islands problem, the goal is to identify groups of continuous data sequences (islands) and groups of data where the sequence is missing (gaps).
While many people encounter gaps and islands problems when dealing with ranges of dates, and recently I did too but with an interesting twist:
How do you determine gaps and islands of data that has overlapping date ranges?

Check out Bert’s explanation of the solution; it’s a good one.

Comments closed

Forcing Job Order In Azure DevOps

Published 2019-03-13 by Kevin Feasel

Chris Taylor shows us how to set up job dependencies in Azure DevOps:

When I first started with VSTS and ultimately Azure DevOps, I went through many failed builds because the order of the jobs in your pipeline don’t run in the order that you’ve built them and how you would logically believe them to run. The image below shows two Build Pipeline jobs but when the build is queued, whether this be manual or via CI, the second job is running before job #1. In this example the build will fail because Job #2 is to deploy a dacpac to a SQL Server on Linux Docker Container (Using Ubuntu Agent Host) but obviously this cannot be done until the dacpac has been created in Job #1 which is running on a VS2017 Agent Host:

Click through to see how it’s done.

Comments closed

Working with Hive in HDInsight

Published 2019-03-12 by Kevin Feasel

Brad Llewellyn takes us through building an HDInsight cluster and writing Hive queries against it:

Hive is a “SQL on Hadoop” technology that combines the scalable processing framework of the ecosystem with the coding simplicity of SQL. Hive is very useful for performant batch processing on relational data, as it leverages all of the skills that most organizations already possess. Hive LLAP (Low Latency Analytical Processing or Live Long and Process) is an extension of Hive that is designed to handle low latency queries over massive amounts of EXTERNAL data. One of this coolest things about the Hadoop SQL ecosystem is that the technologies allow us to create SQL tables directly on top of structured and semi-structured data without having to import it into a proprietary format. That’s exactly what we’re going to do in this post. You can read more about Hive here and here and Hive LLAP here.

We understand that SQL queries don’t typically constitute traditional data science functionality. However, the Hadoop ecosystem has a number of unique and interesting data science features that we can explore. Hive happens to be one of the best starting points on that journey.

Click through for the screenshot-laden demonstration.

Comments closed

The Costs of Specialization within Data Science

Published 2019-03-12 by Kevin Feasel

Eric Colson argues in favor of data science generalists rather than specialists:

But the goal of data science is not to execute. Rather, the goal is to learn and develop profound new business capabilities. Algorithmic products and services like recommendations systems, client engagement bandits, style preference classification, size matching, fashion design systems, logistics optimizers, seasonal trend detection, and more can’t be designed up-front. They need to be learned. There are no blueprints to follow; these are novel capabilities with inherent uncertainty. Coefficients, models, model types, hyper parameters, all the elements you’ll need must be learned through experimentation, trial and error, and iteration. With pins, the learning and design are done up-front, before you produce them. With data science, you learn as you go, not before you go.
In the pin factory, when learning comes first we do not expect, nor do we want, the workers to improvise on any aspect the product, except to produce it more efficiently. Organizing by function makes sense since task specialization leads to process efficiencies and production consistency (no variations in the end product).

I think this article captures the downside risk of specialization, but not the downside risks of generalization: some people simply aren’t very good at some things, leading to huge amounts of technical debt down the road or failing a project due to the lack of requisite knowledge or skills. To give a personal example, I have a generalist team, but I still control the data flows (at the very least doing thorough code reviews of any database changes), my application specialist controls app architecture, my statistician reviews algorithms, etc. I don’t claim that this is the best strategy, but a group of pure generalists will have their own set of problems too.

Comments closed

Deploying SQL Server Versions with Kubernetes

Published 2019-03-12 by Kevin Feasel

Anthony Nocentino shows how we can upgrade or even downgrade our SQL Server containers using Kubernetes deployment scripts:

There’s a few things I want to point out in our YAML file. First, we’re using a Deployment Controller. This will implement a Replica Set of the desired number of replicas using the container imaged defined. In this case, we’ll have 1 replica using the SQL Server 2017 CU11 Image. A Replica Set will guarantee that a defined set of Pods are running at any given time, here we’ll have exactly one Pod. We’re using a Deployment Controller, which gives us move between versions of Replica Sets based off different container images in a controlled fashion…more on that in a second.

Read the whole thing.

Comments closed

Parameterized Direct Query In Power BI

Published 2019-03-12 by Kevin Feasel

Gerhard Brueckl shows how you can parameterize your SQL queries in Power BI, letting you switch server or database names and easily target a different location altogether:

I frequently work on projects where we have multiple tiers on which our solution is deployed to using continuous integration / continuous deployment (CI / CD) pipelines in Azure DevOps. Once everything is deployed, you also need to monitor these different environments and check the status of the data or ETL pipelines. My tool of choice is usually Power BI desktop as it allows me to connect to e.g. SQL databases very easily. However, I always ended up creating a multiple Power BI files – one for each environment.
Having multiple files results in a lot of overhead when it comes to maintenance and also managing these files. Fortunately, I came across this little trick when I was investigating in composite models and aggregations that I am going to explain in this blog post.

I have had to do exactly this same thing, so I’m going to have to try it out myself.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31