Month: August 2017

K Nearest Cliques

Published 2017-08-18 by Kevin Feasel

Vincent Granville explains an algorithm built around finding cliques of data points:

The cliques considered here are defined by circles (in two dimensions) or spheres (in three dimensions.) In the most basic version, we have one clique for each cluster, and the clique is defined as the smallest circle containing a pre-specified proportion p of the points from the cluster in question. If the clusters are well separated, we can even use p = 1. We define the density of a clique as the number of points per unit area. In general, we want to build cliques with high density.

Ideally, we want each cluster in the training set to be covered by a small number of (possibly slightly overlapping) cliques, each one having a high density. Also, as a general rule, a training set point can only belong to one clique, and (ideally) to only one cluster. But the circles associated with two cliques are allowed to overlap.

It’s an interesting approach, and I can see how it’d be faster than K Nearest Neighbors, but I do wonder how accurate the results would be in comparison to KNN.

Comments closed

Your Reminder Not To MERGE

Published 2017-08-18 by Kevin Feasel

Kevin Wilkie points out the numerous problems with the MERGE operator:

Now, when I last posted, I’m sure you thought I was done talking about the MERGE statement. You are so wrong, compadre! One more post is absolutely needed!

There are a few issues with the MERGE statement. Well, as of this writing, there are 361 possible issues according to Microsoft Connect – the actual website where Microsoft checks to see what issues exist!

So, if you want to use the MERGE statement, please read through every issue listed on the link above and make sure that none of those scenarios could exist for you. If they don’t, great. Knock yourself out and use it.

But wait, there’s more! Read on to see what else could be a problem.

Comments closed

OS Threads DMV

Published 2017-08-18 by Kevin Feasel

Ewald Cress moves up the internals stack a little further and looks at a DMV:

Broadly speaking, a DMV presents just another iterator that can be plugged into a query plan. The execution engine calls GetRow() repeatedly until it reaches the end, and the iterator emits rows. The only unusual thing is that the ultimate source of the data may have nothing to do with the storage engine.

Now if you asked me to guess where in the world we’d find a list of all threads to iterate over, I would have expected that we’d start with the NodeManager, iterating over all SOS_Nodes, and then for each of them iterating over its collection of associated SystemThreads. After all, we have a guaranteed 1:1 correspondence between threads and SystemThreads, and I figured that all SystemThreads enlist themselves into a parent SOS_Node upon creation. No-brainer, right?

Turns out that this guess would have been completely wrong, and the reason it would have been a broken implementation will become apparent when we look at the started_by_sqlservr column.

Definitely read the whole thing, but you may need to top off your caffeinated beverage of choice first. Also, I’m pretty sure Ewald promised to hammer one out once per week; that’s how I read it, at least…

Comments closed

One CLR Solution

Published 2017-08-18 by Kevin Feasel

Solomon Rutzky continues his SQL Server 2017 CLR security series:

This new requirement prevents the technique described towards the end of Part 1 from working. That technique uses a SAFE Assembly as an indirect means of creating the Asymmetric Key to create the Login from. That worked perfectly prior to SQL Server 2017, but now even SAFE Assemblies require that the signature-based Login be created first, which now puts us in a whole chicken-egg paradox.

Before proceeding to the solution, it should be noted that yes, Microsoft has, as of RC2 (released on 2017-08-02), provided a kinda/sorta “fix” for this that allows for creating an Assembly without having the signature-based Login. HOWEVER, that “fix” is absolutely horrible, convoluted, and unnecessary. It should not be used by anyone. Ever! In fact, it should be completely removed and forgotten about. In no uncertain terms: it is not an option! To help clarify, I am being intentionally vague about that new feature here (and in Part 1) so as not to distract from these two solutions (this post and Part 3) that do not promote bad practices; it will be covered starting in Part 4.

Solomon outlines one approach to dealing with CLR security changes, though it’s a bit lengthy.

Comments closed

Availability Group DMVs

Published 2017-08-18 by Kevin Feasel

Adrian Buckman shows off a few Availability Group DMVs:

Whilst I was combing through various combinations of sys.dm_hadr and sys.Availability I stumbled across a couple of gems that I thought would be nice to share.

sys.dm_hadr_cluster
sys.dm_hadr_cluster_members
sys.dm_hadr_cluster_networks

Read on for usage examples.

Comments closed

Building An Image Recognizer With R

Published 2017-08-17 by Kevin Feasel

David Smith has a post showing how to build an image recognizer with R and Microsoft’s Cognitive Services Library:

The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates “features” from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these “features”, you could use them in a traditional machine learning model to classify the images, or perform other recognition tasks.

But if you don’t have millions of images, it’s still possible to generate these features from a model that has already been trained on millions of images. ResNet is a very deep neural network model trained for the task of image recognition which has been used to win major computer-vision competitions. With the rxFeaturize function in Microsoft R Client and Microsoft R Server, you can generate 4096 features from this model on any image you provide. The features themselves are meaningful only to a computer, but that vector of 4096 numbers between zero and one is (ideally) a distillation of the unique characteristics of that image as a human would recognize it. You can then use that features vector to create your own image-recognition system without the burden of training your own neural network on a large corpus of images.

Read the whole thing and follow David’s link to the Microsoft Cognitive blog for more details.

Comments closed

MERGE In Hive

Published 2017-08-17 by Kevin Feasel

Carter Shanklin introduces the MERGE operator in Hive:

USE CASE 2: UPDATE HIVE PARTITIONS.

A common strategy in Hive is to partition data by date. This simplifies data loads and improves performance. Regardless of your partitioning strategy you will occasionally have data in the wrong partition. For example, suppose customer data is supplied by a 3rd-party and includes a customer signup date. If the provider had a software bug and needed to change customer signup dates, suddenly records are in the wrong partition and need to be cleaned up.

It has been interesting to see Hive morph over the past few years from a batch warehousing system to something approaching a relational warehouse. This is one additional step in that direction.

Comments closed

Performance Problems Due To Readable Secondaries

Published 2017-08-17 by Kevin Feasel

Paul Randal describes a problem when you create a readable secondary on an Availability Group:

Yesterday I blogged about log shipping performance issues and mentioned a performance problem that can be caused by using availability group readable secondaries, and then realized I hadn’t blogged about the problem, only described it in our Insider newsletter. So here’s a post about it!

Availability groups (AGs) are pretty cool, and one of the most useful features of them is the ability to read directly from one of the secondary replicas. Before, with database mirroring, the only way to access the mirror database was through the creation of a database snapshot, which only gave a single, static view of the data. Readable secondaries are constantly updated from the primary so are far more versatile as a reporting or non-production querying platform.

But I bet you didn’t know that using this feature can cause performance problems on your primary replica?

Definitely read the whole thing.

Comments closed

Dealing With The Registry From SQL Server

Published 2017-08-17 by Kevin Feasel

Wayne Sheffield shows how to read and modify registry entries using SQL Server:

xp_instance_regread

In this example, I used xp_regread to read the direct registry path. If you remember from earlier, there are SQL Server instance-aware versions of each registry procedure. A comparable statement using the instance-aware procedure would be:

1

2

3

4

EXECUTE master.sys.xp_instance_regread

    ‘HKEY_LOCAL_MACHINE’,

    ‘Software\Microsoft\MSSQLSERVER\SQLServerAgent’,

    ‘WorkingDirectory’;

This statement returns the exact same information. Let’s look at the difference between these – in the first query, the registry path is the exact registry path needed, and it includes “\Microsoft SQL Server\MSSQL12.SQL2014\”. In the latter query, this string is replaced with “\MSSQLSERVER\”. Since the latter function is instance aware, it replaces the “MSSQLSERVER” with the exact registry path necessary for this instance of SQL Server. Pretty neat, isn’t it? This allows you to have a script that will run properly regardless of the instance that it is being run on. The rest of the examples in this post will utilize the instance-aware procedures to make it easier for you to follow along and run these yourself.

Sometimes you just have to change something in the registry from SQL Server. Hopefully that “sometimes” is rare.

Comments closed

Attaching Databases To Docker

Published 2017-08-17 by Kevin Feasel

Andrew Pruski shows one scenario where Docker on Windows is better than Docker on Linux:

One of the (if not the) main benefits of working with SQL in a container is that you can create a custom image to build container from that has all of your development databases available as soon as the container comes online.

This is really simple to do with Windows containers. Say I want to attach DatabaseA that has one data file (DatabaseA.mdf) and a log file (DatabaseA_log.ldf): –
ENV attach_dbs="[{'dbName':'DatabaseA','dbFiles':['C:\\SQLServer\\DatabaseA.mdf','C:\\SQLServer\\DatabaseA_log.ldf']}]"
Nice and simple! One line of code and any containers spun up from the image this dockerfile creates will have DatabaseA ready to go.

However this functionality is not available when working with Linux containers. Currently you cannot use an environment variable to attach a database to a SQL instance running in a Linux container.

Read on to see what you can do if you’re using a Linux container.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31