Unsupervised Decision Trees

William Vorhies describes what unsupervised decision trees are:

In anomaly detection we are attempting to identify items or events that don’t match the expected pattern in the data set and are by definition rare.  The traditional ‘signature based’ approach widely used in intrusion detection systems creates training data that can be used in normal supervised techniques.  When an attack is detected the associated traffic pattern is recorded and marked and classified as an intrusion by humans.  That data then combined with normal data creates the supervised training set.

In both supervised and unsupervised cases decision trees, now in the form of random forests are the weapon of choice.  Decision trees are nonparametric; they don’t make an assumption about the distribution of the data.  They’re great at combining numeric and categoricals, and handle missing data like a champ.  All types of anomaly data tend to be highly dimensional and decision trees can take it all in and offer a reasonably clear guide for pruning back to just what’s important.

To be complete, there is also category of Semi-Supervised anomaly detection in which the training data consists only of normal transactions without any anomalies.  This is also known as ‘One Class Classification’ and uses one class SVMs or autoencoders in a slightly different way not discussed here.

Interesting reading.  I’d had no idea that unsupervised decision trees were even a thing.

Stop Using word2vec

Chris Moody wants you to stop using word2vec:

When I started playing with word2vec four years ago I needed (and luckily had) tons of supercomputer time. But because of advances in our understanding of word2vec, computing word vectors now takes fifteen minutes on a single run-of-the-mill computer with standard numerical libraries. Word vectors are awesome but you don’t need a neural network – and definitely don’t need deep learning – to find them. So if you’re using word vectors and aren’t gunning for state of the art or a paper publication then stop using word2vec.

Chris has a follow-up post on word tensors as well:

There’s only three steps to computing word tensors. Counting word-word-document skipgrams, normalizing those counts to form the PMI-like M tensor and then factorizing M into smaller matrices.

But to actually perform the factorization we’ll need to generalize the SVD to higher rank tensors 1. Unfortunately, tensor algebra libraries aren’t very common 2. We’ve written one for non-negative sparse tensor factorization, but because the PMI can be both positive and negative it isn’t applicable here. Instead, for this application I’d recommend HOSVD as implemented in scikit-tensor. I’ve also heard good things about tensorly.

I’m going to keep using word2vec for now, but it’s a good pair of posts.

Tracking Long-Running Queries

Ryan Booz walks us through tracking long-running queries with sp_whoisactive:

This solution runs sp_WhoIsActive every minute and saves the output into a global temp table. From there, I look for any processes that have been running for more than the low threshold setting. Any of the processes that have not been identified and stored previously get logged, output to an HTML table, and an email alert sent.

Next, I take a second look at the table for anything that’s been running longer than the high threshold.  If a second email alert has not been sent for these processes, we output the same data and send the email. If two alerts have already been sent for these processes, I don’t do anything else at the moment. One of the next updates to this script will send an alert to our DevOps notification system for anything running longer than some final threshold (or maybe just the high threshold).

I particularly like this part about not re-alerting over and over for a long-running query.  It’s a relatively minor part of the whole solution, but it gets annoying watching the same e-mail come in every 5 minutes, especially if there’s nothing you can (or at least want to) do about the cause.

The OOM Killer Cometh

Venu Cherukupalli shows how to keep Linux’s Out of Memory Killer from taking down SQL Server:

When an index rebuild was kicked off on a large table (around 25GB), the reindex operation terminated, and the availability group had failed over to the other replica.

Upon further investigation, we discovered that the SQL Server process terminated at the time reindex operation was run and this resulted in the failover.

To determine the reason for the unexpected shutdown, we reviewed the Linux System Logs (/var/log/messages on RHEL) & pacemaker logs. From the pacemaker logs and system logs, we saw entries indicating that oom-killer was invoked, and as a result SQL Server process was terminated.

Read on for the two solutions.  I was hoping for a solution that involved making the SQL Server executable immune from oom-killer’s wily ways, but not so much in this post.

Memory-Optimized Table Maintenance

Ned Otter has a great post looking at what you can and cannot do with memory-optimized tables containing certain types of indexes:

Now, let’s attempt to create a NONCLUSTERED COLUMNSTORE INDEX:

ALTER TABLE dbo.InMemADD
ADD INDEX NCCI_InMem NONCLUSTERED COLUMNSTORE (col1);

Msg 10794, Level 16, State 76, Line 76
The feature ‘NONCLUSTERED COLUMNSTORE’ is not supported with memory optimized tables.

It fails because we can only create a CLUSTERED columnstore index (CCI). For 25 years, Microsoft SQL Server differentiated between indexes that physically ordered data on storage (CLUSTERED) and those that did not (NONCLUSTERED). Unfortunately, they chose to ignore that pattern when creating the syntax for memory-optimized tables; using the word CLUSTERED is required when creating a columnstore index on memory-optimized tables. 

It’s a great post with plenty of trial and error.

Data Lakes Aren’t New

Shannon Lowder reveals one of the deep, dark data lake secrets:

Turns out there are three basic zones or areas to a data lake. Raw, Managed, and Presentation.

The raw zone should be optimized for fast storage.  The goal is to get the data in as quickly as possible.  Don’t make any changes to this data.  You want it stored as close to the original format as possible.  It sounds just like staged data to me.  Data you’d build an extract package to get from source to your staging environment, right?

Maybe you’re thinking this is just a coincidence…let’s move on.

Spoilers:  it’s not a coincidence.

Linked Servers And Inaccessible Statistics

Jason Brimhall troubleshoots an error message involving linked servers and statistics:

On more than one occasion I have had an emergency request because everything was broken. The everything in almost every incident is an SSIS package that is failing with error messages. The error message will typically have text similar to the following:

Could not locate statistics ‘_WA_Sys_00000015_346C780E’ in the system catalogs.

Due to the error, the package fails processing and grinds to a halt. When diving into the package it is discovered that the missing stats happen to be coming from a linked server query. This raises a big bright blaring alarm for me. Why is the SSIS package accessing the data via a linked server? This is rather counter-productive and definitely contrary to what is desired from a performance perspective.

Jason methodically walks us through the troubleshooting process and provides the solution at the end.

Modifying Availability Group Endpoint URLs

Ronald Dameron shows how to change the endpoint URLs for Availability Group replicas:

I recently had to modify the Endpoint URLs on our SQL Server Availability Group replicas.

The reason for this blog post is that I could not answer the following questions:

Do I need to suspend data movement prior to making this change? 

Would this change require a restart of the database instance?

Now Ronald can, and after reading his post, you’ll be able to as well.

Reasons For Using Docker With R

Jeroen Ooms gives us a few reasons why we might want to containerize our R-based products:

The flagship of the OpenCPU system is the OpenCPU server: a mature and powerful Linux stack for embedding R in systems and applications. Because OpenCPU is completely open source we can build and ship on DockerHub. A ready-to-go linux server with both OpenCPU and RStudio can be started using the following (use port 8004 or 80):

docker run -t -p 8004:8004 opencpu/rstudio

Now simply open http://localhost:8004/ocpu/ and http://localhost:8004/rstudio/ in your browser! Login via rstudio with user: opencpu (passwd: opencpu) to build or install apps. See the readme for more info.

This is in the context of one particular product, but the reasons fit other scenarios too.  H/T R-Bloggers

Temporal Table Permissions

Kenneth Fisher shows us the permissions needed to create temporal tables:

Msg 13538, Level 16, State 3, Line 6
You do not have the required permissions to complete the operation.

Well, that’s not good. What permissions do I need exactly? Well, again, according to BOL I need CONTROL on the table and its history table. For those that don’t know CONTROL is the top level permission for any object. You can do anything at all with it.

Read the whole thing.  I gather the reason for requiring this level of access is that you don’t want people to go monkeying with data collected for auditing purposes.

Categories

October 2017
MTWTFSS
« Sep  
 1
2345678
9101112131415
16171819202122
23242526272829
3031