2017-08-25 – Curated SQL

In this blog post, I will discuss the use of deep leaning methods to classify time-series data, without the need to manually engineer features. The example I will consider is the classic Human Activity Recognition (HAR) dataset from the UCI repository. The dataset contains the raw time-series data, as well as a pre-processed one with 561 engineered features. I will compare the performance of typical machine learning algorithms which use engineered features with two deep learning methods (convolutional and recurrent neural networks) and show that deep learning can surpass the performance of the former.

I have used Tensorflow for the implementation and training of the models discussed in this post. In the discussion below, code snippets are provided to explain the implementation. For the complete code, please see my Github repository.

Click through for the samples, or check out the repo, linked above.

Comments closed

Evaluating A Data Science Project

Published 2017-08-25 by Kevin Feasel

Tom Fawcett gives us an interesting evaluation of a data science case study:

The model is a fully connected neural network with three hidden layers, with a ReLU as the activation function. They state that data from Google Compute Engine was used to train the model (implemented in TensorFlow), and Cloud Machine Learning Engine’s HyperTune feature was used to tune hyperparameters.

I have no reason to doubt their representation choices or network design, but one thing looks odd. Their output is two ReLU (rectifier) units, each emitting the network’s accuracy (technically: recall) on that class. I would’ve chosen a single Softmax unit representing the probability of Large Loss driver, from which I could get a ROC or Precision-Recall curve. I could then threshold the output to get any achievable performance on the curve. (I explain the advantages of scoring over hard classification in this post.)

But I’m not a neural network expert, and the purpose here isn’t to critique their network design, just their general approach. I assume they experimented and are reporting the best performance they found.

Read the whole thing.

Comments closed

Reversing Dynamic Data Masking

Published 2017-08-25 by Kevin Feasel

Joe Obbish shows how easy it is to reverse Dynamic Data Masking:

Armed with our new knowledge, we can create a single SQL query that decodes all of the SSNs. The strategy is to define a single CTE with all ten digits and to use one CROSS APPLY for each digit in the SSN. Each CROSS APPLY only references the SSN column in the WHERE clause and returns the matching prefix of the SSN that we’ve found so far. Here’s a snippet of the code:

Click through for progressively faster solutions. This is the main reason I do not care for DDM as a feature. Its main benefit seems to be preventing shoulder-surfing on reports; any concerted attacker with a little bit of access to writing queries can subvert it.

Comments closed

Substrings In SQL Server Versus Oracle

Published 2017-08-25 by Kevin Feasel

Daniel Janik continues his SQL Server versus Oracle syntax comparison series:

Parsing strings is a feature that is often needed in the database world and SUBSTRING/SUBSTR are designed to do just that. I find it interesting how these two platforms approached the functions differently and that’s definitely shows how you can do many things to get to the same answer.

It’s a short post, but Daniel does show one big difference between the Oracle and SQL Server substring functions.

Comments closed

When To Use Always Encrypted

Published 2017-08-25 by Kevin Feasel

Brent Ozar gives us some good pointers on when to use Always Encrypted:

But that comes with a few big drawbacks. They’re really well-documented, but here’s the highlights:

Do you need to query that data from other apps? Do you have a data warehouse, reporting tools, PowerBI, Analysis Services cubes, etc? If so, those apps will also need to be equipped with the latest database drivers and your decryption certificates. For example, here’s how you access Always Encrypted data with PowerBI. Any app that expects to read the encrypted data is going to need work, and that’s especially problematic if you’re replicating the data to other SQL Servers.

Click through to read the rest. Always Encrypted was designed to encrypt a few columns, not everything in a database.

Comments closed

Investigating The OS Workers DMV

Published 2017-08-25 by Kevin Feasel

Ewald Cress continues his DMV internals series:

wait_started_ms_ticks is set in SOS_Task::PreWait(), i.e. just before actually suspending, and again cleared in SOS_Task::PostWait(). For more about the choreography of suspending, see here.

wait_resumed_ms_ticks is set in SOS_Scheduler::PrepareWorkerForResume(), itself called by the mysteriously named but highly popular SOS_Scheduler::ResumeNoCuzz().

start_quantum is set for the Resuming and InstantResuming case within SOS_Scheduler::TaskTransition(), called by SOS_Scheduler::Switch() as the worker is woken up after a wait.

Ewald intends this post as an extension of the official documentation, so it’s best to read that documentation in conjunction with this post.

Comments closed

Storing Sensitive Information In SSIS

Published 2017-08-25 by Kevin Feasel

Shannon Lowder shows the complex interplay between Biml and SSIS when it comes to handling credentials:

One of the questions I get when teaching others how to use Biml is how do you deal with sensitive information like usernames and passwords in your Biml Solution. No one wants to leave this information in plain text in a solution. You need access to it while interrogating your sources and destination connections for metadata. You also need it while Biml creates your SSIS packages since SSIS uses SELECT to read the metadata during design time to gather its metadata. If you lock away that sensitive information too tightly, you won’t be effective while building your solutions.

In the end, you’ll have to compromise between security and efficacy.

Read on for more.

Comments closed

AllDefinedSuccessors In Biml

Published 2017-08-25 by Kevin Feasel

Ben Weissman shows how to push a common value to all children which share a certain property:

One great way to introduce default values in Biml would be variables in include files or code files for example. But depending on what you’re trying to achieve or at what point you realize it, it may already be causing some extra work.

For example: You have a couple of diffent ways to create a dataflow task but in the end, they should all share a property like DefaultBufferMaxRows.

In BimlStudio, you could make use of a transformer, but these are not available in BimlExpress.

As a bonus, this is a bilingual post on two fronts, so you can pick up a little English-German translation as well as a little VB.Net-C# translation.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Day: August 25, 2017

Classifying Time Series Data With TensorFlow

Evaluating A Data Science Project

Reversing Dynamic Data Masking

Substrings In SQL Server Versus Oracle

When To Use Always Encrypted

Investigating The OS Workers DMV

Storing Sensitive Information In SSIS

AllDefinedSuccessors In Biml