Press "Enter" to skip to content

Author: Kevin Feasel

Evaluating A Data Science Project

Tom Fawcett gives us an interesting evaluation of a data science case study:

The model is a fully connected neural network with three hidden layers, with a ReLU as the activation function. They state that data from Google Compute Engine was used to train the model (implemented in TensorFlow), and Cloud Machine Learning Engine’s HyperTune feature was used to tune hyperparameters.

I have no reason to doubt their representation choices or network design, but one thing looks odd. Their output is two ReLU (rectifier) units, each emitting the network’s accuracy (technically: recall) on that class. I would’ve chosen a single Softmax unit representing the probability of Large Loss driver, from which I could get a ROC or Precision-Recall curve. I could then threshold the output to get any achievable performance on the curve. (I explain the advantages of scoring over hard classification in this post.)

But I’m not a neural network expert, and the purpose here isn’t to critique their network design, just their general approach. I assume they experimented and are reporting the best performance they found.

Read the whole thing.

Comments closed

Reversing Dynamic Data Masking

Joe Obbish shows how easy it is to reverse Dynamic Data Masking:

Armed with our new knowledge, we can create a single SQL query that decodes all of the SSNs. The strategy is to define a single CTE with all ten digits and to use one CROSS APPLY for each digit in the SSN. Each CROSS APPLY only references the SSN column in the WHERE clause and returns the matching prefix of the SSN that we’ve found so far. Here’s a snippet of the code:

Click through for progressively faster solutions.  This is the main reason I do not care for DDM as a feature.  Its main benefit seems to be preventing shoulder-surfing on reports; any concerted attacker with a little bit of access to writing queries can subvert it.

Comments closed

Substrings In SQL Server Versus Oracle

Daniel Janik continues his SQL Server versus Oracle syntax comparison series:

Parsing strings is a feature that is often needed in the database world and SUBSTRING/SUBSTR are designed to do just that. I find it interesting how these two platforms approached the functions differently and that’s definitely shows how you can do many things to get to the same answer.

It’s a short post, but Daniel does show one big difference between the Oracle and SQL Server substring functions.

Comments closed

When To Use Always Encrypted

Brent Ozar gives us some good pointers on when to use Always Encrypted:

But that comes with a few big drawbacks. They’re really well-documented, but here’s the highlights:

Do you need to query that data from other apps? Do you have a data warehouse, reporting tools, PowerBI, Analysis Services cubes, etc? If so, those apps will also need to be equipped with the latest database drivers and your decryption certificates. For example, here’s how you access Always Encrypted data with PowerBI. Any app that expects to read the encrypted data is going to need work, and that’s especially problematic if you’re replicating the data to other SQL Servers.

Click through to read the rest.  Always Encrypted was designed to encrypt a few columns, not everything in a database.

Comments closed

Investigating The OS Workers DMV

Ewald Cress continues his DMV internals series:

wait_started_ms_ticks is set in SOS_Task::PreWait(), i.e. just before actually suspending, and again cleared in SOS_Task::PostWait(). For more about the choreography of suspending, see here.

wait_resumed_ms_ticks is set in SOS_Scheduler::PrepareWorkerForResume(), itself called by the mysteriously named but highly popular SOS_Scheduler::ResumeNoCuzz().

start_quantum is set for the Resuming and InstantResuming case within SOS_Scheduler::TaskTransition(), called by SOS_Scheduler::Switch() as the worker is woken up after a wait.

Ewald intends this post as an extension of the official documentation, so it’s best to read that documentation in conjunction with this post.

Comments closed

Storing Sensitive Information In SSIS

Shannon Lowder shows the complex interplay between Biml and SSIS when it comes to handling credentials:

One of the questions I get when teaching others how to use Biml is how do you deal with sensitive information like usernames and passwords in your Biml Solution. No one wants to leave this information in plain text in a solution.  You need access to it while interrogating your sources and destination connections for metadata.  You also need it while Biml creates your SSIS packages since SSIS uses SELECT to read the metadata during design time to gather its metadata.  If you lock away that sensitive information too tightly, you won’t be effective while building your solutions.

In the end, you’ll have to compromise between security and efficacy.

Read on for more.

Comments closed

AllDefinedSuccessors In Biml

Ben Weissman shows how to push a common value to all children which share a certain property:

One great way to introduce default values in Biml would be variables in include files or code files for example. But depending on what you’re trying to achieve or at what point you realize it, it may already be causing some extra work.

For example: You have a couple of diffent ways to create a dataflow task but in the end, they should all share a property like DefaultBufferMaxRows.

In BimlStudio, you could make use of a transformer, but these are not available in BimlExpress.

As a bonus, this is a bilingual post on two fronts, so you can pick up a little English-German translation as well as a little VB.Net-C# translation.

Comments closed

Regression Trees And Double Seasonal Time Series Trends

Peter Laurinec walks us through an example of using regression trees to solve a problem with double-seasonal time series data in R:

Classification and regression tree (or decision tree) is broadly used machine learning method for modeling. They are favorite because of these factors:

  • simple to understand (white box)
  • from a tree we can extract interpretable results and make simple decisions
  • they are helpful for exploratory analysis as binary structure of tree is simple to visualize
  • very good prediction accuracy performance
  • very fast
  • they can be simply tuned by ensemble learning techniques

But! There is always some “but”, they poorly adapt when new unexpected situations (values) appears. In other words, they can not detect and adapt to change or concept drift well (absolutely not). This is due to the fact that tree creates during learning just simple rules based on training data. Simple decision tree does not compute any regression coefficients like linear regression, so trend modeling is not possible. You would ask now, so why we are talking about time series forecasting with regression tree together, right? I will explain how to deal with it in more detail further in this post.

This was a very interesting article.  Absolutely worth reading.  H/T R-Bloggers

Comments closed

Managing Hive Slowly Changing Dimensions

Carter Shanklin shows how to manage Type 1, 2, and 3 slowly changing dimensions in Hive:

The most common SCD update strategies are:

  • Type 1: Overwrite old data with new data. The advantage of this approach is that it is extremely simple, and is used any time you want an easy to synchronize reporting systems with operational systems. The disadvantage is you lose history any time you do an update.

  • Type 2: Add new rows with version history. The advantage of this approach is that it allows you to track full history. The disadvantage is that your dimension tables grow without limit and may become very large. When you use Type 2 SCD you will also usually need to create additional reporting views to simplify the process of seeing only the latest dimension values.

  • Type 3: Add new rows and manage limited version history. The advantage of Type 3 is that you get some version history, but the dimension tables remain at the same size as the source system. You also won’t need to create additional reporting views. The disadvantage is you get limited version history, usually only covering the most recent 2 or 3 changes.

The Hive solution is getting closer and closer to a traditional relational warehouse solution.  And on the whole, that’s a good thing.

Comments closed

Kafka Connect To Elasticsearch

Robin Moffatt shows how to take data from Kafka Connect and feed it into Elasticsearch:

Whilst Kafka Connect is part of Apache Kafka itself, if you want to stream data from Kafka to Elasticsearch you’ll want the Confluent Open Source distribution (or at least, the Elasticsearch connector).

The configuration is pretty simple. As before, see inline comments for details

It’s worth noting that if you’re using the same convertor throughout your pipelines (Avro, in this case) you’d actually put this in the Connect worker config itself rather than repeating it for each connector configuration.

This is a simple example which shows just how easy it can be.

Comments closed