Methods To Improve Model Accuracy

Tristan Robinson shows how to go back to the drawing board when your model’s accuracy isn’t cutting it:

One of the reoccurring principles that appears with machine learning is that of Ockham’s razor, which states that the best models are simple models that fit the data well; this is not an irrefutable principle of logic, but a preference for simplicity. Therefore there is a need of balance between accuracy and simplicity to limit the feature set which tends to lead to better predictions. Simpler models are also more interpretable to humans which also helps. While the data I was working with was limited to around 35 features, there are many data science problems which have thousands of features and so this technique is even more crucial.

There are multiple methods to perform feature selection, of which a few will be covered here. The first method is greedy backward selection which starts with all the features and then finds the feature that hurts predictive power the least when removed, and you remove it. This is done iteratively until a point is met (which will be discussed later). Its known as greedy since it never looks back after removing the feature each time.

An alternative method is greedy forward selection which is basically the inverse, starts with no features, and looks for the feature that by itself is the best model. This then carries on in a similar vein to the backward selection but adding features. The point at which you stop with forward selection is that of diminishing returns for your accuracy.

Read the whole thing.  This is explanation rather than demonstration, but the explanation applies to pretty much any implementation you’re using.

JupyterLab Now Available

Project Jupyter announces the general availability of JupyterLab:

JupyterLab is an interactive development environment for working with notebooks, code and data. Most importantly, JupyterLab has full support for Jupyter notebooks. Additionally, JupyterLab enables you to use text editors, terminals, data file viewers, and other custom components side by side with notebooks in a tabbed work area.

JupyterLab provides a high level of integration between notebooks, documents, and activities:

  • Drag-and-drop to reorder notebook cells and copy them between notebooks.

  • Run code blocks interactively from text files (.py, .R, .md, .tex, etc.).

  • Link a code console to a notebook kernel to explore code interactively without cluttering up the notebook with temporary scratch work.

  • Edit popular file formats with live preview, such as Markdown, JSON, CSV, Vega, VegaLite, and more.

I like this, as I’m a big fan of notebooks but sometimes you just want to write some diagnostic queries and an IDE is way better for that. H/T Giovanni Lanzani

Normalizing To Boyce-Codd Normal Form

I am a big fan of Boyce-Codd Normal Form:

Boyce-Codd Normal Form is a generalization of Second and Third Normal Forms.  There are a couple of requirements to be in Boyce-Codd Normal Form.  First, your table must be in First Normal Form.  This means that:

  • Every entity (row) has a consistent shape.  This is something relational databases do for you automatically:  you can’t create a table where one entity has an attribute (column) but the next entity doesn’t.
  • Every entity has a unique value.  You can uniquely identify any particular row.
  • Every attribute is atomic:  you don’t try to pack more than one value into a single attribute.
  • There are no repeating groups of attributes, like PaymentMethod1, PaymentMethod2, PaymentMethod3, etc.

The other half of BCNF is that every determinant on an entity is a key.

Also click through for an iterative, easy-to-follow process to get to BCNF.

Using psake To Load FunctionsToExport

Cody Konior has a good post on using psake to populate the FunctionsToExport section of a module definition:

The “best practice” and proper way of handling this then is to add an entry to the array in this file every time you create a new function. Manually. What a pain in the ass right? But there’s a better way!

If you haven’t seen psake before you can take a look later. Start up PowerShell as Administrator and install psake from the PowerShell Gallery using Install-Module psake and then you’re ready to go.

How does psake work? It’s the PowerShell equivalent of a Makefile. Basically:

  • You add a file in the root of your module called psakefile.ps1

  • When you run Invoke-psake from that location, it will load and execute the file

For a one-off module with one or two functions, it’s probably not worth it, but once you get to several functions (or if you’re building modules regularly), this looks like a time-saver.

See The Pernicious Effects Of Your UDFs

Pedro Lopes announces an improvement to SQL Server execution plan results in 2017 CU3:

As I mentioned on yesterday’s post, with the recent release of SQL Server 2017 CU3, we released yet more showplan enhancements: you can see other posts related to showplan enhancements here.

In this article I’ll talk about the second showplan improvement we worked on, to assist in the discoverability of UDF usage impact on query execution.

The scenario is that if a query uses one or more user-defined scalar functions (such as T-SQL functions or CLR functions), a significant portion of query execution time may be spent inside those functions, depending on the function definition. And this may be not immediately obvious by looking at the query execution plan.

Recently, we added information on overall query CPU and elapsed time tracking for statistics showplan xml (both in ms), found in the root node of an actual plan (on which I blogged about here). We now added two new attributes: UdfCpuTime and UdfElapsedTime. These provide the total CPU and elapsed time (again, both in ms) that is spent inside all scalar user-defined functions, during the execution of a query.

I love it.  UDFs have historically been silent query killers, as the execution plan would gleefully think that the function call is practically free because it’d only show a single iteration.

When UNION ALL Can Beat OR

Bert Wagner compares a couple methods for writing a query:

Suddenly those key-lookups become too expensive for SQL Server and the query optimizer thinks it’ll be faster to just scan the entire clustered index.

In general this makes sense; SQL Server tries to pick plans that are good enough in most scenarios, and in general I think it chooses wisely.

However, sometimes SQL Server doesn’t pick great plans. Sometimes the plans it picks are downright terrible.

If that particular topic is interesting, I’ve a blog post from a few years back on a similar vein.

Data Discovery And Classification In SQL Server

Gilad Mittelman explains how the SQL Information Protection (aka Data Discovery and Classification) process works in SQL Server and Azure SQL Database:

SQL Information Protection (SQL IP) introduces a set of advanced services and new SQL capabilities, forming a new information protection paradigm in SQL aimed at protecting the data, not just the database:

  • Discovery & recommendations – The classification engine scans your database and identifies columns containing potentially sensitive data. It then provides you an easy way to review and apply the appropriate classification recommendations via the Azure portal.

  • Labeling – Sensitivity classification labels can be persistently tagged on columns using new classification metadata attributes introduced into the SQL Engine. This metadata can then be utilized for advanced sensitivity-based auditing and protection scenarios.

  • Monitoring/Auditing – Sensitivity of the query result set is calculated in real time and used for auditing access to sensitive data (currently in Azure SQL DB only).

  • Visibility – The database classification state can be viewed in a detailed dashboard in the portal. Additionally, you can download a report (in Excel format) to be used for compliance & auditing purposes, as well as other needs.

Check it out, especially with GDPR breathing down our necks.

SSMS 17.5 Released

Alan Yu announces SQL Server Management Studio 17.5:

SSMS 17.5 provides support for almost all feature areas on SQL Server 2008 through the latest SQL Server 2017, which is now generally available.

In addition to enhancements and bug fixes, SSMS 17.5 comes with several exciting new features:

  • Data Discovery and Classification is a new feature for discovering, classifying, labeling and reporting sensitive data in your databases.
  • Query Editor now has support for a SkipsRows option for the Delimited Text External File Format for Azure SQL Data Warehouse.
  • Showplan has enabled the display of estimated plan button for SQL Data Warehouse.

SSMS 17.5 also includes key bug fixes to Showplan, Query Editor, SMO, and Templates, which can be found in the Release Notes.

Aside from data discovery and classification, it’s probably not a major update for most people, but this does look like an interesting tool.

Visual Studio Code In Anaconda 5.1

George Leopold reports that Anaconda 5.1 will now include Visual Studio Code as an optional IDE:

Microsoft and Python data science platform vendor Anaconda have extended their partnership by adding the software giant’s code editor to the latest Anaconda distribution.

The addition of Microsoft’s Visual Studio Code (VS Code) expands its support for the latest release of the Python data science platform, Anaconda 5.1. The Python platform has attracted more than 4.5 million users running the programming language on Windows, Mac and Linux.

Along with editing and debugging features, the partners said the cross-platform code editor includes custom features for Anaconda users. For example, a Python extension customizes VS Code for the Python development environment.

Read on for more information.

Streaming ETL In Practice Using KSQL

Robin Moffatt builds an example of streaming ETL using Oracle, GoldenGate, and Kafka:

So in this post I’m going to show an example of what streaming ETL looks like in practice. I’m replacing batch extracts with event streams, and batch transformation with in-flight transformation of these event streams. We’ll take a stream of data from a transactional system built on Oracle, transform it, and stream it into Elasticsearch to land the results to, but your choice of datastore is up to you—with Kafka’s Connect API you can stream the data to almost anywhere! Using KSQL we’ll see how to filter streams of events in real-time from a database, how to join between events from two database tables, and how to create rolling aggregates on this data.

It’s a very useful example.


February 2018
« Jan