Press "Enter" to skip to content

Month: April 2019

Learning with Limited Data

Shioulin Sam and Nisha Muktewar have new research on machine learning when getting labeled data is time-consuming or difficult:

We are excited to release Learning with Limited Labeled Data, the latest report and prototype from Cloudera Fast Forward Labs.

Being able to learn with limited labeled data relaxes the stringent labeled data requirement for supervised machine learning. Our report focuses on active learning, a technique that relies on collaboration between machines and humans to label smartly.

Active learning makes it possible to build applications using a small set of labeled data, and enables enterprises to leverage their large pools of unlabeled data. In this blog post, we explore how active learning works. (For a higher level introduction, please see our previous blogpost.

The research itself is behind a paywall but you can see their write-up to get an idea of the topic.

Comments closed

Understanding the Page Resource Cracker

John Morehouse looks at the new sys.fn_PageResCracker() function in SQL Server 2019:

In a previous blog post, I discussed two new methods in SQL Server 2019 to determine exactly which page a request might be waiting for when there is contention.  One of these new methods involves a new function, fn_pagerescracker.   Naturally, I wanted to see how this function operates.  Let’s look at the Master database to investigate how it works!

Click through for the function definition and what it all means.

Comments closed

Desired State Configuration Resources

Jess Pomfret takes us through resources in Desired State Configuration:

A critical part of our DSC configuration is made up of resources. These are the building blocks we need to to define our desired state.  There are two kinds of resources that we can use: class based and MOF based (most common). We are going to focus our efforts today on looking at MOF based resources.

Resources come packaged up as modules and our servers, which use at least WMF 4.0, come with several built-in. We have two main options for additional resources; we can find DSC resource modules in the PowerShell Gallery or we can write our own.

Jess wraps up the post with five useful resources for database administrators.

Comments closed

Limitations on Inlining Functions

Erik Darling takes us through a limitation to SQL Server 2019 scalar function inlining:

There’s a lot of excitement (alright, maybe I’m sort of in a bubble with these things) about SQL Server 2019 being able to inline most scalar UDFs.

But there’s a sort of weird catch with them. It’s documented, but still.

If you use GETDATE in the function, it can’t be inlined.

GETDATE() and its bretheren are non-deterministic so I figured that would be an issue. Check out the documentation for the other limitations.

Comments closed

Getting Started with Azure Databricks

Brad Llewellyn has a tutorial for Azure Databricks:

Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post.  The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform.  In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don’t want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.

Azure Databricks is not a Microsoft product.  It is owned and managed by the company Databricks and available in Azure and AWS.  However, Databricks is a “first party offering” in Azure.  This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products.  You can read more about Azure Databricks herehereand here.

Click through for a demonstration of the product.

Comments closed

Solving Logistic Regression Problems with Python

Hardik Jaroli shows how we can solve logistic regression problems using Python, using the Titanic data set as an example:

We will be working with the Titanic Data Set from Kaggle. We’ll be trying to predict a classification- survival or deceased.

Let’s begin by implementing Logistic Regression in Python for classification. We’ll use a “semi-cleaned” version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning.

Click through for the demo.

Comments closed

Deploying and Executing Containerized Packages

Andy Leonard continues a series on Integration Services in Docker. Part 5 shows how you can deploy a package to a containerized SSIS instance:

Returning to Matt Masson’s PowerShell script – combined with the docker volume added earlier – I have a means to deploy an SSIS Project to the SSIS Catalog in the container.

Part 6 shows how we can run those packages:

An aside regarding attempting SSIS package execution from SSMS connected to an instance of SQL Server in a container (using the runas /netonly trick shared earlier: It appears to work, but doesn’t. The package execution is created but “hangs” in Pending Execution status:

Read both to learn more about Andy’s travails in getting this working.

Comments closed

Scripting Database Restores

Max Vernon helps us out with a query to generate a database restore command:

Just point the script at an existing SQL Server Backup File, and give the new database a name, along with a target folder for the data and log files, and press F5. This script is compatible with SQL Server 2005 and higher, and has been tested on a case-sensitive-collation server.

I think building these out by hand is good practice and helps you learn, but when it’s crunch time, you really want to have a script do the work for you.

Comments closed

Finding High-Cardinality Columns

Constantine Kokkinos shows how you can find the cardinality of each column on a SQL table:

Today I was diving into some extremely wide tables, I wanted to take a quick look at things like “How many unique values does this table have in every column?”.

This can be super useful if you have a spreadsheet of results or a schema without effective normalization and you want to determine which rows are the “most unique” – or have high cardinality.

The Github gist is embedded at the bottom of the page, but I will run you through the code in case you want an explanation of how it works

Click through for the script.

Comments closed