Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post. The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform. In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don’t want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.
Azure Databricks is not a Microsoft product. It is owned and managed by the company Databricks and available in Azure and AWS. However, Databricks is a “first party offering” in Azure. This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products. You can read more about Azure Databricks here, hereand here.
Click through for a demonstration of the product.
We will be working with the Titanic Data Set from Kaggle. We’ll be trying to predict a classification- survival or deceased.
Let’s begin by implementing Logistic Regression in Python for classification. We’ll use a “semi-cleaned” version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning.
Click through for the demo.
Andy Leonard continues a series on Integration Services in Docker. Part 5 shows how you can deploy a package to a containerized SSIS instance:
Returning to Matt Masson’s PowerShell script – combined with the docker volume added earlier – I have a means to deploy an SSIS Project to the SSIS Catalog in the container.
An aside regarding attempting SSIS package execution from SSMS connected to an instance of SQL Server in a container (using the runas /netonly trick shared earlier: It appears to work, but doesn’t. The package execution is created but “hangs” in Pending Execution status:
Read both to learn more about Andy’s travails in getting this working.
Just point the script at an existing SQL Server Backup File, and give the new database a name, along with a target folder for the data and log files, and press F5. This script is compatible with SQL Server 2005 and higher, and has been tested on a case-sensitive-collation server.
I think building these out by hand is good practice and helps you learn, but when it’s crunch time, you really want to have a script do the work for you.
Today I was diving into some extremely wide tables, I wanted to take a quick look at things like “How many unique values does this table have in every column?”.
This can be super useful if you have a spreadsheet of results or a schema without effective normalization and you want to determine which rows are the “most unique” – or have high cardinality.
The Github gist is embedded at the bottom of the page, but I will run you through the code in case you want an explanation of how it works
Click through for the script.
IGNORE_DUP_KEYindex option can be specified for both clustered and nonclustered unique indexes. Using it on a clustered index can result in much poorer performance than for a nonclustered unique index.
The size of the performance difference depends on how many uniqueness violations are encountered during the
INSERToperation. The more violations, the worse the clustered unique index performs by comparison. If there are no violations at all, the clustered index insert may even perform better.
IGNORE_DUP_KEY primarily in cases like queue tables where I might be queuing up changes to migrate to a warehouse and where the chance of collision is low but non-zero. It looks like pushing much beyond that pattern can be devastating for performance.