Kevin Feasel – Page 1177

We will be working with the Titanic Data Set from Kaggle. We’ll be trying to predict a classification- survival or deceased.
Let’s begin by implementing Logistic Regression in Python for classification. We’ll use a “semi-cleaned” version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning.

Click through for the demo.

Comments closed

Deploying and Executing Containerized Packages

Published 2019-04-08 by Kevin Feasel

Andy Leonard continues a series on Integration Services in Docker. Part 5 shows how you can deploy a package to a containerized SSIS instance:

Returning to Matt Masson’s PowerShell script – combined with the docker volume added earlier – I have a means to deploy an SSIS Project to the SSIS Catalog in the container.

Part 6 shows how we can run those packages:

An aside regarding attempting SSIS package execution from SSMS connected to an instance of SQL Server in a container (using the runas /netonly trick shared earlier: It appears to work, but doesn’t. The package execution is created but “hangs” in Pending Execution status:

Read both to learn more about Andy’s travails in getting this working.

Comments closed

Scripting Database Restores

Published 2019-04-08 by Kevin Feasel

Max Vernon helps us out with a query to generate a database restore command:

Just point the script at an existing SQL Server Backup File, and give the new database a name, along with a target folder for the data and log files, and press F5. This script is compatible with SQL Server 2005 and higher, and has been tested on a case-sensitive-collation server.

I think building these out by hand is good practice and helps you learn, but when it’s crunch time, you really want to have a script do the work for you.

Comments closed

Finding High-Cardinality Columns

Published 2019-04-08 by Kevin Feasel

Constantine Kokkinos shows how you can find the cardinality of each column on a SQL table:

Today I was diving into some extremely wide tables, I wanted to take a quick look at things like “How many unique values does this table have in every column?”.
This can be super useful if you have a spreadsheet of results or a schema without effective normalization and you want to determine which rows are the “most unique” – or have high cardinality.
The Github gist is embedded at the bottom of the page, but I will run you through the code in case you want an explanation of how it works

Click through for the script.

Comments closed

The Performance Hit From Ignoring Duplicate Keys

Published 2019-04-08 by Kevin Feasel

Paul White explains why there is a big performance hit when using IGNORE_DUP_KEY on clustered indexes:

The IGNORE_DUP_KEY index option can be specified for both clustered and nonclustered unique indexes. Using it on a clustered index can result in much poorer performance than for a nonclustered unique index.
The size of the performance difference depends on how many uniqueness violations are encountered during the INSERT operation. The more violations, the worse the clustered unique index performs by comparison. If there are no violations at all, the clustered index insert may even perform better.

I use IGNORE_DUP_KEY primarily in cases like queue tables where I might be queuing up changes to migrate to a warehouse and where the chance of collision is low but non-zero. It looks like pushing much beyond that pattern can be devastating for performance.

Comments closed

What’s New With KSQL

Published 2019-04-05 by Kevin Feasel

Robin Moffatt looks into additions to KSQL with Confluent Platform 5.2:

PRINT is one of those features you may not quite grok until you start using it…and then you’ll wonder how you lived without it. It provides a simple way of displaying the contents of a Kafka topic and figures out itself which deserialiser to use. Avro? No problem! JSON? Bring it on!
In KSQL 5.2, the PRINT feature gets even better as you can specify how many records you’d like to see from the topic using the LIMIT clause.

These are some good additions.

Comments closed

Finding an Unfair Coin with R

Published 2019-04-05 by Kevin Feasel

Sebastian Sauer works out a coin flip problem:

A stochastic problem, with application to financial theory. Some say it goes back to Warren Buffett. I relied to my colleague Norman Markgraf, who pointed it out to me.
Assume there are two coins. One is fair, one is loaded. The loaded coin has a bias of 60-40. Now, the question is: How many coin flips do you need to be “sure enough” (say, 95%) that you found the loaded coin?
Let’s simulate la chose.

It took a few more flips than I had expected but the number is not outlandish.

Comments closed

Generating Workloads with Powershell

Published 2019-04-05 by Kevin Feasel

Rob Sewell wants to generate a workload against AdventureWorks using Powershell:

For a later blog post I have been trying to generate some workload against an AdventureWorks database.
I found this excellent blog post by Pieter Vanhove t https://blogs.technet.microsoft.com/msftpietervanhove/2016/01/08/generate-workload-on-your-azure-sql-database/ which references this 2011 post by Jonathan Kehayias t
https://www.sqlskills.com/blogs/jonathan/the-adventureworks2008r2-books-online-random-workload-generator/

Rob turns these into multi-threaded workload generators. If you’re looking at generating stress on servers, you might also look at PigDog, developed by Mark Willkinson (one of my co-workers, so I have seen the look of joy on his face when he brings SQL Server to its knees).

Comments closed

A Forensic Accounting Case Study

Published 2019-04-05 by Kevin Feasel

I have a new series I’ve started on applying forensic accounting techniques as a data platform specialist:

Before I dig into my case study, I want to make it absolutely clear that these techniques will help you do a lot more than uncover fraud in your environment. My hope is that there is no fraud going on in your environment and you never need to use these tools for that purpose.
Even with no fraud, there is an excellent reason to learn and use these tools: they help you better understand your data. A common refrain from data platform presenters is “Know your data.” I say it myself. Then we do some hand-waving stuff, give a few examples of what that entails, and go on to the main point of whatever talks we’re giving. Well, this series is dedicated to knowing your data and giving you the right tools to learn and know your data.

This first post sets the scene, with subsequent posts getting into detail on the technical aspects.

Comments closed

Understanding Key Lookups

Published 2019-04-05 by Kevin Feasel

Monica Rathbun explains what a key lookup is in SQL Server:

One of the easiest things to fix when performance tuning queries are Key Lookups or RID Lookups. The key lookup operator occurs when the query optimizer performs an index seek against a specific table and that index does not have all of the columns needed to fulfill the result set. SQL Server is forced to go back to the clustered index using the Primary Key and retrieve the remaining columns it needs to satisfy the request. A RID lookup is the same operation but is performed on a table with no clustered index, otherwise known as a heap. It uses a row id instead of a primary key to do the lookup.
As you can see these can very expensive and can result in substantial performance hits in both I/O and CPU. Imagine a query that runs thousands of times per minute that includes one or more key lookups. This can result in tremendous overhead which is generated by these extra reads it effects the overall engine performance.

Monica’s absolutely right: key lookups can take a decent query and make it into a performance hog.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Author: Kevin Feasel

Solving Logistic Regression Problems with Python

Deploying and Executing Containerized Packages

Scripting Database Restores

Finding High-Cardinality Columns

The Performance Hit From Ignoring Duplicate Keys

What’s New With KSQL

Finding an Unfair Coin with R

Generating Workloads with Powershell

A Forensic Accounting Case Study

Understanding Key Lookups