April 2017 – Page 3 – Curated SQL

How Query Store And Plan Guides Interact

Published 2017-04-26 by Kevin Feasel

Grant Fritchey shows that query metadata gets a little weird when you have a plan guide trying to use one particular query and Query Store is forcing a different plan:

If we rerun the query and then take a look at the first operator in the execution plan, we can see that the Plan Guide is in use… and that the query hash has changed. It no longer matches the original query. Now it matches the query that included the query hint. This actually makes perfect sense. The Plan Guide is basically changing the query from the first example above, into the second.

Now, what happens when we toss in the Query Store

The query behavior is exactly what you want, but some of the metadata is no longer correct.

Comments closed

Statistics For Programmers

Published 2017-04-26 by Kevin Feasel

Julia Evans shares some good resources for developers interested in statistics:

even more links

a paper someone said was good (by Efron): Bootstrap Methods: another look at the jackknife
this book by 5 people named lock
this blog post has an overview of different nonparametric tests
this podcast with Philip Guo and John DeNero where they talk about teaching stats to programmers
nonparametric statistical methods
openintro has free some statistics books

There are a lot of good links in Julia’s post. I should also mention that Andrew Gelman and Deborah Nolan have a new book coming out in July. Gelman’s Bayesian approach suits me well, so I’m pre-ordering the book.

Comments closed

PySpark Persistence

Published 2017-04-26 by Kevin Feasel

David Crook shows how to save data to disk from PySpark:

This is working on HDInsight v3.5 w/Spark 2.0 and Azure Data Lake Storage as the underlying storage system. What is nice about this is that my cluster only has access to its cluster section of the folder structure. I have the structure root/clusters/dasciencecluster. This particular cluster starts at dasciencecluster, while other clusters may start somewhere else. Therefor my data is saved to root/clusters/dasciencecluster/data/open_data/RF_Model.txt

It’s pretty easy to do, and the Scala code would look suspiciously similar. The Java version of the code would be seven pages long.

Comments closed

Time Brush Custom Visual

Published 2017-04-26 by Kevin Feasel

Devin Knight continues his Power BI custom visuals series:

In this module you will learn how to use the Time Brush Power BI Custom Visual. The Time Brush gives you the ability both filter your report and see a graphics representation of your data at the same time. The name Time Brush comes from the behavior used when you select the values you’d like to filter.

The use of color is an interesting take on combining continuous data points with categorical representations of those points.

Comments closed

CI With SQL Server And Jenkins

Published 2017-04-26 by Kevin Feasel

Chris Adkin shows how to auto-deploy SQL Server Data Tools projects to a SQL Server instance using Jenkins:

The aim of this blog post is twofold, it is to explain how:

A “Self building pipeline” for the deployment of a SQL Server Data Tools project can be implemented using open source tools

A build pipeline can be augmented using PowerShell

What You Will Need

Jenkins automation server
cURL
SQL Server 2016 (any edition will suffice)
Visual Studio 2015 community edition
A windows server, physical or virtual to install all of the above on, I will be using Windows Server 2012 R2 as the operating system

Automated integration via CI is extremely helpful, and Chris makes it look easy in this post.

Comments closed

Benefits Of Deprecated Data Types

Published 2017-04-26 by Kevin Feasel

Raul Gonzalez shows how to get one of the benefits of older, deprecated data types using (MAX) data types:

We can see that our table is managed by two different allocation units, IN_ROW_DATA and LOB_DATA, which means that all data within columns of the data types above, will end up in different pages by default, regardless of the size of the data.

This is the default behaviour for old LOB types, to be stored separately, but new LOB types (MAX) by default will try to get them In-Row if they are small enough to fit.

Having some of those documents In-Row will result in a serious increase in the number of pages to scan, therefore affecting performance.

Note that for the table scan we have used only the IN_ROW_DATA pages, making it much lighter than if we have to scan the sum of all pages.

This might be helpful for some situations, like where you rarely need to get to the LOB data.

Comments closed

Killing SPIDs

Published 2017-04-26 by Kevin Feasel

Garland MacNeill is all out of bubble gum:

Recently came across a situation where reporting logins were interfering with nightly jobs due to blocking. After a number of attempts of trying to resolve the blocking, it was decided that a stored procedure that disabled the login and killed the user sessions was the most pragmatic solution. This is the code I came up with to resolve the issue.

Click through for the script. This is definitely a last-ditch option, but it’s good to have in your bag of tricks.

Comments closed

Choosing A Hadoop Data Format

Published 2017-04-25 by Kevin Feasel

Silvia Oliveros has a set of considerations to help you choose a file format for your data in Hadoop:

What does your pipeline look like, and what steps are involved?

Some of the file formats were optimized to work in certain situations. For example, Sequence files were designed to easily share data between Map Reduce (MR) jobs, so if your pipeline involves MR jobs then Sequence files make an excellent option. In the same vein, columnar data formats such as Parquet and ORC were designed to optimize query times; if the final stage of your pipeline needs to be optimized, using a columnar file format will increase speed while querying data.

At first, I’d suggest just using delimited files, as it’s easiest that way. Once you have developed a bit of Hadoop maturity, then it makes sense to think about whether rowstore formats (like Parquet and Avro) or columnstore formats (like ORC) make sense for a particular data set.

Comments closed

OCR With Tesseract

Published 2017-04-25 by Kevin Feasel

Amuda Adelou shows how to use Tesseract’s Java API to perform character recognition in images:

Extracting text from an image means that you are considering the flowchart imagery that’s processed to extract the text components and then extracting the geometrical shapes components. The text components are extracted with geometrical components, as well. The internal relationship between the components is set up by tracing the flow lines that connect different components. The extracted components are output to metadata (in XML format), which is machine-readable. This metadata can be archived, stored in a knowledge base, or shared with others.

Click through for a demo app and code.

Comments closed

Spark Deep Learning On AWS

Published 2017-04-25 by Kevin Feasel

Joseph Spisak, et al, show how to configure and use BigDL in Amazon Web Services’s ElasticMapReduce:

Classify text using BigDL

In this tutorial, we demonstrate how to solve a text classification problem based on the example found here. This example uses a convolutional neural network to classify posts in the 20 Newsgroup dataset into 20 categories.

We’ve provided a companion Jupyter notebook example on GitHub that you can open in the Jupyter dashboard to execute the code sections.

There’s a lot to this tutorial.

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Month: April 2017

even more links

What You Will Need

What does your pipeline look like, and what steps are involved?

Classify text using BigDL