Curated SQL – Page 1466 – A Fine Slice Of SQL Server

That’s because, out of the box, Server 2012 R2 is running PowerShell 4.0. These Gallery cmdlets require PowerShell 5. To upgrade, you either need to upgrade to PowerShell 5.0 and that means installing Windows Management Framework 5.0. This is compatible with versions of Windows as far back as Windows 7, and Windows Server as far back as 2008 R2. Anything earlier, and you’re out of luck. This also requires the .NET framework 4.5 (or above). That means system updates, which could (potentially) lead to system reboots. Plan (and for the love God test) accordingly!

There’s a couple other hitches as well. One, and this sort of goes without saying, you need internet access for this to work. If your machines are behind any kind of filtering or firewall restrictions that prevent them from talking out to the internet, you’ll need to either open them up or use the Save-Module feature to download the bits and install them yourself. Secondly, you need Administrator access for this to work. And three, if you do install them manually, you might have different versions installed for different users (or service accounts).

They’ve made it nice and easy, so read Drew’s post and give it a try.

Comments closed

Partitioned Columnstore Tables

Published 2017-04-27 by Kevin Feasel

Denny Cherry makes an important point about dealing with columnstore tables:

ColumnStore indexes are all the rage with data warehouses. They’re fast, they’re new(ish) and they solve all sorts of problems when dealing with massive amounts of data. However they can cause some issues as well if you aren’t very careful about how you setup your partitions on the ColumnStore index. This is because, you can’t split a ColumnStore partition once it contains data.

Now, if everything is going according to plan you create your partitions well in advance and there’s no issues.

However, if everything hasn’t gone according to plan and someone forgets to create the partitions and you end up with rows in the final partition, you can’t create any more partitions because you can’t split the partition.

Ideally, you get those ducks in a row first. Keep reading for a repro script and a couple potential workarounds.

Comments closed

Real-Time Weather With HDF

Published 2017-04-26 by Kevin Feasel

Balaji Kandregula shows how to use Hortonworks Data Flow components to process weather events in real time:

It’s live weather reporting using HDF, Kafka, and Solr.

Here are the environment requirements for implementing:

HDF (for HDF 2.0, you need Java 1.8).

Kafka.

Spark.

Solr.

Banana.

Now let’s get on to the steps!

There are a lot of moving parts there, but the pieces do plug in well enough and there are a lot of screen shots to guide you along the way.

Comments closed

Data Lake Zoning

Published 2017-04-26 by Kevin Feasel

Parth Patel, et al, explain that there ought to be several zones of data within a data lake:

Within a Data Lake, zones allow the logical and/or physical separation of data that keeps the environment secure, organized, and Agile. Typically, the use of 3 or 4 zones is encouraged, but fewer or more may be leveraged. A generic 4-zone system might include the following:

Transient Zone — Used to hold ephemeral data, such as temporary copies, streaming spools, or other short-lived data before being ingested.

Raw Zone – The zone in which raw data will be maintained. This is also the zone where sensitive data must be encrypted, tokenized, or otherwise secured.

Trusted Zone – After Data Quality, Validation, or other processing is performed on data in the Raw Zone, it becomes the “source of truth” in this zone for downstream systems.

Refined Zone – Manipulated and enriched data is kept in this zone. This is used to store the output from tools like Hive or external tools that will write into to the Data Lake.

Your particular situation may differ but I’d consider this to be good advice no matter where or how you’re storing data, such as a classical data warehouse or an ODS.

Comments closed

The Birthday Problem

Published 2017-04-26 by Kevin Feasel

Mala Mahadevan explains the Birthday problem and demonstrates it with SQL and R:

Given a room of 23 random people, what are chances that two or more of them have the same birthday?

This problem is a little different from the earlier ones, where we actually knew what the probability in each situation was.

What are chances that two people do NOT share the same birthday? Let us exclude leap years for now..chances that two people do not share the same birthday is 364/365, since one person’s birthday is already a given. In a group of 23 people, there are 253 possible pairs (23*22)/2. So the chances of no two people sharing a birthday is 364/365 multiplied 253 times. The chances of two people sharing a birthday, then, per basics of probability, is 1 – this.

The funny thing for me is that I’ve had the Birthday problem explained three separate times using as a demo the 20-30 people in the classroom. In none of those three cases was there a match, so although I understand that it is correct and how it is correct, the 100% failure to replicate led a little nagging voice in the back of my mind to discount it.

Comments closed

How Query Store And Plan Guides Interact

Published 2017-04-26 by Kevin Feasel

Grant Fritchey shows that query metadata gets a little weird when you have a plan guide trying to use one particular query and Query Store is forcing a different plan:

If we rerun the query and then take a look at the first operator in the execution plan, we can see that the Plan Guide is in use… and that the query hash has changed. It no longer matches the original query. Now it matches the query that included the query hint. This actually makes perfect sense. The Plan Guide is basically changing the query from the first example above, into the second.

Now, what happens when we toss in the Query Store

The query behavior is exactly what you want, but some of the metadata is no longer correct.

Comments closed

Statistics For Programmers

Published 2017-04-26 by Kevin Feasel

Julia Evans shares some good resources for developers interested in statistics:

even more links

a paper someone said was good (by Efron): Bootstrap Methods: another look at the jackknife
this book by 5 people named lock
this blog post has an overview of different nonparametric tests
this podcast with Philip Guo and John DeNero where they talk about teaching stats to programmers
nonparametric statistical methods
openintro has free some statistics books

There are a lot of good links in Julia’s post. I should also mention that Andrew Gelman and Deborah Nolan have a new book coming out in July. Gelman’s Bayesian approach suits me well, so I’m pre-ordering the book.

Comments closed

PySpark Persistence

Published 2017-04-26 by Kevin Feasel

David Crook shows how to save data to disk from PySpark:

This is working on HDInsight v3.5 w/Spark 2.0 and Azure Data Lake Storage as the underlying storage system. What is nice about this is that my cluster only has access to its cluster section of the folder structure. I have the structure root/clusters/dasciencecluster. This particular cluster starts at dasciencecluster, while other clusters may start somewhere else. Therefor my data is saved to root/clusters/dasciencecluster/data/open_data/RF_Model.txt

It’s pretty easy to do, and the Scala code would look suspiciously similar. The Java version of the code would be seven pages long.

Comments closed

Time Brush Custom Visual

Published 2017-04-26 by Kevin Feasel

Devin Knight continues his Power BI custom visuals series:

In this module you will learn how to use the Time Brush Power BI Custom Visual. The Time Brush gives you the ability both filter your report and see a graphics representation of your data at the same time. The name Time Brush comes from the behavior used when you select the values you’d like to filter.

The use of color is an interesting take on combining continuous data points with categorical representations of those points.

Comments closed

CI With SQL Server And Jenkins

Published 2017-04-26 by Kevin Feasel

Chris Adkin shows how to auto-deploy SQL Server Data Tools projects to a SQL Server instance using Jenkins:

The aim of this blog post is twofold, it is to explain how:

A “Self building pipeline” for the deployment of a SQL Server Data Tools project can be implemented using open source tools

A build pipeline can be augmented using PowerShell

What You Will Need

Jenkins automation server
cURL
SQL Server 2016 (any edition will suffice)
Visual Studio 2015 community edition
A windows server, physical or virtual to install all of the above on, I will be using Windows Server 2012 R2 as the operating system

Automated integration via CI is extremely helpful, and Chris makes it look easy in this post.

Comments closed

Curated SQL Posts

even more links

What You Will Need