Author: Kevin Feasel

Handling Rogue Queries In Spark

Published 2017-04-27 by Kevin Feasel

Alicja Luszczak, et al, introduce the Query Watchdog:

The previous query would cause problems on many different systems, regardless of whether you’re using Databricks or another data warehousing tool. Luckily, as an user of Databricks, this customer has a feature available that can help solve this problem called the Query Watchdog.

Note: Query Watchdog is available on clusters created with version 2.1-db3 and greater.

A Query Watchdog is a simple process that checks whether or not a given query is creating too many output rows for the number of input rows at a task level. We can set a property to control this and in this example we will use a ratio of 1000 (which is the default).

It looks like this is an all-or-nothing process, but a very interesting start.

Comments closed

Deploying Reports With Powershell

Published 2017-04-27 by Kevin Feasel

Jana Sattainathan has created a few Powershell functions to automate dealing with SQL Server Reporting Services report deployment:

In this post, I want to publish a few functions that I created around SSRS. They are related to and depend on each other.

Get-SSRS – Given the SSRS URI returns the WSDL endpoint
Get-SSRSReport – Returns one or more reports based on inputs
Get-SSRSSharedDataSource – Returns one or more shared data sources based on inputs
Get-SSRSReportDataSource – Returns the data source information on a report by report basis based on inputs
Set-SSRSReportDataSource – Sets the data source of a report to the given data source.
Install-SSRS – Deploys an SSRS report to a specific folder and also optionally sets the datasource for the deployed report

Very useful.

Comments closed

Cloudera Accessing Azure Data Lake Store

Published 2017-04-27 by Kevin Feasel

The Azure Data Lake team has announced that you can now access Azure Data Lake Store using a Cloudera cluster:

The Azure Data Lake (ADL) vision from the beginning has been to transform business data into intelligence by providing analytics on any data at cloud scale. ADL enterprise customers gain insights on their business data using a wide range of tools and platforms. Today’s release of Cloudera Enterprise 5.11 brings another very valuable and widely-used Hadoop computation platform to the set of platforms that can leverage ADLS. No matter what big data analytics platform you choose, Azure Data Lake Store provides a single high throughput enterprise-scale hierarchical file system data lake repository for big data.

Anyone with an Azure subscription can now deploy Cloudera clusters with ADLS. To get started, you can use the Cloudera Enterprise Data Hub template or the Cloudera Director template on Azure Marketplace to create a Cloudera cluster. Once the cluster is up, see here for more information on how to set up your Cloudera cluster with ADLS today!

That’s an interesting development.

Comments closed

Code Formatting

Published 2017-04-27 by Kevin Feasel

Bert Wagner has a few tips on code formatting to make it more readable:

The second example above consistently indents lines, adds new lines, and follows consistent coding patterns. This makes it easy to skim the code quickly.

Books have chapters, headings, and paragraphs defined by formatting that make it easy to find what is needed at a glance — formatting code makes it possible to find things easily too.

The examples Bert uses are all C#, but apply to most languages. I think consistency is key, even more so than your ideal format. This reduces friction between developers, at least outside of the “what should our coding standards be?” meetings…

Comments closed

Removing Tempdb Files

Published 2017-04-27 by Kevin Feasel

Erin Stellato relates a painful experience of tempdb filling up a VM’s drive:

I made a mistake with a script today. I created three new tempdb files sized at 10GB each that filled up a hard drive.

Whoops.

Luckily it was in one of my own testing VMs, so it wasn’t awful. Fixing it, however, was a fun one.

**NOTE: All work was done in a test environment. Proceed with caution if you’re running these commands in Production and make sure you understand the ramifications.

It’s a good opportunity to learn from Erin’s experience.

Comments closed

Getting The SqlServer Powershell Module

Published 2017-04-27 by Kevin Feasel

Drew Furgiuele shows how to install the SqlServer Powershell module from the Powershell Gallery:

That’s because, out of the box, Server 2012 R2 is running PowerShell 4.0. These Gallery cmdlets require PowerShell 5. To upgrade, you either need to upgrade to PowerShell 5.0 and that means installing Windows Management Framework 5.0. This is compatible with versions of Windows as far back as Windows 7, and Windows Server as far back as 2008 R2. Anything earlier, and you’re out of luck. This also requires the .NET framework 4.5 (or above). That means system updates, which could (potentially) lead to system reboots. Plan (and for the love God test) accordingly!

There’s a couple other hitches as well. One, and this sort of goes without saying, you need internet access for this to work. If your machines are behind any kind of filtering or firewall restrictions that prevent them from talking out to the internet, you’ll need to either open them up or use the Save-Module feature to download the bits and install them yourself. Secondly, you need Administrator access for this to work. And three, if you do install them manually, you might have different versions installed for different users (or service accounts).

They’ve made it nice and easy, so read Drew’s post and give it a try.

Comments closed

Partitioned Columnstore Tables

Published 2017-04-27 by Kevin Feasel

Denny Cherry makes an important point about dealing with columnstore tables:

ColumnStore indexes are all the rage with data warehouses. They’re fast, they’re new(ish) and they solve all sorts of problems when dealing with massive amounts of data. However they can cause some issues as well if you aren’t very careful about how you setup your partitions on the ColumnStore index. This is because, you can’t split a ColumnStore partition once it contains data.

Now, if everything is going according to plan you create your partitions well in advance and there’s no issues.

However, if everything hasn’t gone according to plan and someone forgets to create the partitions and you end up with rows in the final partition, you can’t create any more partitions because you can’t split the partition.

Ideally, you get those ducks in a row first. Keep reading for a repro script and a couple potential workarounds.

Comments closed

Real-Time Weather With HDF

Published 2017-04-26 by Kevin Feasel

Balaji Kandregula shows how to use Hortonworks Data Flow components to process weather events in real time:

It’s live weather reporting using HDF, Kafka, and Solr.

Here are the environment requirements for implementing:

HDF (for HDF 2.0, you need Java 1.8).

Kafka.

Spark.

Solr.

Banana.

Now let’s get on to the steps!

There are a lot of moving parts there, but the pieces do plug in well enough and there are a lot of screen shots to guide you along the way.

Comments closed

Data Lake Zoning

Published 2017-04-26 by Kevin Feasel

Parth Patel, et al, explain that there ought to be several zones of data within a data lake:

Within a Data Lake, zones allow the logical and/or physical separation of data that keeps the environment secure, organized, and Agile. Typically, the use of 3 or 4 zones is encouraged, but fewer or more may be leveraged. A generic 4-zone system might include the following:

Transient Zone — Used to hold ephemeral data, such as temporary copies, streaming spools, or other short-lived data before being ingested.

Raw Zone – The zone in which raw data will be maintained. This is also the zone where sensitive data must be encrypted, tokenized, or otherwise secured.

Trusted Zone – After Data Quality, Validation, or other processing is performed on data in the Raw Zone, it becomes the “source of truth” in this zone for downstream systems.

Refined Zone – Manipulated and enriched data is kept in this zone. This is used to store the output from tools like Hive or external tools that will write into to the Data Lake.

Your particular situation may differ but I’d consider this to be good advice no matter where or how you’re storing data, such as a classical data warehouse or an ODS.

Comments closed

The Birthday Problem

Published 2017-04-26 by Kevin Feasel

Mala Mahadevan explains the Birthday problem and demonstrates it with SQL and R:

Given a room of 23 random people, what are chances that two or more of them have the same birthday?

This problem is a little different from the earlier ones, where we actually knew what the probability in each situation was.

What are chances that two people do NOT share the same birthday? Let us exclude leap years for now..chances that two people do not share the same birthday is 364/365, since one person’s birthday is already a given. In a group of 23 people, there are 253 possible pairs (23*22)/2. So the chances of no two people sharing a birthday is 364/365 multiplied 253 times. The chances of two people sharing a birthday, then, per basics of probability, is 1 – this.

The funny thing for me is that I’ve had the Birthday problem explained three separate times using as a demo the 20-30 people in the classroom. In none of those three cases was there a match, so although I understand that it is correct and how it is correct, the 100% failure to replicate led a little nagging voice in the back of my mind to discount it.

Comments closed