Author: Kevin Feasel

Anti-Joins In Power BI

Published 2018-08-21 by Kevin Feasel

Reza Rad explains when you might want to use anti-joins in Power BI:

Finding rows that are in one table, but not the other is one of the most common scenarios happening in any data related applications. You may have customer records coming from two sources, and want to find data rows that exist in one, but not the other. In Power Query, you can use Merge to combine data tables together. Merge can be also used for finding mismatch records. You will learn through this blog post, how in Power Query you can find out which records are missing with Merge, and then report it in Power BI. To learn more about Power BI, read Power BI book from Rookie to Rock Star.

Read on for a demo of how to use anti-joins to solve this problem.

Comments closed

Azure SQL Database SLAs

Published 2018-08-21 by Kevin Feasel

Arun Sirpal ponders the Azure SQL Database service level agreement:

Lets just get straight to the point, Azure SQL Database across all service tiers gives you the customer a SLA of 99.99% up-time. This means potential unavailability periods shown below.

Good, bad, you decide. The point is that even in the cloud we “could” potentially encounter downtime. Can you improve on 99.99%? Well that was the question I asked Microsoft, I was given a “wishy-washy” answer that yes you can by using failover groups ( I’m guessing the read/write endpoint is key here ) to improve the up time. I then pressed on what sort of figure in terms of nines does this provide, to no avail.

So what happens if up time is less than 99.99% or even worse 99% (ouch). Service credits are available as shown below.

Arun also includes some of the exceptions Microsoft has. Most of these are “you messed up” types of exceptions, but not all of them.

Comments closed

The Basic Paradigms Of Functional Programming

Published 2018-08-20 by Kevin Feasel

Ayush Hooda explains a couple core principles behind functional programming:

A pure function can be defined like this:

The output of a pure function depends only on(a) its input parameters and(b) its internal algorithm,which is unlike an OOP method, which can depend on other fields in the same class as the method.
A pure function has no side effects, i.e., that it does not read anything from the outside world or write anything to the outside world. – For example, It does not read from a file, web service, UI, or database, and does not write anything either.
As a result of those first two statements, if a pure function is called with an input parameter x an infinite number of times, it will always return the same result y. – For instance, any time a “string length” function is called with the string “Ayush”, the result will always be 5.

If I got to add one more thing, it’d be the idea that functions are first-class data types. In other words, a function can be an input to another function, the same as any other data type like int, string, etc. It takes some time to get used to that concept, but once you do, these types of languages become quite powerful.

Comments closed

Using Notebooks At Netflix

Published 2018-08-20 by Kevin Feasel

Michelle Ufford, et al, explain why and how they use Jupyter Notebooks at Netflix:

Notebooks were first introduced at Netflix to support data science workflows. As their adoption grew among data scientists, we saw an opportunity to scale our tooling efforts. We realized we could leverage the versatility and architecture of Jupyter notebooks and extend it for general data access. In Q3 2017 we began this work in earnest, elevating notebooks from a niche tool to a first-class citizen of the data platform.

From our users’ perspective, notebooks offer a convenient interface for iteratively running code, exploring output, and visualizing data — all from a single cloud-based development environment. We also maintain a Python library that consolidates access to platform APIs. This means users have programmatic access to virtually the entire platform from within a notebook.Because of this combination of versatility, power, and ease of use, we’ve seen rapid organic adoption for all user types across the entire Data Platform.

Today, notebooks are the most popular tool for working with data at Netflix.

Good article. I love notebooks for two reasons: pedagogical purposes (it’s easier to show a demo in a notebook) and forcing you to work linearly.

Comments closed

Faster User-Defined Functions In SparkR

Published 2018-08-20 by Kevin Feasel

Liang Zhang and Hossein Falaki note a major performance improvement for functions in SparkR using the latest version of the Databricks Runtime:

SparkR offers four APIs that run a user-defined function in R to a SparkDataFrame

dapply()

dapplyCollect()

gapply()

gapplyCollect()

dapply() allows you to run an R function on each partition of the SparkDataFrame and returns the result as a new SparkDataFrame, on which you may apply other transformations or actions. gapply() allows you to apply a function to each grouped partition consisting of a key and the corresponding rows in a SparkDataFrame. dapplyCollect() and gapplyCollect()are shortcuts if you want to call collect() on the result.

The following diagram illustrates the serialization and deserialization performed during the execution of the UDF. The data gets serialized twice and deserialized twice in total, all of which are row-wise.

By vectorizing data serialization and deserialization in Databricks Runtime 4.3, we encode and decode all the values of a column at once. This eliminates the primary bottleneck which row-wise serialization, and significantly improves SparkR’s UDF performance. Also, the benefit from the vectorization is more drastic for larger datasets.

It looks like they get some pretty serious gains from this change.

Comments closed

Subsetting Matrices In R

Published 2018-08-20 by Kevin Feasel

Dave Mason continues his look at matrices in R:

We can extract an entire row from a matrix. To do this, specify the desired row only within the square brackets [ ]. The placeholder where you would otherwise specify the column is left empty.
> #Points scored by Kendrick Perkins.
> points_scored_by_quarter[1,]
1st 2nd 3rd 4th 
  2   2   6   0 
> points_scored_by_quarter["Perkins",]
1st 2nd 3rd 4th 
  2   2   6   0 
Conversely, we can extract a column from a matrix. Specify the column within the square brackets [ ]and omit the row. The result is a vector, thus the pivot effect–the row names are displayed in the output (not the column name).

Dave points out that working with matrices is basically an extension of working with vectors.

Comments closed

Database Ownership Chaining On Azure SQL Managed Instances

Published 2018-08-20 by Kevin Feasel

Jovan Popovic shows that you can enable database ownership chaining on Azure SQL Managed Instances:

If you have the same owner on several objects in several databases, and you have some stored procedure that access these objects, you don’t need to GRANT access permission to every object that the procedure needs to access. If the procedure and the objects have the same owner, you can to GRANT permission on the procedure and Database Engine will allow the procedure to access all other objects that share the same owner.

In this example, I will create two databases that have the same owner and a login that will be used to access the data. One database will have some table and other database will have a stored procedure that reads data from the table in other database. Login will be granted to execute the stored procedure, but not to read data from the table:

“Can” and “should” here probably have different answers. Far better to set up certificates for granting rights.

Comments closed

A Minimalist Guide To Using SQL Server On Linux

Published 2018-08-20 by Kevin Feasel

Mark Litwintschik has a quick guide to installing and using SQL Server 2017 on Ubuntu:

SQL Server is Microsoft’s enterprise relational database offering. It was first released in 1989 and has seen support on various Windows and OS/2 platforms since it’s release. In October 2017, Microsoft released SQL Server 2017 for Linux. To date, Ubuntu 16, Red Hat Enterprise Linux 7.3 and 7.4 as well as SUSE Enterprise Linux Server v12 are supported.

Though the Linux distribution is missing features found in the Windows offering, the result is a very useful and feature-rich database that fits in well in a UNIX environment.

In this post I’ll walk through setting up SQL Server 2017, performing basic data import and export tasks as well as building reports via Jupyter Notebook and automating tasks using Apache Airflow.

Mark calls this a minimalist guide, but he does cover a lot of the basics.

Comments closed

Mining The Plan Cache, Query Store, And More

Published 2018-08-20 by Kevin Feasel

Erin Stellato shows the benefit of digging through the plan cache, Query Store, and third-party performance tool databases (using SentryOne’s SQL Sentry as an example):

As much as I love all this extra data, it’s important to note that some information is more relevant for an actual execution plan, versus an estimated one (e.g. tempdb spill information). Some days we can capture and use the actual plan for troubleshooting, other times we have to use the estimated plan. Very often we get that estimated plan – the plan that has been used for problematic executions potentially – from SQL Server’s plan cache. And pulling individual plans is appropriate when tuning a specific query or set or queries. But what about when you want ideas on where to focus your tuning efforts in terms of patterns?

The SQL Server plan cache is a prodigious source of information when it comes to performance tuning, and I don’t simply mean troubleshooting and trying to understand what’s been running in a system. In this case, I’m talking about mining information from the plans themselves, which are found in sys.dm_exec_query_plan, stored as XML in the query_plan column.

When you combine this data with information from sys.dm_exec_sql_text (so you can easily view the text of the query) and sys.dm_exec_query_stats (execution statistics), you can suddenly start to look for not just those queries that are the heavy hitters or execute most frequently, but those plans that contain a particular join type, or index scan, or those that have the highest cost. This is commonly referred to as mining the plan cache, and there are several blog posts that talk about how to do this. My colleague, Jonathan Kehayias, says he hates to write XML yet he has several posts with queries for mining the plan cache:

It’s a good article with a lot of useful information.

Comments closed

Creating Indexed Views

Published 2018-08-20 by Kevin Feasel

Eduardo Pivaral shows how to create a fairly simple indexed view:

Views help our query writing by simplifying writing the same sentences and/or aggregations over and over again, but it has a drawback, the views just store our query definition, but the performance is not improved by using them.

Since SQL Server 2008, the option to create an index over a view was introduced, of course, there are some limitations, but if your view can use them, the performance improvement could be great!

I will show you how to create a simple index over a view

I’ve used indexed views in the past, but they’re a much less common tool in the belt these days.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31