Author: Kevin Feasel

A while back I wrote about the Perils of VSS Snaps.

After working with several more clients having similar issues, I decided it was time to look at things again. This time, I wanted blood. I wanted to simulate a slow VSS Snap and see what kind of waits stats I’d have to look out for.

Getting software and rigging stuff up to be slow would have been difficult.

Instead, we’re going to cheat and use some old DBCC commands.

This one almost got the “Wacky Ideas” tag but I’m grading on a curve for that category.

Comments closed

Your Data’s Not That Big

Published 2018-08-21 by Kevin Feasel

Larry White throws a bit of cold water on the distributed computing movement:

Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because it was “easy,” which it was, if you ignore all the work needed to make it go fast. It seemed pretty clear to me that it could have been written in Java to run on a single machine with a much smaller staff.

One definition of “big data” is “Data that is too big to fit on one machine.” By that definition what is “big data” for one language is plain-old “data” for another. Java, with it’s efficient memory management, high performance, and multi-threading can get a lot done on one machine. To do data science in Java, however, you need data science tools: Tablesaw is an open-source (Apache 2) Java data science platform that lets users work with data on a single machine. It’s a dataframe and visualization framework. Most data science currently done in clusters could be done on a single machine using Tablesaw paired with a Java machine learning library like Smile.

But you don’t have to take my word for that.

There are some interesting thoughts in this post, but there are limits to what a single machine can do.

Comments closed

Including R Visuals In Power BI Dashboards

Published 2018-08-21 by Kevin Feasel

Parker Stevens shows how to include R visuals in a Power BI dashboard:

Let’s finish up this post with a quick example of how to code the elusive line chart with two y-axes. This always seems to be asked in the forums and it’s pretty easy to implement.

Follow the same steps as shown above to bring in a new R visual. Since we need a column to pass into the visual and open up the editor, let’s just throw in the Angle field that we made previously. With the code editor available we can start writing the R script. In this example, we are going to need some data that is available in a specific R package, called “ggplot2.” Go ahead and install the package by typing the following code the same way we installed scatterplot3d:

install.packages(“ggplot2”)

There are two interesting examples here, including one which accepts an external parameter.

Comments closed

Deploying To Power BI Report Server Using Powershell

Published 2018-08-21 by Kevin Feasel

Rob Sewell shows us how to automate Power BI Report Server deployments:

But I dont want to have to do this each time and there will be multiple pbix files, so I wanted to automate the solution. The end result was a VSTS or TFS release process so that I could simply drop the pbix into a git repository, commit my changes, sync them and have the system deploy them automatically.

As with all good ideas, I started with a google and found this post by Bill Anton which gave me a good start ( I could not get the connection string change to work in my test environment but this was not required so I didnt really examine why)

I wrote a function that I can use via TFS or VSTS by embedding it in a PowerShell script.

Click through for the script.

Comments closed

Anti-Joins In Power BI

Published 2018-08-21 by Kevin Feasel

Reza Rad explains when you might want to use anti-joins in Power BI:

Finding rows that are in one table, but not the other is one of the most common scenarios happening in any data related applications. You may have customer records coming from two sources, and want to find data rows that exist in one, but not the other. In Power Query, you can use Merge to combine data tables together. Merge can be also used for finding mismatch records. You will learn through this blog post, how in Power Query you can find out which records are missing with Merge, and then report it in Power BI. To learn more about Power BI, read Power BI book from Rookie to Rock Star.

Read on for a demo of how to use anti-joins to solve this problem.

Comments closed

Azure SQL Database SLAs

Published 2018-08-21 by Kevin Feasel

Arun Sirpal ponders the Azure SQL Database service level agreement:

Lets just get straight to the point, Azure SQL Database across all service tiers gives you the customer a SLA of 99.99% up-time. This means potential unavailability periods shown below.

Good, bad, you decide. The point is that even in the cloud we “could” potentially encounter downtime. Can you improve on 99.99%? Well that was the question I asked Microsoft, I was given a “wishy-washy” answer that yes you can by using failover groups ( I’m guessing the read/write endpoint is key here ) to improve the up time. I then pressed on what sort of figure in terms of nines does this provide, to no avail.

So what happens if up time is less than 99.99% or even worse 99% (ouch). Service credits are available as shown below.

Arun also includes some of the exceptions Microsoft has. Most of these are “you messed up” types of exceptions, but not all of them.

Comments closed

The Basic Paradigms Of Functional Programming

Published 2018-08-20 by Kevin Feasel

Ayush Hooda explains a couple core principles behind functional programming:

A pure function can be defined like this:

The output of a pure function depends only on(a) its input parameters and(b) its internal algorithm,which is unlike an OOP method, which can depend on other fields in the same class as the method.
A pure function has no side effects, i.e., that it does not read anything from the outside world or write anything to the outside world. – For example, It does not read from a file, web service, UI, or database, and does not write anything either.
As a result of those first two statements, if a pure function is called with an input parameter x an infinite number of times, it will always return the same result y. – For instance, any time a “string length” function is called with the string “Ayush”, the result will always be 5.

If I got to add one more thing, it’d be the idea that functions are first-class data types. In other words, a function can be an input to another function, the same as any other data type like int, string, etc. It takes some time to get used to that concept, but once you do, these types of languages become quite powerful.

Comments closed

Using Notebooks At Netflix

Published 2018-08-20 by Kevin Feasel

Michelle Ufford, et al, explain why and how they use Jupyter Notebooks at Netflix:

Notebooks were first introduced at Netflix to support data science workflows. As their adoption grew among data scientists, we saw an opportunity to scale our tooling efforts. We realized we could leverage the versatility and architecture of Jupyter notebooks and extend it for general data access. In Q3 2017 we began this work in earnest, elevating notebooks from a niche tool to a first-class citizen of the data platform.

From our users’ perspective, notebooks offer a convenient interface for iteratively running code, exploring output, and visualizing data — all from a single cloud-based development environment. We also maintain a Python library that consolidates access to platform APIs. This means users have programmatic access to virtually the entire platform from within a notebook.Because of this combination of versatility, power, and ease of use, we’ve seen rapid organic adoption for all user types across the entire Data Platform.

Today, notebooks are the most popular tool for working with data at Netflix.

Good article. I love notebooks for two reasons: pedagogical purposes (it’s easier to show a demo in a notebook) and forcing you to work linearly.

Comments closed

Faster User-Defined Functions In SparkR

Published 2018-08-20 by Kevin Feasel

Liang Zhang and Hossein Falaki note a major performance improvement for functions in SparkR using the latest version of the Databricks Runtime:

SparkR offers four APIs that run a user-defined function in R to a SparkDataFrame

dapply()

dapplyCollect()

gapply()

gapplyCollect()

dapply() allows you to run an R function on each partition of the SparkDataFrame and returns the result as a new SparkDataFrame, on which you may apply other transformations or actions. gapply() allows you to apply a function to each grouped partition consisting of a key and the corresponding rows in a SparkDataFrame. dapplyCollect() and gapplyCollect()are shortcuts if you want to call collect() on the result.

The following diagram illustrates the serialization and deserialization performed during the execution of the UDF. The data gets serialized twice and deserialized twice in total, all of which are row-wise.

By vectorizing data serialization and deserialization in Databricks Runtime 4.3, we encode and decode all the values of a column at once. This eliminates the primary bottleneck which row-wise serialization, and significantly improves SparkR’s UDF performance. Also, the benefit from the vectorization is more drastic for larger datasets.

It looks like they get some pretty serious gains from this change.

Comments closed

Subsetting Matrices In R

Published 2018-08-20 by Kevin Feasel

Dave Mason continues his look at matrices in R:

We can extract an entire row from a matrix. To do this, specify the desired row only within the square brackets [ ]. The placeholder where you would otherwise specify the column is left empty.
> #Points scored by Kendrick Perkins.
> points_scored_by_quarter[1,]
1st 2nd 3rd 4th 
  2   2   6   0 
> points_scored_by_quarter["Perkins",]
1st 2nd 3rd 4th 
  2   2   6   0 
Conversely, we can extract a column from a matrix. Specify the column within the square brackets [ ]and omit the row. The result is a vector, thus the pivot effect–the row names are displayed in the output (not the column name).

Dave points out that working with matrices is basically an extension of working with vectors.

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30