Press "Enter" to skip to content

Month: August 2018

Dealing With Multicollinearity With R

Chaitanya Sagar explains the concept of multicollinearity in linear regressions and how we can mitigate this issue in R:

Perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, you already have X and Y as independent variables and you add another variable, Z = a*X + b*Y, to the set of independent variables. Now, this new variable, Z, does not add any significant or different value than provided by X or Y. The model can adjust itself to set the parameters that this combination is taken care of while determining the coefficients.

Multicollinearity may arise from several factors. Inclusion or incorrect use of dummy variables in the system may lead to multicollinearity. The other reason could be the usage of derived variables, i.e., one variable is computed from other variables in the system. This is similar to the example we took at the beginning of the article. The other reason could be taking variables which are similar in nature or which provide similar information or the variables which have very high correlation among each other.

Multicollinearity can make regression analysis trickier, and it’s worth knowing about.  H/T R-bloggers.

Comments closed

When Cassandra Makes Sense

Anmol Sarna explains the pros and cons of using Apache Cassandra:

But as we know nothing is perfect. So is the Cassandra Database. What I mean by this is that you cannot have a perfect package. If you wish for one brilliant feature then you might have to compromise on the other features. In today’s blog, we will be going through some of the benefits of selecting Cassandra as your database as well as the problems/drawbacks that one might face if he/she chooses Cassandra for his/her application.
I have also written some blogs earlier which you can go through for reference if you want to know What Cassandra isHow to set it up and how it performs its Reads and Writes.

The only question we have is that should we or should we not pick Cassandra over the other databases that are available. So let’s start by having a quick look at when to use the Cassandra Database. This will give a clear picture to all those who are confused in decided whether to give Cassandra a try or not.

This is a level-headed analysis of Cassandra, so check it out.

Comments closed

What To Watch When Using VSS Snapshots

Erik Darling shows us the wait stats associated with the Volume Shadow Copy Service (VSS):

A while back I wrote about the Perils of VSS Snaps.

After working with several more clients having similar issues, I decided it was time to look at things again. This time, I wanted blood. I wanted to simulate a slow VSS Snap and see what kind of waits stats I’d have to look out for.

Getting software and rigging stuff up to be slow would have been difficult.

Instead, we’re going to cheat and use some old DBCC commands.

This one almost got the “Wacky Ideas” tag but I’m grading on a curve for that category.

Comments closed

Your Data’s Not That Big

Larry White throws a bit of cold water on the distributed computing movement:

Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because it was “easy,” which it was, if you ignore all the work needed to make it go fast. It seemed pretty clear to me that it could have been written in Java to run on a single machine with a much smaller staff.

One definition of “big data” is “Data that is too big to fit on one machine.” By that definition what is “big data” for one language is plain-old “data” for another. Java, with it’s efficient memory management, high performance, and multi-threading can get a lot done on one machine. To do data science in Java, however, you need data science tools: Tablesaw is an open-source (Apache 2) Java data science platform that lets users work with data on a single machine. It’s a dataframe and visualization framework. Most data science currently done in clusters could be done on a single machine using Tablesaw paired with a Java machine learning library like Smile.

But you don’t have to take my word for that.

There are some interesting thoughts in this post, but there are limits to what a single machine can do.

Comments closed

Including R Visuals In Power BI Dashboards

Parker Stevens shows how to include R visuals in a Power BI dashboard:

Let’s finish up this post with a quick example of how to code the elusive line chart with two y-axes. This always seems to be asked in the forums and it’s pretty easy to implement.

Follow the same steps as shown above to bring in a new R visual. Since we need a column to pass into the visual and open up the editor, let’s just throw in the Angle field that we made previously. With the code editor available we can start writing the R script. In this example, we are going to need some data that is available in a specific R package, called “ggplot2.” Go ahead and install the package by typing the following code the same way we installed scatterplot3d:

install.packages(“ggplot2”)

There are two interesting examples here, including one which accepts an external parameter.

Comments closed

Deploying To Power BI Report Server Using Powershell

Rob Sewell shows us how to automate Power BI Report Server deployments:

But I dont want to have to do this each time and there will be multiple pbix files, so I wanted to automate the solution. The end result was a VSTS or TFS release process so that I could simply drop the pbix into a git repository, commit my changes, sync them and have the system deploy them automatically.

As with all good ideas, I started with a google and found this post by Bill Anton which gave me a good start ( I could not get the connection string change to work in my test environment but this was not required so I didnt really examine why)

I wrote a function that I can use via TFS or VSTS by embedding it in a PowerShell script.

Click through for the script.

Comments closed

Anti-Joins In Power BI

Reza Rad explains when you might want to use anti-joins in Power BI:

Finding rows that are in one table, but not the other is one of the most common scenarios happening in any data related applications. You may have customer records coming from two sources, and want to find data rows that exist in one, but not the other. In Power Query, you can use Merge to combine data tables together. Merge can be also used for finding mismatch records. You will learn through this blog post, how in Power Query you can find out which records are missing with Merge, and then report it in Power BI. To learn more about Power BI, read Power BI book from Rookie to Rock Star.

Read on for a demo of how to use anti-joins to solve this problem.

Comments closed

Azure SQL Database SLAs

Arun Sirpal ponders the Azure SQL Database service level agreement:

Lets just get straight to the point, Azure SQL Database across all service tiers gives you the customer a SLA of 99.99% up-time. This means potential unavailability periods shown below.

Good, bad, you decide. The point is that even in the cloud we “could” potentially encounter downtime. Can you improve on 99.99%? Well that was the question I asked Microsoft, I was given a “wishy-washy” answer that yes you can by using failover groups ( I’m guessing the read/write endpoint is key here ) to improve the up time. I then pressed on what sort of figure in terms of nines does this provide, to no avail.

So what happens if up time is less than 99.99% or even worse 99% (ouch). Service credits are available as shown below.

Arun also includes some of the exceptions Microsoft has.  Most of these are “you messed up” types of exceptions, but not all of them.

Comments closed

The Basic Paradigms Of Functional Programming

Ayush Hooda explains a couple core principles behind functional programming:

A pure function can be defined like this:

  • The output of a pure function depends only on(a) its input parameters and(b) its internal algorithm,which is unlike an OOP method, which can depend on other fields in the same class as the method.

  • A pure function has no side effects, i.e., that it does not read anything from the outside world or write anything to the outside world. – For example, It does not read from a file, web service, UI, or database, and does not write anything either.

  • As a result of those first two statements, if a pure function is called with an input parameter x an infinite number of times, it will always return the same result y. – For instance, any time a “string length” function is called with the string “Ayush”, the result will always be 5.

If I got to add one more thing, it’d be the idea that functions are first-class data types.  In other words, a function can be an input to another function, the same as any other data type like int, string, etc.  It takes some time to get used to that concept, but once you do, these types of languages become quite powerful.

Comments closed

Using Notebooks At Netflix

Michelle Ufford, et al, explain why and how they use Jupyter Notebooks at Netflix:

Notebooks were first introduced at Netflix to support data science workflows. As their adoption grew among data scientists, we saw an opportunity to scale our tooling efforts. We realized we could leverage the versatility and architecture of Jupyter notebooks and extend it for general data access. In Q3 2017 we began this work in earnest, elevating notebooks from a niche tool to a first-class citizen of the data platform.

From our users’ perspective, notebooks offer a convenient interface for iteratively running code, exploring output, and visualizing data — all from a single cloud-based development environment. We also maintain a Python library that consolidates access to platform APIs. This means users have programmatic access to virtually the entire platform from within a notebook.Because of this combination of versatility, power, and ease of use, we’ve seen rapid organic adoption for all user types across the entire Data Platform.

Today, notebooks are the most popular tool for working with data at Netflix.

Good article.  I love notebooks for two reasons:  pedagogical purposes (it’s easier to show a demo in a notebook) and forcing you to work linearly.

Comments closed