Press "Enter" to skip to content

Category: Data Science

Cortana Intelligence Suite

Buck Woody discusses various components of the Cortana Intelligence Suite:

It’s not a simple matter of “choose one from column B and two from column A” – you have to learn the processes, and then the tools, and then think about your situation. In other words, some things are complicated because they are…complicated. However:

There are some things you can consider out of the box. So I spoke with my friend Romit Girdhar while we were co-teaching in London last week, and he put together a great visualization. You can see them here, and download the PDF below. Thanks, Romit!

And of course they had to change the name—it wouldn’t be a Microsoft product if the name didn’t change every six months…

Comments closed

Exploring Spark

Adnan Masood has photos of slides from a Spark-related meetup:

Apache Spark is a general purpose cluster computing platform which extends map-reduce to support multiple computation types including but not limited to stream processing and interactive queries. Last week IBM’s Moktar Kandil presented at the Tampa Hadoop and Tampa Data Science Group Joint meetup on the topic of exploring Apache Spark.

Apache Spark for Azure HD-Insight

Following are some of the slides discussed in the meetup. To play with the ALS Recommendation engine notebook, please register at www.datascientistworkbench.com which is a free notebook for Apache Spark platform for educational purposes.

Check out the links.

Comments closed

Key Components For A Successful Project

Ginger Grant lists five key components for a successful data analysis project:

Security is an obvious consideration which needs to be addressed up front. Data is a very valuable commodity and only people with appropriate access should be allowed to see it. What steps are going to be employed to ensure that happens? How much administration is going to be required to implement it? These questions need to be answered up front.

I want to extend special thanks to Ginger for putting security as the top item on the list.  Also, this seems like a pretty good set of criteria for most projects, so definitely check it out.

Comments closed

Data Science Notebooks

Dan Osipov discusses data science notebooks:

Even though they’ve become prominent in the past few years, they have a long history. First notebooks were available in packages like Mathematica andMatlab, used primarily in academia. More recently they’ve started getting traction in Python community with iPython Notebook. Today there are many notebooks to choose from: Jupyter (successor to the iPython Notebook), R Markdown, Apache Zeppelin,Spark Notebook, Databricks Cloud, and more. There are kernels/backends to multiple languages, such as Python, Julia, Scala, SQL, and others.

Traditionally, notebooks have been used to document research and make results reproducible, simply by rerunning the notebook on source data. But why would one want to choose to use a notebook instead of a favorite IDE or command line? There are many limitations in the current browser based notebook implementations that prevent them from offering a comfortable environment to develop code, but what they do offer is an environment for exploration, collaboration, and visualization.

Back In The Day, developers and infrastructure staff used runbooks to make sure they listed and hit all of the steps in an operation.  I don’t really know of one which integrates directly with SQL Server, but Jupyter is probably the best-known cross-platform notebook.

Comments closed

Linear Regression In Azure ML

Ginger Grant gives a brief discussion of linear regression:

There are two types of indicators for linear correlation, positive and negative as shown on the following charts. The Y axis represents Grades, and the x axis is changed to show positive and negative correlation of the amount of X on grades. When X is the amount of study hours, there is a positive correlation and the line goes up. When X is changed to watching cat videos, there is a negative correlation. If you can’t draw a line around the points there is no correlation. If I were to create a graph where X indicated the quantity of the bags of Cheese Doodles consumed on grades, it would not be possible to draw a straight line, where the data points cluster around it. Since this is Line-ar regression, if that line doesn’t exist there is no correlation. Knowing there is no correlation is also useful.

Simple linear regression is a powerful tool and gets you to “good enough” more frequently than you’d think.

Comments closed

Sloan Digital Sky Survey

Joseph Sirosh and Rimma Nehme show a SQL Server use case, walking us through the Sloan Digital Sky Survey:

Astronomers wanted a tool that would be able to quickly answer questions like: “find asteroid candidates” or “find other objects like this one”, which originally gave the motive to build the SQL-based backend. Indeed, right from the beginning Jim Gray asked Alex Szalay to define 20 typical queries astronomers might want to ask and then together they designed the SkyServer database to answer those queries. The anecdote is that the conversation went as follows:

Jim: What are the 20 questions you want to ask?
Alex: Astronomers want to ask anything! Not just 20 queries.
Jim: Ok, start with 5 queries.
[it took Alex 30 minutes to write them all down]
Jim: Ok, add another 5 queries.
[it took Alex 1 hour to write them all down]
Jim: Ok, now add another 5 queries.
[Alex gave up and went home to think about them]

Alex (said later): In 1.5 hours, Jim taught me a lot of humility!

Alex (said later): It also taught us the importance of long-tail distribution and how to prioritize.

This is my favorite part of the article.

Comments closed

Asking The Right Question

Buck Woody argues that the hardest thing about data science is asking the right question:

When I started down the path of learning Data Science, I was nervous. I have to work hard at math – it’s a skill I love but one that does not come naturally to me. I was nervous because I thought the most daunting task I would face in Data Science waslearning all the algebra, statistics, and other maths I would need to do the job.

But I was wrong.

Math isn’t the hardest thing in Data Science. Actually, since it’s so mature, and documented, and well-known, it’s quite possibly the easiest thing to conquer in the skillset. No, the hardest thing about Data Science is asking the right question.

I’ll lodge a bit of a disagreement here.  I’m okay with the argument that asking the right question is the toughest part, but the math’s not particularly easy either…  Knowing when to use which distribution, which model, and which parameters requires a definite amount of skill.

Comments closed

Sentiment Analysis

Dustin Ryan and Patrick Leblanc used Azure ML and Power BI to do sentiment analysis:

Using Azure ML and a free subscription to the Text Analytics API, I’m going to show you how to perform sentiment analysis and key phrase extraction on tweets with the hashtag #Colts (after this past Sunday’s 51-16 beat down of the Colts at the hands of the Jacksonville Jaguars, I’m bathing in the tears of Colts fans. Watch the highlights! ). Although my example here is somewhat humorous, the steps can be used to perform sentiment analysis and key phrase extraction on any text data as long as you can get the data into Power Query.

This is a fantastic example of how Azure ML can be used.  Read the whole thing.

Comments closed

Goodness Of Fit

Steve Bolton is one of my favorite long-form analytics bloggers, and his ongoing goodness of fit series is a testament as to why.

Part 1:

Goodness-of-fit tests are also sometimes applicable to regression models, which I introduced in posts like A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression and A Rickety Stairway to SQL Server Data Mining, Algorithm 4: Logistic Regression. I won’t rehash the explanations here for the sake of brevity; suffice it to say that regressions can be differentiated from probability distributions by looking at them as line charts which point towards the predicted values of one or more variables, whereas distributions are more often represented as histograms representing the full range of a variable’s actual or potential values. I will deal with methods more applicable to regression later in this series, but in this article I’ll explain some simple methods for implementing the more difficult concept of a probability distribution.

Part 2:

As I found out the hard way, the difficult part with implementing these visual aids is not in representing the data in Reporting Services, but in calculating the deceptively short formulas in T-SQL. For P-P Plots, we need to compare two cumulative distribution functions (CDFs). That may be a mouthful, but one that is not particularly difficult to swallow once we understand how to calculate probability distribution functions. PDFs[2] are easily depicted in histograms, where we can plot the probability of the occurrence of each particular value in a distribution from left to right to derive such familiar shapes as the bell curve. Since probabilities in stochastic theory always start at 0 and sum to 1, we can plot them a different way, by summing them in succession for each associated value until we reach that ceiling. Q-Q Plots are a tad more difficult because they involve comparing the inverse of the CDFs, using what is alternately known as quantile or percent point functions[3], but not terribly so. Apparently the raison d’etre for these operations is to distill distributions like the Gaussian down to the uniform distribution, i.e. a flat line in which all outcomes are equally likely, for easier comparison.[4]

Part 3:

The most well-known extension of these somewhat forgotten stats is the Jarque-Bera Test, which only dates back to the 1970s despite being one of earliest examples of normality testing. All of these measures have fallen out of favor with statisticians to some extent, for reasons that will be apparent shortly, but one of the side effects of this is that it is a little more difficult to find variations on them that are more suited to the unique needs of the SQL Server community. One of the strengths of data mining on database servers like SQL Server is that you typically have such an enormous number of records to draw from that you can actually perform calculations on the full population, or a proportion close to it. In ordinary statistics, however, you’re often limited to making inferences based on small samples of just a few dozen or a few hundred rows, out of a much larger population that is often of unknown size; the results can still be logically valid, but often only if other preconditions are met on the data (including normality tests, which are often not performed). For that reason, I usually prefer to leverage SQL Server’s fast set-based retrieval methods to quickly calculate statistics on full populations whenever possible, especially when there are simpler versions of the mathematical formulas available for the full dataset.

Steve doesn’t post very frequently, but if you have a few hours on a lazy Friday, check him out.

1 Comment