Julia Evans shares some good resources for developers interested in statistics:

## even more links

a paper someone said was good (by Efron): Bootstrap Methods: another look at the jackknife

this blog post has an overview of different nonparametric tests

this podcast with Philip Guo and John DeNero where they talk about teaching stats to programmers

openintro has free some statistics books

There are a lot of good links in Julia’s post. I should also mention that Andrew Gelman and Deborah Nolan have a new book coming out in July. Gelman’s Bayesian approach suits me well, so I’m pre-ordering the book.

As you can see from the animation, the algorithm is quite simple:

- First, we identify the user and ‘current’ song to start with (red line)
- Next, we identify the other users who have also listened to this song (green line)
- Then we find the other songs which those other users have also listened to (blue, dotted line)
- Finally, we direct the current user to the top songs from those other songs, prioritized by the number of times they were listened to (this is represented by the thick violet line.)
The algorithm above is quite simple, but as you will see it is quite effective in meeting our requirement. Now, let’s see how to actually implement this in SQL Server 2017.

Click through for animated images as well as an actual execution plan and recommendations for graph query optimization (spoilers: columnstore all the things). They also link to the GitHub project where you can try it out yourself.

We had no options along the way for selecting names for resources, so we have a lot of auto-generated suffixes for our resource names. This is ok for purely learning scenarios, but not my preference if we’re starting a true project with a pre-configured solution. Following an existing naming convention is impossible with solutions (at this point anyway). A wish list item I have is for the solution deployment UI to display the proposed names for each resource and let us alter if desired before the provisioning begins.

The deployment also doesn’t prompt for which subscription to deploy to (if you have multiple subscriptions like I do). The deployment did go to the subscription I wanted, however, it would be really nice to have that as a selection to make sure it’s not just luck.

It sounds like there are some undesirable defaults, but at least it does appear to be very easy to do.

Leila Etaati notes the Scylla and Charybdis of models:

However, in the most machine learning experiences, we will face two risks :

Over fitting and under fitting.

I will explain these two concepts via an example below.

imagine that we have collected information about the number of coffees that have been purchased in a café from 8am to 5pm.

Overfitting tends to be a bigger problem in my experience, but they’re both dangerous.

Steph Locke has a presentation on performing logistic regression using R:

Logistic regressions are a great tool for predicting outcomes that are categorical. They use a transformation function based on probability to perform a linear regression. This makes them easy to interpret and implement in other systems.

Logistic regressions can be used to perform a classification for things like determining whether someone needs to go for a biopsy. They can also be used for a more nuanced view by using the probabilities of an outcome for thinks like prioritising interventions based on likelihood to default on a loan.

It’s a good introduction to an important statistical method.

: Let’s say, we have made our design, constructed a model and obtained a good accuracy. However our model predicts (even with 95% accuracy) the customer who are going to churn in next day! That means our business department have to prevent (somehow, as explained before) those customers to churn in “one day”. Because next day, they will not be our customers. Taking an action to “3000” customers (let’s say) in one day only is impossible. So even our project predicts with very high accuracy, it will not be usefull. This approach also creates another problem: Consider that N months ago, a customer “A” was a happy customer and was working (providing us) with us (let’s say, it is a customer with %100 efficiency – happiness) and tomorrow it will be a customer who is not working with us (a customer with %100 efficiency – happiness). And we can predict the result today. So most probably, the customer has already got the idea to leave from our company in the last day. This is a deadend and we can not prevent the customer to churn at this point – because it is already too late.WHENSo we need to have a certain time limit… Such that we need to be able to warn the business department “M months” before (customer churn) thus they can take action before the customers leave. Here comes another problem, what is the time limit… 2 months, 2.5 months, 3 months…? How do we determine the time, that we need to predict customers churn before (they leave)?

There’s a lot more to a good solution than “I ran a regression against a data set.”

An easier way to do it is to use the normal distribution, or central limit theorem. My post on the theorem illustrates that a sample will follow normal distribution if the sample size is large enough. We will use that as well as the rules around determining probabilities in a normal distribution, to arrive at the probability in this case.

Problem:I have a group of 100 friends who are smokers. The probability of a random smoker having lung disease is 0.3. What are chances that a maximum of 35 people wind up with lung disease?

Click through for the example.

Raghavan Madabusi runs through a sample logistic regression:

Input Variables:These variables are called as predictors or independent variables.

- Customer Demographics (Gender and Senior citizenship)
- Billing Information (Monthly and Annual charges, Payment method)
- Product Services (Multiple line, Online security, Streaming TV, Streaming Movies, and so on)
- Customer relationship variables (Tenure and Contract period)

Output Variables:These variables are called as response or dependent variables. Since the output variable (Churn value) takes the binary form as “0” or “1”, it will be categorized under classification problem in the supervised machine learning.

One of the interesting things in this post was the use of missmap, which is part of Amelia.

Phil Factor shows how to generate a quick linear regression using SQL, Powershell, and Gnuplot:

It looks a bit like someone has fired a shotgun at a wall but is there a relationship between the two variables? If so, what is it? There seems to be a weak positive linear relationship between the two variables here so we can be fairly confident of plotting a trendline.

Here is the data, and we will proceed to calculate the slope and intercept. We will also calculate the correlation.

It’s good to know that this is possible, but I’d switch to R or Python long before.

Engineers at Treselle Systems have put together a two-part series on text normalization using Apache Spark. First, they walk through normalizing the text:

We have used Spark shared variable “broadcast” to achieve distributed caching. Broadcast variables are useful when large datasets need to be cached in executors. “stopwords_en.txt” is not a large dataset but we have used in our use case to make use of that feature.

What are Broadcast Variables?Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables, these variables would be shipped to each executor for every transformation and action, which can cause network overhead. However, with broadcast variables, they are shipped once to all executors and are cached for future reference.

From there, they dig into details on what the Spark engine did and why we see what we do:

Note:Stage 2 has both reduceByKey() and sortByKey() operations and as indicated in job summary “saveAsTextFile()” action triggered Job 2.Do you have any guess whether Stage 2 will be further divided into other stages in Job 2?The answer is:yesJob 2 DAG:This job is triggered due to saveAsTextFile() action operation. The job DAG clearly indicates the list of operations used before the saveAsTextFile() operations.Stage 2 in Job 1 is further divided into another stage as Stage 2. In Stage 2 has both reduceByKey() and sortByKey() operations and both operations can shuffle the data so that Stage 2 in Job 1 is broken down into Stage 4 and Stage 5 in Job 2. There are three stages in this job. But, Stage 3 is skipped. The answer for the skipped stage is provided below“What does “Skipped Stages” mean in Spark?”section.

There’s some good information here if you want to become more familiar with how Spark works.

Kevin Feasel

2017-04-26

Data Science, Learning