In the case of telco customer churn, we collected a combination of the call detail record data and customer profile data from a mobile carrier, and then followed the data science process — data exploration and visualization, data pre-processing and feature engineering, model training, scoring and evaluation — in order to achieve the churn prediction. With a churn indicator in the dataset taking value 1 when the customer is churned and taking value 0 when the customer is non-churned, we addressed the problem as a binary classification problem and tried varioustree-based models along with methods like bagging, random forests and boosting. Because the number of churned customers is much less than that of non-churned customers (making the data set quite unbalanced), SMOTE (Synthetic Minority Oversampling Technique) was applied to adjust the proportion of majority class over minority class in the training data set, thus further improving model performance, especially precision and recall.
All the above data science procedures could be implemented with base R. Rather than moving the data out from the database to an external machine running R, we instead run R scripts directly on SQL Server data by leveraging the in-database analytics capability provided by SQL Server R Services, taking advantage of the rich and powerful CRAN R packages plus the parallel external memory algorithms in the RevoScaleR library. In what follows, we will describe the specific R packages and algorithms that we used to implement the data science solution for predicting telco customer churn.
They have provided the relevant materials in GitHub as well.
When I wanted to research memory problem on a server and started to dig deeper into “DBCC MEMORYSTATUS” command.Very useful links to understand that command were from Microsoft:
During the research I’ve faced two problems:
1. I had to wait several seconds to get the full result set.
2. I had to scroll down and shuffle 115 different data sets to find the counter I want.
To eliminate both these problem all these different counters have to be in the same table/data set.
That will make research easier and data can be stored or compared with the base line.
Read on for the T-SQL script.
The idea of using key-value pairs to store data isn’t new, but with the rapid development of cloud solutions like Azure and the hype around NoSQL databases, using key-value pairs to store data got a big boost. Especially developers (in my experience) love using key-value pair to store their data, because it’s easy for them to consume the data in an application. But it gives the database professional an extra challenge because we’re used to retrieve columns with values instead of a record per value. So how can we turn those key-value pairs into rows?
This is a good example of using PIVOT. I’m not a big fan of storing data in key-value pairs and using pivoting operators because you’re burning CPU on that very expensive SQL Server instance (and you’re not taking advantage of what relational databases do well); if you really need to store data as key-value, I’d recommend doing the pivot in cheaper application servers.
I often see people struggling to use triggers for auditing, or having issues with building them to handle multi row updates. However, there’s another choice: the OUTPUT clause.
Not many people use this clause, but it’s a great way to access the virtual inserted and deleted tables in your code.
My favorite post on the topic is still Adam Machanic’s.
Alright, so we’ve learned that I can’t return more than 65,535 columns, AND I can only use 4,096 elements in my SELECT. I can think of several workarounds there – 65,535 / 4096 = about 16, which means I could create a few tables or CTEs and do SELECT *’s from them, thereby returning all 65,535 columns with less than 4,096 things in my SELECT. But for now, we’ll just start with 4,096 things in my SELECT:
If you think you need 65K columns returned, I refer you to Swart’s Ten Percent Rule.
The IF statement asks: “is 1 in the collection of values (2, NULL)?” And we must say, “ah, no idea. That can’t be proven. Therefore we can’t return true.”
The value 1 cannot be determined to not be in the set of (2, NULL).
NULL is a strange bird.
Would say we need to extract an information associated with an “UPDATE” for LSNs started at “0000004f:00000087:0001”. You can just specify Starting and Ending LSNs as “fn_dblog” parameters:
That portion of code would return you ONLY Log records between LSNs “0000004f:00000087:0001″ and “0000004f:00000088:0001″.
Slava’s post uses fn_dblog() as an example but the techniques are applicable across the board, and in practice sum up to “get the fewest number of rows and fewest number of columns you need to solve the problem at hand.”
Basic stuff, right? Both will return 951 records (books) that I do not own. And, very quickly…because the tables are tiny. Sub-1 second is fast.
The issue here is HOW the rows are compared.
English version now, techy stuff later:
In the first query, this is equivalent to you standing at the bookstore and calling home to have someone check to see if the book in your hand is already in your collection. EVERY time. One by one.
In the second, you got really smart and brought a list with you, which you are comparing to the books on the shelf at the store. You’ve got both “lists” in one place, so it is far more efficient.
Even in the case with a few hundred records, you can see why there’d be a performance difference.
Julie Koesmarno made a great post on installing R packages. Please follow this post. Also Microsoft suggests the following way to install R packages on MSDN.
Since I wanted to be able to have packages installed directly from SQL Server Management Studio (SSMS) here is yet another way to do it. I have used xp_cmdshell to install any additional package for my R (optionally you can setEXECUTE AS USER).
This is a bit of a backdoor method, but it does work.
This is a real-world problem that I came across the other day. In a reporting scenario, I wanted to output a number of values in an easy, human-readable way for a report. But just making a long, comma-separated string of numbers doesn’t really make it very readable. This is particularly true when there are hundreds of values.
So here’s a powerful pattern to solve that task.
I really like this. It takes the gaps & islands problem and goes one step further.