Press "Enter" to skip to content

Author: Kevin Feasel

Understanding The Problem: Churn Edition

Emre Yazici points out the importance and difficulty of nailing down good definitions, using bookings churn as an example:

WHEN: Let’s say, we have made our design, constructed a model and obtained a good accuracy. However our model predicts (even with 95% accuracy) the customer who are going to churn in next day! That means our business department have to prevent (somehow, as explained before) those customers to churn in “one day”. Because next day, they will not be our customers. Taking an action to “3000” customers (let’s say) in one day only is impossible. So even our project predicts with very high accuracy, it will not be usefull. This approach also creates another problem: Consider that N months ago, a customer “A” was a happy customer and was working (providing us) with us (let’s say, it is a customer with %100 efficiency – happiness) and tomorrow it will be a customer who is not working with us (a customer with %100 efficiency – happiness). And we can predict the result today. So most probably, the customer has already got the idea to leave from our company in the last day. This is a deadend and we can not prevent the customer to churn at this point – because it is already too late.

So we need to have a certain time limit… Such that we need to be able to warn the business department “M months” before (customer churn) thus they can take action before the customers leave. Here comes another problem, what is the time limit… 2 months, 2.5 months, 3 months…? How do we determine the time, that we need to predict customers churn before (they leave)?

There’s a lot more to a good solution than “I ran a regression against a data set.”

Comments closed

Generating Homoglyphs In R

Bob Rudis shows how to create homoglyphs (character sequences which look similar to other character sequences) using a few R packages:

We can try it out with a very familiar domain:

(converted <- to_homoglyph("google.com"))
## [1] "ƍ၀໐|.com"

Now, that’s using all possible homoglyphs and it might not look like google.com to you, but imagine whittling down the list to ones that are really close to Latin character set matches. Or, imagine you’re in a hurry and see that version of Google’s URL with a shiny, green lock icon from Let’s Encrypt. You might not really give it a second thought if the page looked fine (or were on a mobile browser without a location bar showing).

Click through for more details, as well as information on punycode.

Comments closed

When Binomials Converge

Mala Mahadevan shows an example of the central limit theorem in action, as a large enough sample from a binomial distribution approximates the normal:

An easier way to do it is to use the normal distribution, or central limit theorem. My post on the theorem illustrates that a sample will follow normal distribution if the sample size is large enough. We will use that as well as the rules around determining probabilities in a normal distribution, to arrive at the probability in this case.
Problem: I have a group of 100 friends who are smokers.  The probability of a random smoker having lung disease is 0.3. What are chances that a maximum of 35 people wind up with lung disease?

Click through for the example.

Comments closed

Understanding Probe Residuals

Daniel Janik explains what a probe residual is in an execution plan:

A probe residual is important because they can indicate key performance problems that might not otherwise be brought to your attention.

What is a probe residual?

Simply put, a probe residual is an extra operation that must be performed to compete the matching process. Extra being left over things to do.

Click through for an example brought about by implicit conversion.

Comments closed

The Story Of Nick

Kenneth Fisher tells the story of where the optimizer’s cost value comes from:

Obviously, it’s an important subject, right? And yet we keep seeing comments about how the cost is in seconds.

And to be fair, it is. It’s an estimate of how many seconds a query would take, if it was running on a developers workstation from back in the 90’s. At least that’s the story. In fact Dave Dustin (t) posted this interesting story today:

The best way to think of cost is as a probabilistic, ordinal, unitless value:  3 might be greater than 2; 1000 is almost certainly greater than 2; and “2 what?” is undefined.

Comments closed

Case-Insensitive Power Query Sorts

Cedric Charlier points out a comaprisonCriteria on Table.Sort in Power Query:

Have you already tried to sort a table based on a text field? The result is usually a surprise for most people. M language has a specific implementation of the sort engine for text where upper case letters are always ordered before lower case letters. It means that Z is always before a. In the example (here under), Fishing Rod is sorted before Fishing net.

The classical trick to escape from this weird behavior is to create a new column containing the upper case version of the text that will be used to sort your table, then configure the sort operation on this newly created column. This is a two steps approach (Three steps, if you take into account the need to remove the new column). Nothing bad with this except that it obfuscates the code and I hate that.

Click through to learn a more elegant way of sorting.

Comments closed

T-SQL Tuesday Roundup

Koen Verbeeck has his roundup of this month’s T-SQL Tuesday:

I asked the SQL Server community to write about their experience/opinion about the changing world we live in and how it impacts their daily job. The response was overwhelming: we had 30 participants in this months blog party!

Here’s an overview of everyone who participated. Take your time to read their stories, as they are very insightful, interesting or just plain fun to read.

There’s a lot of reading this month, with a well above-average turnout.

Comments closed

CPU Hot-Add And NUMA

Frank Denneman discusses VMware NUMA behavior when you hot-add more CPUs:

But what happens when the VM is configured with less vCPUs than the core count of the physical CPU package and CPU Hot-Add is enabled? Will there be performance impact? And the answer is no. The VPD configured for the VM fits inside a NUMA node, and thus the CPU scheduler and the NUMA scheduler optimizes memory operations. It’s all about memory locality. Let’s make use of some application workload test to determine the behavior of the VMkernel CPU scheduling.

For this test, I’ve installed DVD Store 3.0 and ran some test loads on the MS-SQL server. To determine the baseline, I’ve logged in the ESXi host via an SSH session and executed the command: sched-stats -t numa-pnode. This command shows the CPU and memory configuration of each NUMA node in the system. This screenshot shows that the system is only running the ESXi operating system. Hardly any memory is consumed. TotalMem indicates the total amount of physical memory in the NUMA node in kb. FreeMem indicates the amount of free physical memory in the NUMA node in kb.

Interesting reading.

Comments closed