Press "Enter" to skip to content

Month: January 2020

Solving the Gaps and Islands Set of Problems

Ed Pollack continues a series on gap and island analysis:

Gaps and islands analysis supplies a mechanism to group data organically in ways that a standard GROUP BY cannot provide. Once we know how to perform an analysis and group data into islands, we can extend this into the realm of real data.

For all code examples in this article, we will use a set of baseball data that I’ve created and maintained over the years. This data is ideal for analytics as it is large and contains data quality that varies between very accurate and very sloppy. As a result, we are forced to consider data quality in our work, as well as scrutinize boundary conditions for correctness. This data will be used without much introduction as we will only reference two tables, and each is relatively straightforward.

The code in this article gets a bit complex, but Ed shows off some powerful techniques.

Comments closed

Copying Measure Definitions in Power BI

Erik Svensen takes us through an oddity in Power BI’s user interface:

Here is an idea you can vote for if you would find it useful as well –

So we end up copying the formula from text in the formula bar

And click new measure and Paste it into the formula bar

But 8 of 10 times nothing is pasted (at least when I select) – WHY ???

This is a strange user experience. But regardless, I find it odd that you can’t copy a measure definition. If this is odd to you as well, upvote the Power BI suggestion.

Comments closed

Fraud Detection with Flink

Alexander Fedulov gives us a case study of using Apache Flink for fraud detection:

In this blog post, we have discussed the motivation behind supporting dynamic, runtime changes to a Flink application by looking at a sample use case – a Fraud Detection engine. We have described the overall architecture and interactions between its components as well as provided references for building and running a demo Fraud Detection application in a dockerized setup. We then showed the details of implementing a dynamic data partitioning pattern as the first underlying building block to enable flexible runtime configurations.

To remain focused on describing the core mechanics of the pattern, we kept the complexity of the DSL and the underlying rules engine to a minimum. Going forward, it is easy to imagine adding extensions such as allowing more sophisticated rule definitions, including filtering of certain events, logical rules chaining, and other more advanced functionality.

It was an interesting discussion and you can grab the code as well.

Comments closed

Customizing Your Rprofile

Colin Gillespie shows how you can customize R via the .Rprofile file:

Every time R starts, it runs through a couple of R scripts. One of these scripts is the .Rprofile. This allows users to customise their particular set-up. However, some care has to be taken, as if this script is broken, this can cause R to break. If this happens, just delete the script!

Full details of how the .Rprofile works can be found in my book with Robin on Efficient R programming. However, roughly R will look for a file called .Rprofile first in your current working directory, then in your home area. Crucially, it will only load the first file found. This means you can have per project Rprofile.

Click through for a sample R profile which has a lot going on.

Comments closed

Simulating Feller’s Coin-Tossing Puzzle in R

David Robinson has another fun puzzle:

Mathematician William Feller posed the following problem:

If you flip a coin times, what is the probability there are no streaks of heads in a row?

Note that while the number of heads in a sequence is governed by the binomial distribution, the presence of consecutive heads is a bit more complicated, because the presence of a streak at various points in the sequence isn’t independent

Click through for a solution in R.

Comments closed

Against Citizen Data Scientists

Bill Schmarzo doesn’t like the idea of “citizen data scientists” very much:

“Hello,” he says. “My name is Dr. Payne and I am your Citizen Dentist for today.”

Citizen Dentist?! You repeat the question out loud for him to hear, want an answer to this looney statement. “What is a Citizen Dentist?”

Get this. He replies, “I’m a person who performs dental work, but my proficiency and expertise is outside of the field of dentistry.”

Bill’s alternative is “Citizens of Data Science.” Click through to see what that means and how it differs.

Comments closed

Strongly Type Table-Valued Parameters

Jonathan Kehayias shows the benefits of using the MaxLength parameter when calling a table-valued parameter from .NET code:

We can see that the MaxLength for the string columns is set at -1, meaning they are being passed over TDS to SQL Server as LOBs (Large Objects) or essentially as MAX datatyped columns, and this can impact performance in a negative manner. If we change the .NET DataTable definition to be strongly-typed to the schema definition of the user-defined table type as follows and look at the MaxLength of the same column using a debug break:

This can be important, especially if you make a lot of calls or use fairly large TVP sizes.

Comments closed

Calculating Compound Interest in DAX

Marco Russo and Alberto Ferrari want you to watch your money grow:

Coincidentally, both debt instrument examples are what is known as “bullet” loans, where the entire principal amount ($100) is repaid in one lump sum at maturity (at the end of Year 5). In the first example the interest income payments are deferred until maturity, thereby allowing the interest to compound over the holding period.  In the second example, the interest income payments are made at the end of each year, which means that the amount of debt accruing interest each year is always the same ($100).

Now let us consider a slightly more complex investment with compounding interest where the interest rate differs year-to-year. Because the interest rate varies, you can’t use the simple formula above (or its FV function equivalent in Excel). Rather, you must effectively stack each year on top of the preceding year and calculate year-by-year.

And that’s something you can do with DAX.

Comments closed

Contrasting TVPs and Memory-Optimized TVPs

Denis Gobo wants to see what memory-optimized table-valued parameters are good for:

The other day I was thinking about the blog post Faster temp table and table variable by using memory optimization I read a while back. Since you can’t believe anything on the internets (no disrespect to whoever wrote that post) , I decided to take this for a test

In this post I will be creating 2 databases, one is a plain vanilla database and the other, a database that also has a file group that contains memory optimized data

I will also be creating a table type in each database, a plain one and a memory optimized one in the memory optimized database

Read on for Denis’s findings.

Comments closed