Press "Enter" to skip to content

Curated SQL Posts

Survival Analysis in Spark

Rab Saker and Bryan Smith hit on a topic close to my heart:

These patterns seem to indicate that KKBox could actually differentiate between customers based on their lifetime potential using information known at the time of acquisition. This information might help inform or steer specific discounts or promotions to customers as they register for a trial. This information might also inform KKBox of which offerings or capabilities to discontinue as some, e.g. Initial Payment Method 35 or the 7-day payment plan as shown in Figure 3, align with exceptionally high churn rates in the first 30-days with little long-term survivorship.

Of course, there are relationships between these factors so that we should be careful in viewing them in isolation. By deriving a baseline risk (hazard) of customer churn (Figure 4), we can calculate the influence of different factors on the baseline in such a manner that each factor may be considered an independent hazard multiplier.  When combined (through simple multiplication) against the baseline, we can plot the a specific customer’s chances of abandoning a subscription by a given point in time (Table 1).

Click through for the story as well as a set of notebooks.

Comments closed

Minimum Permissions Required for Get-DbaDbUser

Shane O’Neill walks us through wants to figure out minimum permissions required for the Get-DbaDbUser cmdlet in dbatools:

I’m not going to sugarcoat things – the person that sent me the request has more access than they rightly need. The “public” access worker did not need any of that access so I wasn’t going to just give her the same level.

Plus, we’re supposed to be a workforce that has embraced the DevOps spirit and DevOps is nothing if it doesn’t include Security in it.

So, if I could find a way to give the user enough permission to run the command and not a lot more, then the happier I would be.

Shane takes us through the process so we don’t have to.

Comments closed

A Warning on Power BI Custom Visuals

Martin Schoombee gives us a warning around relying upon free custom visuals:

Before you get the impression that I’m against custom visuals, let me say this: I love custom visuals! I myself have used many custom visuals in the past and have been very quick to look for a custom visual when I couldn’t get something to display or work the way I needed it to in Power BI.

Custom visuals fill an important gap where the base product is not yet where it needs to be, and what better way for Microsoft to see what people need and where they need to invest more time from a visualization standpoint? It’s an awesome concept and I like it.

Unfortunately there are a few BUT’s to follow, but let me first tell you my story…

Read the whole thing. I like custom visuals a lot, but there are risks in a corporate world, and I don’t necessarily mean security.

Comments closed

Creating a Power BI Streaming Dataset

Rob Farley takes us through the process of creating and using a Power BI streaming dataset:

Real-time Power BI sets are a really useful feature, and there’s a good description of them at https://docs.microsoft.com/en-us/power-bi/connect-data/service-real-time-streaming. I thought I’d do a quick walkthrough specifically around the Push side, and show you – including the odd gotcha that you might not have noticed.

To create a dataset that you want to push data into, you need to go to the Power BI service, go to your Workspace, and create a Streaming dataset. Even if you’re not wanting to use it with a streaming service, this is the one you need.

Rob has plenty of animated GIFs to walk you through the process, as well as a couple of caveats if you want to play along at home.

Comments closed

When Batch Mode on Rowstore Hurts Performance

Erik Darling walks us through a scenario where batch mode on rowstore can make performance of a query worse:

I’m not mad at 2019 or Batch Mode On Rowstore (BMOR) or anything.

But if I’m gonna get into it, I’m gonna document issues I run into so that hopefully they help you out, too.

One thing I ran into recently was where BMOR kicked in for a query and made it slow down.

Click through for the scenario, why it’s slower when using batch mode, and two ways you can improve the query.

Comments closed

Learning About Index Utilization with dbatools

Ben Miller takes us through a way to know your data:

You have many tables in your databases and you want to know how they are used. There are DMVs for index usage stats which will tell you about like sys.dm_db_index_usage_stats and querying them is insightful, but how do the stats change over time? These stats are reset when the instance is restarted and it is good to know that you have 2000 seeks and 500 scans of the index, but when did they happen? Was it on a common day? Common hour?

Ben has a way to help you figure that out.

Comments closed

Calculating Spark Application Resource Allocations

The Hadoop in Real World team walks us through resource allocation for Spark applications:

In this post we will look at how to calculate resource allocation for Spark applications. Figuring out how to allocate resources for a Spark application requires a good understanding of resource allocation properties in YARN and also resource related properties in Spark. Let’s look at both.

This post covers the properties you want to keep an eye on when running Spark applications.

Comments closed

Comparing Gradient Descent to the Normal Equation for Small Data Sets

Pushkara Sharma compares two techniques for regression:

In this article, we will see the actual difference between gradient descent and the normal equation in a practical approach. Most of the newbie machine learning enthusiasts learn about gradient descent during the linear regression and move further without even knowing about the most underestimated Normal Equation that is far less complex and provides very good results for small to medium size datasets.

If you are new to machine learning, or not familiar with a normal equation or gradient descent, don’t worry I’ll try my best to explain these in layman’s terms. So, I will start by explaining a little about the regression problem.

I was surprised by the results.

Comments closed

More Scraping Web Pages

Dave Mason continues scraping web pages for fun and profit:

In the last post, we looked at a way to scrape HTML table data from web pages, and save the data to a table in SQL Server. One of the drawbacks is the need to know the schema of the data that gets scraped–you need a SQL Server table to store the data, after all. Another shortcoming is if there are multiple HTML tables, you need to identify which one(s) you want to save.

For this post, we’ll revisit web scraping with Machine Learning Services and R. This time, we’ll take a schema-less approach that returns JSON data. As before, this web page will be scraped: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players, the second is a listing of games played during the regular season.

Click through to see how Dave manages this feat.

Comments closed