Press "Enter" to skip to content

Month: December 2020

Bayesian Modeling of Holiday Behavior

Daniel Marthaler and Brian Coffey have an interesting post:

As the year unfolds, our demand fluctuates. Two big drivers of that fluctuation are seasonality and holidays. With the holiday season upon us, it’s a great time to describe how both seasonality and holiday effects can be estimated, and how you can use this formulation in a predictive time series model.

In this post, we describe the difference between seasonality and holiday effects, posit a general Bayesian Holiday Model, and show how that model performs on some Google Trends data.

Read the whole thing.

Comments closed

Deleting Messages and Topics in Kafka

The Hadoop in Real World team has a pair of related posts. The first is on how to remove messages in a Kafka topic:

The easiest way to purge or delete messages in a Kafka topic is by setting the retention.ms to a low value. retention.ms configuration controls how long messages should be kept in a topic. Once the age of the message in a topic hits the retention time the message will be removed from the topic.

Note the below steps delete or purge messages in your topic. Use precaution when executing the below.

Because Kafka is an immutable log rather than “final” storage, the ideal scenario has you never deleting data. But sometimes you just run low on disk space. You can also set the max retention size as another option. But note that these aren’t going to let you delete a single message—that’s not a good thing to do with a log; rather, you offset or cancel out the message and submit a new one.

The second post covers deletion of a Kafka topic:

In this post we will see how to delete a Kafka topic and get the details of the topic before deleting it.

Comments closed

Chaining with DirectQuery for Power BI Datasets

Wolfgang Strasser explains the notion of chaining when working with Power BI datasets:

In my last blog post I introduced the new concept of DirectQuery for Power BI datasets. This feature allows you to extend and modify a (remote) published Power BI dataset with the help of a local model.

The local model does not contain a copy of the remote dataset but a reference to it. You, as Power BI developer, are able to extend the referenced model with new data sources (like the Excel file I used in my previous example) and/or extend the model with new measures, columns and so on. For a new data model, relationships between the two data islands can be created.

Read on for examples of how this can be useful and what the current limitations look like.

Comments closed

Using the Open Source R or Python Runtime with Machine Learning Services

Niels Berglund walks us through using the open source extensibility framework to install R or Python:

When Java became a supported language in SQL Server 2019, Microsoft mentioned that communication between ExternalHost and the language extension should be based on an API, regardless of the external language. The API is the Extensibility Framework API for SQL Server. Having an API ensures simplicity and ease of use for the extension developer.

From the paragraph above, one can assume that Microsoft would like to see 3rd party development of language extensions. That assumption turned out to be accurate as, mentioned above, Microsoft open-sourced the Java language extension, together with the include files for the extension API, in September 2020! This means that anyone interested can now create a language extension for their own favorite language!

However, open sourcing the Java extension was not the only thing Microsoft did. They also created and open-sourced language extensions for R and Python!

Click through for more detail and a walkthrough on installation of Python.

Comments closed

External Table Not Accessible because Content of Directory Cannot be Listed

Liliam Leme troubleshoots an error when working with a serverless SQL pool in Azure Synapse Analytics:

Following this lab: Lab: Serverless Synapse – From Spark to SQL On Demand – Microsoft Tech Community

You may experience this message: 

Failed to execute the query because content of directory cannot be listed) 

This is due to an extra step required to enable the AAD to pass through the firewall on the storage.

Click through for the solution.

Comments closed

Adding a Database Project to GitHub

Elizabeth Noble shows how you can get your brand new Azure Data Studio project into GitHub:

Once you have the database project created, you’ll want to get your database project added to source control so that you (and others) can modify and manage your database code. This next step is the beginning of allowing you and others to work on the same databases and minimize the risk of overwriting someone else’s work or deploying the wrong code to Production.

Tools like GitHub Desktop and SourceTree have definitely made things easier, especially for the happy path scenarios.

Comments closed

Apache Spark Performance Tuning

Tomaz Kastrun provides a few hints when performance tuning Apache Spark code:

DataFrame versus Datasets versus SQL versus RDD is another choice, yet it is fairly easy. DataFrames, Datasets and SQL objects are all equal in performance and stability (at least from Spar 2.3 and above), meaning that if you are using DataFrames in any language, performance will be the same. Again, when writing custom objects of functions (UDF), there will be some performance degradation with both R or Python, so switching to Scala or Java might be a optimisation.

Read on for the details. My version is “When performance matters the most, be willing to switch to Scala.” It’s not always correct, but is rarely outright bad advice.

Comments closed

The Intuition Behind Averaging

The Stats Guy takes a look at averages:

In this diagram, there are a bunch of numbers and a single question mark. Behind the question, is also a number. The known numbers are the same as in our friend v above.

Our task is as follows:

– Make a guess on what that mystery number could be. And,
– If we can’t get it right, then reduce, as much as possible, the error we incur on our guess.

This is a well-written explanation of an important concept. H/T R-Bloggers

Comments closed

Naive Bayes and Continuous Predictor Variables

Akhila takes us through the intuition of how Naive Bayes works:

Usually we use the e1071 package to build a Naive Bayes classifier in R. And then using this classifier, we make some predictions on the training data.

So probability for these predictions can be directly calculated based on frequency of occurrences if the features are categorical.
But what if, there are features with continuous values? What the Naive Bayes classifier is actually doing behind the scenes to predict the probabilities of continuous data?

Click through for the answer. Also, Naive Bayes isn’t Bayesian, but that’s not important.

Comments closed

Power BI Composite Model V2 Demo

Wolfgang Strasser gives us a walkthrough of DirectQuery for Power BI datasets:

With the December 2020 release of Power BI Desktop, this approach changed. You are now able to change a live connection to a Power BI dataset (or an Azure Analysis Services connection) to DirectQuery mode. Which allows us, to enhance the remote model with new columns, tables, additional datasources and create relationships between the datasources.

Let’s dive deeper into this and look at the story together with a sample.

I’ve seen and linked to several posts talking about the idea, but Wolfgang has a demo going, which makes it easier to follow.

Comments closed