Press "Enter" to skip to content

Day: February 1, 2021

Determining a Good Test Set Size

John Mount thinks about test set size:

In this note we will answer “what is a good test set size?” three ways.

– The usual practical answer.
– A decision theory answer.
– A novel variational answer.

Each of these answers is a bit different, as they are solved in slightly different assumed contexts and optimizing different objectives. Knowing all 3 solutions gives us some perspective on the problem.

My rule of thumb is that I want it to be as small as possible while containing the highest likelihood of hitting all real-world scenarios enough times to provide a valid comparison. This conversely maximizes the size of the training data set, giving us the best chance of seeing the widest variety of scenarios we can during the formative phase.

And as usual, John goes way deeper than my rules of thumb. I like this post a lot.

Comments closed

Model Post-Processing with insight

The easystats team talks about the insight package in R:

We are talking about the insight package. It is what allows other packages, like easystats (parameterseffectsizeperformancereport, …) or ggstatsplotsjstats or modelsummary to be as powerful as they are, supporting tons of different R models. So why make you life hard when you can be like them, and rely on insight?

It is made for developers (and users) that do some postprocessing of different models (e.g., extracting stuff like parameters, values, data, names, specifications, predictions, priors, etc.), whether it is to nicely display their results or to do further computation.

Click through for an example of what it does and how it works. H/T R-bloggers

Comments closed

Using Active Directory Authentication for SQL Server on Linux

Jamie Wick takes us through a lengthy process:

SQL Server has been supported on several Linux distributions for a couple of years now. For some people, the primary stumbling block to implementing SQL Server on Linux is the need to retain Active Directory (ie Windows-based) authentication for their database users and applications. Below we’ll go over how to join a Linux server (Ubuntu release 20.04) with SQL Server 2019 to an Active Directory domain, and then configure SQL Server to allow Windows-based logins.

There are quite a few steps here and I appreciate Jamie providing us an image-filled, step-by-step process.

Comments closed

Checking for Missing Failover Cluster Dependencies

Chad Callihan ran into an error creating a new database:

A tool that restores a model type database and does a bit of configuration work was failing. I took a look at the stores procedures and started to go step by step. It didn’t take long before getting this error message when attempting to restore/create a database:

Msg 5184, Level 16, State 2, Line 3
Cannot use file ‘D:\sql_log\CC_Test_name_4.ldf’ for clustered server. Only formatted files on which the cluster resource of the server has a dependency can be used. Either the disk resource containing the file is not present in the cluster group or the cluster resource of the Sql Server does not have a dependency on it.

Click through for the solution.

Comments closed

Performance Comparison of ISNULL and COALESCE

Erik Darling notes the edge cases where ISNULL() can be faster than COALESCE():

Sometimes there are very good reasons to use either coalesce or isnull, owing to them having different capabilities, behaviors, and support across databases.

But isnull has some particular capabilities that are interesting, despite its limitations: only two arguments, specific to SQL Server, and uh… well, we can’t always get three reasons, as a wise man once said.

There is one thing that makes isnull interesting in certain scenarios. Let’s look at a couple.

Read the whole thing. You (hopefully) won’t be in this situation often, but if you do happen to find yourself in it for whatever reason (and you can’t re-write the query to something better), it’s good to keep this in mind.

Comments closed

Optimizing a SQL Server 2019 Project for a Dedicated SQL Pool

Kevin Chant shows us how we can modify a database schema intended for SQL Server 2019 to work best with an Azure Synapse Analytics dedicated SQL pool:

In this post I want to cover how you can transform your SQL Server database schema for a dedicated SQL Pool if you are using Azure DevOps. Because I covered it at Data Toboggan over the weekend and it can be very useful.

By the end of this post, you will know one way you can transform the schema of a database project for SQL Server 2019 if you are using Azure DevOps. So that you can make it optimal for dedicated SQL Pools.

Click through for the process and an example. Note that this isn’t a quick “check this box and you’re done” type of solution, but if you already have a proper star schema, this will help you think through some of the things you’ll need to do.

Comments closed

Reading Delta Lake Tables from Power BI

Gerhard Brueckl checks out the Apache Parquet connector in Power BI, reading from a Delta Lake:

“Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.”

However, Parquet is just a file format and does not really support you when it comes to data management. Common data manipulation operations (DML)  like updates and deletes still need to be handled manually by the data pipeline. This was one of the reasons why Delta Lake (delta.io) was developed besides a lot of other features like ACID transactions, proper meta data handling and a lot more. If you are interested in the details, please follow the link above.

Click through for a demo.

Comments closed

Default Format String for Power BI Calculation Groups

Gilbert Quevauvilliers figures out how to display a calculation group’s default format string in Power BI:

Recently I was working on a customer’s data and one thing that was missed was when something was not selected the Calculation Group did not return any default Format String.

Below is how I solved this, with the tricky part being formatting of the measure to be displayed correctly.

This follows on from my previous blog posts with regards to Calculation Groups:

Power BI Visuals now support Custom Formatting for measures using Calculation Groups!

Create Currency Formatting Strings using Calculation Groups in Power BI Pro & Premium / Azure Analysis Services / SQL Server Analysis Services 2019

Click through for the process.

Comments closed