Press "Enter" to skip to content

Month: September 2020

The Power of AUC

John Mount takes a deeper look at Area Under the Curve:

I am finishing up a work-note that has some really neat implications as to why working with AUC is more powerful than one might think.

I think I am far enough along to share the consequences here. This started as some, now reappraised, thoughts on the fallacy of thinking knowing the AUC (area under the curve) means you know the shape of the ROC plot (receiver operating characteristic plot]. I now think for many practical applications the AUC number carries a lot more information about the ROC shape than one might expect.

Read on for the explanation.

Comments closed

Connecting to an API with Username and Password in Power Query

Gilbert Quevauvilliers has a challenge:

In the blog post I am going to show you the steps that I took to get data from the XE.COM API which uses a username and password to log into the API

You might be thinking that I could put in the username and password when I used the Web Connector.

My challenge is that I wanted to create a function that I could pass through multiple currencies to the API. And in order to do that I wanted to store the details within the function.

Read on to see how Gilbert solves this.

Comments closed

Working with Keys and Certificates using dbatools

Mikey Bronowski concludes a series on using dbatools to replace SQL Server Management Studio functionality:

There are multiple security-related objects that are not easily accessible via SQL Server Management Studio. The first one would be Service Master Key, if exists, can be seen under the master database. Luckily, dbatools can help us to take a backup.

Click through for the details. If you’ve missed any of Mikey’s posts, here is the full list of posts.

Comments closed

Optimize for Unknown with Inline Table-Valued Functions

Koen Verbeeck hits on a strange case:

Turns out SQL Server used a plan with a hash join in the fast query, and a nested loop in the slow query. Due to SQL Server also wildly using incorrect estimates, the nested loops performs really poorly. Quite similar to parameter sniffing with stored procedures. Erik Darling has written a great article about it: Inline Table Valued Functions: Parameter Snorting.

The thing is, in contrast to scalar functions or multi-statement table-valued functions, the iTVF should have better performance because it will be expanded into the calling query. This way, SQL Server can use “more correct” estimates and create a plan for each different parameter. Well, today was not that day.

Read on for details on how Koen performed troubleshooting and the solution.

Comments closed

Understanding the Size of a Power BI File

Lazaros Viastikopoulos explains that finding the size of a Power BI file isn’t quite as easy as you’d think:

As you can see from the above screenshot, the size of the PBIX file is 1.2 MB (1,238 KB) by looking at the size in file explorer. Also, when we look at all the files we are ingesting into Power BI, the total size comes to 2.8 MB. However, when we ingest data into Power BI which is processed to memory, it can be bigger than the PBIX file size due to some compression that occurs and various other elements which again, are not displayed through the disk.

So, when we are getting various errors notifying us that our Power BI file is too large and we see that it is under the 1GB restriction on disk, it can leave us scratching our head. But, as I mentioned above due to compression and various other elements, once data is processed to memory it can be larger then what we see on disk, but still below the total size of the data source we are ingesting due to the compression.

Click through for information on two tools you can use to determine the in-use size.

Comments closed

Finding Skew in a Spark DataFrame

Landon Robinson walks us through skew in Spark DataFrames:

Ignoring issues caused by skew can be worth it sometimes, especially if the skew is not too severe, or isn’t worth the time spent for the performance gained. This is particularly true with one-off or ad-hoc analysis that isn’t likely to be repeated, and simply needs to get done.

However, the rest of the time, we need to find out where the skew is occurring, and take steps to dissolve it and get back to processing our big data. This post will show you one way to help find the source of skew in a Spark DataFrame. It won’t delve into the handful of ways to mitigate it (repartitioning, distributing/clustering, isolation, etc) (but our new book will), but this will certainly help pinpoint where the issue may be.

Click through to learn more.

Comments closed

Cloning Delta Lakes

Burak Yavuz and Pranav Anand show us how to clone Delta Lakes:

Clones are replicas of a source table at a given point in time. They have the same metadata as the source table: same schema, constraints, column descriptions, statistics, and partitioning. However, they behave as a separate table with a separate lineage or history. Any changes made to clones only affect the clone and not the source. Any changes that happen to the source during or after the cloning process also do not get reflected in the clone due to Snapshot Isolation. In Databricks Delta Lake we have two types of clones: shallow or deep.

Read on to learn the differences, as well as a few useful scenarios.

Comments closed

The Merry-Go-Round Scan

David Fowler covers one of the best ways of optimizing frequent scans of large amounts of data:

As we all know, full table scans can be very expensive, poor old SQL is forced to read every single row in a table (of course that doesn’t always mean that it’s a bad choice for SQL).

Lets assume we’ve got a table scan happening that results in 1,000,000 page reads, that’s quite a bit of work for SQL to do. Now imagine another query comes in and needs to scan the same table, that’s also going to need to do 1,000,000 reads to get the data that it needs. If this table happens to be frequently accessed, this is soon going add up.

There’s a clever solution which tends to work better and better as you have more and more queries scanning the table.

Comments closed

The Value of FileTable

Louis Davidson shows us a good scenario for SQL Server’s FileTable:

Now, my pictures are very manageable and searchable using SQL Server. The fact that the names are very standardized also helps me with searching the names in a file folder (I copy all of the files from the share into a Dropbox folder so I have access to the files everywhere I am as well.)

And additional cool thing is that even on my Express Edition local server, the filetable data is backed up with the database, so when I backup the database, it backs up my pictures in addition to the textual data. That extra backup, along with the copy of the files in Dropbox give me extra relief that I have a decent backup if something gets corrupted.

With just a little bit of configuration, and a little bit of code, I have a simple database of my images that I can search using T-SQL in a manner that is a lot more powerful that what I can reasonably do using File Explorer, along with the ability to keep history of how the pictures are used. This would be very useful for any photographer with a stock of pictures that they want to search and keep history of, even if it is just baby pictures…of the puppy.

Click through for the example. I think FileTable is one of the most underutilized features from SQL Server 2012.

Comments closed