Press "Enter" to skip to content

Curated SQL Posts

Monitoring Blocked Processes in Azure SQL DB

Etienne Lopes wants to see what the hold-up is:

Blocked processes (and deadlocks) are often one of the main factors responsible for performance issues in the databases, as such, it’s really important to monitor them effectively and if they exist then understand where, what, why, how often, duration, etc. Having that information will greatly help in the following fine tuning process.

In this post I’ll show one way to easily monitor blocked processes in an Azure SQL database.

Click through to learn how.

Comments closed

Incremental Refresh in Power BI

Reza Rad doesn’t have time to wait for a full dataset reload:

The default configuration for the Power BI dataset is to wipe out the entire data and reload it again. This can be a long process if you have a big dataset. Hybrid tables in Power BI keep part of the data in DirectQuery, and the rest is imported for data freshness and performance. In this article, I explain how you can set up an incremental refresh in Power BI and its requirements. You will also learn about Hybrid tables in Power BI. Incremental Refresh is not just in Power BI datasets but also in Dataflows and Datamarts. In this article, you learn to load only part of the changed data instead of loading the entire data each time. To learn more about Power BI, read the Power BI book from Rookie to Rock Star.

Click through for the article.

Comments closed

Architecting a Data Lake

James Serra provides some guidance:

I have had a lot of conversations with customers to help them understand how to design a data lake. I touched on this in my blog Data lake details, but that was written a long time ago so I wanted to update it. I often find customers do not spend enough time in designing a data lake and many times have to go back and redo their design and data lake build-out because they did not think through all their use cases for data. So make sure you think through all the sources of data you will use now and in the future, understanding the size, type, and speed of the data. Then absorb all the information you can find on data lake architecture and choose the appropriate design for your situation.

The concepts are simple but there are some interesting implications to what James includes as well as additional resources, so check it out.

Comments closed

Finding the Resource Database

Chad Callihan confirms that you can see the resource database but only in your peripheral vision:

Can you name each of the system databases in SQL Server?

You might be able to name master, msdb, model, and tempdb. What about the fifth system database? Don’t feel bad if you’re drawing a blank.

The fifth system database is the Resource database. Even if you’ve never heard of the Resource database, chances are that you’ve used it. Let’s discuss how that can be.

Click through to learn what the resource DB is and why you typically won’t see it.

Comments closed

Finding Near-Duplicates in a Corpus

Estelle Wang de-dupes text data:

Building a large high-quality corpus for Natural Language Processing (NLP) is not for the faint of heart. Text data can be large, cumbersome, and unwieldy and unlike clean numbers or categorical data in rows and columns, discerning differences between documents can be challenging. In organizations where documents are shared, modified, and shared again before being saved in an archive, the problem of duplication can become overwhelming.

To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options.

Click through for solutions in SAS.

Comments closed

English and Japanese Vowels

John Cook plots vowels:

Vowel sounds can be visualized in a two-dimensional space according to tongue position. The vertical axis is runs from open down to closed, and the horizontal runs from front to back. See a linguistics textbook for far more detail.

English has five vowel letters, but a lot more than five vowel sounds. Scholars argue about how many vowel sounds English and other languages have because there’s room for disagreement on how much two sounds can differ and still be considered variations on the same sound. The IPA Handbook [1] lists 11 vowel sounds in American English, not counting diphthongs.

Click through for comparative charts and some bonus expertise in the comments.

Comments closed

Streaming Datasets in Power BI

Reza Rad needs data in real time:

Datasets in Power BI can have connection types such as Import, DirectQuery or Live Connection. However, there is also one specific type of dataset which is different. This type of dataset is called Streaming Dataset. A streaming dataset is for a real-time dashboard and comes with various setups and configurations. In this video and article, we’ll talk about this type of dataset.

Reza includes a video as well as a very helpful walkthrough.

Comments closed

sqlpackage and Managed Identities

Nora Yang provides a guide:

– Enable AAD auth on Azure SQL server

– Conn to Azure SQL database via AAD admin

– Create contained user for the managed identity (using Azure VM name as contained username)

        create user <vmname> from external provider;

        alter role db_owner add member <vmname>;

Read on for the full set of steps.

Comments closed

Tuning a Range Query

Grant Fritchey gets forum-sniped:

Recently I was looking through DBA.StackExchange when I saw a pretty simple question that I decided to answer. I went off, set up a test database and some test tables and quickly wrote a query to answer the question. I got it all formatted pretty and was on my way to post it when I saw that another answer was already there.

Yeah. Identical to mine. Almost line for line.

Well, nuts.

I know. I’ll write a blog post.

In thinking about the problem, the thing that caught my mind was Grant’s comment about poor design. This got me thinking about one of my favorite topics: orthogonal design for relational excellence. The idea of a BETWEEN table of [ MinValue : MaxValue ] is the first thing people think of but is also the worst because you have two big problems: gaps and overlap.

The second solution is to use MinValue and calculate MaxValue (if we actually need it) as LEAD(MinValue) OVER (ORDER BY MinValue) - e, where e represents the smallest reasonable increment we’d need. Queries would find, for each Value in the main table, the largest MinValue below Value. That removes gaps and overlap but might be a performance concern as the main table’s data size grows.

The big-brain solution, which generally works best when you have a discrete number of options, could be a tally table. In Grant’s example, we see values from 1 to 1000, with a rank for each. If it’s really as simple as that, we would create a new lookup table with Value + RankDesc and simply join the main table’s Value to the lookup table’s Value to get the appropriate RankDesc. Yeah, you have 1000 rows instead of 3 but queries are trivial at that point. The downside is that this approach doesn’t work for continuous variables (e.g., give me the exact amount of your household income for the prior tax year) and the utility of this solution probably breaks down once you get past tens of thousands of rows.

In the case of a continuous variable or an enormous discrete variable, we have the simplest option of all: ignore something. If you care about the range, use the table from the second solution and use that ID on the main table. If you care about the value but not the range, just have the value and no lookup table.

Comments closed