Press "Enter" to skip to content

Curated SQL Posts

What Good Data Governance Means

Paul Andrew shares some thoughts on the true meaning of data governance:

… Someone asked me what I thought ‘good’ Data Governance might look like for a given data platform – warehouse or analytics solution. This included all aspects of data governance is was a very broad question.

To add some actual context, not related to Star Wars, this was during the time when Microsoft started talking about version 2 of the Azure Data Catalogue offering and what else could/should be included in a suite of governance tools. Certainly, long before the days of what we now call Azure Purview. It was also a time when GDPR had a lot of focus for the data community as we battled with the principals and technical implications.

Anyway, with a recent fresh perspective on things, I’ve decided to dust off my original notes and attempt to distill the answer of good data governanace into the following areas. What I also find with data governanace that the more I experience in the industry the deeper my perspective on the subject goes.

Paul has put together a lengthy answer on the topic, well worth the read.

Comments closed

Batch Mode and Window Functions

I wind down a series on window functions:

SQL Server typically operates in row mode, which means that an operator processes one row at a time. This sounds inefficient, but tends to work out pretty well in practice. However, something which may work out even better is to process more than one row at a time, especially when the number of rows gets to be fairly large. Enter batch mode.

Batch mode was introduced in SQL Server 2012 alongside non-clustered columnstore indexes. It became interesting in SQL Server 2016 and very interesting in SQL Server 2019. That’s because 2016 introduced writable clustered columnstore indexes and 2019 gives us batch mode outside of columnstore indexes.

There are some nice potential performance gains for queries involving window functions.

Comments closed

tempdb Usage and Recommendations

Chad Callihan has a two-parter for us. First up is a look at tempdb usage:

The tempdb database in SQL Server holds a little bit of everything over its existence. There can be the temporary tables that you created, work tables like spools that SQL Server creates, version store data, and much more in tempdb at any time.

Is your tempdb having a hard time keeping up with the workload on your SQL Server? How can you find out what is using tempdb the most?

Chad also recommends picking the right number of files for tempdb:

We just made it through Thanksgiving dinner and there’s another big Christmas dinner coming up in few weeks. Sometimes you might find that one plate is not enough to hold all of that food. Just as multiple plates can make big dinners easier to manage, multiple tempdb data files can make the SQL Server workload easier to manage.

Read on to see how to track and modify this setting.

Comments closed

Modeling Many-to-Many Relationships in Power BI

Marco Russo and Alberto Ferrari show off two techniques:

Our readers know SQLBI position regarding bidirectional relationships: they are a powerful tool that should be used with great care and avoided in most scenarios. There actually is one scenario where bidirectional relationships are useful: when you need to create a model involving a many-to-many relationship between dimensions. In this scenario, using a bidirectional filter relationship is the suggested solution. Nonetheless, there may be reasons why the bidirectional relationship cannot be created, because of ambiguity. If you face this situation, you can use a different modeling technique based on a limited many-to-many cardinality relationship, which would work even when it is set as a unidirectional relationship. The choice between the two models is not an easy one. Both come with advantages and disadvantages that need to be deeply understood in order to make the right choice.

In this article, we first provide a description of the two techniques, and then we proceed with the performance analysis of both solutions, so to provide information about which technique to use and when.

Read on for the analysis.

Comments closed

Getting Around in Spark

Tomaz Kastrun continues a series on Apache Spark. Day 3 shows off the CLI and web UI:

In CLI we will type and run a simple Scala script and observe the behaviour in the WEB UI.

We will read the text file into RDD (Resilient Distributed Dataset). Spark engine resides on location:

/usr/local/Cellar/apache-spark/3.2.0 for MacOS and
C:\SparkApp\spark-3.2.0-bin-hadoop3.2 for Windows (based on the blogpost from Dec.1)

Day 4 compares local mode versus cluster mode:

Finding the best way to write Spark will be dependent of the language flavour. As we have mentioned, Spark runs both on Windows and Mac OS or Linux (both UNIX-like systems). And you will need Java installed to run the clusters. Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+. And the language flavour can also determine which IDE will be used.

Day 5 shows the setup of a Spark cluster:

Spark can run both by itself, or over several existing cluster managers. It currently provides several options for deployment. If you decide to use Hadoop and YARN, there is usually the installation needed to install everything on nodes. Installing Java, JavaJDK, Hadoop and setting all the needed configuration. This installation is preferred when installing several nodes. A good example and explanation is available here. you will also be installing HDFS that comes with Hadoop.

Check out all three posts and get caught up on Spark.

Comments closed

Indexing and Window Functions

I continue a series on window functions in SQL Server:

If you’ve been around the block with window functions, you’ve probably heard of the POC indexing strategy: Partition by, Order by, Covering. In other words, with a query, focus on the columns in the PARTITION BY clause (in order!), then the ORDER BY clause (again, in order!), and finally other columns in the SELECT clause to make the index covering (not in order! though it doesn’t hurt!).

But do read on to understand why this is not sufficient.

Comments closed

Troubleshooting Timeouts with Import Refresh

Chris Webb begins a series on troubleshooting timeouts:

If you’re working with a large Power BI dataset and/or a slow data source in Import mode it can be very frustrating to run into timeout errors after you have already waited a long time for a refresh to finish. There are a number of different types of timeout that you might run into, and in this series I’ll look at a few of them and discuss some of the ways you can work around them.

In this post I’ll look at one of the most commonly-encountered timeouts: the limit on the maximum length of time an Import mode dataset refresh can take. 

Click through to see the limits and ways to (sort of) get around them.

Comments closed

Tracking Value Changes in SQL Server

Tomas Zika wants to track a change:

This time a colleague from work asked how to best find a culprit that has been changing a specific cell in a table. It could be an automated process, application logic, application user or even an ad-hoc statement – we didn’t know. The table has many different access patterns, some of which are frequent. Ideally, we don’t want to monitor everything and sift through it.

We wanted to learn the who, the how and then ask why? If you like to know the whole journey, read on. Otherwise, you can skip to section Eureka moment

Click through for the Eureka moment. It is important to embrace the power of “and.”

Comments closed

The User-Assigned Managed Identity in ADF

Asanka Padmakumara takes a look at the user defined managed identity:

If you are familiar with Managed Identity concepts in ADF, each ADF instance comes with own System Assigned Managed Identity (MI). We can use that MI to control ADF’s access to any data sources which support Azure AD based authentication. This is considered to be the most secured and recommended way of authenticating ADF with cloud systems. If not, you can use Azure Key vault to store credentials. Let’s take an example on to discuss how User Assigned Managed Identity helps for manage access within multiple ADF environment.

Click through to see how the user assigned managed identity makes life better.

Comments closed