Press "Enter" to skip to content

Curated SQL Posts

tempdb Usage and Recommendations

Chad Callihan has a two-parter for us. First up is a look at tempdb usage:

The tempdb database in SQL Server holds a little bit of everything over its existence. There can be the temporary tables that you created, work tables like spools that SQL Server creates, version store data, and much more in tempdb at any time.

Is your tempdb having a hard time keeping up with the workload on your SQL Server? How can you find out what is using tempdb the most?

Chad also recommends picking the right number of files for tempdb:

We just made it through Thanksgiving dinner and there’s another big Christmas dinner coming up in few weeks. Sometimes you might find that one plate is not enough to hold all of that food. Just as multiple plates can make big dinners easier to manage, multiple tempdb data files can make the SQL Server workload easier to manage.

Read on to see how to track and modify this setting.

Comments closed

Modeling Many-to-Many Relationships in Power BI

Marco Russo and Alberto Ferrari show off two techniques:

Our readers know SQLBI position regarding bidirectional relationships: they are a powerful tool that should be used with great care and avoided in most scenarios. There actually is one scenario where bidirectional relationships are useful: when you need to create a model involving a many-to-many relationship between dimensions. In this scenario, using a bidirectional filter relationship is the suggested solution. Nonetheless, there may be reasons why the bidirectional relationship cannot be created, because of ambiguity. If you face this situation, you can use a different modeling technique based on a limited many-to-many cardinality relationship, which would work even when it is set as a unidirectional relationship. The choice between the two models is not an easy one. Both come with advantages and disadvantages that need to be deeply understood in order to make the right choice.

In this article, we first provide a description of the two techniques, and then we proceed with the performance analysis of both solutions, so to provide information about which technique to use and when.

Read on for the analysis.

Comments closed

Getting Around in Spark

Tomaz Kastrun continues a series on Apache Spark. Day 3 shows off the CLI and web UI:

In CLI we will type and run a simple Scala script and observe the behaviour in the WEB UI.

We will read the text file into RDD (Resilient Distributed Dataset). Spark engine resides on location:

/usr/local/Cellar/apache-spark/3.2.0 for MacOS and
C:\SparkApp\spark-3.2.0-bin-hadoop3.2 for Windows (based on the blogpost from Dec.1)

Day 4 compares local mode versus cluster mode:

Finding the best way to write Spark will be dependent of the language flavour. As we have mentioned, Spark runs both on Windows and Mac OS or Linux (both UNIX-like systems). And you will need Java installed to run the clusters. Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+. And the language flavour can also determine which IDE will be used.

Day 5 shows the setup of a Spark cluster:

Spark can run both by itself, or over several existing cluster managers. It currently provides several options for deployment. If you decide to use Hadoop and YARN, there is usually the installation needed to install everything on nodes. Installing Java, JavaJDK, Hadoop and setting all the needed configuration. This installation is preferred when installing several nodes. A good example and explanation is available here. you will also be installing HDFS that comes with Hadoop.

Check out all three posts and get caught up on Spark.

Comments closed

Indexing and Window Functions

I continue a series on window functions in SQL Server:

If you’ve been around the block with window functions, you’ve probably heard of the POC indexing strategy: Partition by, Order by, Covering. In other words, with a query, focus on the columns in the PARTITION BY clause (in order!), then the ORDER BY clause (again, in order!), and finally other columns in the SELECT clause to make the index covering (not in order! though it doesn’t hurt!).

But do read on to understand why this is not sufficient.

Comments closed

Troubleshooting Timeouts with Import Refresh

Chris Webb begins a series on troubleshooting timeouts:

If you’re working with a large Power BI dataset and/or a slow data source in Import mode it can be very frustrating to run into timeout errors after you have already waited a long time for a refresh to finish. There are a number of different types of timeout that you might run into, and in this series I’ll look at a few of them and discuss some of the ways you can work around them.

In this post I’ll look at one of the most commonly-encountered timeouts: the limit on the maximum length of time an Import mode dataset refresh can take. 

Click through to see the limits and ways to (sort of) get around them.

Comments closed

Tracking Value Changes in SQL Server

Tomas Zika wants to track a change:

This time a colleague from work asked how to best find a culprit that has been changing a specific cell in a table. It could be an automated process, application logic, application user or even an ad-hoc statement – we didn’t know. The table has many different access patterns, some of which are frequent. Ideally, we don’t want to monitor everything and sift through it.

We wanted to learn the who, the how and then ask why? If you like to know the whole journey, read on. Otherwise, you can skip to section Eureka moment

Click through for the Eureka moment. It is important to embrace the power of “and.”

Comments closed

The User-Assigned Managed Identity in ADF

Asanka Padmakumara takes a look at the user defined managed identity:

If you are familiar with Managed Identity concepts in ADF, each ADF instance comes with own System Assigned Managed Identity (MI). We can use that MI to control ADF’s access to any data sources which support Azure AD based authentication. This is considered to be the most secured and recommended way of authenticating ADF with cloud systems. If not, you can use Azure Key vault to store credentials. Let’s take an example on to discuss how User Assigned Managed Identity helps for manage access within multiple ADF environment.

Click through to see how the user assigned managed identity makes life better.

Comments closed

SQL Server Backend for Django

Warren Chu announces a new version of the SQL Server 3rd Party Backend for Django:

We have released version 1.1 of the SQL Server 3rd Party Backend for Django. This release contains support for the upcoming release of Django 4.0, as well as a number of issue fixes.

Our plan is to time releases to coincide with major releases of Django and SQL Server, to ensure users of this project can keep up to date with Django while continuing to use SQL Server as a backend.

Read on to see what this entails.

Comments closed

Choosing a Statistical Test

Antoine Soetewey has a handy chart for us:

Being a teaching assistant in statistics for students with diverse backgrounds, I have the chance to see what is globally not well understood by students.

I have realized that it is usually not a problem for students to do a specific statistical test when they are told which one to use (as long as they have good resources and they have been attentive during classes, of course). However, it appears that the task is much more difficult for them when they need to choose what test to do.

Click through for the chart, as well as a PDF version. H/T R-Bloggers.

Comments closed

Installing Apache Spark

Tomaz Kastrun continues a series on Apache Spark:

Installing Apache Spark on Windows computer will require preinstalled Java JDK (Java Development Kit). Java 8 or later version, with current version 17. On Oracle website, download the Java and install it on your system. Easiest way is to download the x64 MSI Installer. Install the file and follow the instructions. Installer will create a folder like “C:\Program Files\Java\jdk-17.0.1”.

Read on for instructions for both Windows and MacOS. You can also create a container running Spark, which is another helpful method.

Comments closed