Press "Enter" to skip to content

Author: Kevin Feasel

Troubleshooting High Threadpool Waits and Deadlocked Schedulers

Eitan Blumin takes us through a troubleshooting scenario:

In short, high THREADPOOL waits can happen when SQL Server doesn’t have enough “worker threads” to handle new tasks, which could cause SQL Server to hang and refuse connections. When a task is waiting for a worker thread to become available, that wait type is called THREADPOOL wait.

A background process, called “Scheduler Monitor“, will identify when the same worker threads are “stuck” in the same state for 60 seconds or more. In which case it will resolve the issue as a Deadlocked Scheduler, and that’ll cause dropped connections, rollbacks, and even fail-overs.

When a Deadlocked Scheduler event happens, SQL Server will automatically generate a memory dump file (SQLDump#####.mdmp), and log the incident in the SQL Server Error Log.

Read on to understand what causes this as well as why we always fumble our keys under the car as the scary monster approaches.

Comments closed

Adding Row Numbers to ADF Data Flows

Rayis Imayev shows two methods of generating unique, ascending row numbers in Azure Data Factory data flows:

Adding a row number to your dataset could a trivial task. Both ANSI and Spark SQL have the row_number() window function that can enrich your data with a unique number whole for your whole or partitioned data recordset. 

Recently I had a case of creating a data flow in Azure Data Factory (ADF) where there was a need to add a row number.

Read on for a couple attempts which didn’t work, followed by two that do, including an assist from Joseph Edwards.

Comments closed

Stream Processing with ksqldb

Michael Drogalis takes us through how stream processing works with ksqldb:

ksqlDB, the event streaming database, is becoming one of the most popular ways to work with Apache Kafka®. Every day, we answer many questions about the project, but here’s a question with an answer that we are always trying to improve: How does ksqlDB work?

The mechanics behind stream processing can be challenging to grasp. The concepts are abstract, and many of them involve motion—two things that are hard for the mind’s eye to visualize. Let’s pop open the hood of ksqlDB to explore its essential concepts, how each works, and how it all relates to Kafka.

Click through for a demo with animations.

Comments closed

Reasons Data Science Projects Fail

Ryohei Fujimaki summarizes some of the reasons why data science projects can fail:

According to Gartner analyst Nick Heudecker, over 85% of data science projects fail.  A report from Dimensional Research indicated that only 4% of companies have succeeded in deploying ML models to production environment.

Even more critical, the economic downturn caused by the COVID-19 pandemic has placed increased pressure on data science and BI teams to deliver more with less. In this down market, organizations are reassessing which AI/ML models they should develop, how to optimize resources and how to best use valuable budget dollars for maximum impact. In this type of environment, AI/ML project failure is simply not acceptable.

That 85% sounds suspiciously like the percentage of failed business intelligence and data warehouse projects, as well as the percentage of failed big data projects. It’s close enough that it makes me want to come up with some overarching idea that projects based on the consolidation of multiple independent data systems across several business units are liable to fail about 5/6 of the time.

Comments closed

Delta Lake DML Internals

Tathagata Das, et al, take us through how Delta Lake handles update, delete, and merge operations:

`DELETE` works just like `UPDATE` under the hood. Delta Lake makes two scans of the data: the first scan is to identify any data files that contain rows matching the predicate condition. The second scan reads the matching data files into memory, at which point Delta Lake deletes the rows in question before writing out the newly clean data to disk.

After Delta Lake completes a `DELETE` operation successfully, the old data files are not deleted — they’re still retained on disk, but recorded as “tombstoned” (no longer part of the active table) in the Delta Lake transaction log. Remember, those old files aren’t deleted immediately because you might still need them to time travel back to an earlier version of the table. If you want to delete files older than a certain time period, you can use the `VACUUM` command.

Click through for a video as well as a blog post with the details.

Comments closed

View Native Analysis Services Queries in Power Query

Chris Webb gives us an update on Power Query:

If you’re familiar with the topic of query folding in Power Query, you’ll know that the View Native Query right-click option in the Applied Steps pane of the Power Query Editor can be used to show the native query that is run against the data source. You may also know that there are some data sources where query folding does take place but where View Native Query remains greyed out.

Read on to see which sources are now available and to see an example of this in action.

Comments closed

ADF and Self-Hosted Integration Runtime Config Errors

Teo Lachev points out a common issue with using the Azure Data Factory self-hosted integration runtime:

You’ve set up the Azure Data Factory self-hosted integration runtime to access on-prem data sources. You create a linked server, click Test Connection, and then get greeted with an error saying the security context can’t be passed. On the on-prem VM, you use the Integration Runtime Configuration Manager and get a similar error or something to the extent that JSON can’t be parsed. You spent a few hours in trying everything that comes to mind, such as checking firewalls, connectivity from SSMS, but nothing helps.

Read on for the solution.

Comments closed

Time Series Forecasting in R

Selcuk Disci contrasts a couple of methods for time series forecasting:

It is always hard to find a proper model to forecast time series data. One of the reasons is that models that use time-series data often expose to serial correlation. In this article, we will compare k nearest neighbor (KNN) regression which is a supervised machine learning method, with a more classical and stochastic process, autoregressive integrated moving average (ARIMA).

We will use the monthly prices of refined gold futures(XAUTRY) for one gram in Turkish Lira traded on BIST(Istanbul Stock Exchange) for forecasting. We created the data frame starting from 2013. You can download the relevant excel file from here.

Click through for the demonstration. H/T R-Bloggers.

Comments closed

Delayed Durability in SQL Server

Esat Erkec walks us through Delayed Durability in SQL Server:

In this article, we will learn the Delayed Durability feature that helps to improve transaction log file write throughput in SQL Server.

OLTP (Online Transaction Processing) databases should process a huge number of transactions within the shortest time and concurrently. Therefore, the transaction completion time becomes more important for the performance of the OLTP databases. Particularly for SQL Server, the transaction log (T-log) file configuration will play a key role in the performance of the transaction completion times because the write throughput to the log file directly affects the application response times.

This is a feature which might be useful in specific scenarios, but I’m always concerned about that risk of data loss.

Comments closed