Press "Enter" to skip to content

Day: December 5, 2019

The A* Search Algorithm

Akash Kumar takes us through the A* algorithm and how it works for search:

Path Finding has been one of the oldest and most popular applications in computer programming. You could virtually find the most optimal path from a source to a destination by adding costs which would represent time, money etc. A* is one of the most popular algorithms for all the right reasons. In this article, let’s find out just why.

Click through for an explanation of what the algorithm does and pseudocode to implement it.

Comments closed

JupyterLab Integration for Databricks

Bernhard Walter announces an integration between JupyterLab and Databricks:

This blog post starts with a quick overview how using a remote Databricks cluster from your local JupyterLab would look like. It then provides an end to end example of working with JupyterLab Integration followed by explaining the differences to Databricks Connect. If you want to try it yourself, the last section explains the installation.

I like this a lot, as it fights back a bit against the balkanization of data science: it means I don’t need to keep one set of notebooks here and another set of notebooks there and a third set of notebooks somewhere else.

Comments closed

Fastest Way to Delete Lots of Rows in SQL Server

Bertrand tries out a few methods to delete data and what SQL Server configuration settings do to this calculus:

That took far longer than I’m comfortable admitting. Part of that was because I had originally included a 0.1% test for rowperloop which, in some cases, took several hours. So I removed those from the table a few days in, and can easily say: if you are removing 1,000,000 rows, deleting 1,000 rows at a time is highly unlikely to be an optimal choice, regardless of any other variables

I think Aaron lays out the caveats pretty well, but I’d reiterate that the main benefit behind chunking delete operations is not so much to make things faster, but to reduce the amount of time you spend blocking more important work, like user queries. And reducing the risk of blowing out the transaction log file (and maybe running out of disk space too).

1 Comment

Azure Data Factory Components and Copy Data Wizard

Cathrine Wilhelmsen continues a series on Azure Data Factory. First, we get an overview of the available components:

Pipelines are the things you execute or run in Azure Data Factory, similar to packages in SQL Server Integration Services (SSIS). This is where you define your workflow: what you want to do and in which order. For example, a pipeline can first copy data from an on-premises data center to Azure Data Lake Storage, and then transform the data from Azure Data Lake Storage into Azure Synapse Analytics (previously Azure SQL Data Warehouse).

Then, Cathrine looks at the Copy Data wizard:

LEGO! Yay! I love LEGO. Rebrickable is an online service that will show you which LEGO sets you can build from the sets and parts you already own. Fun! 🙂

They also have a database of all official LEGO sets and parts (including themes and colors) that you can download for free as CSV files or JSON files.

The CSV files are automatically generated at the start of each month and can be found on rebrickable.com/downloads

Cathrine takes this LEGO data and feeds it into Azure Data Lake Storage.

Comments closed

An Ode to the SQLCallstackResolver

Jonathan Kehayias really likes the SQLCallstackResolver project:

Lately, I have been doing a lot of work troubleshooting certain behaviors in SQL Server for workloads that are, to put it simply, designed horribly. As a part of this, I have found it necessary to collect callstacks with Extended Events and to materialize them using the debugger symbols for SQL Server to see where exactly certain types of issues are being encountered to better understand some of the internals of newer features in SQL Server 2017 and 2019. Years ago I blogged about how to use the package0.callstack action in Extended Events for this type of thing, and Paul also has a blog post that talks about how to download the PDB symbols for SQL Server as well as a post that also demonstrates using the package0.callstack action to determine what causes a particular wait type. Using the debugging tools to get the symbols is somewhat clunky and tedious, so when I happened on this amazingly simple method of getting symbol files I had to share it.

The SQLCallstackResolver on Github has to be one of the greatest things since sliced bread if you want to materialize callstacks from SQL Server. 

Arvind Shyamsundar deserves a lot of credit for putting it together; he did a great job with the project.

Comments closed

Power BI Aggregation Precedence

Shabnam Watson asks and answers a question of importance:

Precedence is one of the aggregation properties that you can define on an aggregation table in Power BI. In a model with multiple aggregation tables, you can use Precedence to define the order in which aggregation tables will be considered by Power BI to answer queries. The higher the Precedence number, the sooner the aggregation table will be considered. Very simple and easy to understand. but the question is:

What does Power BI do when there are multiple aggregation tables configured with the same Precedence value and they can all answer the same query? Which one is considered first? Does it choose the smallest one? or is there another rule in place?

Click through to find out.

Comments closed

Persistent Memory for SQL Server on Linux

The SQL Server team shows how you can configure persistent memory for SQL Server on Linux:

With the release of SQL Server 2019 on Linux, Microsoft introduced persistent memory (PMEM) support on Linux. This is an exciting development, as previous versions of SQL Server on Linux didn’t support PMEM. Let’s look at how to configure the PMEM for SQL Server on Linux.

SQL Server 2016 introduced support for non-volatile DIMMs and an optimization called Tail of the Log Caching on NVDIMM. These leveraged Windows Server direct access to a persistent memory device in DAX mode to reduce the number of operations needed to harden a log buffer to persistent storage.

SQL Server 2019 extends the support for PMEM devices to Linux, providing full enlightenment of data and transaction logs placed on PMEM. Enlightenment is a way to access the storage device using efficient user-space memcpy() operations. Rather than going through the file system and storage stack, SQL Server leverages DAX support on Linux to place data directly into the device. This helps to reduce latency.

Click through for the configuration steps.

Comments closed

Troubleshooting RESOURCE_SEMAPHORE Waits

David Fowler takes a look at the RESOURCE_SEMAPHORE wait type:

These aren’t something that I ever want to see and if I do I am straight away going to go looking for the cause. Basically what’s that’s telling you is that you’ve got processes waiting on a memory allocation. This is going to be because SQL Server hasn’t got enough memory to dish out to that particular process.

What’s going on here is that every time a query runs, it’ll ask for a certain amount of memory. A happy SQL Server will serve up a tasty slice of memory for that query to run in. The problem comes when the query is asking for a bigger slice of the pie than SQL has available. In that case the query will need to wait until SQL has enough free to give the query what it wants, that’s when the RESOURCE_SEMAPHORE wait starts ticking up.

Read on for a second look at this, as well as what you can do to help. Sometimes the answer is “add more memory” but this isn’t necessarily the case.

Comments closed