Press "Enter" to skip to content

Author: Kevin Feasel

JupyterLab Integration for Databricks

Bernhard Walter announces an integration between JupyterLab and Databricks:

This blog post starts with a quick overview how using a remote Databricks cluster from your local JupyterLab would look like. It then provides an end to end example of working with JupyterLab Integration followed by explaining the differences to Databricks Connect. If you want to try it yourself, the last section explains the installation.

I like this a lot, as it fights back a bit against the balkanization of data science: it means I don’t need to keep one set of notebooks here and another set of notebooks there and a third set of notebooks somewhere else.

Comments closed

Power BI Aggregation Precedence

Shabnam Watson asks and answers a question of importance:

Precedence is one of the aggregation properties that you can define on an aggregation table in Power BI. In a model with multiple aggregation tables, you can use Precedence to define the order in which aggregation tables will be considered by Power BI to answer queries. The higher the Precedence number, the sooner the aggregation table will be considered. Very simple and easy to understand. but the question is:

What does Power BI do when there are multiple aggregation tables configured with the same Precedence value and they can all answer the same query? Which one is considered first? Does it choose the smallest one? or is there another rule in place?

Click through to find out.

Comments closed

An Ode to the SQLCallstackResolver

Jonathan Kehayias really likes the SQLCallstackResolver project:

Lately, I have been doing a lot of work troubleshooting certain behaviors in SQL Server for workloads that are, to put it simply, designed horribly. As a part of this, I have found it necessary to collect callstacks with Extended Events and to materialize them using the debugger symbols for SQL Server to see where exactly certain types of issues are being encountered to better understand some of the internals of newer features in SQL Server 2017 and 2019. Years ago I blogged about how to use the package0.callstack action in Extended Events for this type of thing, and Paul also has a blog post that talks about how to download the PDB symbols for SQL Server as well as a post that also demonstrates using the package0.callstack action to determine what causes a particular wait type. Using the debugging tools to get the symbols is somewhat clunky and tedious, so when I happened on this amazingly simple method of getting symbol files I had to share it.

The SQLCallstackResolver on Github has to be one of the greatest things since sliced bread if you want to materialize callstacks from SQL Server. 

Arvind Shyamsundar deserves a lot of credit for putting it together; he did a great job with the project.

Comments closed

Persistent Memory for SQL Server on Linux

The SQL Server team shows how you can configure persistent memory for SQL Server on Linux:

With the release of SQL Server 2019 on Linux, Microsoft introduced persistent memory (PMEM) support on Linux. This is an exciting development, as previous versions of SQL Server on Linux didn’t support PMEM. Let’s look at how to configure the PMEM for SQL Server on Linux.

SQL Server 2016 introduced support for non-volatile DIMMs and an optimization called Tail of the Log Caching on NVDIMM. These leveraged Windows Server direct access to a persistent memory device in DAX mode to reduce the number of operations needed to harden a log buffer to persistent storage.

SQL Server 2019 extends the support for PMEM devices to Linux, providing full enlightenment of data and transaction logs placed on PMEM. Enlightenment is a way to access the storage device using efficient user-space memcpy() operations. Rather than going through the file system and storage stack, SQL Server leverages DAX support on Linux to place data directly into the device. This helps to reduce latency.

Click through for the configuration steps.

Comments closed

Troubleshooting RESOURCE_SEMAPHORE Waits

David Fowler takes a look at the RESOURCE_SEMAPHORE wait type:

These aren’t something that I ever want to see and if I do I am straight away going to go looking for the cause. Basically what’s that’s telling you is that you’ve got processes waiting on a memory allocation. This is going to be because SQL Server hasn’t got enough memory to dish out to that particular process.

What’s going on here is that every time a query runs, it’ll ask for a certain amount of memory. A happy SQL Server will serve up a tasty slice of memory for that query to run in. The problem comes when the query is asking for a bigger slice of the pie than SQL has available. In that case the query will need to wait until SQL has enough free to give the query what it wants, that’s when the RESOURCE_SEMAPHORE wait starts ticking up.

Read on for a second look at this, as well as what you can do to help. Sometimes the answer is “add more memory” but this isn’t necessarily the case.

Comments closed

Realistic-Looking Islands with R

Holger K. von Jouanne-Diedrich uses fractal math to create realistic-looking artificial islands:

Here we will turn this principle on its head and use it to actually create realistic-looking landmasses with R. The inspiration for this came from chapter 4 “Infinite Detail” of the book “Math Bytes” by my colleague Professor T. Chartier from Davidson College in North Carolina.

The idea is to start with some very simple form, like a square, and add more detail step-by-step. Concretely, we go through every midpoint of our ever more complex polygon and shift it by a random amount. Because the polygon will be getting more and more intricate we have to adjust the absolute amount by which we shift the respective midpoints. 

Click through for the code.

Comments closed

Automated ML Pipelines with SAS

Sophia Rowland shows off SAS’s auto-ML action:

The dsAutoMl action does it all. It will explore your data, generate features, select features, create models, and autotune the hyper-parameters of those models. This action includes the four policies we have seen in my first two blogs: explorationPolicy, screenPolicy, transformationPolicy, and selectionPolicy. Please review my previous blogs if you need a refresher on the data exploration and cleaning process or feature generation and selection process. The dsAutoMl action builds on our prior discussions through model generation and autotuning. A data scientist can choose to build several models such as decision trees, random forests, gradient boosting models, and neural networks. In addition, the data scientist can control which objective function to optimize for and the number of K-folds to use. The output of the dsAutoMl action includes information about the features generated, information on the model pipelines generated, and an analytic store file for generating the features with new data.

This is an area where several companies are investing a lot of money, trying to simplify the process of training models.

Comments closed

Managing SQL Server Documentation with JSON

Phil Factor gives us the gloop:

Metadata extract files are handy for documentation, study, cataloguing and change-tracking. This type of file supplements source because it can record configuration, permissions, dependencies and documentation much more clearly. It is a good way of making a start with documenting your database.

Here is a sample of a json metadata file (from AdventureWorks 2016). It was generated using GloopCollectionOfObjects.sql that is here in Github, and is being viewed in JSONBuddy. I use this format of JSON, a collection of documents representing SQL Server base objects (no parent objects) when I need to read the contents into MongoDB. The term ‘Gloop’ refers to a large query that, you’d have thought, would be better off as a procedure. Here is a typical sample of the output.

This is an interesting approach to documentation. I’m not totally buying into it, but that might just be due to my not having tried it.

Comments closed

Incremental Data Moves to Azure Blob Storage

Ginger Daniel continues a series on moving data incrementally from SQL Server to Azure Blob Storage:

In Part 1 of this series, we demonstrated how to copy a full SQL database table from a SQL Server database into an Azure Blob Storage account as a csv file.  My client needed data moved from their on premise SQL Server database to Azure, and then needed the daily incremental data changes uploaded as well.  This article will discuss how to upload the incremental data changes to Azure after the initial data load.

Click through for the process.

Comments closed

Eager Spooling Against Indexes

Erik Darling finds an eager spool even when there is a good index to use:

But he did write about Eager Index Spools recently, and the post ended with the following statement:

Eager index spools are often a sign that a useful permanent index is missing from the database schema.

I’d like to show you a case where you may see an Eager Index Spool even when you have the index being spooled.

Click through for Erik’s demonstration.

Comments closed