Press "Enter" to skip to content

Curated SQL Posts

Things To Know About Databricks UAP

Kara Annanie has five things you should know about the Databricks Unified Analytics Platform:

4.     A Spark Dataframe is not the same as a Pandas/R Dataframe
Spark Dataframes are specifically designed to use distributed memory to perform operations across a cluster whereas Pandas/R Dataframes can only run on one computer. This means that you need to use a Spark Dataframe to realize the benefits of the cluster when coding in Python or R within Databricks.

This is a nice set of tips for people getting started with Spark and Databricks.

Comments closed

When A Procedure Has Multiple Plan Cache Entries

Arthur Daniels shows that multi-statement stored procedures can have multiple entries in the plan cache:

So we have two entries for this stored procedure. I included the statement sql handle to show that each statement handle has its own text. Let’s parse that text to see each statement. I copied the parsing SQL from this Plan Cache article.

This is a good thing to keep in mind if you’re trying to figure out how often a procedure gets called: SUM on the execution counts grouped only by text might not give you the results you expect.

Comments closed

Azure Data Studio Code Snippets

Rich Brenner shows us how we can create code snippets in Azure Data Studio:

A great thing about these snippets is that you can add your own and they can be exactly how you want them.

To get started with this open the Command Pallet with Ctrl+Shift+P and type in ‘snippets’.

Scroll down and find the SQL option. Open it and it will bring you to the SQL.json file in which we’ll be storing our SQL Snippets.

I had to migrate a bunch of SSMS snippets to Azure Data Studio and was not that happy with the experience, especially for some of the more complicated snippets.

Comments closed

Digging Into DBCC CHECKIDENT

Solomon Rutzky covers the four primary scenarios when running DBCC CHECKIDENT and specifying a new reseed value:

So, when specifying a “new_reseed_value“, the possible scenarios covered are:
1. Rows exist
2. No rows due to none inserted since the table was created
3. No rows due to TRUNCATE TABLE operation


What’s missing? The following scenario:
No rows due to DELETE operation!!

Click through to see how DBCC CHECKIDENT behaves differently depending upon the scenario.

Comments closed

Database Integration Testing With Pester

Ust Oldfield shows us an example of how we can perform database integration testing using Pester:

In a previous post, I gave an overview to integration tests and documenting integration points. In this post, I will give a practical example of developing and performing integration tests with the Pester framework for PowerShell. With a data platform, especially one hosted in Azure, it’s important to test that the Azure resources in your environment have been deployed and configured correctly. After we’ve done this, we can test the integration points on the platform, confident that all the components have been deployed.

The code for performing integration tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test documented integration points. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database.

Click through for the script.

Comments closed

Orphaned Workspaces In Power BI

David Eldersveld explains what orphaned workspaces are in Power BI:

One of the newer features in the Power BI Admin Portal is the ability to view all of a tenant’s Workspaces. As I was browsing through the collection of workspaces, I noticed several marked as Orphaned. What is an orphaned workspace, and how does it occur?

I was expecting orphaned workspaces to be a new thing where you pay for an Azure service using a distributed blockchain technology called Gruel (or maybe Grool).

Comments closed

The Costs Of Rebuilding Indexes Online

Tibor Karaszi explains that TANSTAAFL (There A’int No Such Thing As A Free Lunch) applies to online index rebuilds:

The time it takes to rebuild the index can be substantially longer for ONLINE. Many of us has other things to do with the database during night-time and/or weekends. Yes, these are the typical window in time where we try to find things such as index rebuilds. Say that you do it night-time and it currently take 4 hours. Wouldn’t it be nice if you could cut that time down to 1.5 hours? That would leave more time for imports, massaging of data, CHECKDB and other things you want to do. Sure, you can do it ONLINE, but it will slow down access during the rebuild. Also the more data you modify during the rebuild, the more space you need in tempdb.

Betteridge’s Law of Headlines applies too, so that’s two important principles in one post.

As far as the post goes, Tibor makes a fair point: there is a trade-off between availability and efficiency with index rebuilds. But having worked with clustered columnstore indexes in 2014, you’ll pry the online operations in subsequent versions out of my cold, dead hands.

Comments closed

Working With WebHDFS From Node.js

Somanth Veettil shows us how to use Node.js to work with the WebHDFS REST API:

There is an npm module, “node-webhdfs,” with a wrapper that allows you to access Hadoop WebHDFS APIs. You can install the node-webhdfs package using npm:
npm install webhdfs 
After the above step, you can write a Node.js program to access this API. Below are a few steps to help you out.

Click through for examples on how the package works.

Comments closed

Impala Improvements in CDH 5.15.0

Michael Ho, et al, share some improvements in Apache Impala’s scalability in the Cloudera Distribution of Hadoop:

Kudu RPC (KRPC) supports asynchronous RPCs. This removes the need to have a single thread per connection. Connections between hosts are long-lived. All RPCs between two hosts multiplex on the same established connection. This drastically cuts down the number of TCP connections between hosts and decouples the number of connections from the number of query fragments.

The error handling semantics are much cleaner and the RPC library transparently re-establishes broken connections. Support for SASL and TLS are built-in. KRPC uses protocol buffers for payload serialization. In addition to structured data, KRPC also supports attaching binary data payloads to RPCs, which removes the cost of data serialization and is used for large data objects like Impala’s intermediate row batches. There is also support for RPC cancellation which comes in handy when a query is cancelled because it allows query teardown to happen sooner.

Looks like there were some pretty nice gains out of this project.

Comments closed