Press "Enter" to skip to content

Author: Kevin Feasel

Removing Redundant Indexes

Guy Glantser helps find and remove redundant indexes in SQL Server:

For some reason, which I have never understood, SQL Server allows you to create duplicate indexes on the same object (table or view). You can create as many non-clustered indexes as you like with the exact index keys and included columns as well as the exact index properties. The only difference between the indexes would be the index ID and the index name. This is a very undesirable situation, because there is clearly no benefit from having the same index more than once, but on the other hand there is quite a lot of overhead that each index incurs. The overhead includes the storage consumed by the index, the on-going maintenance during DML statements, the periodical rebuild and/or reorganize operations, and the additional complexity that the optimizer has to deal with when evaluating possible access methods to the relevant table or view.

I don’t fully agree with Guy’s definition of redundancy, but it’s more a quibble than anything else—if I have an index on (A,B,C) and an index on (A,B), it might seem redundant, but there are cases when it isn’t. For example, suppose C is a large NVARCHAR column such that we barely fit (A,B,C) into the window for an index (1700 bytes in SQL Server 2016, 900 in prior versions), but A and B are INT types. If we have a lot of cases where we need (A,B) but not C, that second index is definitely not redundant.

Regardless, click through for Guy’s argument and a script to help you find potentially redundant indexes.

Comments closed

Security Practices for Azure Databricks

Abhinav Garg and Anna Shrestinian walk us through good security practices when using Azure Databricks:

Azure Databricks is a Unified Data Analytics Platform that is a part of the Microsoft Azure Cloud. Built upon the foundations of Delta LakeMLflowKoalas and Apache SparkTM, Azure Databricks is a first party PaaS on Microsoft Azure cloud that provides one-click setup, native integrations with other Azure cloud services, interactive workspace, and enterprise-grade security to power Data & AI use cases for small to large global customers. The platform enables true collaboration between different data personas in any enterprise, like Data Engineers, Data Scientists, Business Analysts and SecOps / Cloud Engineering.

In this article, we will share a list of cloud security features and capabilities that an enterprise data team could utilize to bake their Azure Databricks environment as per their governance policy.

Much of this is fairly straightforward, but it is nice to have it all in one place.

Comments closed

Accessing Azure Queue Storage from R

Hong Ooi announces a new package for R:

This post is to announce that the AzureQstor package is now on GitHub. AzureQstor provides an R interface to Azure queue storage, building on the facilities provided by AzureStor.

Queue Storage is a service for storing large numbers of messages, for example from automated sensors, that can be accessed remotely via authenticated calls using HTTP or HTTPS. A single queue message can be up to 64 KB in size, and a queue can contain millions of messages, up to the total capacity limit of a storage account. Queue storage is often used to create a backlog of work to process asynchronously.

Hong includes a couple of demos as well, so check it out.

Comments closed

XML Storage in SQL Server

Vladimir Klimov walks us through what happens when we save data to an XML data type:

When working on the release of dbForge Transaction Log, among other tasks, our team had to puzzle out how to properly store typed XML data.

To start with, it is worth mentioning that SQL Server does not store XML in the format it was entered. An XML string is parsed, split to tags, and thus is stored in a compressed format. Description elements that the server considers unnecessary are discarded.

It also should be kept in mind that, if the data type of a column is specified as simple XML, the server will store this data as Unicode strings.

Click through for a couple of examples and their data storage requirements.

Comments closed

Displaying Cosmos DB Spatial Data with .NET Core

Hasan Savran builds up a quick .NET Core app to retrieve spatial data from Cosmos DB and display it:

Cosmos DB stores geospatial data in GeoJSON format. You can not tell what raw GeoJSON represents because usually all it has is a type and bunch of coordinates. Azure Cosmos DB does not have any UI to help you what GeoJSON data looks like on a map either. Only option you have is a third party tool which might display data on a map or Azure Cosmos DB Jupyter Notebooks.

    I want to run a query in Azure Cosmos DB and see the results on a map. I decided to create a simple UI which displays spatial data on a map. I will show you how to do this step by step. I will use LeafLetJs as a map. It is open source and free! Also, I need to create .NET Core 3.1 web application and use Azure Cosmos DB Emulator for data.

Hasan walks us through the demo and promises to put the code in GitHub later.

Comments closed

The Value of Negative Identity Values

Randolph West explains why you might start at MIN(INT) for an identity integer column:

A quick(er) post this week, in response to Greg Low’s blog post from a few weeks ago titled “Don’t start identity columns or sequences with large negative values.”

Greg writes that you shouldn’t use large negative values in a table, because… it’s hard to read them, I guess? And also they don’t compress well.

I disagree … to a degree. Dang, words are hard. Anyway, when I design a table I create what’s called a surrogate key as my primary key, which is a value that is intended for the table to uniquely identify a row so that it participates in relational activities like joins and foreign keys in an efficient way. In other words the identity column is not for me, it’s for the database engine. I don’t need to worry about what row a value has. I choose the data type for that identity column based on the estimated number of rows, not whether I can memorize that a [StatusID] of 5 means something. Magic numbers are bad, mmmkay?

I don’t mind using negative values, especially for things like queue tables where the rows are ephemeral. The identity values may be harder to read, but as Randolph points out, in those types of cases, you aren’t really reading the values anyhow.

Comments closed

XML and JSON Creation in SQL Server vs Elsewhere

Lukas Eder walks us through converting result sets to XML and JSON using different platforms:

SQL Server supports transforming flat tabular SQL result sets into hierarchical structures by convention using the convenient FOR XML or FOR JSON syntaxes. This is really convenient and less verbose than the standard SQL/XML or SQL/JSON APIs – although the standard ones are more powerful.

In this blog post, I’d like to show a few core features of the SQL Server syntax, and what they correspond to in standard SQL. jOOQ 3.14 will support both SQL Server’s syntax and the standard syntax, and will be able to translate from one to the other, such that you can use SQL Server syntax also on Db2, MariaDB, MySQL, Oracle, PostgreSQL. You can play around with the current state of development on our website here.

Click through for a series of examples.

Comments closed

Filtered Indexes and the Optimizer

Paul White covers a couple of issues around filtered indexes:

This is a many-to-many merge join, where the execution engine must keep track of duplicates from the outer input in a worktable, and rewind as necessary. Duplicates? We are scanning a unique index!

It turns out the optimizer does not know that a filtered unique index produces unique values. This is a one-to-one join, but the optimizer costs it as if it were many-to-many. The higher estimated cost of a many-to-many merge join explains why a hash join plan is chosen.

Read the whole thing.

Comments closed

WAITFOR TIME Instead of DELAY

Denis Gobo points out that there is a second option when using WAITFOR:

I was looking at some code I wrote the other day and noticed the WAITFOR command.. This got me thinking. How many times have I used WAITFOR in code, probably as much as I have used NTILE  🙂

I looked at the documentation for WAITFOR and notice there is TIME in addition to DELAY.  Oh that is handy, I always rolled my own ghetto-style version by calculating how long it would be until a specific time and then I would use that in the WAITFOR DELAY command

Read on to see why you might use WAITFOR and how to use each option. I use it a bit more frequently than Denis, though I still need to run a test each time to get the syntax right…

Comments closed

Dynamic File Pruning on Delta Lake

Ali Afroozeh, et al, take us through Dynamic File Pruning in Databricks Runtime 6.1:

In addition to eliminating data at partition granularity, Delta Lake on Databricks dynamically skips unnecessary files when possible. This can be achieved because Delta Lake automatically collects metadata about data files managed by Delta Lake and so, data can be skipped without data file access. Prior to Dynamic File Pruning, file pruning only took place when queries contained a literal value in the predicate but now this works for both literal filters as well as join filters. This means that Dynamic File Pruning now allows star schema queries to take advantage of data skipping at file granularity.

There are some interesting performance results here. I’d also be curious to see how robust the results are as queries get more complicated

Comments closed