Press "Enter" to skip to content

Day: January 30, 2020

What’s New with Delta Lake

Denny Lee and Tathagata Das announce Delta Lake 0.5.0:

With the following pull requests, you can now run even more Delta Lake operations concurrently. With finer grain conflict detection, these updates make it easier to run complex workflows on Delta tables such as:

– Running deletes (e.g. for GDPR compliance) concurrently on older partitions while newer partitions are being appended.
– Running file compactions concurrently with appends.
– Running updates and merges concurrently on disjoint sets of partitions.

Click through for the full changelog.

Comments closed

Cloudera and R

Ian Cook shows us how Cloudera works for R users:

Our customers love it when they can use familiar syntax to work with data regardless of its size or its source. The popularity of sparklyr is a case in point: it enables R users to use either SQL or dplyr—both familiar to most R users—to work with large-scale data using Apache Spark. Two R packages developed at Cloudera—implyr and tidyquery—aim to provide this same choice of either SQL or dplyr when querying tables with Apache Impala and when manipulating R data frames.

The implyr package is new to me, but looks interesting.

Comments closed

The Cost of Sorting in Stored Procedures

Monica Rathbun wants us to think about whether we really need that ORDER BY clause:

We know that sorting can be one of the most expensive things in an execution plan as shown below. However,  we continue to do ORDER BYs repeatedly. Yes, I 100% agree that there is a need to sort a results set and that this should be done in the procedure for good reason, but my concern is having  multiple sorts, erroneous sorts, and the sorts that can be done elsewhere. These are the ones that waste resources and can stifle performance.

Click through for a demo showing that this does make a difference.

Comments closed

Chaos Engineering with SQL Server

Andrew Pruski is excited about Chaos Engineering:

Chaos Engineering is a practice in which we run experiments against a system to see if it reacts the way we expect it to when it encounters a failure.

We’re not trying to break things here…Chaos Engineering is not breaking things in production.

If I said to my boss that we’re now going to be experiencing an increased amount of outages because, “I AM AN ENGINEER OF CHAOS”, I’d be marched out the front door pretty quickly.

On the plus side, we will know Andrew’s supervillain origin story.


What Uses tempdb?

Jason Hall takes us through several SQL Server features which use tempdb behind the scenes:

Since SQL Server 2005, triggers use the version store, which is kept in tempdb. So, if you’re using triggers, they are implicitly using tempdb. Remember that triggers are set-based, and you’ll get version data for every affected row. For really big batches, this can turn out to be quite a bit of version data.

Click through for several more examples.

Comments closed

An Intro to Power BI Premium

Gilbert Quevauvilliers gives us an overview of what Power BI Premium is and what you need to know before using it:

A great place to start is to first explain “What is Power BI Premium?”

At the very basic level Power BI Premium allows you to buy dedicated capacity with additional features. 

Read on to see what that means, what you get with Premium, and how it can benefit your organization. Gilbert also covers the pricing model, which is important because this isn’t cheap.

Comments closed

A Script for Peer Reviews of Code

Paul Andrew shares with us a script which is useful for peer reviewing code before check-in:

Does the code include good comments? Things that explain the reason why logic has been implemented to assist future developers looking at the same code.

All to often I see code comments written that just translate from code to English and tell me what is happening. What should be fairly obvious to the reviewer as they read the code. Why is so much more important.

It’s a 20-point checklist, but worth reviewing and adapting for your own purposes.

Comments closed

Ghost Records and Availability Groups

Aaron Bertrand ran into an interesting problem:

I recently came across an issue in a SQL Server Availability Group scenario where queries against a heavily-used queue table were taking longer and longer over time. The symptoms of the query were that logical reads were increasing rapidly, and we ultimately tracked it down to ghost records (and version ghost records) that were being created as rows from the queue table were consumed and deleted. Because the database was being used in a readable secondary, the ghost cleanup process simply wasn’t able to keep up with the volume of deletes against the table. The first workaround implemented was to suspend the queue consumers and rebuild the table.

Read on to see how Aaron used filtered indexes and soft deletes to mitigate some of the pain.

Comments closed