Press "Enter" to skip to content

Author: Kevin Feasel

Automating Historical Partition Processing in PBI Per User

Gilbert Quevauvilliers runs into a timing issue:

I recently had a big challenge with one of my customers where due to the sheer volume of data and network connectivity speed I was hitting the 5-hour limit for processing of data into my premium per user dataset.

My solution was to change the partitions from monthly to daily. And then once I have all the daily partitions merge them back into monthly partitions.

The challenge I had was I now had to process daily partitions from 2019-01-01 to 2021-11-30. This was a LOT of partitions and I had to find a way to automate the processing of partitions.

Not only that, but I had to ensure that I did not overload the source system too!

Read on to see what Gilbert did to solve this problem.

Comments closed

Determining the Right Batch Size for Deletes

Jess Pomfret breaks out the lab coat and safety goggles:

I found myself needing to clear out a large amount of data from a table this week as part of a clean up job.  In order to avoid the transaction log catching fire from a long running, massive delete, I wrote the following T-SQL to chunk through the rows that needed to be deleted in batches. The question is though, what’s the optimal batch size?

I usually go with a rule of thumb: 1K for wide tables (in terms of columns and row size) or when there are foreign key constraints, 10K for medium-width tables, and about 25K for narrow tables. But if this is an operation you run frequently, it’s worth experimenting a bit.

Comments closed

Tracking SQL Server Uptime

Garry Bargsley has a cmdlet for us:

This week’s blog post will help you check your SQL Servers up-time. There are numerous reasons I can think of that you would want to know how long your SQL Server has been online. Was the server recently patched, did it crash and come back online, or did someone restart it by mistake? These are all valid questions about a single SQL Server or your entire estate. I will show you how you can easily check one too many servers quickly for uptime.

We will start by using every DBA’s favorite PowerShell module…  dbatools

Admittedly, I’d just check the start time for the tempdb database, but this cmdlet does give more info.

Comments closed

Building a Pipeline for External Data Sharing

Hope Foley has data to share:

I worked with a customer recently who had a need to share CSVs for an auditing situation.  They had a lot of external customers that they needed to collect CSVs from for the audit process.  There were a lot of discussions happening on how to best do it, whether we’d pull data from their environment or have them push them into theirs.  Folks weren’t sure on that so I tried to come up with something that would work for both. 

Read on for Hope’s solution to the problem.

Comments closed

Reasons for Partitioning in SQL Server

Erik Darling has opinions:

When I work with clients, nearly every single one has this burning question about partitioning.

“We’ve got this huge table, should we partition it?”

“Do you need to insert or delete data in big chunks?”

“No, it’s all transactional.”

“Do you have last page contention problems?”

“No, but won’t it help performance?”

“No, not unless you’re using clustered column store.”

“…”

Read on to unpack Erik’s argument. I do wish that there were more good cases for partitioning in SQL Server, but they’re almost all in the analytics space—which is part of why partitioning is a lot more useful in Azure Synapse Analytics dedicated SQL pools.

Comments closed

Testing IOPS, Latency, and Throughput: an Analogy

Brent Ozar has a timely analogy for us:

You’re trying to decide whether to use DHL, FedEx, UPS, or your local postal service.

You could measure them by sending me a gift – after all, it is the holidays, and I do a lot of stuff for you year round, and it’s probably the least you could do.

– You place one box outside

– You call the shipping company to come pick it up, and start the clock

– When I get it, I’ll call you to confirm receipt, and you stop the clock

Click through for the rest of the story.

Comments closed

Using Scala at Databricks

Li Haoyi gives us a peek behind the curtain:

With hundreds of developers and millions of lines of code, Databricks is one of the largest Scala shops around. This post will be a broad tour of Scala at Databricks, from its inception to usage, style, tooling and challenges. We will cover topics ranging from cloud infrastructure and bespoke language tooling to the human processes around managing our large Scala codebase. From this post, you’ll learn about everything big and small that goes into making Scala at Databricks work, a useful case study for anyone supporting the use of Scala in a growing organization.

It’s always interesting to see how the largest companies handle certain classes of problems. From this post, we can get an idea of the high-level requirements and usage, making it worth the read.

Comments closed

What Good Data Governance Means

Paul Andrew shares some thoughts on the true meaning of data governance:

… Someone asked me what I thought ‘good’ Data Governance might look like for a given data platform – warehouse or analytics solution. This included all aspects of data governance is was a very broad question.

To add some actual context, not related to Star Wars, this was during the time when Microsoft started talking about version 2 of the Azure Data Catalogue offering and what else could/should be included in a suite of governance tools. Certainly, long before the days of what we now call Azure Purview. It was also a time when GDPR had a lot of focus for the data community as we battled with the principals and technical implications.

Anyway, with a recent fresh perspective on things, I’ve decided to dust off my original notes and attempt to distill the answer of good data governanace into the following areas. What I also find with data governanace that the more I experience in the industry the deeper my perspective on the subject goes.

Paul has put together a lengthy answer on the topic, well worth the read.

Comments closed

Batch Mode and Window Functions

I wind down a series on window functions:

SQL Server typically operates in row mode, which means that an operator processes one row at a time. This sounds inefficient, but tends to work out pretty well in practice. However, something which may work out even better is to process more than one row at a time, especially when the number of rows gets to be fairly large. Enter batch mode.

Batch mode was introduced in SQL Server 2012 alongside non-clustered columnstore indexes. It became interesting in SQL Server 2016 and very interesting in SQL Server 2019. That’s because 2016 introduced writable clustered columnstore indexes and 2019 gives us batch mode outside of columnstore indexes.

There are some nice potential performance gains for queries involving window functions.

Comments closed