Press "Enter" to skip to content

Curated SQL Posts

Snapshot Creation in Azure Data Studio

Dave Bland checks out an extension to Azure Data Studio to manage snapshots:

Like many Azure Data Studio extensions, DB Snapshot Creator is designed to bring functionality into ADS that is not present by default.  This extension was developed by Sean Price. As the name suggests, this extension can be used to easily create database snapshots.  Before going too deep into this extension, let’s take a quick moment to go over what a snapshot is.

Back in the day, I created a WPF tool for a company to manage snapshots for manual testing: take a snapshot, perform whatever destructive testing you needed to do, and revert back to a known good state. In a world with good CI/CD tooling and Docker containers, that’s not nearly as important anymore, but sometimes you just need to run a quick test, so I’m glad the functionality is still around.

Comments closed

Removing an Extra Transaction Log File

Jeff Iannucci shows how to remove an unwanted guest from your database:

True, there’s no advantage to having more than one log file, but sometimes that one file grows suddenly and fills up the drive in the middle of a transaction and you’re stuck with those dreaded “THE DATABASE IS DOWN!!!” tickets until that transaction finishes. So, in the heat of the moment, you hit the panic button and create ANOTHER log file on a different drive.

Then, minutes, hours, or even weeks later, you want to put the universe back in order by resizing the original log file and removing the extra one. But what if you find you can’t remove that extra one, no matter what you try to do?

This is a legitimate case. Hopefully you plan ahead and never hit it, but stuff happens.

Comments closed

Making a Better Pie Chart

Elizabeth Ricks tries the impossible:

A friend called me recently and started our conversation with: “I know you dislike pie charts, but…can you help me create one?” 

Spoiler alert: I don’t hate pie charts. They’ve received a bad rap over the years and with good reason—they are very commonly used when another chart type would be better suited. The appropriate use case for a pie chart is expressing a part-to-whole relationship. Their limitation is that it can be difficult to accurately judge the relative size of and compare the segments. Here are some related articles on our blog: the great pie debate and an updated post on pies

Elizabeth does put together the best possible case, but I’m still in favor of burning pie charts to the ground.

Comments closed

Hive: Shuffle Failed with Too Many Fetch Failures

Dmitry Tolpeko takes us through an ugly error:

On one of the clusters I noticed an increased rate of shuffle errors, and the restart of a job did not help, it still failed with the same error.

The error was as follows:

Error: Error while running task ( failure ) : org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal (Shuffle.java:301)

Caused by: java.io.IOException: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true

Click through to understand what this error means and what you can do about it.

Comments closed

Distinct Counts in Power Query

Reza Rad shows how you can get a distinct count in Power Query:

You can have a distinct count calculation in multiple places in Power BI, through DAX code, using the Visual’s aggregation on a field, or even in Power Query. If you are doing the distinct count in Power Query as part of a group by operation, however, the existing distinct count is for all columns in the table, not for a particular column. In this article, I’ll show you a method you can use to get the distinct count of a particular column through the Group By transformation in Power Query component of Power BI.

Click through to learn how.

Comments closed

Corruption and Secondary Databases

Paul Randal shares some wisdom on corruption:

We’ve had a few new clients come to use recently after experiencing corruption, and they’ve been worried about whether physical corruption can propagate to secondary databases (like an availability group secondary or log shipping secondary) through the mechanism used to maintain the secondary database in sync with the primary database. I explained how it’s very rare for that to happen, but sometimes it does, and not in a way you’d think. Read on…

I don’t even have to ask you to read on; Paul has even done that. And do read the comments as well.

Comments closed

Preventing Query Timeouts with Power BI Incremental Refresh

Gilbert Quevauvilliers shows how to set the default timeout for a query against SQL Server from Power BI:

This was because on the first refresh it has to process all the data before it can incrementally refresh the dataset.

As per the documentation the default timeout for a SQL Server database is set to 10 minutes, and when I am processing a lot of data it can easily take longer than 10 minutes to return all the data.

Read on to see how you can change that if you need to.

Comments closed

Goodbye, MCSE

John Deardurff helps break the news:

Major Announcement from Microsoft Learning today. As Microsoft continues to invest in role-based learning offerings, the Microsoft Certified Solutions Associate (MCSA), Microsoft Certified Solutions Developer (MCSD), and Microsoft Certified Solutions Expert (MCSE) certifications will be phased out with a final retirement date of June 30th, 2020. Find the entire list of retired certifications here.

On the plus side, at least people who hold the next iteration of the MCSE won’t be confused with people who worked with NT4 anymore…

Comments closed

Loading Data into Delta Lake

Prakash Chockalingam takes us through auto-loading Delta Lake from various sources:

Auto Loader is an optimized file source that overcomes all the above limitations and provides a seamless way for data teams to load the raw data at low cost and latency with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.

This does look interesting.

Comments closed

How Apache Beam Runs on Top of Apache Flink

Maximilian Michels and Markos Sfikas explain why you might want to combine Apache Beam with Apache Flink:

Apache Flink and Apache Beam are open-source frameworks for parallel, distributed data processing at scale. Unlike Flink, Beam does not come with a full-blown execution engine of its own but plugs into other execution engines, such as Apache Flink, Apache Spark, or Google Cloud Dataflow. In this blog post we discuss the reasons to use Flink together with Beam for your batch and stream processing needs. We also take a closer look at how Beam works with Flink to provide an idea of the technical aspects of running Beam pipelines with Flink. We hope you find some useful information on how and why the two frameworks can be utilized in combination. For more information, you can refer to the corresponding documentation on the Beam website or contact the community through the Beam mailing list.

Read on for the full story. If you’re so inclined, you can also check out the full talk as a video.

Comments closed