Press "Enter" to skip to content

Day: June 22, 2022

Improving Join Performance with Skewed Datasets in Spark

Ajay Gupta gets into the topic of join performance:

Performing Joins on Skewed DatasetsA Dataset is considered to be skewed for a Join operation when the distribution of join keys across the records in the dataset is skewed towards a small subset of keys. For example when 80% of records in the datasets contribute to only 20% of Join keys.

Implications of Skewed Datasets for Join: Skewed Datasets, if not handled appropriately, can lead to stragglers in the Join stage (Read this linked story to know more about Stragglers). This brings down the overall execution efficiency of the Spark job. Also, skewed datasets can cause memory overruns on certain executors leading to the failure of the Spark job. Therefore, it is important to identify and address Join-based stages where large skewed datasets are involved.

Read on for five techniques which may help you out.

Comments closed

Example Data Pre-Processing Activities

Aayush Srivastava takes us through some pre-processing activities in machine learning:

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm

Read on for examples of pre-processing steps and how pre-processing differs from data cleaning.

Comments closed

The Cost of Empty Partitions in SQL Server

Aaron Bertrand reminds us that TANSTAAFL:

Not too long ago, I came across a table that had 15,000 partitions—all but 4 of them empty. I bet when you have implemented partitioning you, too, have wondered: “Why shouldn’t I create all future partitions now?”

The question is valid: wouldn’t maintenance be easier if you only had to phase out old partitions, without ever worrying about adding partitions to accommodate new data?

Here’s the thing. Microsoft will never tell you this, but empty partitions are not free. I don’t mean you will get an invoice for creating too many, but you will pay for them in other ways.

I think the reason people do this sort of thing is that partition management is harder than it really needs to be in SQL Server. Adopting the partitioning process for dedicated SQL pools in Azure Synapse Analytics would be a good start but consider that partitions are almost always date or numeric intervals (often pseudo-date numbers like 20220601 to represent June 1, 2022) and allow people to create partitions based on the key and have the system manage it. I’m being vague and hand-wavey and not talking about all of the edge cases (like if some dope puts in a date for 22220601 instead of 20220601) but I think some PARTITION RANGE RIGHT ON [DateWK] INTERVAL MONTHLY WITH MAX_FUTURE_PARTITIONS=2, MAX_MAINTAINED_PARTITIONS=48, PARTITION_ARCHIVAL_TABLE=History.MyTable or something like that would cut to the core of what partition management does without writing thousand-line scripts full of dynamic SQL trying to manage these things.

Comments closed

Distributed Replay Deprecated in SQL Server 2022

Brent Ozar starts the wake:

For SQL Server 2022, Microsoft deprecated Distributed Replay.

The idea behind the feature was that you’d capture a trace against your production environment, set up another environment for load testing or QA testing, and then replay that exact same workload against it. You’d be able to measure which queries got better or worse, and how.

The reality was a complete mess. It was a giant pain in the rear to set up and use, to the point where I got frustrated with it within a few hours and asked my peers about their experiences with it. I got back a string of four-letter words – everybody really struggled to get it across the finish line. Over subsequent versions, Microsoft made token efforts to improve it, but never really gave it the love it required.

Yep, I can concur. What we wanted was a simple button-click (or easy-to-navigate UI) that let you capture “What does a real production workload look like?” and then the ability to re-run it elsewhere, like on new hardware. What we got was indeed a mess.

I don’t fully agree with Brent’s argument that the right answer is to build app-level testing. If everything was architected and developed for this, then yeah, that might be a better answer. But unless you’ve built all relevant applications around APIs (so they can be programmatically invoked rather than trying to do everything via Selenium) and have put in the legwork necessary to track and re-run calls, I think you end up with an even bigger mess—especially if there are multiple applications working with the same database. I do agree that this is a hard problem regardless of the path you choose.

Comments closed

Exporting Power BI Row-Level Security Details

Gilbert Quevauvilliers needs a report, stat!:

In a previous tweet on twitter, I had elaborated on how I had extracted the RLS Roles with the details and then exported it into a CSV file which then allowed the organization to keep an audit of the RLS for the dataset.

In the steps below I will show you how I did this.

In my previous blog post I explained how to export data from a Power BI report to a CSV file here: Exporting a Power BI Visual data to a CSV File in SharePoint

Read on to see how, as well as a few notes on what it takes to get this report.

Comments closed

DOP Feedback in SQL Server 2022

Erik Darling talks about a potentially exciting feature:

I’m not going to demo DOP feedback in this post, I’m just going to show you the situation that it hopes to improve upon.

To do that, I’m going to run a simple aggregation query at different degrees of parallelism, and show you the changes in query timing.

Figuring out where that elbow is (in other words, when you move from approximately-linear gains to sub-linear gains) can be extremely helpful. Of course, this is like solving a partial equilibrium problem: it’s part of the problem but there’s a whole separate general equilibrium problem from there—what’s the best number of cores for this query with the constraint that I have all of these other queries running on a busy server? But before I make it seem like I’m minimizing the value of this, the partial answer will, in many circumstances, be good enough.

Comments closed

Operating Power BI Desktop as a B2B User

Meagan Longoria shares some notes:

I noticed Adam Saxton post a tip on the Guy in a Cube YouTube channel about publishing reports from Power BI Desktop for external users. According to Microsoft Docs (as of June 21, 2022), you can’t publish directly from Power BI Desktop to an external tenant. But Adam shows how that is now possible thanks to an update in Azure Active Directory.

Click through for the sign-in process as well as what you can do and the pitfalls you might run into along the way.

Comments closed

Lack of Fun with Scalar Functions

Tom Zika takes away the scalars:

I’m still surprised many people don’t realise how lousy Scalar functions are. So because it’s my current focus in work and this Stack Overflow question, I’ll be revisiting this topic.

The focus of part one is parallelism. Unfortunately, parallelism often gets a bad rep because of the prominent wait stats. Also, if there is a skew, it can run slow. But for the most part, it’s advantageous.

Whether or not you want parallelism should be an informed choice. But Scalar functions will enforce the query to run serially, even if you are unaware. That’s why I want to shine a light on this.

Read on for a demo of how even a no-op scalar function can affect query performance. Given the mess we normally see in scalar functions, it’s all downhill from there.

Comments closed