Curated SQL – Page 495 – A Fine Slice Of SQL Server

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm

Read on for examples of pre-processing steps and how pre-processing differs from data cleaning.

Comments closed

The Cost of Empty Partitions in SQL Server

Published 2022-06-22 by Kevin Feasel

Aaron Bertrand reminds us that TANSTAAFL:

Not too long ago, I came across a table that had 15,000 partitions—all but 4 of them empty. I bet when you have implemented partitioning you, too, have wondered: “Why shouldn’t I create all future partitions now?”
The question is valid: wouldn’t maintenance be easier if you only had to phase out old partitions, without ever worrying about adding partitions to accommodate new data?
Here’s the thing. Microsoft will never tell you this, but empty partitions are not free. I don’t mean you will get an invoice for creating too many, but you will pay for them in other ways.

I think the reason people do this sort of thing is that partition management is harder than it really needs to be in SQL Server. Adopting the partitioning process for dedicated SQL pools in Azure Synapse Analytics would be a good start but consider that partitions are almost always date or numeric intervals (often pseudo-date numbers like 20220601 to represent June 1, 2022) and allow people to create partitions based on the key and have the system manage it. I’m being vague and hand-wavey and not talking about all of the edge cases (like if some dope puts in a date for 22220601 instead of 20220601) but I think some PARTITION RANGE RIGHT ON [DateWK] INTERVAL MONTHLY WITH MAX_FUTURE_PARTITIONS=2, MAX_MAINTAINED_PARTITIONS=48, PARTITION_ARCHIVAL_TABLE=History.MyTable or something like that would cut to the core of what partition management does without writing thousand-line scripts full of dynamic SQL trying to manage these things.

Comments closed

Distributed Replay Deprecated in SQL Server 2022

Published 2022-06-22 by Kevin Feasel

Brent Ozar starts the wake:

For SQL Server 2022, Microsoft deprecated Distributed Replay.
The idea behind the feature was that you’d capture a trace against your production environment, set up another environment for load testing or QA testing, and then replay that exact same workload against it. You’d be able to measure which queries got better or worse, and how.
The reality was a complete mess. It was a giant pain in the rear to set up and use, to the point where I got frustrated with it within a few hours and asked my peers about their experiences with it. I got back a string of four-letter words – everybody really struggled to get it across the finish line. Over subsequent versions, Microsoft made token efforts to improve it, but never really gave it the love it required.

Yep, I can concur. What we wanted was a simple button-click (or easy-to-navigate UI) that let you capture “What does a real production workload look like?” and then the ability to re-run it elsewhere, like on new hardware. What we got was indeed a mess.

I don’t fully agree with Brent’s argument that the right answer is to build app-level testing. If everything was architected and developed for this, then yeah, that might be a better answer. But unless you’ve built all relevant applications around APIs (so they can be programmatically invoked rather than trying to do everything via Selenium) and have put in the legwork necessary to track and re-run calls, I think you end up with an even bigger mess—especially if there are multiple applications working with the same database. I do agree that this is a hard problem regardless of the path you choose.

Comments closed

Exporting Power BI Row-Level Security Details

Published 2022-06-22 by Kevin Feasel

Gilbert Quevauvilliers needs a report, stat!:

In a previous tweet on twitter, I had elaborated on how I had extracted the RLS Roles with the details and then exported it into a CSV file which then allowed the organization to keep an audit of the RLS for the dataset.
In the steps below I will show you how I did this.
In my previous blog post I explained how to export data from a Power BI report to a CSV file here: Exporting a Power BI Visual data to a CSV File in SharePoint

Read on to see how, as well as a few notes on what it takes to get this report.

Comments closed

DOP Feedback in SQL Server 2022

Published 2022-06-22 by Kevin Feasel

Erik Darling talks about a potentially exciting feature:

I’m not going to demo DOP feedback in this post, I’m just going to show you the situation that it hopes to improve upon.
To do that, I’m going to run a simple aggregation query at different degrees of parallelism, and show you the changes in query timing.

Figuring out where that elbow is (in other words, when you move from approximately-linear gains to sub-linear gains) can be extremely helpful. Of course, this is like solving a partial equilibrium problem: it’s part of the problem but there’s a whole separate general equilibrium problem from there—what’s the best number of cores for this query with the constraint that I have all of these other queries running on a busy server? But before I make it seem like I’m minimizing the value of this, the partial answer will, in many circumstances, be good enough.

Comments closed

Operating Power BI Desktop as a B2B User

Published 2022-06-22 by Kevin Feasel

Meagan Longoria shares some notes:

I noticed Adam Saxton post a tip on the Guy in a Cube YouTube channel about publishing reports from Power BI Desktop for external users. According to Microsoft Docs (as of June 21, 2022), you can’t publish directly from Power BI Desktop to an external tenant. But Adam shows how that is now possible thanks to an update in Azure Active Directory.

Click through for the sign-in process as well as what you can do and the pitfalls you might run into along the way.

Comments closed

Lack of Fun with Scalar Functions

Published 2022-06-22 by Kevin Feasel

Tom Zika takes away the scalars:

I’m still surprised many people don’t realise how lousy Scalar functions are. So because it’s my current focus in work and this Stack Overflow question, I’ll be revisiting this topic.
The focus of part one is parallelism. Unfortunately, parallelism often gets a bad rep because of the prominent wait stats. Also, if there is a skew, it can run slow. But for the most part, it’s advantageous.
Whether or not you want parallelism should be an informed choice. But Scalar functions will enforce the query to run serially, even if you are unaware. That’s why I want to shine a light on this.

Read on for a demo of how even a no-op scalar function can affect query performance. Given the mess we normally see in scalar functions, it’s all downhill from there.

Comments closed

Delta Live Tables and Power BI Data Modeling

Published 2022-06-21 by Kevin Feasel

Tahir Fayyaz goes from Delta Lake to Power BI:

To get the optimal performance from Power BI it is recommended to use a star schema data model and to make use of user-defined aggregated tables. However, as you build out your facts, dimensions, and aggregation tables and views in Delta Lake, ready to be used by the Power BI data model, it can become complicated to manage all the pipelines, dependencies, and data quality as you need to consider the following:
– How to easily develop and manage the data model’s transformation code.
– How to run and scale data pipelines for the model as data volumes grow.
– How to keep all the Delta Lake tables updated as new data arrives.
– How to view the lineage for all tables as the model gets more complex.
– How to actively stop data quality issues that result in incorrect reports.

Read on for recommendations, a couple architectural diagrams, and some sample code.

Comments closed

The top Operator in KQL

Published 2022-06-21 by Kevin Feasel

Robert Cain has top men working on this. Top. Men:

Top 10 lists are all the rage on the internet. Everywhere you look you see “Top 10 Cute Kitten Videos” or “Top 10 Pluralsight Videos by ArcaneCode”.
KQL includes a top operator so you can generate your own top lists. Even better, you are not limited to just ten items either.

Read on to see how you can use the top operator in KQL.

Comments closed

Constructing JSON Objects in SQL Server

Published 2022-06-21 by Kevin Feasel

Hasan Savran checks out a couple of functions new to SQL Server 2022:

JSON Functions are introduced to SQL Server in version 2016. Saving JSON documents and retrieving documents using JSON Functions brings many possibilities to SQL Server. It is great to see that Microsoft continues to add different functions to the original JSON functions set.
Today, I will explain two new JSON functions which are available in SQL Server 2022 and Azure SQL Database.

Read on to learn more about these functions.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Curated SQL Posts