Curated SQL – Page 842 – A Fine Slice Of SQL Server

Embracing the XML

Published 2021-04-06 by Kevin Feasel

While XML is, without a doubt, a giant pain in the bottom, sometimes, the best way to deal with Extended Events is to simply embrace the XML.
Now, I know, just last week, I suggested ways to avoid the XML. I will freely admit, that is my default position. If I can avoid the XML, I will certainly do it. However, there are times where just embracing the XML works out nicely. Let’s talk about it a little.

Just need to do a little victory dance here. I didn’t explicitly say “embrace the XML” but close enough…

I think the biggest problem DBAs have with XML is that they end up treating it like a dreadful task: I need to shred XML for an extended event. But to do that, I have to learn how to query it using this quasi-language, and so they get stuck trying to fuss with something somebody else did, moving symbols around in the hopes that they get the right incantation. By contrast, a day or two really focusing in on how XQuery and XPath work would clarify a lot and make the process much simpler.

There is a fair counter-point in asking how often you’ll use this, and if the answer is “probably never,” then poke through and just try to get it working. But I’ve got a bit of bad news: “probably never” is probably wrong.

Comments closed

Deploying an Azure Arc Enabled Data Services Controller

Published 2021-04-06 by Kevin Feasel

Chris Adkin continues a series:

If you have been following this series, you should have:
– a basic understanding of Terraform
– a Kubernetes cluster that you can connect to using kubectl
– a basic understanding of Kubernetes services
– a working metalLB load balancer
– a basic understanding of how storage works in the world of Kubernetes
– a Kubernetes storage solution in the form of PX Store, alternatively you can use any solution (for the purposes of this series) which supports persistent volumes, however to use the backup solution in part 9 of the series you will need to use something that supports CSI

From here, Chris explains the importance of the data controller and then deploys one.

Comments closed

Columnstore, Strings, and Windowing Functions

Published 2021-04-06 by Kevin Feasel

Erik Darling has a tale to tell:

The only columns that we were really selecting from the Comments table were UserId and CreationDate, which are an integer and a datetime.
Those are relatively easy columns to deal with, both from the perspective of reading and sorting.
In order to show you how column selection can muck things up, we need to create a more appropriate column store index, add columns to the select list, and use a where clause to restrict the number of rows we’re sorting. Otherwise, we’ll get a 16GB memory grant for every query.

Read on to see how one little (or, well, big) string column can foul up the whole works.

Comments closed

Indexing for Physical Join Operators

Published 2021-04-06 by Kevin Feasel

Deepthi Goguri continues a series on physical join operators:

In the Part1 of decoding the physical join operators, we learned about the different types of physical operators: Nested loops, Merge joins and Hash joins. We have seen when they are useful and how to take advantage of each for the performance of our queries. We have also seen when they are useful and when they needs to be avoided.
In this part, we will know more about these operators and how the indexes really help these operator to perform better so the queries can execute faster.

Read on to see how to define indexes for each of the three physical operators.

Comments closed

Ways to Insert Data into a Hive Table

Published 2021-04-05 by Kevin Feasel

The Hadoop in Real World team has ways to insert data into Hive tables:

There are several different variations and ways when it comes to inserting or loading data into Hive tables. This post will cover 3 broad ways to insert or load data into Hive tables.

Click through for those methods.

Comments closed

Displaying Metrics on Graphite Dashboards

Published 2021-04-05 by Kevin Feasel

Nick Campion takes us through working with Graphite:

Graphite is a free and open-source software. It is used as a time-series database monitoring tool, where you can collect, store and display time-series data in real-time. As you can monitor certain metrics of this data using Graphite, it has a very useful and simple dashboard used to visualize these metrics.
This article will show you how to display a metric on your Graphite dashboard.

Click through for more information.

Comments closed

Kafka Sans ZooKeeper

Published 2021-04-05 by Kevin Feasel

Ben Stopford and Ismael Juma give us a preview:

So we’re very pleased to say that the early access of the KIP-500 code has been committed to trunk and is expected to be included in the upcoming 2.8 release. For the first time, you can run Kafka without ZooKeeper. We call this the Kafka Raft Metadata mode, typically shortened to KRaft (pronounced like craft) mode.
Beware, there are some features that are not available in this early-access release. We do not yet support the use of ACLs and other security features or transactions. Also, both partition reassignment and JBOD are unsupported in KRaft mode (these are anticipated to be available in an Apache Kafka release later in the year). Hence, consider the quorum controller experimental software—we don’t advise subjecting it to production workloads. If you do try out the software, however, you’ll find a host of new advantages: It’s simpler to deploy and operate, you can run Kafka in its entirety as a single process, and it can accommodate significantly more partitions per cluster (see measurements below).

Read on for more information. This is a big deal for Kafka.

Comments closed

Describing the Physical Join Operators

Published 2021-04-05 by Kevin Feasel

Deepthi Goguri explains the three physical join operators in SQL Server:

Merge join is not a bad thing and it may be efficient already in your execution plan. What you have to observe when you see the merge joins and performance slow on that plan is to focus on the upstream operations that are going into the merge join. Whether the data is presorted as you already have an index or whether the data is presorted in SQL Server own way then in that case, you can simply check if you can just add that missing column in the index and place in the last key column in the index or use a different join algorithm will be better. The other scenario might be you have lots of duplicate values in your data. If that is the case SQL Server will be using the work tables to handle how the duplicate values can be joined on. So, if you see the duplicate values or using tempdb, then finding the better options will be good.

Click through for more detail. Each physical operator has its place and does quite well within it, but the challenge comes when the optimizer thinks a particular route is better than the one you had in mind.

Comments closed

Comparing CSV to Parquet File Loading Performance in Power BI

Published 2021-04-05 by Kevin Feasel

Chris Webb has a comparison for us:

Earlier in this series on importing data from ADLSgen2 into Power BI I showed how partitioning a table in your dataset can improve refresh performance. In that post I used CSV files in ADLSgen2 as my source and created one partition per CSV file, but after my recent discovery that importing data from multiple Parquet files can be tuned to be a lot faster than importing data from CSV files, I decided to try creating partitions linked to Parquet files instead.

Click through for the experiment and its results.

Comments closed

Query Plans and Window Functions

Published 2021-04-05 by Kevin Feasel

Erik Darling has a two-fer here. First, window functions and parallelism:

When windowing functions don’t have a Partition By, the parallel zone ends much earlier on than it does with one.
That doesn’t mean it’s always slower, though. My general experience is the opposite, unless you have a good supporting index.
But “good supporting index” is for tomorrow. You’re just going to have to deal with that.

Second, columnstore behavior with respect to window functions:

Not only is the parallel version of the row mode plan a full second slower, but… look at that batch mode plan.
Look at it real close. There’s a sort before the Window Aggregate, despite reading from the same nonclustered index that the row mode plan uses.
But the row mode plan doesn’t have a Sort in it. Why?

Check out both posts.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Curated SQL Posts