Press "Enter" to skip to content

Day: May 25, 2023

Adding Count to a Grouped DataFrame in Spark

The Big Data in Real World team does some counting:

We want to group the dataset by Name and get a count to see the employee and the number of projects they are assigned to. In addition to that sub count, we also want to add a column with a total count like below.

One important thing to remember about Spark transformations is that they’re lazy: just because you ran df.groupBy(...).agg(...) doesn’t mean the new DataFrame exists yet, so until you call the show() action (or whatever), the original data is still there for the taking, which is how you can reference it again later in the chained statement.

Comments closed

Contrasting Kafka and Pulsar

Tessa Burk perform a comparson:

Apache Kafka® and Apache Pulsar™ are 2 popular message broker software options. Although they share certain similarities, there are big differences between them that impact their suitability for various projects.  

In this comparison guide, we will explore the functionality of Kafka and Pulsar, explain the differences between the software, who would use them, and why.  

Click through for that comparison. I haven’t used Pulsar before, so it’s interesting to get this sort of a functionality and community comparison.

Comments closed

Holding and Pruning WhoIsActive Results

Andrea Allred has a script for us:

Last month I talked about how I started collecting data from sp_whoisactive. The long term goal was to analyze and tune those long running processes that are sometimes hard to quantify in Query Store. We had started collecting the data in a table (whoisactiveLog), but wanted to make a simple table that our Engineers could refer to and find the long running processes. We also wanted to archive off the whoisactiveLog into another table and save that for 70 days in case it was needed for further research on the tuning of processes. Each night, we have a break in our processes where we can do this maintenance so we decided we would empty the whoisactiveLog table completely.

There’s also a check for session length in there.

Comments closed

Microsoft Fabric for the Power BI Practitioner

Kurt Buhler provides a nice graphic:

I’m just writing this quick article to share a visual overview I made of the newly announced Power BI-related features. I hope it helps you get an at-a-glance overview of some of the big changes relevant to Power BI. More importantly, I hope it doesn’t make you feel overwhelmed! For more information, check out the documentation and learning paths.

There is a lot in store for the platform, but you can already see a slew of new changes and opportunities for Power BI developers.

Comments closed

Trying out Parameter Sensitive Plan Optimization

Jared Poche gives it a try:

Parameter Sensitive Plan Optimization attempts to solve the parameter sniffing issue by allowing a query to have different plans that are used based on the cardinality of one important parameter. The parent query has a plan that is really just a stub; there can be up to three variant queries that have full execution plans. When the query is executed, the optimizer chooses which variant query and plan to use based on the cardinality of the parameter’s value.

Click through for examples. I share Jared’s thoughts about row counts, though I’m not a huge fan of just using those. The rule of thumb I like to apply (though I certainly didn’t make it up myself!) is that, if you’re retrieving 0.5% or less of a table, a seek is the best option, assuming there is an appropriate index. If you’re retrieving 20% or more of a table, a scan is the best option, regardless of index quality.

Going a little further, somewhere in that 0.5-20% range, there’s an in-between zone where you should be indifferent between seek and scan, as they’ll both perform approximately as well. But if the optimizer chooses “seek” for the in-between zone and you nudge up those returned row counts higher and higher, seek becomes less viable, and there may be a zone somewhere between X% (that mid-point of indifference) and 20% where you haven’t yet crossed the row count threshold for another plan but should switch over to the scan.

Coming up with the right solution to this problem would be pretty hard, and I’m not paid to solve problems. I’m not-paid to come up with problems, however.

Comments closed

Building a Lakehouse in Microsoft Fabric

Reza Rad builds a warehouse down by the river:

The term Lakehouse is derived from two other words; Data Lake and Data Warehouse. A Lakehouse is a place to store structured data (such as Data Warehouse) and unstructured data (such as a Data lake) in a single location. Lakehouse is capable of scaling up to handle large amounts of data. Other tools and services can be used to interact with the lakehouse, for example, to load or read data into it.

Click through for instructions on how to build one and how to access it from SSMS and Power BI.

Comments closed

Combining Backup Encryption and Compression

Matthew McGiffen joins two great flavors:

In SQL Server you can also compress your encrypted backups. Unlike TDE this has been possible with Backup Encryption since the feature was first made available, and there have been no issues that have required fixing – though as always you should still test that restores work correctly. As mentioned in my post about compression with TDE, compressing backups has benefits not just in terms of file size but potentially also in reduced backup times as the time taken to write to disk is smaller.

Read on for more information. Microsoft did the right thing: they compress first and then encrypt; otherwise, you’re not getting any benefit from the compression.

Comments closed