# Day: October 27, 2020

To calculate Median Absolute Deviation (MAD) you need to calculate the difference between the value and the median. In simpler terms, you will need to calculate the median of the entire dataset, the difference between each value and this median, then take another median of all the differences.

In Spark you can use a SQL expression ‘percentile()’ to calculate any medians or quartiles in a dataframe. ‘percentile()’ expects a column and an array of percentiles to calculate (for median we can provide ‘array(0.5)’ because we want the 50% value ie median) and will return an array of results.

Like standard deviation, to use MAD to identify the outliers it needs to be a certain number of MAD’s away. This number is also referred to as the threshold and is defaulted to 3.

Read on for three measures and their implementations in PySpark.

With Spark 3.0 release (on June 2020) there are some major improvements over the previous releases, some of the main and exciting features for Spark SQL & Scala developers are AQE (Adaptive Query Execution), Dynamic Partition Pruning and other performance optimization and enhancements.

Below I’ve listed out these new features and enhancements all together in one page for better understanding and future reference.

Click through for the summary.

I wanted to announce the first open source project officially released by ChannelAdvisor: the QDS Toolbox. This is an effort which Pablo Lozano and Efraim Sharon pushed hard internally and several database administrators and database engineers contributed to (though I wasn’t one of them).

Pablo, Efraim, &co have done a great job with this, so check out the repo.

In this puzzle you have to check whether an array is Monotonic or not

Though it looks like the solution actually shows whether the array is strictly monotonic rather than monotonic (i.e., weakly monotonic), as it allows for ties.

In this blog post, I discussed some facts about Azure SQL Database Serverless. If someone wants to avoid those limitations and disable ‘Auto-pause,’ how costly is it?

Click through for a sample calculation to understand the trade-off between faster performance and lower bills.

1. Open the “New Session…” GUI for extended events in SSMS
2. Give it a name, it’s not important
3. Add an event with a Duration Field, I’ve added sp_statement_completed and configure it to be higher than an arbitrary value

And then watch it go boom in SSMS 18.7. Click through for a demo of the issue and a workaround.

Lock waits are particularly annoying. Imagine (I know, this might be difficult) that you have a friend who is puzzled by why a query is sometimes slow.

They send you an actual plan for when it’s fast, and an actual plan for when it’s slow. You compare them in every which way, and everything except duration is identical.

It’d be a whole lot easier to answer them if LCK waits were collected, but hey. Let’s just make them jump through another hoop to figure out what’s going on.

CXCONSUMER has a similar problem — and here’s the thing — if people are going through the trouble of collecting this information, give’em what they ask for. Don’t just give them what you think is a good idea.

Click through to see the issue and what you can do to work around this limitation.