Press "Enter" to skip to content

Day: October 27, 2020

Outlier Identification Using Spark 3.0

Tori Tompkins takes us through principles of anomaly detection in Apache Spark 3.0:

To calculate Median Absolute Deviation (MAD) you need to calculate the difference between the value and the median. In simpler terms, you will need to calculate the median of the entire dataset, the difference between each value and this median, then take another median of all the differences.

In Spark you can use a SQL expression ‘percentile()’ to calculate any medians or quartiles in a dataframe. ‘percentile()’ expects a column and an array of percentiles to calculate (for median we can provide ‘array(0.5)’ because we want the 50% value ie median) and will return an array of results.

Like standard deviation, to use MAD to identify the outliers it needs to be a certain number of MAD’s away. This number is also referred to as the threshold and is defaulted to 3.

Read on for three measures and their implementations in PySpark.

Comments closed

Features and Improvements in Spark 3.0

Manoj Pandey summarizes some of the improvements in Apache Spark 3.0:

With Spark 3.0 release (on June 2020) there are some major improvements over the previous releases, some of the main and exciting features for Spark SQL & Scala developers are AQE (Adaptive Query Execution), Dynamic Partition Pruning and other performance optimization and enhancements.

Below I’ve listed out these new features and enhancements all together in one page for better understanding and future reference.

Click through for the summary.

Comments closed

SSMS and Ignoring Certain Waits

Erik Darling has a plea to the SQL Server Management Studio team:

Lock waits are particularly annoying. Imagine (I know, this might be difficult) that you have a friend who is puzzled by why a query is sometimes slow.

They send you an actual plan for when it’s fast, and an actual plan for when it’s slow. You compare them in every which way, and everything except duration is identical.

It’d be a whole lot easier to answer them if LCK waits were collected, but hey. Let’s just make them jump through another hoop to figure out what’s going on.

CXCONSUMER has a similar problem — and here’s the thing — if people are going through the trouble of collecting this information, give’em what they ask for. Don’t just give them what you think is a good idea.

Click through to see the issue and what you can do to work around this limitation.

Comments closed

The Benefits of User Research

Kendra Little explains why data platform professionals should care about user research:

It’s easy to be a bit cynical about this and think, “That’s not the way it is in my job.” If you feel that way, I challenge you to take the initiative to change your own perspective! There is no need for database specialists to wait around for someone to give them permission to start thinking about their organization’s customers and to start being a creative collaborator with others.

Give yourself permission, and very likely it will have a positive impact on your career.

It’s easy for developers and DBAs to forget that they’re in the business of providing services. If you can’t name your consumers and their interests, that opens up concerns about whether you’re satisfying their needs.

Comments closed