Press "Enter" to skip to content

Day: November 29, 2021

Managing Spark on Kubernetes instead of YARN

Rohit Choudhary argues that it’s a good idea to move from YARN to Kubernetes for your Spark clusters:

When it comes to data operations, Spark provides a tremendous advantage as a resource for data operations because it aligns with the things that make data ops valuable. It is optimized for machine learning and AI, which are used for batch processing (in real-time and at scale), and it is adept at operating within different types of environments.

Spark doesn’t completely manage these clusters of machines but instead uses a cluster manager (known as a scheduler). Most companies have traditionally used the Java Virtual Machine (JVM)-based Hadoop YARN to manage their clusters. But with the dramatic rise of Kubernetes and cloud-native computing, many organizations are moving away from YARN to Kubernetes to manage their Spark clusters. Spark on Kubernetes is even now generally available since the Apache Spark 3.1 release in March 2021.

I see some of the benefits there but am not totally sold, especially given the complexity of Kubernetes and its own lack of built-in security measures.

Comments closed

Ordered Set Functions in SQL Server

I continue a series on window functions in SQL Server:

As of SQL Server 2019, there is only one ordered set function: STRING_AGG(). I like STRING_AGG() a lot, especially because it means my days of needing to explain the STUFF() + FOR XML PATH trick to concatenate values together in SQL Server are numbered.

STRING_AGG() is interesting in that we categorize it as a window function and yet it violates my first rule of window functions: there isn’t an OVER() clause. Instead, it accepts but does not require a WITHIN GROUP() clause. Let’s see it in action.

Click through for a look at that, as well as a little hint that maybe we’ve seen ordered set functions before in a different guise.

Comments closed

FAST_FORWARD and Cursors

Joe Obbish skips past the commercials:

If you’re like me, you started your database journey by defining cursors with the default options. This went on until a senior developer or DBA kindly pointed out that you can get better performance by using the FAST_FORWARD option. Or maybe you were a real go-getter and found Aaron Bertrand’s performance benchmarking blog post on different cursor options. I admit that for many years I didn’t care to know why FAST_FORWARD sometimes made my queries faster. It had “FAST” in the name and that was good enough for me.

Recently I saw a production issue where using the right cursor options led to a 1000X performance improvement. I decided that ten years of ignorance was enough and finally did some research on different cursor options. This post contains a reproduction and discussion of the production issue.

I thought everybody knew how this works: the database streams the data tape from the supply pully to the play shaft by using a sprocket to rotate the gear in the cassette at a fixed speed. The FAST_FORWARD cursor option engages the fast forward idler in the VCR database and causes rotation to occur more rapidly than normal.

Comments closed