2023-03-10 – Curated SQL

Spark Application Dependency Caching

Published 2023-03-10 by Kevin Feasel

Shu Wang, Biao He, and Minchu Yang talk turkey about dependencies:

In this blog post, we will share our analysis of Spark Dependency Management at LinkedIn, highlight interesting findings, and present our design choice of using a simple user-level cache over more complex alternatives. We will also discuss our rollout experience and lessons learned. Finally, we will demonstrate the impact of accelerating all Spark applications at LinkedIn at the cluster level. Faster Spark jobs translate to increased data freshness, leading to an enhanced member experience by way of more relevant recommendations, timely insights, effective abuse protection, and other similar improvements.

If you work with Spark to any serious extent, you’ll want to read this post.

Comments closed

Type 2 Dimension Loading in Redshift

Published 2023-03-10 by Kevin Feasel

Vaidy Kalpathy takes us through a bit of dimensional modeling in AWS Redshift:

Populating an SCD dimension table involves merging data from multiple source tables, which are usually normalized. SCD tables contain a pair of date columns (effective and expiry dates) that represent the record’s validity date range. Changes are inserted as new active records effective from the date of data loading, while simultaneously expiring the current active record on a previous day. During each data load, incoming change records are matched against existing active records, comparing each attribute value to determine whether existing records have changed or were deleted or are new records coming in.

Click through for the article.

Comments closed

Architectural Erosion and Technical Debt

Published 2023-03-10 by Kevin Feasel

Uli Homann and Eric Charran (via Ben Brauer) talk about the concept of architectural erosion:

The way Eric thinks about architectural erosion is when architects and engineers work together, they construct a system or solution. They have launched that solution into production. It’s performing well for a period of time and then change happens. Whether it’s a change in requirements, a change in infrastructure, or a change in customer habits, DevOps signals saying that people are using a certain feature versus they’re not using a certain feature. What ended up happening is there’s a moment in time in which we look at velocity around how do we implement this change and make the applications experience, do what the customer or the end user wants as quickly as possible. Then there’s the strategic picture of managing things like technical debt, which is if I do something tactical, I’m probably going to do it fast and cheap and not necessarily the “right way.” This then accrues to the architectural patterns, longevity and scalability and all those other types of things, and then that goes into my pile of technical debt.

Read on to learn more about the topic and what we, as technical professionals, can do to mitigate this risk.

Comments closed

Try Purview (Almost) for Free

Published 2023-03-10 by Kevin Feasel

Wolfgang Strasser wants to try Microsoft Purview but doesn’t want to break the bank:

And my reaction was – Nice, very nice.. I can try and create Microsoft Purview instances for free and test new features..

BUT: I wanted to be sure and check, how much metadata (sources, scan results, data assets, classifications) can fit into 1 MB of metadata.

Read on for Wolfgang’s test, as well as the full set of costs around trying out Purview.

Comments closed

Building Your Own TRY_PARSE Function

Published 2023-03-10 by Kevin Feasel

Aaron Bertrand gives it a go:

A while back, I wrote a couple of tips about simulating TRY_CONVERT – a function added in SQL Server 2012 – in earlier versions (see Part 1 and Part 2). Recently, someone in the community had a similar requirement: they wanted to make TRY_PARSE work for dates in SQL Server 2008. Now, I’m not one to help people stay on versions of SQL Server that are now 15 years old, but I do acknowledge that hands may be tied, and they can’t always control the version they’re stuck with.

Aaron does this community member a solid, though it’s probably something you’d never want to use if you’re on SQL Server 2012 or later.

Comments closed

The Importance of Trace Flag 460

Published 2023-03-10 by Kevin Feasel

Jonathan Kehayias helps me understand that Trace Flag 460 is one of the best trace flags ever:

For most people that are reading this post, I’d venture to guess that you have no idea what Trace Flag 460 is or when you would use it. Well first off, let me tell you it’s a fully documented and supported trace flag, and it’s totally safe. In fact, on SQL Server 2019 and higher it is the default behavior when you create a new database. What does it do? It makes troubleshooting string or binary truncation issues easier by changing the error message that is returned from message ID 8152 and replaces it instead with message ID 2628. The trace flag is also available in SQL Server 2016 SP2 CU6+ and SQL Server 2017 CU12.

Read on to learn more about it and to remove a potential bit of confusion in the documentation.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Day: March 10, 2023

Spark Application Dependency Caching

Type 2 Dimension Loading in Redshift

Architectural Erosion and Technical Debt

Try Purview (Almost) for Free

Building Your Own TRY_PARSE Function

The Importance of Trace Flag 460