Press "Enter" to skip to content

Curated SQL Posts

magrittr’s Four Pipes

Gregory Janesch shows off the various pipes in the magrittr R package:

The magrittr package is a part of the extended tidyverse – i.e., not one of the ones normally loaded. It is the one that supplies the pipe operator (%>%), but it turns out that the package actually contains four pipe operators in total. All are intended to streamline and improve the readability of code, though the three non-basic ones are a bit more situational, and I’ve rarely seen them used, so I thought I would go into them a bit.

The CRAN page for magrittr is here; much of this post is based off of the package’s vignettes and documentation.

Click through for demonstrations of each. I’ve only seen the basic pipe in use as well, but the others look quite interesting and I can see use cases where knowing about them would be helpful. Also, note in the comments about the secret 5th pipe. H/T R-Bloggers

Comments closed

Databricks Serverless SQL

Nikhil Jethava and Kevin Clugage announce serverless SQL on Databricks:

Databricks SQL already provides a first-class user experience for BI and SQL directly on the data lake, and today, we are excited to announce another step in making data and AI simple with Databricks Serverless SQL. This new capability for Databricks SQL provides instant compute to users for their BI and SQL workloads, with minimal management required and capacity optimizations that can lower overall cost by an average of 40%. This makes it even easier for organizations to expand adoption of the lakehouse for business analysts who are looking to access the rich, real-time datasets of the lakehouse with a simple and performant solution.

Under the hood of this capability is an active server fleet, fully managed by Databricks, that can transfer compute capacity to user queries, typically in about 15 seconds. The best part? You only pay for Serverless SQL when users start running reports or queries.

Things are getting interesting between Databricks and Azure Synapse Analytics, as both now have serverless SQL and Spark offerings. Synapse Analytics has the better implementation for serverless SQL and Databricks the superior Spark implementation, so it becomes a question of which weakness you take in order to gain the strength.

1 Comment

Interchangability between ADF and Synapse Integration Pipelines

Paul Andrew makes a discovery:

Inspired by an earlier blog where we looked at ‘How Interchangeable Delta Tables Are Between Databricks and Synapse‘ I decided to do a similar exercise, but this time with the integration pipeline components taking centre stage.

As I said in my previous blog post, the question in the heading of this blog should be incredibly pertinent to all solution/technical leads delivering an Azure based data platform solution so to answer it directly:

Read on to learn the answer.

Comments closed

Avoiding WHILE 1=1 Loops

Aaron Bertrand does not believe in the power of the infinite loop:

A short time ago a colleague had an issue with a Microsoft SQL Server stored procedure. They were using our recommended approach for batching updates, but there was a small problem with their code that led to the procedure “running forever.” I think we’ve all made a mistake like this at one point or another; here’s how I try to avoid the situation altogether.

The argument isn’t “don’t use WHILE loops” or “don’t use batching logic,” but instead to ensure that you have a break condition somewhere. It’s reasonable to ask for an end state before you begin processing something, after all.

Comments closed

.NET 5 C# Language Extension for SQL Server

Nikita Takru makes an announcement:

SQL Server 2019 supports the R, Python, and Java Language Extensions. These language extensions provide many benefits to developers. They provide data security, rapid speed for deployment, and ease of integration.

Previously, we announced the release of JavaR, and Python extensions. Today we are thrilled to share that we are open sourcing the .NET 5 C# Language Extension for SQL Server on GitHub.

My response to this is roughly the same as what Panagiotis Kanavos put in a GitHub issue. F#, being a functional programming language intended for things like data analysis, would be a no-brainer. And the choice of Windows-only support probably helped get it out the door faster, but can be limiting over time.

I guess the short version is, we’ll see what the community does with this. I’d also like to see how it compares to CLR on actions. My guess is that it’ll be slower based on the way the R and Python engines work, but that’s just a guess. Overall, I’m happy this exists and want it to be more useful.

Comments closed

Wait Stats on Sort Spills

Erik Darling starts a new series on wait stats, starting with one particular topic:

The point is not that spills are the sole things that cause these waits, it’s just to give you some things to potentially watch out for if you see these waits piling up and can’t pin down where they’re coming from.

In all the queries, I’m going to be using the MAX_GRANT_PERCENT hint to set the memory grant ridiculously low to make the waits I care about stick out.

Click through for the first of several demonstrations.

Comments closed

Attributing Redshift Costs to Users

Jason Pedreza, et al, show how you can break down query utilization by user in an Amazon Redshift database:

At its simplest form, cost attribution can be determined using the amount of the storage assigned to the individual objects using the ownership of the objects to the groups. But the downside of this approach is it doesn’t provide a true translation of the resource usage. For example, let’s say Team 1 has total object size of 1 TB, whereas Team 2 has 100 GB in total size. Team 1 member runs 10 queries daily, and Team 2 runs 1,000 queries per day. Of course, Team 2 uses more resources than Team 1.

The Amazon Redshift RA3 architecture allows you to pay for the compute and data warehouse storage capacity separately, therefore storage doesn’t reflect the resources used by the teams for the cost attribution.

Click through to see how.

Comments closed

LATERAL VIEW in Hive

The Hadoop in Real World team provides a quick example of a powerful feature in Apache Hive:

Lateral view is used in conjunction with user-defined table generating functions such as explode(). A UDTF generates zero or more output rows for each input row. 

Click here if you like to know the difference between UDF, UDAF and UDTF

A lateral view first applies the UDTF to each row of the base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

In other words, LATERAL joins are the SQL standard for Microsoft’s CROSS APPLY operator. I normally dislike having different names for the same thing due to the risk of confusion, but in fairness to Microsoft on this one, my recollection is that the common name came after SQL Server 2005, which already had CROSS APPLY and OUTER APPLY.

Comments closed

Learn, Test, Forget

Erik Darling shows us the bright side of a memory equipped to forget:

I often have to talk clients down from odd ledges. At this point, there’s about 20 years of advice about SQL Server out there, and for various reasons received wisdom often gets migrated en masse, whether it’s still applicable or not.

Outdated trace flags and settings are pretty common, and don’t get me started on coding practices.

But I get it — that stuff came along with advice that still makes sense — like parallelism and memory settings, separating off tempdb, etc.

Unfortunately, it often leads people to the “it used to be fast so it should always be fast” trap based on what used to work, rather than what makes sense currently. Or even what the current problem is.

Read the whole thing. Also, in a rare double-quotation, I want to cite this line from Erik:

“I don’t know, xxVanWilderFan420xx said they were bad and we should avoid them.”

Never fully trust a person going by the pseudonym xxVanWilderFan420xx. However, if it were xxSolomonKaneFan420xx, 100% believe.

Comments closed

LAG() vs All

Kathi Kellenberger shows the power of the LAG() function in T-SQL:

Microsoft introduced the first window (aka, windowing or windowed) functions with SQL Server 2005. These functions were ROW_NUMBERRANKDENSE_RANKNTILE, and the window aggregates. Many folks, including myself, used these functions without realizing they were part of a special group. In 2012, Microsoft added several more: LAG and LEADFIRST_VALUE and LAST_VALUEPERCENT_RANK and CUME_DISTPERCENTILE_CONT, and PERCENTILE_DISC. They also added the ability to do running totals and moving calculations.

These functions were promoted as improving performance over older techniques, but that isn’t always the case. There were still performance problems with the aggregate functions introduced in 2005 and the four of the functions introduced in 2012. In 2019, Microsoft introduced Batch Mode on Row Store, available on Enterprise and Developer Editions, that can improve the performance of window aggregates and the four statistical functions from 2012.

I started writing this article to compare some window function solutions to traditional solutions. I found that there were so many ways to write a query that includes a column from another row that this article is dedicated to the window functions LAG and LEAD.

In these sorts of circumstances, LAG() is extremely efficient at its job. Click through to see just how efficient.

Comments closed