Press "Enter" to skip to content

Day: September 9, 2021

magrittr’s Four Pipes

Gregory Janesch shows off the various pipes in the magrittr R package:

The magrittr package is a part of the extended tidyverse – i.e., not one of the ones normally loaded. It is the one that supplies the pipe operator (%>%), but it turns out that the package actually contains four pipe operators in total. All are intended to streamline and improve the readability of code, though the three non-basic ones are a bit more situational, and I’ve rarely seen them used, so I thought I would go into them a bit.

The CRAN page for magrittr is here; much of this post is based off of the package’s vignettes and documentation.

Click through for demonstrations of each. I’ve only seen the basic pipe in use as well, but the others look quite interesting and I can see use cases where knowing about them would be helpful. Also, note in the comments about the secret 5th pipe. H/T R-Bloggers

Comments closed

Databricks Serverless SQL

Nikhil Jethava and Kevin Clugage announce serverless SQL on Databricks:

Databricks SQL already provides a first-class user experience for BI and SQL directly on the data lake, and today, we are excited to announce another step in making data and AI simple with Databricks Serverless SQL. This new capability for Databricks SQL provides instant compute to users for their BI and SQL workloads, with minimal management required and capacity optimizations that can lower overall cost by an average of 40%. This makes it even easier for organizations to expand adoption of the lakehouse for business analysts who are looking to access the rich, real-time datasets of the lakehouse with a simple and performant solution.

Under the hood of this capability is an active server fleet, fully managed by Databricks, that can transfer compute capacity to user queries, typically in about 15 seconds. The best part? You only pay for Serverless SQL when users start running reports or queries.

Things are getting interesting between Databricks and Azure Synapse Analytics, as both now have serverless SQL and Spark offerings. Synapse Analytics has the better implementation for serverless SQL and Databricks the superior Spark implementation, so it becomes a question of which weakness you take in order to gain the strength.

1 Comment

Interchangability between ADF and Synapse Integration Pipelines

Paul Andrew makes a discovery:

Inspired by an earlier blog where we looked at ‘How Interchangeable Delta Tables Are Between Databricks and Synapse‘ I decided to do a similar exercise, but this time with the integration pipeline components taking centre stage.

As I said in my previous blog post, the question in the heading of this blog should be incredibly pertinent to all solution/technical leads delivering an Azure based data platform solution so to answer it directly:

Read on to learn the answer.

Comments closed

Avoiding WHILE 1=1 Loops

Aaron Bertrand does not believe in the power of the infinite loop:

A short time ago a colleague had an issue with a Microsoft SQL Server stored procedure. They were using our recommended approach for batching updates, but there was a small problem with their code that led to the procedure “running forever.” I think we’ve all made a mistake like this at one point or another; here’s how I try to avoid the situation altogether.

The argument isn’t “don’t use WHILE loops” or “don’t use batching logic,” but instead to ensure that you have a break condition somewhere. It’s reasonable to ask for an end state before you begin processing something, after all.

Comments closed

.NET 5 C# Language Extension for SQL Server

Nikita Takru makes an announcement:

SQL Server 2019 supports the R, Python, and Java Language Extensions. These language extensions provide many benefits to developers. They provide data security, rapid speed for deployment, and ease of integration.

Previously, we announced the release of JavaR, and Python extensions. Today we are thrilled to share that we are open sourcing the .NET 5 C# Language Extension for SQL Server on GitHub.

My response to this is roughly the same as what Panagiotis Kanavos put in a GitHub issue. F#, being a functional programming language intended for things like data analysis, would be a no-brainer. And the choice of Windows-only support probably helped get it out the door faster, but can be limiting over time.

I guess the short version is, we’ll see what the community does with this. I’d also like to see how it compares to CLR on actions. My guess is that it’ll be slower based on the way the R and Python engines work, but that’s just a guess. Overall, I’m happy this exists and want it to be more useful.

Comments closed

Wait Stats on Sort Spills

Erik Darling starts a new series on wait stats, starting with one particular topic:

The point is not that spills are the sole things that cause these waits, it’s just to give you some things to potentially watch out for if you see these waits piling up and can’t pin down where they’re coming from.

In all the queries, I’m going to be using the MAX_GRANT_PERCENT hint to set the memory grant ridiculously low to make the waits I care about stick out.

Click through for the first of several demonstrations.

Comments closed