Press "Enter" to skip to content

Day: March 9, 2022

Trials and Tribulations of Maintaining CRAN Packages

John Mount explains the downside cost of CRAN being so useful:

If this automated email from a bulk sender bounces, goes to SPAM, or isn’t responded to quickly: your package will be archived or removed from CRAN. We’ve received these emails, and always acted on them quickly, out of fear.

The referred to check results are often not reproducible. For example, our most recent scare (that hasn’t yet triggered the email, and we have submitted a work-around before complaining here) was just “SUMMARY: processing the following file failed”, without details beyond the name of the failing file.

This is a tricky problem. On the one hand, as an end user of packages, I want packages playing nicely with each other. This is a lot better than Pip’s “Oh, sorry, you need version X but to install version X, it’ll break package Y as it needs < X” nightmare.

On the other hand, as a maintainer of a package, there’s a lot of added effort on a tight timeline for what is usually a volunteer effort.

I don’t have any CRAN packages I maintain and so I tend to be on the beneficiary side of things. But it’s important to keep those package maintainers in mind and one of the easiest ways to do that is to make explicit, reproducable bug reports. It may not make the deadlines more lax but at least that makes maintainers’ lives easier.

Comments closed

Apply Functions in R

Selina Cheng explains how the various apply() functions work:

Today I’m going to talk about a useful family of functions that allows you to repetitively perform a specified function (e.g., sum()mean()) across a vector, list, matrix, or data frame. For those of you familiar with ‘for’ loops, the apply() family often allows you to avoid constructing those and instead wrap the loop into one simple function.

I’m going to discuss the functions apply()lapply()sapply(), and tapply() in this blog post (as well as using the dplyr library for similar tasks). These functions all end in apply() because you apply the function you want across all the specified elements.

Read on to see how these functions work. H/T R-Bloggers.

Comments closed

Summarizing Data Mesh in Azure

Paul Andrew wraps up a series:

When we consider this in the context of what I’ve already established in part 1 of the series, I focused on our data products and ownership. Now I want to re-introduce our data domains as a level above our data products. We can even consider this a hierarchy.

– Data Domains

– Data Products

Why?

Read on for that answer.

Comments closed

Spools, Plus Memory Grants

Erik Darling continues looking at plan operators. Erik starts with spools:

Spools are temporary structures that get stuck over in tempdb. They’re a bit like temp tables, though they don’t have any of the optimizations and enhancements. For example, loading data into a spool is a row-by-row operation.

The structure that spools use varies a bit. Table spools use a “clustered index”, but it’s not built on any of the columns in your data. Index spools use the same thing, but it’s defined on columns in your data that the optimizer thinks would make some facet of the query faster.

Definitely a must-read and a good way of explaining things. In my words, spools aren’t necessarily a problem but if you have a problem, spools are often at the root.

Erik Darling is also Overdrawn at the Memory Bank:

Whoever called memory a “bank” was a smart cookie. Everything you get from RAM is a loan.

In SQL Server, queries can get memory loaned to them while they execute. The most common reasons for memory grants are Sorts and Hashes. You may also see them for an Optimized Nested Loops Join, but whatever.

Memory is such an important aspect of query and overall server performance that it really helps to understand when there’s pressure on it, and where it’s coming from.

Check out both.

Comments closed

Matching Supply and Demand

Itzik Ben-Gan continues looking at interesting solutions to a tricky problem:

Last month I covered solutions based on a revised interval intersections approach compared to the classic one. The fastest of those solutions combined ideas from Kamil, Luca, and Daniel. It unified two queries with disjoint sargable predicates. It took the solution 1.34 seconds to complete against a 400K-row input. That’s not too shabby considering the solution based on the classic interval intersections approach took 931 seconds to complete against the same input. Also recall Joe came up with a brilliant solution that relies on the classic interval intersection approach but optimizes the matching logic by bucketizing intervals based on the largest interval length. With the same 400K-row input, it took Joe’s solution 0.9 seconds to complete. The tricky part about this solution is its performance degrades as the largest interval length increases.

This month I explore fascinating solutions that are faster than the Kamil/Luca/Daniel Revised Intersections solution and are neutral to interval length. The solutions in this article were created by Brian Walker, Peter Larsson, Paul White, and me.

These are some of the best solutions but the whole series has been quite interesting.

Comments closed

Data Virtualization with Azure SQL Managed Instance

Mladen Andzic announces data virtualization in Azure SQL Managed Instance:

Data virtualization capabilities, now in preview in Azure SQL Managed Instance, enable you to execute Transact-SQL (T-SQL) queries against data from files stored in Azure Data Lake Storage Gen2 or Azure Blob Storage and combine it with relational data stored locally in the managed instance using logical joins. This way you can transparently access external data while keeping it in its original format and location. There is no data duplication or need to run and maintain ETL processes, which means that you can extract and deliver insights faster. Currently supported file formats are Parquet, CSV, and JSON.

I’m going to start calling it PolyBase Duck Typing: it’s not actually PolyBase but the syntax is the same and the outcome is the same and the method to enable it is the same and “PolyBase” is a lot easier to say than “data virtualization.” So even though it’s not PolyBase, I’m going to call it PolyBase until there’s a meaningful split.

Comments closed

Watching and (Not) Messing with Optimization Phases

David Alcock giveth:

The full optimisation stage is where the optimiser uses a bag of tricks to optimise our query (surprise, surpise), well technically it has three bags of tricks that are named optimisation phases that each contain a collection of transformation rules (which I cover in this post that you should never do). The optimiser is not limited to using just one of the phases and each has a set criteria which determines if the optimiser can use that particular phase.

In order to see what how the optimiser is using these phases we need to enable Trace Flag 8675 as well as Trace Flag 3604 which will redirect the output to the query messages tab in Management Studio:

And David Alcock taketh away:

Now it has to be said it’s undocumented for a reason, the reason is that it’s really not a good idea to do this. In fact enabling this trace flag is such a bad idea that it will probably cause no end of issues with query performance…so let’s do it, but before we do let me add yet again that please don’t do this! Disabling optimisation features is a really bad idea, just like we did in this post – the purpose for this demo is just to show that we can, and how dangerous it can get.

This is fun to learn and interesting when doing advanced troubleshooting, but maybe not something you want to do very often.

Comments closed

Preventing Triggers from Firing for a Single Process

Andy Mallon builds a trigger guard:

I recently saw a question on DBA Stack Exchange (it has since been deleted by the author), who had a “special process” that ran regularly, and as part of that process, they disabled the trigger, did some stuff, and re-enabled it. During that process, the step that disables the trigger would deadlock, and cause problems. So the asker was wondering how to catch & handle the deadlock during the DISABLE TRIGGER step.

Yeah, disabling the trigger, not so great. Read on for one interesting way of doing it, as well as a few other methods in the comments.

Comments closed