Press "Enter" to skip to content

Day: September 4, 2020

Filtering with dplyr Using Strings as Expressions

Kenneth Tay shows how to build arbitrary expressions to execute in dplyr’s filter function:

This took me a while to figure out and so I thought I would post this as future reference. Let’s say I have the mtcars data and I want to filter for just the rows with cyl == 6. I would do something like this:

library(tidyverse)
data(mtcars)
mtcars %>% filter(cyl == 6)

What if I had the filter condition as a string instead?

Read on to see how you can do this. Given the answer, I wonder if there’s any chance that could be turned into an injection vulnerability. H/T R-Bloggers

Comments closed

The State of R Packages Locally

Maelle Salmon and Gabor Csardi walk us through some details about installed packages on a system:

Now how do you know where any of your installed packages was installed? You can use find.package() and path.package()!

To check whether a package is installed, it is better to use find.package() than installed.packages() because the latter, as its docs state, can be slow on some systems. In both cases, it does not mean the package is usable, for that you’d need to use library() or require().

Read on for several tips around where packages are located, what their contents look like, and learning a bit more about the actual code in packages.

Comments closed

Triggers Should Handle Multi-Row Scenarios

Kenneth Fisher has a public service announcement:

While I was helping him with part of it I noticed that the trigger was only set up to handle single row updates. When I told him it needed to be able to handle multiple row updates he asked me in all seriousness, “Why? Since only one value can be changed at a time why should I worry about multiple rows being changed at once?”

Read on for Kenneth’s response, which approximates to Au contraire mon frère. Unhandled assumptions often become bugs.

Comments closed

Messy Code and Reasonable Expectations

Rachel by the Bay has a doozy of a story:

One day not so long ago, I was in a meeting listening to a team explain why their service had gone down and taken out a big chunk of a business. They were one of those things that has to exist and work in order for the actual “thing that makes money” to go. Think of delivering pizzas, connecting dog walkers with dogs who need to be walked, that kind of thing.

It turned out they had been crashing every time a request came through for a certain part of the country. That is, not all pizzas, dog walkers, or whatever it was were handled identically, so they had their own city or region configurations. Think of differences in pricing, taxes, features, or whatever. Trying to process a request for this one particular region had caused the entire process to die when it hit a new config that was “bad” somehow.

Read on for the story. This sounds like a boundary issue. Boundaries are messy and need thorough examination to handle as many possible points of failure as is reasonable. Taking seriously the point that it makes the code messy, the answer is not “Don’t do the checks,” but rather “Put the checks in a place where their messiness has a minimal impact on the rest of my beautiful code but still does the important work we need them to do.” Failing that, live with the mess and have a working process.

Comments closed

Building dacpac Files on Non-Windows Machines

Erik Ejlskov Jensen provides another advantage for Azure SQL Database’s database projects:

For a while now, it has been possible to publish a .dacpac file (meaning apply it to an new or existing database) using the cross-platform version of sqlpackage.

But authoring and building a database project (sqlproj) was only possible on Windows, as the .sqlproj project type is based on the classic .NET Framework .csproj project type.

Now, thanks to the new Database Project extension in Azure Data Studio Insiders build, it is now possible to author, build and manually publish a SQL Server Database project.

And by using the new MsBuild.Sdk.SqlProj SDK and project type, is is also possible to build and publish a Database Project from a build agent (CI pipeline), without having to install the sqlpackage tool. Read on!

You heard Erik.

Comments closed

Data Lineage and SSIS

Aveek Das has a two-parter. First up is a discussion of data lineage:

In this article, I am going to explain what Data Lineage in ETL is and how to implement the same. In this modern world, where companies are dealing with a humongous amount of data every day, there also lies a challenge to efficiently manage and monitor this data. There are systems that generate data every second and are being processed to a final reporting or monitoring tool for analysis. In order to process this data, we use a variety of ETL tools, which in turn makes the data transformation possible in a managed way.

While transforming the data in the ETL pipeline, it has to go through multiple steps of transformations in order to achieve the final result. For example, when the ETL receives the raw data from the source, there may be operations applied to it like filtering, sorting, merging, or splitting two columns, etc. There can also be aggregations or other calculations made on this raw data before finally moving into a data warehouse or preparing it for reporting. In order to be able to detect what the source of a particular record is, we need to implement something known as Data Lineage. It is a piece of simple metadata information that helps us detect gaps in the data processing pipeline and enables us to fix issues later.

Part two covers data lineage with SQL Server Integration Services:

In this article, I am going to discuss SSIS data lineage concepts, which are often used while designing ETL workloads on a data warehouse. Although this article is focused on implementing data lineage using SSIS, it does not only confine to SSIS but to any ETL tools in the market using which data is moved from one source to a destination. In my previous article, Understanding Data Lineage in ETL, I have already discussed the generic importance of data lineage concepts for any ETL tool. I would definitely suggest you have a look at it if you want to understand in general how data lineage helps to track the source of a single record in the warehouse.

If you’re fairly new to this world, it’s a good introduction to an important topic.

Comments closed