Author: Kevin Feasel

This took me a while to figure out and so I thought I would post this as future reference. Let’s say I have the mtcars data and I want to filter for just the rows with cyl == 6. I would do something like this:
library(tidyverse)
data(mtcars)
mtcars %>% filter(cyl == 6)
What if I had the filter condition as a string instead?

Read on to see how you can do this. Given the answer, I wonder if there’s any chance that could be turned into an injection vulnerability. H/T R-Bloggers

Comments closed

The State of R Packages Locally

Published 2020-09-04 by Kevin Feasel

Maelle Salmon and Gabor Csardi walk us through some details about installed packages on a system:

Now how do you know where any of your installed packages was installed? You can use find.package() and path.package()!
To check whether a package is installed, it is better to use find.package() than installed.packages() because the latter, as its docs state, can be slow on some systems. In both cases, it does not mean the package is usable, for that you’d need to use library() or require().

Read on for several tips around where packages are located, what their contents look like, and learning a bit more about the actual code in packages.

Comments closed

Triggers Should Handle Multi-Row Scenarios

Published 2020-09-04 by Kevin Feasel

Kenneth Fisher has a public service announcement:

While I was helping him with part of it I noticed that the trigger was only set up to handle single row updates. When I told him it needed to be able to handle multiple row updates he asked me in all seriousness, “Why? Since only one value can be changed at a time why should I worry about multiple rows being changed at once?”

Read on for Kenneth’s response, which approximates to Au contraire mon frère. Unhandled assumptions often become bugs.

Comments closed

Messy Code and Reasonable Expectations

Published 2020-09-04 by Kevin Feasel

Rachel by the Bay has a doozy of a story:

One day not so long ago, I was in a meeting listening to a team explain why their service had gone down and taken out a big chunk of a business. They were one of those things that has to exist and work in order for the actual “thing that makes money” to go. Think of delivering pizzas, connecting dog walkers with dogs who need to be walked, that kind of thing.
It turned out they had been crashing every time a request came through for a certain part of the country. That is, not all pizzas, dog walkers, or whatever it was were handled identically, so they had their own city or region configurations. Think of differences in pricing, taxes, features, or whatever. Trying to process a request for this one particular region had caused the entire process to die when it hit a new config that was “bad” somehow.

Read on for the story. This sounds like a boundary issue. Boundaries are messy and need thorough examination to handle as many possible points of failure as is reasonable. Taking seriously the point that it makes the code messy, the answer is not “Don’t do the checks,” but rather “Put the checks in a place where their messiness has a minimal impact on the rest of my beautiful code but still does the important work we need them to do.” Failing that, live with the mess and have a working process.

Comments closed

Building dacpac Files on Non-Windows Machines

Published 2020-09-04 by Kevin Feasel

Erik Ejlskov Jensen provides another advantage for Azure SQL Database’s database projects:

For a while now, it has been possible to publish a .dacpac file (meaning apply it to an new or existing database) using the cross-platform version of sqlpackage.
But authoring and building a database project (sqlproj) was only possible on Windows, as the .sqlproj project type is based on the classic .NET Framework .csproj project type.
Now, thanks to the new Database Project extension in Azure Data Studio Insiders build, it is now possible to author, build and manually publish a SQL Server Database project.
And by using the new MsBuild.Sdk.SqlProj SDK and project type, is is also possible to build and publish a Database Project from a build agent (CI pipeline), without having to install the sqlpackage tool. Read on!

You heard Erik.

Comments closed

Data Lineage and SSIS

Published 2020-09-04 by Kevin Feasel

Aveek Das has a two-parter. First up is a discussion of data lineage:

In this article, I am going to explain what Data Lineage in ETL is and how to implement the same. In this modern world, where companies are dealing with a humongous amount of data every day, there also lies a challenge to efficiently manage and monitor this data. There are systems that generate data every second and are being processed to a final reporting or monitoring tool for analysis. In order to process this data, we use a variety of ETL tools, which in turn makes the data transformation possible in a managed way.
While transforming the data in the ETL pipeline, it has to go through multiple steps of transformations in order to achieve the final result. For example, when the ETL receives the raw data from the source, there may be operations applied to it like filtering, sorting, merging, or splitting two columns, etc. There can also be aggregations or other calculations made on this raw data before finally moving into a data warehouse or preparing it for reporting. In order to be able to detect what the source of a particular record is, we need to implement something known as Data Lineage. It is a piece of simple metadata information that helps us detect gaps in the data processing pipeline and enables us to fix issues later.

Part two covers data lineage with SQL Server Integration Services:

In this article, I am going to discuss SSIS data lineage concepts, which are often used while designing ETL workloads on a data warehouse. Although this article is focused on implementing data lineage using SSIS, it does not only confine to SSIS but to any ETL tools in the market using which data is moved from one source to a destination. In my previous article, Understanding Data Lineage in ETL, I have already discussed the generic importance of data lineage concepts for any ETL tool. I would definitely suggest you have a look at it if you want to understand in general how data lineage helps to track the source of a single record in the warehouse.

If you’re fairly new to this world, it’s a good introduction to an important topic.

Comments closed

Join Operations in Spark

Published 2020-09-03 by Kevin Feasel

Swantika Gupta compares hash and merge join operations in Apache Spark:

One of the most frequently used transformations in Apache Spark is Join operation. Joins in Apache Spark allow the developer to combine two or more data frames based on certain (sortable) keys. The syntax for writing a join operation is simple but some times what goes on behind the curtain is lost. Internally, for Joins Apache Spark proposes a couple of Algorithms and then chooses one of them. Not knowing what these internal algorithms are, and which one does spark choose might make a simple Join operation expensive.
While opting for a Join Algorithm, Spark looks at the size of the data frames involved. It considers the Join type and condition specified, and hint (if any) to finally decide upon the algorithm to use. In most of the cases, Sort Merge join and Shuffle Hash join are the two major power horses that drive the Spark SQL joins. But if spark finds the size of one of the data frames less than a certain threshold, Spark puts up Broadcast Join as it’s top contender.

Click through for the comparison, though do note that this comparison doesn’t include nested loop joins, which are possible in Spark as well.

Comments closed

Automatic Retention Periods with Power Query

Published 2020-09-03 by Kevin Feasel

Gilbert Quevauvilliers has an easy method of constraining data sizes in Power Query:

How cool would it be to not have to manually update your dataset to keep data for last 2 Years (last year and this year to the current date)?
In this blog post I will show you how you can easily filter dates in Power Query to show dates for last year and year to date by using the GUI and not having to hardcode anything.
I must admit the first time I saw this was when I watching my good friend Reid Havens presenting. Thanks Reid!

Read on for the technique.

Comments closed

Data Importation and Exportation with dbatools

Published 2020-09-03 by Kevin Feasel

Mikey Bronowski continues a series on dbatools functionality vis-a-vis SQL Server Management Studio:

The SSMS offers to script out lots of the SQL Server objects, however it can be limited in some areas. Using Get-Dba* commands and piping them into Export-DbaScript may add few more options. For example SQL Agent jobs:

Click through for just shy of a dozen cmdlets to help you run your data import-export business.

Comments closed

DATETIME2 and Storage Size

Published 2020-09-03 by Kevin Feasel

Randolph West digs into an issue:

Two years ago I wrote a post that got a lot of traction in the comments at the time. Last month there was renewed interest because one of the commenters noted that the official SQL Server documentation for DATETIME2 disagreed with my assertions, and that I was under-representing the storage requirements.
To remind you, I have been saying for years that you can use DATETIME2(3) as a drop-in replacement for DATETIME, and have better granularity (1ms versus 3ms) for 12.5% less storage (1 byte per column per row). The commenter intimated that because my statement conflicted with the documentation, that I must be wrong. As it turns out the documentation was wrong, but I also learned something new in the process!

It’s an interesting internal look at how difficult it is to get documentation right, even on something which sounds simple.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31