Author: Kevin Feasel

Economic Articles With Data Included

Sebastian Kranz has a Shiny app to help you find economic papers with included data:

One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.

The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.

I like this idea, particularly because it promotes the notion that if you’re going to write a paper based on a data set, you ought to provide the data set. There are too many cases of typos or accidental miscodings which take an interesting result and render it mundane (or sometimes even the exact opposite of what the paper reads). H/T R-Bloggers

Giving A Name To The R Pipe

John Mount noodles an idea from Hadley Wickham:

I’d say this fails on at least two counts, the first “%then%” doesn’t seem grammatical (as d is a noun), and magrittr pipes can’t be associated with a new name (as they are implemented by looking for theirselves by name in captured unevaluated code).

However, the wrapr dot arrow pipe can take on new names.

Let’s try a variation, using a traditional pronunciation: “to”.

I don’t like “then” very much. I definitely prefer the C# lambda pronunciation of “goes to” for this.

Click through for John’s thoughts on right assignment as well, something I almost categorically dislike.

The Zen Of Airflow

Bas Harenslak shows how you can think of The Zen of Python as it applies to Apache Airflow:

Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. This allows for concise and flexible scripts but can also be the downside of Airflow; since it’s Python code there are infinite ways to define your pipelines. The Zen of Python is a list of 19 Python design principles and in this blog post I point out some of these principles on four Airflow examples. This blog was written with Airflow 1.10.2.

My favorite of the Zen of Python principles is a combination of two: “simple is better than complex; complex is better than complicated.” That’s something I don’t always get right, but it is critical for a stable architecture.

The Steps Of A Database Deployment

I have a series on near-zero downtime. In this post, I cover some of the key assumptions in the series as well as the steps in a database deployment:

Database Release
The database release phase is the first “primary” phase. It usually starts on a schedule, maybe 2 PM on a Wednesday or maybe “every day at 9 AM, 1 PM, 6 PM, and 10 PM” for more mature shops. Depending upon how much of an effect our release process normally has on end users, we might alert them that we expect to see a degradation in services starting at this point.

This phase of the release has us push out our database changes. This can involve creating or altering database objects but will not involve dropping existing objects.

Our database changes should support the blue-green deployment model. At this point in the process, all of the application code is “blue”—that is, the current production code. Our procedure changes need to be able to support that code without breaking. If we need to drop a column from a stored procedure, for example, we would not want to do it here. If we need to add a column to a stored procedure, we might do it here as long as it doesn’t break the calling code.

This is two topics smashed together into one post, but gives you an idea of a mental model around database deployments.

Azure Data Lake Store Gen2

James Serra gives us the low-down on Azure Data Lake Store Gen2 now that it is generally available:

When to use Blob vs ADLS Gen2
New analytics projects should use ADLS Gen2, and current Blob storage should be converted to ADLS Gen2, unless these are non-analytical use cases that only need object storage rather than hierarchical storage (i.e. video, images, backup files), in which case you can use Blob Storage and save a bit of money on transaction costs (storage costs will be the same between Blob and ADLS Gen2 but transaction costs will be a bit higher for ADLS Gen2 due to the overhead of namespaces).

Looks like there are still some things missing from Gen2, so don’t automatically jump on an upgrade. Read the documentation first to make sure you aren’t relying on something which isn’t there yet.

Regression Testing With Pester

Ust Oldfield continues a series on Pester testing:

In a previous post, I gave an overview to regression tests. In this post, I will give a practical example of developing and performing regression tests with the Pester framework for PowerShell. The code for performing regression tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test regression scenarios. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database. The structure of the metadata database will be exactly the same as laid out in the Integration Test post.

There’s a hefty test script here too, so check it out.

.Net Core On Docker Connecting Via AD To SQL Server

Michal Poreba shows us how to connect Windows Docker containers running .Net Core to SQL Server via Active Directory when the containers are not connected to the domain:

The good news is that it is not an unreasonable requirement and it has been done before. The solution is to use Group Managed Service Accounts (gMSA) and Credential Spec Files. A number of people have already documented their efforts. Some were more successful than others.

Click through for a detailed guide to getting this working.

Optimizer Imperfections With Complex Filters

Erik Darling shows a couple examples of how the optimizer will sometimes pick a superior plan when dealing with complicated filters but not always:

Sometimes, the optimizer can take a query with a complex where clause, and turn it into two queries.

This only happens up to a certain point in complexity, and only if you have really specific indexes to allow these kinds of plan choices.

This is a case where I generally don’t trust the optimizer to get it right. Even when it does, I’d be concerned that this won’t be a stable solution and a minor change somewhere could result in regression to a bad plan.

Conjoint Analysis In R

Abhijit Telang introduces the concept of conjoint analysis and shows how you can implement this in R:

We will need to typically transform the problem of utility modeling from its intangible, abstract form to something that is measurable. That is, we wish to assign a numeric value to the perceived utility by the consumer, and we want to measure that accurately and precisely (as much as possible).

This is where survey design comes in, where, as a market researcher, we must design inputs (in the form of questionnaires) to have respondents do the hard work of transforming their qualitative, habitual, perceptual opinions into  simplified, summarized aggregate values which are expressed either as a numeric value or on a rank scale.

I tend to shy away from this kind of analysis because it runs a huge risk of trying to turn ordinal utility rankings into cardinal functions.

Tips For Using PolyBase With Cloudera QuickStart VM

I have a post on using Cloudera’s QuickStart VM with PolyBase:

Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being quickstart.cloudera like the host name, is actually localhost. You can change this in /etc/cloudera-scm-agent/config.ini.

Because PolyBase needs to have direct access to the data nodes, having a node called localhost is a bit of a drag.

I’m used to the Hortonworks Data Platform, so this is a quick compendium of things I noticed to were different.

