One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.
The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.
I like this idea, particularly because it promotes the notion that if you’re going to write a paper based on a data set, you ought to provide the data set. There are too many cases of typos or accidental miscodings which take an interesting result and render it mundane (or sometimes even the exact opposite of what the paper reads). H/T R-Bloggers
I’d say this fails on at least two counts, the first “
%then%” doesn’t seem grammatical (as
dis a noun), and
magrittrpipes can’t be associated with a new name (as they are implemented by looking for theirselves by name in captured unevaluated code).
wraprdot arrow pipe can take on new names.
Let’s try a variation, using a traditional pronunciation: “to”.
I don’t like “then” very much. I definitely prefer the C# lambda pronunciation of “goes to” for this.
Click through for John’s thoughts on right assignment as well, something I almost categorically dislike.
Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. This allows for concise and flexible scripts but can also be the downside of Airflow; since it’s Python code there are infinite ways to define your pipelines. The Zen of Python is a list of 19 Python design principles and in this blog post I point out some of these principles on four Airflow examples. This blog was written with Airflow 1.10.2.
My favorite of the Zen of Python principles is a combination of two: “simple is better than complex; complex is better than complicated.” That’s something I don’t always get right, but it is critical for a stable architecture.
The database release phase is the first “primary” phase. It usually starts on a schedule, maybe 2 PM on a Wednesday or maybe “every day at 9 AM, 1 PM, 6 PM, and 10 PM” for more mature shops. Depending upon how much of an effect our release process normally has on end users, we might alert them that we expect to see a degradation in services starting at this point.
This phase of the release has us push out our database changes. This can involve creating or altering database objects but will not involve dropping existing objects.
Our database changes should support the blue-green deployment model. At this point in the process, all of the application code is “blue”—that is, the current production code. Our procedure changes need to be able to support that code without breaking. If we need to drop a column from a stored procedure, for example, we would not want to do it here. If we need to add a column to a stored procedure, we might do it here as long as it doesn’t break the calling code.
This is two topics smashed together into one post, but gives you an idea of a mental model around database deployments.
When to use Blob vs ADLS Gen2
New analytics projects should use ADLS Gen2, and current Blob storage should be converted to ADLS Gen2, unless these are non-analytical use cases that only need object storage rather than hierarchical storage (i.e. video, images, backup files), in which case you can use Blob Storage and save a bit of money on transaction costs (storage costs will be the same between Blob and ADLS Gen2 but transaction costs will be a bit higher for ADLS Gen2 due to the overhead of namespaces).
Looks like there are still some things missing from Gen2, so don’t automatically jump on an upgrade. Read the documentation first to make sure you aren’t relying on something which isn’t there yet.
In a previous post, I gave an overview to regression tests. In this post, I will give a practical example of developing and performing regression tests with the Pester framework for PowerShell. The code for performing regression tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test regression scenarios. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database. The structure of the metadata database will be exactly the same as laid out in the Integration Test post.
There’s a hefty test script here too, so check it out.
The good news is that it is not an unreasonable requirement and it has been done before. The solution is to use Group Managed Service Accounts (gMSA) and Credential Spec Files. A number of people have already documented their efforts. Some were more successful than others.
Click through for a detailed guide to getting this working.
Sometimes, the optimizer can take a query with a complex where clause, and turn it into two queries.
This only happens up to a certain point in complexity, and only if you have really specific indexes to allow these kinds of plan choices.
This is a case where I generally don’t trust the optimizer to get it right. Even when it does, I’d be concerned that this won’t be a stable solution and a minor change somewhere could result in regression to a bad plan.
We will need to typically transform the problem of utility modeling from its intangible, abstract form to something that is measurable. That is, we wish to assign a numeric value to the perceived utility by the consumer, and we want to measure that accurately and precisely (as much as possible).
This is where survey design comes in, where, as a market researcher, we must design inputs (in the form of questionnaires) to have respondents do the hard work of transforming their qualitative, habitual, perceptual opinions into simplified, summarized aggregate values which are expressed either as a numeric value or on a rank scale.
I tend to shy away from this kind of analysis because it runs a huge risk of trying to turn ordinal utility rankings into cardinal functions.
Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being
quickstart.clouderalike the host name, is actually
localhost. You can change this in
Because PolyBase needs to have direct access to the data nodes, having a node called localhost is a bit of a drag.
I’m used to the Hortonworks Data Platform, so this is a quick compendium of things I noticed to were different.