Shared Database Privacy

Duncan Greaves has some thoughts about safeguarding privacy in shared databases:

The difficulty with privacy (or more correctly, information confidentiality) in database terms is that databases are supposed to maintain huge amounts of information, and processing and recording data is difficult, if not impossible without them. Public bodies especially, have difficulty in defining and maintaining the boundaries of information disclosure that they should provide, whilst maintaining the utility of the information for the improvement of welfare and services.
  Privacy is contingent on first having a correctly secured database. Additional privacy controls are required when sensitive data leaves the protected trust perimeter of the database to be utilised by third parties.

Click through for more detail.

A Central Repository for Query Store

Tracy Boggiano shares work on centralizing Query Store results across a number of databases:

I’ve worked for SaaS companies for the last 6 years or so.  So our queries are largely the same across our system and by default Query Store is per database.  So it would be handy to have a central repository to help you determine which queries across your whole server are your worse performing queries.  Hence comes my idea to build a central repository.  I believe I put in connect item before it got moved to the new platform for this but never put a new ticket.  So this is the beginning of building something along those lines.  So it will be a work in progress so to speak.  My current company I care about queries that are taking a long time to run.  So I’m going to store the top 50 queries in total duration into a database handily called DBA because that’s where I store all the DBA stuff.  To do this, I have some none client related databases I don’t care about so I create a table to tell which databases to collect the data from.  Then a table to put the information into and job to run every day at midnight and sum up the data.  Now the data is stored in UTC time so the data will be off by whatever timezone difference you are in but with most people being 24×7 shops as SaaS companies that shouldn’t matter and if it does you can edit the query.

This helps to resolve a necessary pain point in Query Store: all of that data is per-database, so if you have a federated system with a large number of equivalent databases, getting cross-system stats is painful.

The Anatomy of a Pester Test

Shane O’Neill takes us through using Pester to test self-contained scripts:

Where things differ…
…could be when you try to accommodate different people and create a .ps1 file that both defines and calls a function. Self Contained scripts, if you would call them that.
Normally the reason that I’ve heard from this is you’re trying to help a non-technical minded person and they just want a file that they can open, hit “run”, and everything is done for them.
Have you ever tried to Pester test those files though? It’s not recommended, especially if your function removes or modifies objects.

Click through for a solution and read Shane’s update as well for a scenario where it doesn’t quite work as hoped.

Generating Reference Numbers With Sequences

Kevin Feasel

2019-02-25

T-SQL

Matthew McGiffen shares one technique to generate reference numbers using a sequence and the FORMAT function:

One thing to note is that, while the sequence will generally produce unique number, it is still worth enforcing that in your table definition with a unique constraint i.e.

ALTER TABLE dbo.Orders ADD CONSTRAINT UQ_Orders_OrderReference UNIQUE(OrderReference);

This prevents someone from issuing an UPDATE command that might create a duplicate reference. 

As long as you can live with the occasional gap in your reference number, sequences are a good solution to the problem.

Economic Articles With Data Included

Kevin Feasel

2019-02-22

Data, R

Sebastian Kranz has a Shiny app to help you find economic papers with included data:

One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.

The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.

I like this idea, particularly because it promotes the notion that if you’re going to write a paper based on a data set, you ought to provide the data set. There are too many cases of typos or accidental miscodings which take an interesting result and render it mundane (or sometimes even the exact opposite of what the paper reads). H/T R-Bloggers

Giving A Name To The R Pipe

Kevin Feasel

2019-02-22

R

John Mount noodles an idea from Hadley Wickham:

I’d say this fails on at least two counts, the first “%then%” doesn’t seem grammatical (as d is a noun), and magrittr pipes can’t be associated with a new name (as they are implemented by looking for theirselves by name in captured unevaluated code).

However, the wrapr dot arrow pipe can take on new names.

Let’s try a variation, using a traditional pronunciation: “to”.

I don’t like “then” very much. I definitely prefer the C# lambda pronunciation of “goes to” for this.

Click through for John’s thoughts on right assignment as well, something I almost categorically dislike.

The Zen Of Airflow

Kevin Feasel

2019-02-22

ETL, Python

Bas Harenslak shows how you can think of The Zen of Python as it applies to Apache Airflow:

Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. This allows for concise and flexible scripts but can also be the downside of Airflow; since it’s Python code there are infinite ways to define your pipelines. The Zen of Python is a list of 19 Python design principles and in this blog post I point out some of these principles on four Airflow examples. This blog was written with Airflow 1.10.2.

My favorite of the Zen of Python principles is a combination of two: “simple is better than complex; complex is better than complicated.” That’s something I don’t always get right, but it is critical for a stable architecture.

The Steps Of A Database Deployment

I have a series on near-zero downtime. In this post, I cover some of the key assumptions in the series as well as the steps in a database deployment:

Database Release
The database release phase is the first “primary” phase. It usually starts on a schedule, maybe 2 PM on a Wednesday or maybe “every day at 9 AM, 1 PM, 6 PM, and 10 PM” for more mature shops. Depending upon how much of an effect our release process normally has on end users, we might alert them that we expect to see a degradation in services starting at this point.

This phase of the release has us push out our database changes. This can involve creating or altering database objects but will not involve dropping existing objects.

Our database changes should support the blue-green deployment model. At this point in the process, all of the application code is “blue”—that is, the current production code. Our procedure changes need to be able to support that code without breaking. If we need to drop a column from a stored procedure, for example, we would not want to do it here. If we need to add a column to a stored procedure, we might do it here as long as it doesn’t break the calling code.

This is two topics smashed together into one post, but gives you an idea of a mental model around database deployments.

Azure Data Lake Store Gen2

James Serra gives us the low-down on Azure Data Lake Store Gen2 now that it is generally available:

When to use Blob vs ADLS Gen2
New analytics projects should use ADLS Gen2, and current Blob storage should be converted to ADLS Gen2, unless these are non-analytical use cases that only need object storage rather than hierarchical storage (i.e. video, images, backup files), in which case you can use Blob Storage and save a bit of money on transaction costs (storage costs will be the same between Blob and ADLS Gen2 but transaction costs will be a bit higher for ADLS Gen2 due to the overhead of namespaces).

Looks like there are still some things missing from Gen2, so don’t automatically jump on an upgrade. Read the documentation first to make sure you aren’t relying on something which isn’t there yet.

Regression Testing With Pester

Ust Oldfield continues a series on Pester testing:

In a previous post, I gave an overview to regression tests. In this post, I will give a practical example of developing and performing regression tests with the Pester framework for PowerShell. The code for performing regression tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test regression scenarios. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database. The structure of the metadata database will be exactly the same as laid out in the Integration Test post.

There’s a hefty test script here too, so check it out.

Categories

February 2019
MTWTFSS
« Jan Mar »
 123
45678910
11121314151617
18192021222324
25262728