Author: Kevin Feasel

The following was run on a fresh Ubuntu 16.04.2 LTS installation. The machine I’m using has an Intel Core i5 4670K clocked at 3.4 GHz, 8 GB of RAM and 1 TB of mechanical storage capacity.
First I’ve setup a standalone Hadoop environment following the instructions from my Hadoop 3 installation guide. Below I’ve installed Kafkacat for feeding and reading off of Kafka, libsnappy as I’ll be using Snappy compression on the Kafka topics, Python, Screen for running applications in the background and Zookeeper which is used by Kafka for coordination.

From there, Mark has the configuration scripts and processes to get the entire pipeline built.

Comments closed

Parsing HL7 Messages With Python

Published 2019-02-25 by Kevin Feasel

Cristian Satnic has HL7 formatted messages in SQL Server and wishes to parse them using Python:

Each line in the HL7 message is called a segment and then each segment is split into individual fields by | (pipe) characters (typically). HL7 fields have well-defined names and meanings … for example in the example above PID-3 (the 3rd field in the PID segment where the identifier ‘PID’ is not counted) is 12001 and that represents the patient identifier.
For this particular project I’m working on we have HL7 messages stored in a SQL Server 2016 database table where each row in the table contains the raw HL7 2.x message in a particular column. I need to be able to intelligently filter over this HL7 data by looking at values in particular HL7 fields (as shown above). Since this HL7 data is stored in a varchar(MAX) column I could certainly attempt to play games using LIKE comparisons in SQL but that would not get me very far. SQL simply does not understand the complex structure of HL7 and I have no native SQL Server functions at my disposal that I could quickly use to parse this data and filter it.

Cristian has a Jupyter Notebook which takes us through the solution. With SQL Server 2017, there’s the possibility of solving this in a stored procedure using Machine Learning Services.

Comments closed

Testing Cosmos DB’s REST API

Published 2019-02-25 by Kevin Feasel

Hasan Savran shows how we can test Cosmos DB’s REST API using Postman:

You have many options to access to CosmosDB. Rest API is one of these options and it is the low level access way to Cosmos DB. You can customize all options of CosmosDB by using REST API. To customize the calls, and pass the required authorization information, you need to use http headers. There are many headers you can set depending on the operation you want to run in CosmosDB. I am going to cover only the required headers here.

In the following example, I am going to try to create a database in CosmosDB emulator by using the REST API. First let’s look at the required header fields for this request. These requirement applies to all other REST API calls too.

It’s a little more complicated than just posting to a URL and Hasan has you covered.

Comments closed

Shared Database Privacy

Published 2019-02-25 by Kevin Feasel

Duncan Greaves has some thoughts about safeguarding privacy in shared databases:

The difficulty with privacy (or more correctly, information confidentiality) in database terms is that databases are supposed to maintain huge amounts of information, and processing and recording data is difficult, if not impossible without them. Public bodies especially, have difficulty in defining and maintaining the boundaries of information disclosure that they should provide, whilst maintaining the utility of the information for the improvement of welfare and services.
Privacy is contingent on first having a correctly secured database. Additional privacy controls are required when sensitive data leaves the protected trust perimeter of the database to be utilised by third parties.

Click through for more detail.

Comments closed

A Central Repository for Query Store

Published 2019-02-25 by Kevin Feasel

Tracy Boggiano shares work on centralizing Query Store results across a number of databases:

I’ve worked for SaaS companies for the last 6 years or so. So our queries are largely the same across our system and by default Query Store is per database. So it would be handy to have a central repository to help you determine which queries across your whole server are your worse performing queries. Hence comes my idea to build a central repository. I believe I put in connect item before it got moved to the new platform for this but never put a new ticket. So this is the beginning of building something along those lines. So it will be a work in progress so to speak. My current company I care about queries that are taking a long time to run. So I’m going to store the top 50 queries in total duration into a database handily called DBA because that’s where I store all the DBA stuff. To do this, I have some none client related databases I don’t care about so I create a table to tell which databases to collect the data from. Then a table to put the information into and job to run every day at midnight and sum up the data. Now the data is stored in UTC time so the data will be off by whatever timezone difference you are in but with most people being 24×7 shops as SaaS companies that shouldn’t matter and if it does you can edit the query.

This helps to resolve a necessary pain point in Query Store: all of that data is per-database, so if you have a federated system with a large number of equivalent databases, getting cross-system stats is painful.

Comments closed

The Anatomy of a Pester Test

Published 2019-02-25 by Kevin Feasel

Shane O’Neill takes us through using Pester to test self-contained scripts:

Where things differ…
…could be when you try to accommodate different people and create a .ps1 file that both defines and calls a function. Self Contained scripts, if you would call them that.
Normally the reason that I’ve heard from this is you’re trying to help a non-technical minded person and they just want a file that they can open, hit “run”, and everything is done for them.
Have you ever tried to Pester test those files though? It’s not recommended, especially if your function removes or modifies objects.

Click through for a solution and read Shane’s update as well for a scenario where it doesn’t quite work as hoped.

Comments closed

Generating Reference Numbers With Sequences

Published 2019-02-25 by Kevin Feasel

Matthew McGiffen shares one technique to generate reference numbers using a sequence and the FORMAT function:

One thing to note is that, while the sequence will generally produce unique number, it is still worth enforcing that in your table definition with a unique constraint i.e.

ALTER TABLE dbo.Orders ADD CONSTRAINT UQ_Orders_OrderReference UNIQUE(OrderReference);
This prevents someone from issuing an UPDATE command that might create a duplicate reference.

As long as you can live with the occasional gap in your reference number, sequences are a good solution to the problem.

Comments closed

Economic Articles With Data Included

Published 2019-02-22 by Kevin Feasel

Sebastian Kranz has a Shiny app to help you find economic papers with included data:

One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.
The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.

I like this idea, particularly because it promotes the notion that if you’re going to write a paper based on a data set, you ought to provide the data set. There are too many cases of typos or accidental miscodings which take an interesting result and render it mundane (or sometimes even the exact opposite of what the paper reads). H/T R-Bloggers

Comments closed

Giving A Name To The R Pipe

Published 2019-02-22 by Kevin Feasel

John Mount noodles an idea from Hadley Wickham:

I’d say this fails on at least two counts, the first “%then%” doesn’t seem grammatical (as d is a noun), and magrittr pipes can’t be associated with a new name (as they are implemented by looking for theirselves by name in captured unevaluated code).
However, the wrapr dot arrow pipe can take on new names.
Let’s try a variation, using a traditional pronunciation: “to”.

I don’t like “then” very much. I definitely prefer the C# lambda pronunciation of “goes to” for this.

Click through for John’s thoughts on right assignment as well, something I almost categorically dislike.

Comments closed

The Zen Of Airflow

Published 2019-02-22 by Kevin Feasel

Bas Harenslak shows how you can think of The Zen of Python as it applies to Apache Airflow:

Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. This allows for concise and flexible scripts but can also be the downside of Airflow; since it’s Python code there are infinite ways to define your pipelines. The Zen of Python is a list of 19 Python design principles and in this blog post I point out some of these principles on four Airflow examples. This blog was written with Airflow 1.10.2.

My favorite of the Zen of Python principles is a combination of two: “simple is better than complex; complex is better than complicated.” That’s something I don’t always get right, but it is critical for a stable architecture.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31