Where Hadoop Is Going

Kevin Feasel

2019-02-25

Hadoop

Erik Krogen summarizes a recent Hadoop developer gathering at LinkedIn:

The day started with LinkedIn’s very own Jonathan Hung (left) and Anthony Hsu(right) discussing TensorFlow on YARN, or TonY, our home-grown and recently open-sourced solution for distributed deep learning via TensorFlow on top of YARN. They discussed its architecture and implementation, as well as future goals, such as support for additional runtimes like PyTorch. You can view their slides here and a recording of their presentation here.

Looks like there were several interesting talks and a lot of content showing where Hadoop will go over the next year or so.

Getting Started With Apache Flume

Kevin Feasel

2019-02-25

Hadoop

Mark Litwintschik takes us through installation and configuration of Apache Flume:

The following was run on a fresh Ubuntu 16.04.2 LTS installation. The machine I’m using has an Intel Core i5 4670K clocked at 3.4 GHz, 8 GB of RAM and 1 TB of mechanical storage capacity.

First I’ve setup a standalone Hadoop environment following the instructions from my Hadoop 3 installation guide. Below I’ve installed Kafkacat for feeding and reading off of Kafka, libsnappy as I’ll be using Snappy compression on the Kafka topics, Python, Screen for running applications in the background and Zookeeper which is used by Kafka for coordination.

From there, Mark has the configuration scripts and processes to get the entire pipeline built.

Parsing HL7 Messages With Python

Kevin Feasel

2019-02-25

Python

Cristian Satnic has HL7 formatted messages in SQL Server and wishes to parse them using Python:

Each line in the HL7 message is called a segment and then each segment is split into individual fields by | (pipe) characters (typically). HL7 fields have well-defined names and meanings … for example in the example above PID-3 (the 3rd field in the PID segment where the identifier ‘PID’ is not counted) is 12001 and that represents the patient identifier.

For this particular project I’m working on we have HL7 messages stored in a SQL Server 2016 database table where each row in the table contains the raw HL7 2.x message in a particular column. I need to be able to intelligently filter over this HL7 data by looking at values in particular HL7 fields (as shown above). Since this HL7 data is stored in a varchar(MAX) column I could certainly attempt to play games using LIKE comparisons in SQL but that would not get me very far. SQL simply does not understand the complex structure of HL7 and I have no native SQL Server functions at my disposal that I could quickly use to parse this data and filter it.

Cristian has a Jupyter Notebook which takes us through the solution. With SQL Server 2017, there’s the possibility of solving this in a stored procedure using Machine Learning Services.

Testing Cosmos DB’s REST API

Hasan Savran shows how we can test Cosmos DB’s REST API using Postman:

        You have many options to access to CosmosDB. Rest API is one of these options and it is the low level access way to Cosmos DB. You can customize all options of CosmosDB by using REST API. To customize the calls, and pass the required authorization information, you need to use http headers. There are many headers you can set depending on the operation you want to run in CosmosDB.  I am going to cover only the required headers here.

      In the following example, I am going to try to create a database in CosmosDB emulator by using the REST API. First let’s look at the required header fields for this request. These requirement applies to all other REST API calls too.

It’s a little more complicated than just posting to a URL and Hasan has you covered.

Shared Database Privacy

Duncan Greaves has some thoughts about safeguarding privacy in shared databases:

The difficulty with privacy (or more correctly, information confidentiality) in database terms is that databases are supposed to maintain huge amounts of information, and processing and recording data is difficult, if not impossible without them. Public bodies especially, have difficulty in defining and maintaining the boundaries of information disclosure that they should provide, whilst maintaining the utility of the information for the improvement of welfare and services.
  Privacy is contingent on first having a correctly secured database. Additional privacy controls are required when sensitive data leaves the protected trust perimeter of the database to be utilised by third parties.

Click through for more detail.

A Central Repository for Query Store

Tracy Boggiano shares work on centralizing Query Store results across a number of databases:

I’ve worked for SaaS companies for the last 6 years or so.  So our queries are largely the same across our system and by default Query Store is per database.  So it would be handy to have a central repository to help you determine which queries across your whole server are your worse performing queries.  Hence comes my idea to build a central repository.  I believe I put in connect item before it got moved to the new platform for this but never put a new ticket.  So this is the beginning of building something along those lines.  So it will be a work in progress so to speak.  My current company I care about queries that are taking a long time to run.  So I’m going to store the top 50 queries in total duration into a database handily called DBA because that’s where I store all the DBA stuff.  To do this, I have some none client related databases I don’t care about so I create a table to tell which databases to collect the data from.  Then a table to put the information into and job to run every day at midnight and sum up the data.  Now the data is stored in UTC time so the data will be off by whatever timezone difference you are in but with most people being 24×7 shops as SaaS companies that shouldn’t matter and if it does you can edit the query.

This helps to resolve a necessary pain point in Query Store: all of that data is per-database, so if you have a federated system with a large number of equivalent databases, getting cross-system stats is painful.

The Anatomy of a Pester Test

Shane O’Neill takes us through using Pester to test self-contained scripts:

Where things differ…
…could be when you try to accommodate different people and create a .ps1 file that both defines and calls a function. Self Contained scripts, if you would call them that.
Normally the reason that I’ve heard from this is you’re trying to help a non-technical minded person and they just want a file that they can open, hit “run”, and everything is done for them.
Have you ever tried to Pester test those files though? It’s not recommended, especially if your function removes or modifies objects.

Click through for a solution and read Shane’s update as well for a scenario where it doesn’t quite work as hoped.

Generating Reference Numbers With Sequences

Kevin Feasel

2019-02-25

T-SQL

Matthew McGiffen shares one technique to generate reference numbers using a sequence and the FORMAT function:

One thing to note is that, while the sequence will generally produce unique number, it is still worth enforcing that in your table definition with a unique constraint i.e.

ALTER TABLE dbo.Orders ADD CONSTRAINT UQ_Orders_OrderReference UNIQUE(OrderReference);

This prevents someone from issuing an UPDATE command that might create a duplicate reference. 

As long as you can live with the occasional gap in your reference number, sequences are a good solution to the problem.

Categories

February 2019
MTWTFSS
« Jan Mar »
 123
45678910
11121314151617
18192021222324
25262728