Real-Time Weather With HDF

Balaji Kandregula shows how to use Hortonworks Data Flow components to process weather events in real time:

It’s live weather reporting using HDF, Kafka, and Solr.

Here are the environment requirements for implementing:

  • HDF (for HDF 2.0, you need Java 1.8).
  • Kafka.
  • Spark.
  • Solr.
  • Banana.

Now let’s get on to the steps!

There are a lot of moving parts there, but the pieces do plug in well enough and there are a lot of screen shots to guide you along the way.

Pipeline Architecture With Kafka

Alexandra Wang describes how Pandora Media has used Apache Kafka for real-time ad serving using Kafka Connect:

Our ad server publishes billions of messages per day to Kafka. We soon realized that writing a proprietary Kafka consumer able to handle that amount of data with the desired offset management logic would be non-trivial, especially when requiring exactly-once-delivery semantics. We found that the Kafka Connect API paired with the HDFS connector developed by Confluent would be perfect for our use case.

We’ve also found it painful not having a central authority on data structures that can share their respective schemas across all services and applications. Without a central registry for message schemas, data serialization and deserialization for a variety of applications are troublesome and the pipeline is fragile when schema evolution happens. We found Schema Registry is a great solution for this problem.

To address the above two problems, we integrated the Kafka Connect API and Schema Registry into our Kafka-centered data pipeline.

Well worth reading, especially the difficulties that they’ve had during maintenance periods and in lower environments.

Thinking About NULL

Duncan Greaves on NULL:

NULL exists because the following general conditions apply:

Existence –The attribute does not exist in the domain, or domain understanding is wrong. This means there is a missing entity in our domain model or entites are mixed in a table. E.g table contains hair colour for a car entity, Number of pregnancies for male patients.

Missing – The information has not been given at the time a row was created. E.g. A customer may decline to give their age.

Not Yet – Data is contingent upon an unknown event in the future, E.g. Termination date or Date of death.

Does not apply– Is not applicable for this instance of a record. E.g. Hair colour for bald people.

Placeholders – Indicates that we know that a bit of data exists, but we don’t know what it is, in this case keeping a NULL is useful for CUBE or ROLLUP queries.

In the real world applications of data structures NULLs are often unavoidable. However, it confuses users, designers and DBA’s (generally) hate it. It complicates Reporting, ETL, Business Intelligence and Data Science initiatives. As such, users need to be aware of the design and query compromises they need to use.

I think there’s significance in what NULL represents, but it’s a concept with its fair share of complexity.  Read the whole thing.

Taking Advantage Of Azure Elasticity

Arun Sirpal migrated a number of Azure SQL Databases into an elastic pool and configured a series of elastic jobs to support them:

I want to show you how I went from having multiple single SQL databases in Azure to a database elastic pool within a new dedicated SQL Server. Once setup I create and use elastic jobs. This post is long but I am sure you will find it useful.


  • Create a new “logical” SQL Server.

  • Create a new elastic pool within this logical SQL Server.

  • Move the data from the old single SQL databases to the above elastic pool (couple of ways to do this but I used built-in backups).

  • Confirm application connection.

  • Decommission single SQL databases.

  • Create / setup an elastic job.

  • Check the controller database.

Definitely worth reading if you are looking at hosting multiple databases in Azure.

Tuning Kafka And Spark Data Pipelines

Larry Murdock explains the tuning options available to Kafka and Spark Streams:

Kafka is not the Ferrari of messaging middleware, rather it is the salt flats rocket car. It is fast, but don’t expect to find an AUX jack for your iPhone. Everything is stripped down for speed.

Compared to other messaging middleware, the core is simpler and handles fewer features. It is a transaction log and its job is to take the message you sent asynchronously and write it to disk as soon as possible, returning an acknowledgement once it is committed via an optional callback. You can force a degree of synchronicity by chaining a get to the send call, but that is kind of cheating Kafka’s intention. It does not send it on to a receiver. It only does pub-sub. It does not handle back pressure for you.

I like this as a high-level overview of the different options available.  Definitely gets a More Research Is Required tag, but this post helps you figure out where to go next.

Splitting A Small Database

Brent Ozar explains why he recommended a client break out a small database:

Listen, I can explain. Really.

We had a client with a 5GB database, and they wanted it to be highly available. The data powered their web site, and that site needed to be up and running in short order even if they lost the server – or an entire data center – or a region of servers.

The first challenge: they didn’t want to pay a lot for this muffler database. They didn’t have a full time DBA, and they only had licensing for a small SQL Server Standard Edition.

Read on for the full explanation.  Given the constraints and expectations, it makes sense, and this is a good example of figuring out how expected future growth can change the bottom line for a DBA.

Azure Networking

Joshua Feierman has an article on how Azure Networking works, particularly from the viewpoint of a DBA:

The connecting thread between an Azure virtual machine and a virtual network is a Virtual Network Interface Card, or VNic for short. These are resources that are separate and distinct from the virtual machine and network itself, which can be assigned to a given virtual machine.

If you go to the “All Resources” screen and sort by the “Type” column, you will find a number of network interface resources.

There’s some good information in here.

SQL Server On VMware Guide

David Klee announces an update to VMware’s SQL Server best practices guide:

I am proud to announce that we contributed to the latest revision of the Microsoft SQL Server on VMware best practices guide, freely available at this address. This document outlines some of the common VM-level tweaks and adjustments that are made when running enterprise SQL Server VMs on VMware platforms. This guide is considered a must-read if you manage these sorts of SQL Servers, which cannot be treated as general purpose virtual machines.

This guide was recently updated for vSphere 6.5, and we consider it an absolute must for your enterprise management library!

If you manage SQL Server instances on VMware, it’s definitely worth the read.

Entity Framework Slow, News At 11

Jovan Popovic shows that Entity Framework is slow and Dapper is fast:

To setup test, you can go to StackExchange/Dapper GitHub an download source code. Tests are created as C# solution (Dapper.sln). When you open this solution you can find Dapper.Tests project. You might need to change two things:

  1. Connection strings are hardcoded in Tests.cs file with values like “Server=(local)\SQL2014;Database=tempdb;User ID=sa;Password=Password12!”. You might need to change this and put your connection info.
  2. Project is compiled using dotnet sdk 1.0.0-preview2-003121, so you might get compilation errors if you don’t have a matching framework. I have removed line: “sdk”: { “version”: “1.0.0-preview2-003121” } from global.json to fix this.

Now you will be able to build project and run tests.

Nothing’s going to be faster than hand-crafted, well-tuned statements from people who know what they’re doing.  Micro-ORMs like Dapper and FSharp.Data.SqlClient will trade a little bit of a speed hit for developer niceties.  Heavier frameworks like Entity Framework and NHibernate add a lot more, but tend to be significantly slower.

Supersized Tables

Deborah Melkin tells a story of a design battle she lost:

The programmers came to me and said we need to add a large number of columns to this table for one piece of functionality. It would more than double the total number of columns on the table. Oh, and all of the new columns would be NULL since we would only need to populate them if they were using that functionality and even then, not all of them would require data. The final result would be that 65-75% of the table would end up having nullable fields with the majority of those having NULL for the value.

I said what I think any sane DBA would say to this request: No.

Click through for the rest of the tale.


April 2017
« Mar