Lambda architecture – developed by Nathan Marz – provides a clear set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability and recomputation into the system. Batch processes high volumes of data where a group of transactions is collected over a period of time. Data is collected, entered, processed and then batch results produced. Batch processing requires separate programs for input, process and output. An example is payroll and billing systems. In contrast, real-time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real-time). Customer services and bank ATMs are examples.
Lambda architecture has three (3) layers:
I haven’t heard much about the Lambda and Kappa architectures lately, so when I saw this, I figured it was time for a refresher.
Based on our experience, S3’s availability has been fantastic. Only twice in the last six years have we experienced S3 downtime and we have never experienced data loss from S3.
Amazon claims 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. The official SLA from Amazon can be found here: Service Level Agreement – Amazon Simple Storage Service (S3).
For HDFS, in contrast, it is difficult to estimate availability and durability. One could theoretically compute the two SLA attributes based on EC2’s mean time between failures (MTTF), plus upgrade and maintenance downtimes. In reality, those are difficult to quantify. Our understanding working with customers is that the majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year.
It’s interesting how opinion has shifted; even a year ago, the recommendation would be different.
Do staging, data refinement and reporting all from SQL DW. You can scale compute power up when needed (i.e. during staging, data refinement, or large number of users doing reporting) or down to save costs (i.e. nights and weekends when user reporting is low). The pros of this option are by reducing the number of technologies you are building a simpler solution and reducing the number of copies of the data. The cons are since everything is done on SQL DW you can have performance issues (i.e. doing data refinement while users are reporting), can hit the SQL DW concurrent query limit, and can have a higher cost since SQL DW is the highest-cost product, especially if you are not able to pause it. Pausing it reduces your cost to zero for compute, only having to pay for storage (see Azure SQL Data Warehouse pricing), but no one can use SQL DW when paused
Click through for three other architecture ideas.
SQL Data Warehouse supports many loading methods, including SSIS, BCP, the SQLBulkCopy API, and Azure Data Factory (ADF). These methods all share a common pattern for data ingestion. By comparison, the PolyBase technology uses a different approach that provides better performance.
PolyBase is by far the fastest and most scalable SQL Data Warehouse loading method to date, so we recommend it as your default loading mechanism. PolyBase is a scalable, query processing framework compatible with Transact-SQL that can be used to combine and bridge data across relational database management systems, Azure Blob Storage, Azure Data Lake Store and Hadoop database platform ecosystems (APS only).
If you’re curious about Azure SQL Data Warehouse, this is a must-read.
Each record that is inserted goes onto the next available distribution. This guarantees that you will have a smooth, even distribution of data, but it means you have no way of telling which data is on which distribution. This isn’t always a problem!
If I wanted to perform a count of records, grouped by a particular field, I can perform this on a round-robin table. Each distribution will run the query in parallel and return it’s grouped results. The results can be simply added together as a second part of the query, and adding together 60 smaller datasets shouldn’t be a large overhead. For this kind of single-table aggregation, round-robin distribution is perfectly adequate!
However, the issues arise when we have multiple tables in our query. In order to join two tables. Let’s take a very simple join between a fact table and a dimension. I’ve shown 6 distributions for simplicity, but this would be happening across all 60.
Figuring out which distribution key to use can make a huge difference in performance.
It’s live weather reporting using HDF, Kafka, and Solr.
Here are the environment requirements for implementing:
- HDF (for HDF 2.0, you need Java 1.8).
Now let’s get on to the steps!
There are a lot of moving parts there, but the pieces do plug in well enough and there are a lot of screen shots to guide you along the way.
Our ad server publishes billions of messages per day to Kafka. We soon realized that writing a proprietary Kafka consumer able to handle that amount of data with the desired offset management logic would be non-trivial, especially when requiring exactly-once-delivery semantics. We found that the Kafka Connect API paired with the HDFS connector developed by Confluent would be perfect for our use case.
We’ve also found it painful not having a central authority on data structures that can share their respective schemas across all services and applications. Without a central registry for message schemas, data serialization and deserialization for a variety of applications are troublesome and the pipeline is fragile when schema evolution happens. We found Schema Registry is a great solution for this problem.
To address the above two problems, we integrated the Kafka Connect API and Schema Registry into our Kafka-centered data pipeline.
Well worth reading, especially the difficulties that they’ve had during maintenance periods and in lower environments.
NULL exists because the following general conditions apply:
Existence –The attribute does not exist in the domain, or domain understanding is wrong. This means there is a missing entity in our domain model or entites are mixed in a table. E.g table contains hair colour for a car entity, Number of pregnancies for male patients.
Missing – The information has not been given at the time a row was created. E.g. A customer may decline to give their age.
Not Yet – Data is contingent upon an unknown event in the future, E.g. Termination date or Date of death.
Does not apply– Is not applicable for this instance of a record. E.g. Hair colour for bald people.
Placeholders – Indicates that we know that a bit of data exists, but we don’t know what it is, in this case keeping a NULL is useful for CUBE or ROLLUP queries.
In the real world applications of data structures NULLs are often unavoidable. However, it confuses users, designers and DBA’s (generally) hate it. It complicates Reporting, ETL, Business Intelligence and Data Science initiatives. As such, users need to be aware of the design and query compromises they need to use.
I think there’s significance in what NULL represents, but it’s a concept with its fair share of complexity. Read the whole thing.
I want to show you how I went from having multiple single SQL databases in Azure to a database elastic pool within a new dedicated SQL Server. Once setup I create and use elastic jobs. This post is long but I am sure you will find it useful.
Create a new “logical” SQL Server.
Create a new elastic pool within this logical SQL Server.
Move the data from the old single SQL databases to the above elastic pool (couple of ways to do this but I used built-in backups).
Confirm application connection.
Decommission single SQL databases.
Create / setup an elastic job.
Check the controller database.
Definitely worth reading if you are looking at hosting multiple databases in Azure.
Kafka is not the Ferrari of messaging middleware, rather it is the salt flats rocket car. It is fast, but don’t expect to find an AUX jack for your iPhone. Everything is stripped down for speed.
Compared to other messaging middleware, the core is simpler and handles fewer features. It is a transaction log and its job is to take the message you sent asynchronously and write it to disk as soon as possible, returning an acknowledgement once it is committed via an optional callback. You can force a degree of synchronicity by chaining a get to the send call, but that is kind of cheating Kafka’s intention. It does not send it on to a receiver. It only does pub-sub. It does not handle back pressure for you.
I like this as a high-level overview of the different options available. Definitely gets a More Research Is Required tag, but this post helps you figure out where to go next.