Press "Enter" to skip to content

Category: Architecture

Lambda Architecture

Sebastiao Correia discusses Lambda architecture:

The batch layer stores all the data with no constraint on the schema. The schema-on-read is built in the batch views in the serving layer. Creating schema-on-read views requires algorithms to parse the data from the batch layer and convert them in a readable way. This allows input data to freely evolve as there is no constraint on their structure. But then, the algorithm that builds the view is responsible to manage the structural change in order to still deliver the same view as expected. 

This shows a coupling between the data and the algorithms used for serving the data. Focusing on data quality is therefore not enough and we may ask the question of the algorithm quality. As the system lives and evolves, the algorithms may become more and more complex. These algorithms must not be regarded as black boxes, but a clear understanding of what they are doing is important if we want to have a good data governance. Moreover, during the batch view creation, data quality transformations could be done so as to provide data of better quality to the consumer of the views.

Lambda is an interesting architectural concept, as it tries to solve the age-old “fast or accurate?” problem with “both.”  Get your fast estimates streamed through a speed layer, but your accurate, slow calculations handled through the serving layer.  Definitely check out this article.

Comments closed

Data Warehouse Design Tips

Dustin Ryan has part one of a two-part series on data warehouse design best practices:

2. Store additive measures in the data warehouse.

The best type of measures to store in the data warehouse are those measures that can be fully aggregated. A measure that can be fully aggregated is a measure that can be summarized by any dimension or all dimensions and still remain meaningful. For instance, a Sales Amount measure can be summarized by Product, Date, Geography, etc. and still provide valuable insight for the customer.

Measures that cannot be fully aggregated, such as ratios or other percentage type calculations should be handled in the semantic model or the reporting tool. For example, a measure such as Percentage Profit Margin stored in a table cannot be properly aggregated. A better option would be to store the additive measures that are the base for the Percentage Profit Margin, such as Revenue, Cost, Margin, etc. These base measures can be used to calculate the ratio in a query, semantic model, or reporting tool.

The first five tips are non-controverisal and act as a good baseline for understanding warehousing with SQL Server.  Do check it out.

Comments closed

In-Memory OLTP Using Ignite

Babu Elumalai explains how to use Apache Ignite to build an in-memory OLTP system on top of Amazon’s DynamoDB:

Business users have been content to perform analytics on data collected in Amazon Redshift to spot trends. But recently, they have been asking AWS whether the latency can be reduced for real-time analysis. At the same time, they want to continue using the analytical tools they’re familiar with.

In this situation, we need a system that lets you capture the data stream in real time and use SQL to analyze it in real time.

In the earlier section, you learned how to build the pipeline to Amazon Redshift with Firehose and Lambda functions. The following illustration shows how to use Apache Spark Streaming on EMR to compute time window statistics from DynamoDB Streams. The computed data can be persisted to Amazon S3 and accessed with SparkSQL using Apache Zeppelin.

There are a lot of technologies at play here and it’s worth a perusal, even though I’m going to keep recommending that you use a relational database like SQL Server for OLTP work in all but the most extreme of circumstances.

Comments closed

Building A Prediction Engine

Richard Williamson explains how to build a prediction engine using technologies such as Spark, Kudu, Impala, and Kafka:

We’ll aim to predict the volume of events for the next 10 minutes using a streaming regression model, and compare those results to a traditional batch prediction method. This prediction could then be used to dynamically scale compute resources, or for other business optimization. I will start out by describing how you would do the prediction through traditional batch processing methods using both Apache Impala (incubating) and Apache Spark, and then finish by showing how to more dynamically predict usage by using Spark Streaming.

Of course, the starting point for any prediction is a freshly updated data feed for the historic volume for which I want to forecast future volume. In this case, I discovered that Meetup.com has a very nice data feed that can be used for demonstration purposes. You can read more about the API here, but all you need to know at this point is that it provides a steady stream of RSVP volume that we can use to predict future RSVP volume.

This is pretty dense, but it is a great look at one potential architecture leveraging Spark and several tools in the Hadoop ecosystem.

Comments closed

Integrating Lambda With Relational Databases

Bob Strahan shows how to integrate AWS Lambda with relational databases running on EC2:

Here are a few reasons why you might find this capability useful:

  • Instrumentation: Use database triggers to call a Lambda function when important data is changed in the database. Your Lambda function can easily integrate with Amazon CloudWatch, allowing you to create custom metrics, dashboards and alarms based on changes to your data.

  • Outbound streaming: Again, use triggers to call Lambda when key data is modified. Your Lambda function can post messages to other AWS services such as Amazon SQS, Amazon SNS, Amazon SES, or Amazon Kinesis Firehose, to send notifications, trigger external workflows, or to push events and data to downstream systems, such as an Amazon Redshift data warehouse.

  • Access external data sources: Call Lambda functions from within your SQL code to retrieve data from external web services, read messages from Amazon Kinesis streams, query data from other databases, and more.

  • Incremental modernization: Improve agility, scalability, and reliability, and eliminate database vendor lock-in by evolving in steps from an existing monolithic database design to a well-architected, modern microservices approach. You can use a microservices architecture to migrate business logic embodied in database procedures into database-agnostic Lambda functions while preserving compatibility with remaining SQL packages.

His specific example is around Oracle/Postgres, but I’d imagine you could do the same on SQL Server with the CLR.

Comments closed

Business Logic

Ed Elliott hits a classic architectural argument—whether business logic should be in stored procedures;

Stackoverflow is a specific use case and they decided to use .Net so they have a specific set of problems to deal with in terms of performance. They deploy (as I understand it) 10 times a day so if they need to change a query then they can quickly and easily – how quickly can you modify code and get it to production to fix a problem causing downtime on your mission critical app written in powerbuilder 20 years ago? (I jest but you get the point)

I like Ed’s back-and-forth arguing, as there are legitimate cases for both sides and the best answer almost always is somewhere in between for line of business apps.   I have three points that I tend to mention whenever this discussion comes up.

First, a lot of “business logic” is actually data logic.  Check constraints, foreign key constraints, unique key constraints, and even primary key constraints (for non-surrogate primary keys) are business rules, but they’re business rules around how the data is shaped and it’s a lot better to use your database system to maintain those rules.

Second, validation rules should be everywhere.  The fancy Javascript library should do validation, the server-side business logic should do validation, and the database should do validation.  You don’t know what’s going to skip one or more of these layers, and your database is the final gatekeeper preventing bad data from sneaking into your system.

Third, at the margin, go where your maintenance developers are most comfortable.  If they’re really good with C# but not good with SQL, the marginal business logic (the stuff you could really go either way on) should stay in the app tier; if your maintainers have really strong SQL skills but are lagging on the .NET side, I’d stick the marginal logic in stored procedures.

Comments closed

Unsharing The Database

Randy Shoup talks about scaling up through breaking out a shared database:

For an early- and mid-stage startup, a monolithic database is absolutely the appropriate architecture choice. With a small team and a small company, a single shared database made it simple to get started. Moving fast meant being able to make rapid changes across the entire system. A shared database made it very easy to join data between different tables, and it made transactions across multiple tables possible. These are pretty convenient.

As we have gotten larger, those benefits have become liabilities. It has become a single point of failure, where issues with the shared database can bring down nearly all of our applications. It has become a performance bottleneck, where long-running operations from one application can slow down others. Finally, and most importantly, the shared database has become a coupling point between teams, slowing down our ability to make changes.

I have my misgivings (as you’d expect from a database snob), particularly because I value highly the benefits of normalization and see sharded systems as a step backwards in that regard.  But even with that said, there are absolutely benefits to slicing out orthogonal sections of data; the point of disagreement is in those places in which two teams’ entities and attributes overlap.

Comments closed

Architecting Semi-Structured Data Solutions

James Serra gives four architectural scenarios for handling large quantities of semi-structured data:

An evolution of the three previous scenarios that provides multiple options for the various technologies.  Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control.  ELT is usually used instead of ETL (see Difference between ETL and ELT).  The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.

Hub-and-spoke should be your ultimate goal.  See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.

Check it out for a high-level architectural view of contemporary warehousing choices.  I prefer having both systems in play:  the EDW answers known business questions and gives you back report information relatively quickly; whereas the Hadoop cluster allows you to do spelunking, data cleansing, and answer unanticipated business questions.

Comments closed

The Importance Of Integration Testing

Michael Bourgon shows an example of why integration testing is important:

We are in process of doing a migration from an ancient creaky server to a shiny new VM.  Rather than just rebuild it and restore everything, we’re taking the (painful) opportunity to clean things up and improve several systems.

As part of this, we’re replicating data from the old server to the new server, so that we can migrate processes piecemeal, so that rollback is not “OH CRAP TURN IT OFF TURN IT OFF ROLL BACK TO THE OLD SERVER”.

But we ran into a weird problem.  On the target server, we had a many-to-many table that sits between, let’s say, stores and orders.  We have a stores table, we have an orders table, and this one (call it STORE_ORDERS for simplicity) is just a linking table between the two.  ID, stores_id, orders_id.  Everything scripted identically between the two databases (aside from the NOT FOR REPLICATION flag).

This is a case where action A works fine and action B works fine, but the combination of actions A and B leads to sadness.

Comments closed

Data Lakes

Jen Stirrup has a great primer on data lakes and factors to consider before you jump into the idea:

The organization will need to take a step back to understand better their existing status. Are they just starting out? Are other departments which are doing the same thing, perhaps in the local organization or somewhere else in the world? Once the organization understands their state better, they can start to broadly work out the strategy that the Data Lake is intended to provide.

As part of this understanding, the objective of the Data Lake will need to be identified. Is it for data science? Or, for example, is the Data Lake simply to store data in a holding pattern for data discovery? Identifying the objective will help align the vision and the goals, and set the scene for communication to move forward.

I would like to popularize the term Data Swamp for “that place you store a whole bunch of data of dubious origin and value.”  It’s the place that you promise management of course you can get the data back…as long as they never actually ask for it or are okay with reading terabytes of flat files from backup tapes.  The Data Swamp is the Aristotelian counterpart to the Data Lake, Goofus to its Gallant.  It will also, to my estimate, be the more common version.

Comments closed