The expansion of data sets and increased expectations of businesses for analysis and modeling of data has led developers to create a number of database products to meet those needs. As data professionals, it is incumbent upon us to understand how these tools work and put them to their best use–before somebody else puts them to sub-optimal use. I am joined by Kevin Feasel who walks us through some of the technologies available and sorts out under what circumstances we want to consider using each one.
Jay Kreps, the co-creator of Apache Kafka and CEO of Confluent, was one of the first big data architects to espouse an alternative to the Lambda architecture, which he did with his 2014 O’Reilly story “Questioning the Lambda Architecture.” While Kreps appreciated some aspects of the Lambda architecture—in particular how it deals with reprocessing data—he stated that the downside was just too great.
“The Lambda architecture says I have to have Hadoop and I have to have Storm and I’m going to implement everything in both places and keep them in sync. “I think that’s extremely hard to do,” Kreps tells Datanami. “I think one of the biggest things hurting stream processing is the amount of complexity that you have to incur to build something. That makes it slow to build applications that way, hard to roll them out, and hard to make them reliable enough to be a key part of the business.
I wonder if we’re seeing the next generation of Kimball v Inmon here, or if one will absolutely dominate.
The batch layer stores all the data with no constraint on the schema. The schema-on-read is built in the batch views in the serving layer. Creating schema-on-read views requires algorithms to parse the data from the batch layer and convert them in a readable way. This allows input data to freely evolve as there is no constraint on their structure. But then, the algorithm that builds the view is responsible to manage the structural change in order to still deliver the same view as expected.
This shows a coupling between the data and the algorithms used for serving the data. Focusing on data quality is therefore not enough and we may ask the question of the algorithm quality. As the system lives and evolves, the algorithms may become more and more complex. These algorithms must not be regarded as black boxes, but a clear understanding of what they are doing is important if we want to have a good data governance. Moreover, during the batch view creation, data quality transformations could be done so as to provide data of better quality to the consumer of the views.
Lambda is an interesting architectural concept, as it tries to solve the age-old “fast or accurate?” problem with “both.” Get your fast estimates streamed through a speed layer, but your accurate, slow calculations handled through the serving layer. Definitely check out this article.
2. Store additive measures in the data warehouse.
The best type of measures to store in the data warehouse are those measures that can be fully aggregated. A measure that can be fully aggregated is a measure that can be summarized by any dimension or all dimensions and still remain meaningful. For instance, a Sales Amount measure can be summarized by Product, Date, Geography, etc. and still provide valuable insight for the customer.
Measures that cannot be fully aggregated, such as ratios or other percentage type calculations should be handled in the semantic model or the reporting tool. For example, a measure such as Percentage Profit Margin stored in a table cannot be properly aggregated. A better option would be to store the additive measures that are the base for the Percentage Profit Margin, such as Revenue, Cost, Margin, etc. These base measures can be used to calculate the ratio in a query, semantic model, or reporting tool.
The first five tips are non-controverisal and act as a good baseline for understanding warehousing with SQL Server. Do check it out.
Business users have been content to perform analytics on data collected in Amazon Redshift to spot trends. But recently, they have been asking AWS whether the latency can be reduced for real-time analysis. At the same time, they want to continue using the analytical tools they’re familiar with.
In this situation, we need a system that lets you capture the data stream in real time and use SQL to analyze it in real time.
In the earlier section, you learned how to build the pipeline to Amazon Redshift with Firehose and Lambda functions. The following illustration shows how to use Apache Spark Streaming on EMR to compute time window statistics from DynamoDB Streams. The computed data can be persisted to Amazon S3 and accessed with SparkSQL using Apache Zeppelin.
There are a lot of technologies at play here and it’s worth a perusal, even though I’m going to keep recommending that you use a relational database like SQL Server for OLTP work in all but the most extreme of circumstances.
We’ll aim to predict the volume of events for the next 10 minutes using a streaming regression model, and compare those results to a traditional batch prediction method. This prediction could then be used to dynamically scale compute resources, or for other business optimization. I will start out by describing how you would do the prediction through traditional batch processing methods using both Apache Impala (incubating) and Apache Spark, and then finish by showing how to more dynamically predict usage by using Spark Streaming.
Of course, the starting point for any prediction is a freshly updated data feed for the historic volume for which I want to forecast future volume. In this case, I discovered that Meetup.com has a very nice data feed that can be used for demonstration purposes. You can read more about the API here, but all you need to know at this point is that it provides a steady stream of RSVP volume that we can use to predict future RSVP volume.
This is pretty dense, but it is a great look at one potential architecture leveraging Spark and several tools in the Hadoop ecosystem.
Here are a few reasons why you might find this capability useful:
Instrumentation: Use database triggers to call a Lambda function when important data is changed in the database. Your Lambda function can easily integrate with Amazon CloudWatch, allowing you to create custom metrics, dashboards and alarms based on changes to your data.
Outbound streaming: Again, use triggers to call Lambda when key data is modified. Your Lambda function can post messages to other AWS services such as Amazon SQS, Amazon SNS, Amazon SES, or Amazon Kinesis Firehose, to send notifications, trigger external workflows, or to push events and data to downstream systems, such as an Amazon Redshift data warehouse.
Access external data sources: Call Lambda functions from within your SQL code to retrieve data from external web services, read messages from Amazon Kinesis streams, query data from other databases, and more.
Incremental modernization: Improve agility, scalability, and reliability, and eliminate database vendor lock-in by evolving in steps from an existing monolithic database design to a well-architected, modern microservices approach. You can use a microservices architecture to migrate business logic embodied in database procedures into database-agnostic Lambda functions while preserving compatibility with remaining SQL packages.
His specific example is around Oracle/Postgres, but I’d imagine you could do the same on SQL Server with the CLR.
Stackoverflow is a specific use case and they decided to use .Net so they have a specific set of problems to deal with in terms of performance. They deploy (as I understand it) 10 times a day so if they need to change a query then they can quickly and easily – how quickly can you modify code and get it to production to fix a problem causing downtime on your mission critical app written in powerbuilder 20 years ago? (I jest but you get the point)
I like Ed’s back-and-forth arguing, as there are legitimate cases for both sides and the best answer almost always is somewhere in between for line of business apps. I have three points that I tend to mention whenever this discussion comes up.
First, a lot of “business logic” is actually data logic. Check constraints, foreign key constraints, unique key constraints, and even primary key constraints (for non-surrogate primary keys) are business rules, but they’re business rules around how the data is shaped and it’s a lot better to use your database system to maintain those rules.
Third, at the margin, go where your maintenance developers are most comfortable. If they’re really good with C# but not good with SQL, the marginal business logic (the stuff you could really go either way on) should stay in the app tier; if your maintainers have really strong SQL skills but are lagging on the .NET side, I’d stick the marginal logic in stored procedures.
For an early- and mid-stage startup, a monolithic database is absolutely the appropriate architecture choice. With a small team and a small company, a single shared database made it simple to get started. Moving fast meant being able to make rapid changes across the entire system. A shared database made it very easy to join data between different tables, and it made transactions across multiple tables possible. These are pretty convenient.
As we have gotten larger, those benefits have become liabilities. It has become a single point of failure, where issues with the shared database can bring down nearly all of our applications. It has become a performance bottleneck, where long-running operations from one application can slow down others. Finally, and most importantly, the shared database has become a coupling point between teams, slowing down our ability to make changes.
I have my misgivings (as you’d expect from a database snob), particularly because I value highly the benefits of normalization and see sharded systems as a step backwards in that regard. But even with that said, there are absolutely benefits to slicing out orthogonal sections of data; the point of disagreement is in those places in which two teams’ entities and attributes overlap.
An evolution of the three previous scenarios that provides multiple options for the various technologies. Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control. ELT is usually used instead of ETL (see Difference between ETL and ELT). The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.
Hub-and-spoke should be your ultimate goal. See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
Check it out for a high-level architectural view of contemporary warehousing choices. I prefer having both systems in play: the EDW answers known business questions and gives you back report information relatively quickly; whereas the Hadoop cluster allows you to do spelunking, data cleansing, and answer unanticipated business questions.