Press "Enter" to skip to content

Category: Architecture

BigQuery Versus Redshift

Kiyoto Tamura compares Google’s BigQuery versus Amazon’s Redshift for cloud-based warehousing:

Neither service is truly “set and forget” and requires a dedicated engineer to learn the service and maintain it. You can use various tools to automate many aspects of the operation, but someone will have to maintain automation scripts and workflows.

That said, here are things that I’ve heard first-hand from talking to users

The bottom line there is that Redshift is a bit more mature than BigQuery today, but keep an eye on both of them.

Comments closed

Warehouse History

Kennie Pontoppidan delves into various aspects of collecting and storing history in warehouses:

In T2 history we have the two attributes ValidFromDate and ValidToDate. We can choose two different strategies for updating the values of these: using system time (load time) or business time. If we use system time for the T2 splits, the data warehouse history is dependent on when we load data. This makes it impossible to reload data in the data warehouse without messing up the data history. If we allow our load ETL procedures to use timestamps for business time (when data was really valid) for T2 history, we get the opportunity to reload data. But the cost of this flexibility is a much more complicated design for T2 splits. We also need to keep track of this metadata on the source system attributes.

Part of a warehouse’s value is its ability to replay historical data, but you can only do that if you store the data correctly (and query it correctly!).

Comments closed

SQL Data Partners Podcast: The Wide World Of Data

Carlos L. Chacon was nice enough to interview me on his podcast:

The expansion of data sets and increased expectations of businesses for analysis and modeling of data has led developers to create a number of database products to meet those needs. As data professionals, it is incumbent upon us to understand how these tools work and put them to their best use–before somebody else puts them to sub-optimal use.  I am joined by Kevin Feasel who walks us through some of the technologies available and sorts out under what circumstances we want to consider using each one.

Show notes are on the SQL Data Partners podcast site.  My presentation slides are available online.  And if I get just a few more people to dig Aphyr as much as I do, the world will be a better place.

Comments closed

Lambda And Kappa

Alex Woodie has a story on two competing data architectures:

Jay Kreps, the co-creator of Apache Kafka and CEO of Confluent, was one of the first big data architects to espouse an alternative to the Lambda architecture, which he did with his 2014 O’Reilly story “Questioning the Lambda Architecture.” While Kreps appreciated some aspects of the Lambda architecture—in particular how it deals with reprocessing data—he stated that the downside was just too great.

“The Lambda architecture says I have to have Hadoop and I have to have Storm and I’m going to implement everything in both places and keep them in sync. “I think that’s extremely hard to do,” Kreps tells Datanami. “I think one of the biggest things hurting stream processing is the amount of complexity that you have to incur to build something. That makes it slow to build applications that way, hard to roll them out, and hard to make them reliable enough to be a key part of the business.

I wonder if we’re seeing the next generation of Kimball v Inmon here, or if one will absolutely dominate.

Comments closed

Lambda Architecture

Sebastiao Correia discusses Lambda architecture:

The batch layer stores all the data with no constraint on the schema. The schema-on-read is built in the batch views in the serving layer. Creating schema-on-read views requires algorithms to parse the data from the batch layer and convert them in a readable way. This allows input data to freely evolve as there is no constraint on their structure. But then, the algorithm that builds the view is responsible to manage the structural change in order to still deliver the same view as expected. 

This shows a coupling between the data and the algorithms used for serving the data. Focusing on data quality is therefore not enough and we may ask the question of the algorithm quality. As the system lives and evolves, the algorithms may become more and more complex. These algorithms must not be regarded as black boxes, but a clear understanding of what they are doing is important if we want to have a good data governance. Moreover, during the batch view creation, data quality transformations could be done so as to provide data of better quality to the consumer of the views.

Lambda is an interesting architectural concept, as it tries to solve the age-old “fast or accurate?” problem with “both.”  Get your fast estimates streamed through a speed layer, but your accurate, slow calculations handled through the serving layer.  Definitely check out this article.

Comments closed

Data Warehouse Design Tips

Dustin Ryan has part one of a two-part series on data warehouse design best practices:

2. Store additive measures in the data warehouse.

The best type of measures to store in the data warehouse are those measures that can be fully aggregated. A measure that can be fully aggregated is a measure that can be summarized by any dimension or all dimensions and still remain meaningful. For instance, a Sales Amount measure can be summarized by Product, Date, Geography, etc. and still provide valuable insight for the customer.

Measures that cannot be fully aggregated, such as ratios or other percentage type calculations should be handled in the semantic model or the reporting tool. For example, a measure such as Percentage Profit Margin stored in a table cannot be properly aggregated. A better option would be to store the additive measures that are the base for the Percentage Profit Margin, such as Revenue, Cost, Margin, etc. These base measures can be used to calculate the ratio in a query, semantic model, or reporting tool.

The first five tips are non-controverisal and act as a good baseline for understanding warehousing with SQL Server.  Do check it out.

Comments closed

In-Memory OLTP Using Ignite

Babu Elumalai explains how to use Apache Ignite to build an in-memory OLTP system on top of Amazon’s DynamoDB:

Business users have been content to perform analytics on data collected in Amazon Redshift to spot trends. But recently, they have been asking AWS whether the latency can be reduced for real-time analysis. At the same time, they want to continue using the analytical tools they’re familiar with.

In this situation, we need a system that lets you capture the data stream in real time and use SQL to analyze it in real time.

In the earlier section, you learned how to build the pipeline to Amazon Redshift with Firehose and Lambda functions. The following illustration shows how to use Apache Spark Streaming on EMR to compute time window statistics from DynamoDB Streams. The computed data can be persisted to Amazon S3 and accessed with SparkSQL using Apache Zeppelin.

There are a lot of technologies at play here and it’s worth a perusal, even though I’m going to keep recommending that you use a relational database like SQL Server for OLTP work in all but the most extreme of circumstances.

Comments closed

Building A Prediction Engine

Richard Williamson explains how to build a prediction engine using technologies such as Spark, Kudu, Impala, and Kafka:

We’ll aim to predict the volume of events for the next 10 minutes using a streaming regression model, and compare those results to a traditional batch prediction method. This prediction could then be used to dynamically scale compute resources, or for other business optimization. I will start out by describing how you would do the prediction through traditional batch processing methods using both Apache Impala (incubating) and Apache Spark, and then finish by showing how to more dynamically predict usage by using Spark Streaming.

Of course, the starting point for any prediction is a freshly updated data feed for the historic volume for which I want to forecast future volume. In this case, I discovered that Meetup.com has a very nice data feed that can be used for demonstration purposes. You can read more about the API here, but all you need to know at this point is that it provides a steady stream of RSVP volume that we can use to predict future RSVP volume.

This is pretty dense, but it is a great look at one potential architecture leveraging Spark and several tools in the Hadoop ecosystem.

Comments closed

Integrating Lambda With Relational Databases

Bob Strahan shows how to integrate AWS Lambda with relational databases running on EC2:

Here are a few reasons why you might find this capability useful:

  • Instrumentation: Use database triggers to call a Lambda function when important data is changed in the database. Your Lambda function can easily integrate with Amazon CloudWatch, allowing you to create custom metrics, dashboards and alarms based on changes to your data.

  • Outbound streaming: Again, use triggers to call Lambda when key data is modified. Your Lambda function can post messages to other AWS services such as Amazon SQS, Amazon SNS, Amazon SES, or Amazon Kinesis Firehose, to send notifications, trigger external workflows, or to push events and data to downstream systems, such as an Amazon Redshift data warehouse.

  • Access external data sources: Call Lambda functions from within your SQL code to retrieve data from external web services, read messages from Amazon Kinesis streams, query data from other databases, and more.

  • Incremental modernization: Improve agility, scalability, and reliability, and eliminate database vendor lock-in by evolving in steps from an existing monolithic database design to a well-architected, modern microservices approach. You can use a microservices architecture to migrate business logic embodied in database procedures into database-agnostic Lambda functions while preserving compatibility with remaining SQL packages.

His specific example is around Oracle/Postgres, but I’d imagine you could do the same on SQL Server with the CLR.

Comments closed

Business Logic

Ed Elliott hits a classic architectural argument—whether business logic should be in stored procedures;

Stackoverflow is a specific use case and they decided to use .Net so they have a specific set of problems to deal with in terms of performance. They deploy (as I understand it) 10 times a day so if they need to change a query then they can quickly and easily – how quickly can you modify code and get it to production to fix a problem causing downtime on your mission critical app written in powerbuilder 20 years ago? (I jest but you get the point)

I like Ed’s back-and-forth arguing, as there are legitimate cases for both sides and the best answer almost always is somewhere in between for line of business apps.   I have three points that I tend to mention whenever this discussion comes up.

First, a lot of “business logic” is actually data logic.  Check constraints, foreign key constraints, unique key constraints, and even primary key constraints (for non-surrogate primary keys) are business rules, but they’re business rules around how the data is shaped and it’s a lot better to use your database system to maintain those rules.

Second, validation rules should be everywhere.  The fancy Javascript library should do validation, the server-side business logic should do validation, and the database should do validation.  You don’t know what’s going to skip one or more of these layers, and your database is the final gatekeeper preventing bad data from sneaking into your system.

Third, at the margin, go where your maintenance developers are most comfortable.  If they’re really good with C# but not good with SQL, the marginal business logic (the stuff you could really go either way on) should stay in the app tier; if your maintainers have really strong SQL skills but are lagging on the .NET side, I’d stick the marginal logic in stored procedures.

Comments closed