Basically, the idea is to keep the fast stuff fast and the slow stuff slow. I wrote a paper 14 years ago on the challenges of real-time data warehousing. Fortunately, both the data streaming, database, and BI layers have all evolved significantly since then, and now there exist databases and other data storage engines which can support the feature trinity that is needed to do both real-time and historical analytics right, without a Lambda architecture:
- Accept real-time streams of data at high rates.
- Simultaneously respond to large volumes of queries, including on the most recently added data.
- Store all the history needed for analysis.
We call these engines “fast data sinks” and there are four main groups of them today:
It’s an interesting argument.
While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer. The virtual data layer allows users to query data from many sources through a single, unified interface. Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date. Storage costs and data governance complexity are minimized. See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement.
SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:
James covers some of the reasoning behind this and the shift from using Polybase to integrate data with Hadoop + Azure Blob Storage to using SQL Server as a data virtualization engine.
It starts off easy – you fill in your name, your street, your apartment number, and then you reach the field labelled “Post Town”. What is a “Post Town”? Huh. Next you see “County”. Well, you know what a county is, but since when do you list it in your address? Then there’s “Postal code”. You might recognize it as what the US calls a “zip code”, but it’s still confusing to see.
So now you don’t really know what to do, right? Where do you put your city, or your state? Do you just cram your address into the form however you can and hope that you get your order? Or do you abandon your cart and decide not to buy anything from this site?
This is more application-level code than data (and so glides over the underlying database work), but Danielle does link to some good resources for figuring out what non-US addresses look like.
One of the things I think is important in modeling your particular entity is including a primary key (PK). In my DevOps talk I stress this, as I’d rather most attendees come away thinking a PK is important as their first takeaway from the session. There are exceptions, but they are rare, and I would prefer that most tables just have some PK included from the beginning.
A PK ought to be stable as well, and there are plenty of written words about how to pick the PK for your particular problem domain. Often I have received the advice that natural keys are preferred over surrogate keys, and it is worth the effort to try and identify a suitable column (or set of columns) that will guarantee uniqueness. I think that’s good advice, and it’s also advice I tend to ignore.
Read on for Steve’s reasoning. I tend to use surrogate keys out of habit, though I do prefer to put unique key constraints on natural keys to help me reason through data models.
Given the shortcomings of monitoring and testing, we should shift focus to building observable systems. This means treating observability of system behaviour as a primary feature of the system being built, and integrating this feature into how we design, build, test, and maintain our systems. This also means acknowledging that the ease with which we can debug our production environment will be a key indicator of system reliability, scalability, and ultimately customer experience. Designing a system to be observable requires effort from three disciplines of software development: development, testing, and operations. None of these disciplines is more important than the others, and the sum of them is greater than the value of the individual parts. Let’s take some time to look at each discipline in more detail, with a focus on observability.
My struggle has never been with the concept, but rather with getting the implementation details right. “Make everything observable” is great until you run out of disk space because you’re logging everything.
I strongly urge anyone with an interest in learning Kubernetes to watch this presentation, as it makes a great job of explaining Kubernetes from the ground up.
“Out of the box”, Kubernetes will look after scheduling. If there is a requirement to ensure that pods only run on a specific set of nodes, there is a means of doing this via label selectors, as documented here. A label selector is a directive of influencing resource utilization related decisions made by the cluster.
Read the whole thing.
Kubernetes is an open source container-orchestration system for automating deployments, scaling and management of containerized applications. In this tutorial, you will learn how to get started with Microservices on Kubernetes. I will cover the below topics in details —
How does Kubernetes help to build scalable Microservices?
Overview of Kubernetes Architecture
Create a Local Development Environment for Kubernetes using Minikube
Create a Kubernetes Cluster and deploy your Microservices on Kubernetes
Automate your Kubernetes Environment Setup
CI/CD Pipeline for deploying containerized application to Kubernetes
This sticks mostly to a high-level architecture discussion, and does a good job at that.
Our main requirement for this new project was to build infrastructure that would be 100 percent self-service for our Data Scientists. In other words, my teammates and I would never be directly involved in the discovery, creation, configuration and management of the event data. Self-service would fix the primary shortcoming of our legacy event delivery system: manual administration that was performed by my team whenever a new dataset was born. This manual process hindered the productivity and access to event data for our Data Scientists. Meanwhile, fulfilling the requests of the Data Scientists hindered our own ability to improve the infrastructure. This scenario is exactly what the Data Platform Team strives to avoid. Building self-service tooling is the number one tenet of the Data Platform Team at Stitch Fix, so whatever we built to replace the old event infrastructure needed to be self-service for our Data Scientists. You can learn more about our philosophy in Jeff Magnusson’s post Engineers Shouldn’t Write ETL.
This is an architectural overview and a good read.
Go and search through it for a NoSQL data management system. I’ll wait.
You found one that was named NoSQL didn’t you. Oracle NoSQL, because Oracle. Of course. However, under the Database Model what did it say? Document Store. Why?
BECAUSE NOSQL IS NOT A DATA STORAGE ENGINE!
There is not a NoSQL thing to use. You can’t compare NoSQL to anything because NoSQL is just a different screed, a rant, a tantrum. A bunch of influential developers threw themselves on the floor, screaming and kicking, spittle flying everywhere, because they didn’t want to eat their broccoli.
Wait, that was my kids when they were three.
As the voice of reason (and you know you’re in trouble when that’s the case!), non-relational engines take up 6 of the top 15 slots. There are particular advantages to non-relational data stores—as Grant willingly notes—and as companies grow, specialized data storage can become quite useful. But relational databases are a great starting point and for good reason: they solve a subset of problems extremely well and solve most problems well enough. This is probably also a good place to drop in a reference to Feasel’s Law.
There are disadvantages to the layered architecture approach. And some aspects of it have been deemed not architecturally pure for domain-driven design.
However, there are disadvantages to most approaches. The key is to understand both the advantages and disadvantages. Pick what architectural patterns work best for your particular problem.
Here are some factors you want to keep in mind.
The layered architecture is very database-centric. As mentioned before, since everything flows down, it’s often the database that’s the last layer. Critics of this architecture point out that your application isn’t about storing data; it’s about solving a business problem. However, when so many applications are simple CRUD apps, maybe the database is more than just a secondary player.
Scaleability can be difficult with a layered architecture. This is tied to the fact that many layered applications tend to take on monolithic properties. If you need to scale your app, you have to scale the whole app! However, that doesn’t mean your layered application has to be a monolith. Once it becomes large enough, it’s time to split it out—just like you would with any other architecture.
A layered application is harder to evolve, as changes in requirements will often touch all layers.
A layered architecture is deployed as a whole. That’s even if it’s modular and separated into good components and namespaces. But that might not be a bad thing. Unless you have separate teams working on different parts of the application, deploying all at once isn’t the worst thing you can do.