With the announcement of SQL Server 2019 big data clusters at Ignite, Kubernetes (often abbreviated to K8s) now stands front and center as part of Microsoft’s data platform vision. The obvious inference being that this is something that the Microsoft data platform community is going to show an increased interest in. The post aims to provide some context around:
why container orchestration is required
how Kubernetes is architected
the basics of working with Kubernetes
and why embracing open source software should be approached in an eyes wide open manner
Kubernetes is another technology which is useful to learn and can be helpful down the line.
Basically, the idea is to keep the fast stuff fast and the slow stuff slow. I wrote a paper 14 years ago on the challenges of real-time data warehousing. Fortunately, both the data streaming, database, and BI layers have all evolved significantly since then, and now there exist databases and other data storage engines which can support the feature trinity that is needed to do both real-time and historical analytics right, without a Lambda architecture:
- Accept real-time streams of data at high rates.
- Simultaneously respond to large volumes of queries, including on the most recently added data.
- Store all the history needed for analysis.
We call these engines “fast data sinks” and there are four main groups of them today:
It’s an interesting argument.
While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer. The virtual data layer allows users to query data from many sources through a single, unified interface. Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date. Storage costs and data governance complexity are minimized. See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement.
SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:
James covers some of the reasoning behind this and the shift from using Polybase to integrate data with Hadoop + Azure Blob Storage to using SQL Server as a data virtualization engine.
It starts off easy – you fill in your name, your street, your apartment number, and then you reach the field labelled “Post Town”. What is a “Post Town”? Huh. Next you see “County”. Well, you know what a county is, but since when do you list it in your address? Then there’s “Postal code”. You might recognize it as what the US calls a “zip code”, but it’s still confusing to see.
So now you don’t really know what to do, right? Where do you put your city, or your state? Do you just cram your address into the form however you can and hope that you get your order? Or do you abandon your cart and decide not to buy anything from this site?
This is more application-level code than data (and so glides over the underlying database work), but Danielle does link to some good resources for figuring out what non-US addresses look like.
One of the things I think is important in modeling your particular entity is including a primary key (PK). In my DevOps talk I stress this, as I’d rather most attendees come away thinking a PK is important as their first takeaway from the session. There are exceptions, but they are rare, and I would prefer that most tables just have some PK included from the beginning.
A PK ought to be stable as well, and there are plenty of written words about how to pick the PK for your particular problem domain. Often I have received the advice that natural keys are preferred over surrogate keys, and it is worth the effort to try and identify a suitable column (or set of columns) that will guarantee uniqueness. I think that’s good advice, and it’s also advice I tend to ignore.
Read on for Steve’s reasoning. I tend to use surrogate keys out of habit, though I do prefer to put unique key constraints on natural keys to help me reason through data models.
Given the shortcomings of monitoring and testing, we should shift focus to building observable systems. This means treating observability of system behaviour as a primary feature of the system being built, and integrating this feature into how we design, build, test, and maintain our systems. This also means acknowledging that the ease with which we can debug our production environment will be a key indicator of system reliability, scalability, and ultimately customer experience. Designing a system to be observable requires effort from three disciplines of software development: development, testing, and operations. None of these disciplines is more important than the others, and the sum of them is greater than the value of the individual parts. Let’s take some time to look at each discipline in more detail, with a focus on observability.
My struggle has never been with the concept, but rather with getting the implementation details right. “Make everything observable” is great until you run out of disk space because you’re logging everything.
I strongly urge anyone with an interest in learning Kubernetes to watch this presentation, as it makes a great job of explaining Kubernetes from the ground up.
“Out of the box”, Kubernetes will look after scheduling. If there is a requirement to ensure that pods only run on a specific set of nodes, there is a means of doing this via label selectors, as documented here. A label selector is a directive of influencing resource utilization related decisions made by the cluster.
Read the whole thing.
Kubernetes is an open source container-orchestration system for automating deployments, scaling and management of containerized applications. In this tutorial, you will learn how to get started with Microservices on Kubernetes. I will cover the below topics in details —
How does Kubernetes help to build scalable Microservices?
Overview of Kubernetes Architecture
Create a Local Development Environment for Kubernetes using Minikube
Create a Kubernetes Cluster and deploy your Microservices on Kubernetes
Automate your Kubernetes Environment Setup
CI/CD Pipeline for deploying containerized application to Kubernetes
This sticks mostly to a high-level architecture discussion, and does a good job at that.
Our main requirement for this new project was to build infrastructure that would be 100 percent self-service for our Data Scientists. In other words, my teammates and I would never be directly involved in the discovery, creation, configuration and management of the event data. Self-service would fix the primary shortcoming of our legacy event delivery system: manual administration that was performed by my team whenever a new dataset was born. This manual process hindered the productivity and access to event data for our Data Scientists. Meanwhile, fulfilling the requests of the Data Scientists hindered our own ability to improve the infrastructure. This scenario is exactly what the Data Platform Team strives to avoid. Building self-service tooling is the number one tenet of the Data Platform Team at Stitch Fix, so whatever we built to replace the old event infrastructure needed to be self-service for our Data Scientists. You can learn more about our philosophy in Jeff Magnusson’s post Engineers Shouldn’t Write ETL.
This is an architectural overview and a good read.
Go and search through it for a NoSQL data management system. I’ll wait.
You found one that was named NoSQL didn’t you. Oracle NoSQL, because Oracle. Of course. However, under the Database Model what did it say? Document Store. Why?
BECAUSE NOSQL IS NOT A DATA STORAGE ENGINE!
There is not a NoSQL thing to use. You can’t compare NoSQL to anything because NoSQL is just a different screed, a rant, a tantrum. A bunch of influential developers threw themselves on the floor, screaming and kicking, spittle flying everywhere, because they didn’t want to eat their broccoli.
Wait, that was my kids when they were three.
As the voice of reason (and you know you’re in trouble when that’s the case!), non-relational engines take up 6 of the top 15 slots. There are particular advantages to non-relational data stores—as Grant willingly notes—and as companies grow, specialized data storage can become quite useful. But relational databases are a great starting point and for good reason: they solve a subset of problems extremely well and solve most problems well enough. This is probably also a good place to drop in a reference to Feasel’s Law.