If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake. And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL). Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data. There is good news and bad news when it comes to which product to use. The good news is there are a lot of products to choose from. The bad news is there are a lot of products to choose from :-). I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part
The only surprise is the non-mention of Azure Data Lake Analytics, and there is a good conversation in the comments section explaining why.
I can’t begin to tell you how many terrible things you can avoid by starting your apps out using an optimistic isolation level. Read queries and write queries can magically exist together, at the expense of some tempdb.
Yes, that means you can’t leave transactions open for a very long time, but hey, you shouldn’t do that anyway.
Yes, that means you’ll suffer a bit more if you perform large modifications, but you should be batching them anyway.
Optimistic concurrency is huge—definitely worth the top slot in Erik’s list.
The major problem of the Lambda architecture is that we have to build two separate pipelines, which can be very complex, and, ultimately, difficult to combine the processing of batch and real-time data, however, it is now possible to overcome such limitation if we have the possibility to change our approach.
Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks File System (DBFS). It is a single data management tool that combines the scale of a data lake, the reliability and performance of a data warehouse, and the low latency of streaming in a single system. The core abstraction of Databricks Delta is an optimized Spark table that stores data as parquet files in DBFS and maintains a transaction log that tracks changes to the table.
It’s an interesting contrast and I recommend reading the whole thing.
Leona Zhang has a series going on Apache Kafka. Part one covers some of the concepts around messaging systems:
There is a difference between batch processing applications and stream processing applications. A visible boundary determines the most significant difference between batch processing and stream processing. If it exists, it is called batch processing. For example, a client collects the data once every hour, sends this data to the server for statistics, and then saves the statistical results in the statistical database.
If the boundary doesn’t exist, the processing is called streaming data (stream processing). Here is an example of stream processing: logs and orders are generated continuously on a large website just like a data flow. If the processing of each log and order takes less than several hundred milliseconds or several seconds after its generation, the application is called a stream application. If the collection of logs and orders happens once every hour followed by a unified transmission, the original stream data converts into batch data.
Occasionally, stream processing becomes imperative. For example, Jack Ma wanted to display the orders and sales on Tmall for November 11 on a large screen. If the data center works in a T+1 mode and can obtain data for November 11 on November 12, Jack Ma would not be happy.
Kafka uses the group concept to integrate the producer/consumer and publisher/subscriber models.
One topic may have multiple groups, and one group may include multiple consumers. Only one consumer in the group can consume one message. For different groups, consumers are in the publisher/subscriber model. All groups receive one message.
Note: Allocate one partition to only one consumer in the same group. If there are three partitions and four consumers in one of the groups, one consumer is redundant and cannot receive any data.
This looks to be the start to a good series.
Google is doing more SQL, or at least shifting towards relational SQL databases as a way of storing data. At least, some of their engineers see this as a better way to store data for many problems. Since I’m a relational database advocate, I found this to be interesting.
When Google first started to publish information on BigTable and other new ways of dealing with large amounts of data, I felt that these weren’t solutions I’d use or problems that many people had. The idea of Map Reduce is interesting and certainly applicable to the problem space Google had of a global database of sites, but that’s not a problem I’ve ever encountered. Instead, most of the struggles I’ve had with relational systems are still better addressed in a relational system.
Read the whole thing. Note that this is slightly different than Feasel’s Law, as Steve is focusing more on the consistency side of things rather than the interface.
Also, just going to leave this here:
Stop me if you’ve heard this one before.
The company is growing like crazy, your engineering team keeps rising to the challenge, and you are ferociously proud of them. But some cracks are beginning to show, and frankly you’re a little worried. You have always advocated for engineers to have broad latitude in technical decisions, including choosing languages and tools. This autonomy and culture of ownership is part of how you have successfully hired and retained top talent despite the siren song of the Faceboogles.
But recently you saw something terrifying that you cannot unsee: your company is using all the languages, all the environments, all the databases, all the build tools. Shit!!! Your ops team is in full revolt and you can’t really blame them. It’s grown into an unsupportable nightmare and something MUST be done, but you don’t know what or how — let alone how to solve it while retaining the autonomy and personal agency that you all value so highly.
I hear a version of this everywhere I’ve gone for the past year or two. It’s crazy how often. I’ve been meaning to write my answer up for ages, and here it (finally) is.
I like the solution: embrace the sprawl but make the default a stable set of well-supported systems with reasons for people to want to start there. Read the whole thing.
…but evidently some people think of other things. Those people are obviously wrong, but let’s continue. When I’m not a walking zombie after reading a 10,000 word blog post some idiot wrote, I see monitoring as the process of keeping an eye on your stuff, like a security guard sitting at a desk full of cameras somewhere. Sometimes they fall asleep–that’s monitoring going down. Sometimes they’re distracted with a doughnut delivery–that’s an upgrade outage. Sometimes the camera is on a loop–I don’t know where I was going with that one, but someone’s probably robbing you. And then you have the fire alarm. You don’t need a human to trigger that. The same applies when a door gets opened, maybe that’s wired to a siren. Or maybe it’s not. Or maybe the siren broke in 1984.
I know what you’re thinking: Nick, what the hell? My point is only that monitoring any application isn’t that much different from monitoring anything else. Some things you can automate. Some things you can’t. Some things have thresholds for which alarms are valid. Sometimes you’ll get those thresholds wrong (especially on holidays). And sometimes, when setting up further automation isn’t quite worth it, you just make using human eyes easier.
This is a really good post covering monitoring techniques at a high level and getting into specific implementations at Stack Overflow.
Kafka applications are event based, and leverage stream processing to continuously process input data. If you’re using Kafka, then you can embed an analytic model natively in a Kafka Streams or KSQLapplication. There are various examples of Kafka Streams microservices embedding models built with TensorFlow, H2O or Deeplearning4j natively.
It is not always possible or feasible to embed analytic models directly due to architectural, security or organizational reasons. You can also choose to use RPC to perform model inference from your Kafka application (bearing in mind the the pros and cons discussed above). You can visit my project for an example of gRPC integration between a Kafka Streams microservice and locally hosted TensorFlow Serving container for making predictions with a hosted TensorFlow model.
There are a couple separate and interesting patterns here.
With the announcement of SQL Server 2019 big data clusters at Ignite, Kubernetes (often abbreviated to K8s) now stands front and center as part of Microsoft’s data platform vision. The obvious inference being that this is something that the Microsoft data platform community is going to show an increased interest in. The post aims to provide some context around:
why container orchestration is required
how Kubernetes is architected
the basics of working with Kubernetes
and why embracing open source software should be approached in an eyes wide open manner
Kubernetes is another technology which is useful to learn and can be helpful down the line.
Basically, the idea is to keep the fast stuff fast and the slow stuff slow. I wrote a paper 14 years ago on the challenges of real-time data warehousing. Fortunately, both the data streaming, database, and BI layers have all evolved significantly since then, and now there exist databases and other data storage engines which can support the feature trinity that is needed to do both real-time and historical analytics right, without a Lambda architecture:
- Accept real-time streams of data at high rates.
- Simultaneously respond to large volumes of queries, including on the most recently added data.
- Store all the history needed for analysis.
We call these engines “fast data sinks” and there are four main groups of them today:
It’s an interesting argument.