A Review Of AWS Managed Kafka Public Preview

Stephane Maarek is not impressed with AWS’s managed Kafka offering so far:

For me, the more people use Apache Kafka, the more business I get. As I teach Apache Kafka online on Udemy (links at https://kafka-tutorials.com/), the prospect of having an entire user base from AWS wanting to learn Apache Kafka is exciting! And as an Apache Kafka consultant, it’s always more fun to spend time deploying data pipelines than deploying infrastructure.

Unfortunately what AWS released today misses the mark. I think it’s reminiscent of managed services of open source software in AWS overall: they’re released early and lack features that I think should be MVP. In my opinion this will deter future users.

Based on Stephane’s reading, this is a product which should have sat in development for another 3-6 months to flesh out the features, upgrade the version of Kafka used, etc.  Definitely read this before jumping on AWS MSK.

Load Testing Spark To MongoDB

Abdelghani Tassi has a quick load test to see how fast Spark can load data into MongoDB:

Recently, my company faced the serious challenge of loading a 10 million rows of CSV-formatted geographic data to MongoDB in real-time.

We first tried to make a simple Python script to load CSV files in memory and send data to MongoDB. Processing 10 million rows this way took 26 minutes!

26 minutes for processing a dataset in real-time is unacceptable so we decided to proceed differently.

I’m not sure the test was totally fair, but the results comport to my biases…  There is some good advice here:  storing data in optimized formats (Parquet in this instance) can make a big difference, Spark is useful for ETL style operations, and Scala is generally the fastest language in the Spark world.

Hive And Spark Integrated Together

Bikas Saha and Saumitra Buragohain share some of the direction the Apache Hive team is going in version 3:

The latest release of Apache Hive 3 (part of HDP 3) provides significant new capabilities including ACID support for data ingest. This functionality has many applications, a crucial one being privacy support for data modifications and deletions for GDPR. In addition, ACID also significantly reduces the time to ingest for data, thereby improving data freshness for Hive queries. To provide these features, Hive needs to take full control of the files that store the table data and thus this data is no longer directly accessible by third party systems like Apache Spark. Thus Apache Spark’s built-in support for Hive table data is no longer supported for data managed by Hive 3.

At the same time, Apache Spark has become the de-facto standard for a wide variety complex processing use cases on Big Data. This includes data stored in Hive 3 tables and thus we need a way to provide efficient, high-performance, ACID compliant access to Hive 3 table data from Spark. Fortunately, Apache Spark supports a pluggable approach for various data sources and Apache Hive itself can also be considered as one data source. We have implemented the Hive Warehouse Connector (HWC) as library to provide first class support for Spark to read Hive 3 data for subsequent complex processing (like machine learning) in Spark.

Spark is also commonly used to ETL raw data into Hive tables and this scenario should continue to be supported in the Hive ACID world. To do that, HWC integrates with the latest Hive Streaming APIs to support ingest into Hive both from batch jobs as well as structured streaming jobs.

Overall the Hive Warehouse connector provide efficient read write access to Hive warehouse data from Spark jobs, while providing transparent user identity propagation and maintaining consistent security and access control.

Spark has had some dependencies on Hive (or at least expectations of certain Hive conventions like /tmp/hive existing), but the two systems have historically been more (friendly) competitors than tools integrated in the same chain.

Dealing With System Sprawl

Charity Majors has a simple (but not easy) solution to system sprawl:

Stop me if you’ve heard this one before.

The company is growing like crazy, your engineering team keeps rising to the challenge, and you are ferociously proud of them.  But some cracks are beginning to show, and frankly you’re a little worried.  You have always advocated for engineers to have broad latitude in technical decisions, including choosing languages and tools.  This autonomy and culture of ownership is part of how you have successfully hired and retained top talent despite the siren song of the Faceboogles.

But recently you saw something terrifying that you cannot unsee: your company is using all the languages, all the environments, all the databases, all the build tools.  Shit!!!  Your ops team is in full revolt and you can’t really blame them.  It’s grown into an unsupportable nightmare and something MUST be done, but you don’t know what or how — let alone how to solve it while retaining the autonomy and personal agency that you all value so highly.

I hear a version of this everywhere I’ve gone for the past year or two.  It’s crazy how often.  I’ve been meaning to write my answer up for ages, and here it (finally) is.

I like the solution:  embrace the sprawl but make the default a stable set of well-supported systems with reasons for people to want to start there.  Read the whole thing.

Understanding Recursive CTEs Via Execution Plans

Hugo Kornelis shows us how SQL Server’s database engine implements recursive common table expressions:

I am pretty sure that (almost) everyone reading this blog knows that a CTE (Common Table Expression) is an independent subquery that can be named and then referenced (multiple times if needed) in the main query. This makes CTEs an invaluable tool to increase the readability of complex queries. Almost everything we can do with a CTE can equally well be done by using subqueries directly in the query but at the cost of losing readability.

However, there is also one feature that is actually unique to CTEs: recursion. This blog post is not about what this feature is, what it does, or how to write the code. A brief description is given as part of the complete explanation of CTEs in Books Online, and many bloggers have written about recursive CTEs as well.

For this post, I’ll assume the reader knows what a recursive CTE is, and jump into the execution plan to learn how this recursion is actually implemented.

This is (as usual) a great article, and helps explain why recursive CTEs can be slow.

Advent Of Code Day One, In T-SQL Form

Wayne Sheffield walks through several useful techniques for T-SQL developers:

And here we have a T-SQL solution for Day 1 of the Advent of Code challenge. The key tasks that we can learn from today are:

  • Loading a file.

  • Split a string on a delimiter.

  • Including additional rows into a result set (adding the first zero with a UNION ALL).

  • Multiplying (duplicating) a result set multiple times.

  • Performing a running total calculation.

  • Assigning a sequential number to a set of rows in a specific order.

  • Use of the GROUP BY and HAVING clauses while performing an aggregation.

Read the whole thing.

Rounding Errors And Data Type Conversions In DAX

Marco Russo explains the rules behind data type conversions in DAX:

Any DAX formula involving arithmetical operators ( + – * / ) might produce a result in a different data type. While this is obvious when you have different data types in the arguments, it could be less intuitive when the arguments have the same data type. Indeed, the result might have a different data type. This is important. Indeed, in a complex expression there could be many operators, but every operator defines a single expression that produces a new data type – that is the argument of the next operator. We will start looking at the resulting data type of the standard operators, showing a few examples later of how they could affect the result in a more complex expression.

Marco shows some relatively drastic differences:  hundreds of dollars when dealing with millions (and any company okay with being off by hundreds of dollars when dealing with millions, please mail me a check for hundreds of dollars).

Technologies Surrounding Big Data Clusters In SQL Server 2019

Buck Woody has some long-term homework for people:

Some of these technologies and concepts are not owned or created by Microsoft – the concepts are universal, and a few of the technologies are open-source. I’ve marked those in italics.

I’ve also included a few links to a training resource I’ve found to be useful. I normally use LinkedIn Learning for larger courses, along with EdX, DataCamp, and many other platforms for in-depth training. The links I have indicated here are by no means exhaustive, but they are free, and provide a good starting point.

Click through for a list of some of the technologies in play.

Working With Firewall Rules From Azure SQL Database

Arun Sirpal shows us how we can use T-SQL to set and modify firewall rules within Azure SQL Database:

For this post I want to actually show you the TSQL code to do this, hopefully it will become a good reference point for the future. Before we step into the code lets understand the differences between database level and server level rules.

For server level rules they enable access your entire Azure SQL server, that is, all the databases within the same logical server. These rules are stored in the master database. Database level rules enable access to certain databases (yes you could also run this within master) within the same logical server, think of this as you being more granular with the access where they are created within the user database in question.

Personally, I try and always use database level rules, this is especially true when I work with failover groups.

Click through for instructions on how to work with both server and database level rules.


December 2018
« Nov