In this blog, we’ll move one step forward to get an understanding of the Dual streaming model to see what abstractions does KSQL use to process the data.
All the data that we are working on with KSQL is produced to Kafka topics by some client. This client can be any Application, Kafka connectors etc., which produces continuous never-ending data to the topics.
KSQL does not directly interact with these topics, it rather introduces a couple of abstractions in between to process the data, which are known as Streams and Tables.
Read on to learn what these are and why it’s useful to think in these terms.
A classic interview question is: “How do you go about displaying large amounts of data in a performant way?” Most people (at least on the front end), usually come up with pagination first. An implementation for pagination might go something like this:
Out of a list of 100, request 10 items at a time until 100 items are reached. So you would do 9 requests, asking for 1–10, 11–20, etc., until the 100 are reached.
In Kafka’s case, there could be 1 million messages between successive requests, so a user can never see the “latest” message, only the range as requested by the browser. In addition, there is a fundamental problem with pagination as it relates to Kafka. Message ordering across partitions is non-deterministic, so what is displayed in the UI, a linear sequence from 1–100, would not represent the data as it is laid out inside of Kafka.
Very interesting reading.
I couldn’t find numbers from Google but others say that by 2017 Google processed over 20PB a day (not to mention answering 40K search queries/second) so Google is definitely in the big data game. The numbers go down fast after that, even for companies who are really big data companies — Facebook presented back in 2017 that they handle 500TB+ of new data daily, the whole of Twitter’s data as of May 2018 was around 300PB, and Uber reported their data warehouse is in the 100+ PB range.
Ok, but what about the rest of us? Let’s take a look at an example.
I often fight with this myself—SQL Server can easily handle multi-billion row data sets, for example. It’s the same problem in Azure with SQL Data Warehouse: the “you must be this tall to ride the rides” marker is set pretty high.
I would not mind quoting the Druid documentation for this purpose: “Druid is a data store designed for high-performance slice-and-dice analytics (“OLAP“-style) on large data sets. Druid is most often used as a data store for powering GUI analytical applications, or as a backend for highly-concurrent APIs that need fast aggregations.”
You might be wondering where is “SQL” in that? Actually, the fact is Druid is designed for special kind of SQL workloads which we can relate with powering the GUI analytical applications which require low latency query response. But in this post, we will only look in the “how part” of it using Druid to quickly run queries.
Click through to see how.
To make the above possible, we provide a Bring Your Own VNET (also called VNET Injection) feature, which allows customers to deploy the Azure Databricks clusters (data plane) in their own-managed VNETs. Such workspaces could be deployed using Azure Portal, or in an automated fashion using ARM Templates, which could be run using Azure CLI, Azure Powershell, Azure Python SDK, etc.
With this capability, the Databricks workspace NSG is also managed by the customer. We manage a set of inbound and outbound NSG rules using a Network Intent Policy, as those are required for secure, bidirectional communication with the control/management plane.
This is a good article if the defaults won’t get past corporate security.
1. Cluster Size (N): Number of nodes/brokers in the Kafka cluster, we should have 2x+1, i.e. at least 3 nodes or more in an odd number.
2. Partitions: We write/publish data/event into a topic which is divided into partitions (by default 1), but we should have M times N, where M can be any integer number, i.e. M >= 1, to achieve more parallelism and partitioning of data over the cluster.
3.Replication Factor: determines the number of copies (including the original/Leader) of each partition in the cluster. All replicas of a partition exist on separate node/broker, and we should never have R.F. > N, but at least 3.
We recommend having 3 RF with 3 or 5 nodes cluster. This helps in having both availabilities as well as consistency.
Click through for several more tradeoff points.
There is a Jira ticket for the Apache Spark project, SPARK-27006. The gist of this ticket is to bring .NET support to Spark, specifically by supporting DataFrames in C# (and hopefully F#). No support for Datasets or RDDs is included in here, but giving .NET developers DataFrame access would make it easy for us to write code which interacts with Spark SQL and a good chunk of the SparkSession object.
You an click through and read everything I have to say, but do go to the Spark ticket and vote for .NET support.
Apache Druid is a distributed, high-performance columnar store for real-time analytics on a large dataset. Druid core design combines the OLAP analytics, time series database and search system to create a single operational analysis. Druid is most suitable for data with high cardinality column or queries having higher aggregation or group by.
Druid has very specific use cases. If you don’t fit one of the use cases, it’s not a good solution at all; but if you do fit one of the use cases, it’s excellent.
This post describes how to generate big datasets with Hive in HDInsight, specifically TPC-DS benchmarking datasets. There are many tools for generating sample data, and this one is particularly nice due to its familiarity and ability to generate massive datasets up to 100 terabytes in size. The intended purpose of TPC data is for benchmarking purposes, but big sample datasets are also very useful for learning big data tools, proofs of concept, testing, etc.
The TPC (Transaction Processing Performance Council) provides tools for generating the benchmarking data, but using them to generate big data is not trivial, and would take a very long time on modest hardware. Thankfully someone has written a nice utility that uses Hive and Python to run the generator on a Hadoop cluster. While Hadoop clusters are not easy to setup, using a Hadoop cloud service like Azure HDInsight is remarkably easy. With HDInsight, you can use a powerful cluster of machines to generate the data quickly, and when you’re done you can delete the cluster, leaving the data in place.
Most of the instructions should follow through to work with on-prem or non-HDInsight Hadoop clusters, though there will be some changes to accommodate differences in HDInsight.
The first step in any type of analysis is to understand the dataset itself. A Databricks dashboard can provide a concise format in which to present relevant information about the data to clients, as well as a quick reference for analysts when returning to a project.
To create this dashboard, a user can simply switch to Dashboard view instead of Code view under the View tab. The user can either click on an existing dashboard or create a new one. Creating a new dashboard will automatically display any of the visualizations present in the notebook. Customization of the dashboard is easily achieved by clicking on the chart icon in the top right corner of the desired command cells to add new elements.
This isn’t quite a step-by-step guide but does spur on ideas.