Press "Enter" to skip to content

Category: Hadoop

Big Data Often Isn’t

Arnon Rotem-gal-oz argues that “big data” is often a misnomer:

I couldn’t find numbers from Google but others say that by 2017 Google processed over 20PB a day (not to mention answering 40K search queries/second) so Google is definitely in the big data game. The numbers go down fast after that, even for companies who are really big data companies — Facebook presented back in 2017 that they handle 500TB+ of new data daily, the whole of Twitter’s data as of May 2018 was around 300PB, and Uber reported their data warehouse is in the 100+ PB range.

Ok, but what about the rest of us? Let’s take a look at an example.

I often fight with this myself—SQL Server can easily handle multi-billion row data sets, for example. It’s the same problem in Azure with SQL Data Warehouse: the “you must be this tall to ride the rides” marker is set pretty high.

Comments closed

Querying Apache Druid

Manish Mishra takes us through the basics of querying from Apache Druid:

I would not mind quoting the Druid documentation for this purpose:  “Druid is a data store designed for high-performance slice-and-dice analytics (“OLAP“-style) on large data sets. Druid is most often used as a data store for powering GUI analytical applications, or as a backend for highly-concurrent APIs that need fast aggregations.”

You might be wondering where is “SQL” in that? Actually, the fact is Druid is designed for special kind of SQL workloads which we can relate with powering the GUI analytical applications which require low latency query response. But in this post, we will only look in the “how part” of it using Druid to quickly run queries.

Click through to see how.

Comments closed

Deploying Azure Databricks in a Custom VNET

Abhinav Garg and Anna Shrestinian explain how you can use VNET injection with Azure Databricks:

To make the above possible, we provide a Bring Your Own VNET (also called VNET Injection) feature, which allows customers to deploy the Azure Databricks clusters (data plane) in their own-managed VNETs. Such workspaces could be deployed using Azure Portal, or in an automated fashion using ARM Templates, which could be run using Azure CLI, Azure Powershell, Azure Python SDK, etc.

With this capability, the Databricks workspace NSG is also managed by the customer. We manage a set of inbound and outbound NSG rules using a Network Intent Policy, as those are required for secure, bidirectional communication with the control/management plane. 

This is a good article if the defaults won’t get past corporate security.

Comments closed

Consistency Versus Availability with Kafka

Sourabh Verma lists some of the areas where you can make a conscious tradeoff between consistency and availability with Apache Kafka:

1. Cluster Size (N): Number of nodes/brokers in the Kafka cluster, we should have 2x+1, i.e. at least 3 nodes or more in an odd number.
2. Partitions: We write/publish data/event into a topic which is divided into partitions (by default 1), but we should have M times N, where can be any integer number, i.e. M >= 1, to achieve more parallelism and partitioning of data over the cluster.
3.Replication Factor: determines the number of copies (including the original/Leader) of each partition in the cluster. All replicas of a partition exist on separate node/broker, and we should never have R.F. > N, but at least 3. 
We recommend having 3 RF with 3 or 5 nodes cluster. This helps in having both availabilities as well as consistency.

Click through for several more tradeoff points.

Comments closed

Bring .NET Support to Spark

I have a request that you vote up a Spark issue:

There is a Jira ticket for the Apache Spark project, SPARK-27006. The gist of this ticket is to bring .NET support to Spark, specifically by supporting DataFrames in C# (and hopefully F#). No support for Datasets or RDDs is included in here, but giving .NET developers DataFrame access would make it easy for us to write code which interacts with Spark SQL and a good chunk of the SparkSession object.

You an click through and read everything I have to say, but do go to the Spark ticket and vote for .NET support.

Comments closed

Apache Druid Concepts

Jatin Demla takes us through some of the key concepts behind Apache Druid:

Apache Druid is a distributed, high-performance columnar store for real-time analytics on a large dataset. Druid core design combines the OLAP analytics, time series database and search system to create a single operational analysis. Druid is most suitable for data with high cardinality column or queries having higher aggregation or group by.

Druid has very specific use cases. If you don’t fit one of the use cases, it’s not a good solution at all; but if you do fit one of the use cases, it’s excellent.

Comments closed

Generating TPC-DS Data Sets with HDInsight

Chris Koester shows how you can generate artificial data sets in the TCP-DS format using HDInsight:

This post describes how to generate big datasets with Hive in HDInsight, specifically TPC-DS benchmarking datasets. There are many tools for generating sample data, and this one is particularly nice due to its familiarity and ability to generate massive datasets up to 100 terabytes in size. The intended purpose of TPC data is for benchmarking purposes, but big sample datasets are also very useful for learning big data tools, proofs of concept, testing, etc.

The TPC (Transaction Processing Performance Council) provides tools for generating the benchmarking data, but using them to generate big data is not trivial, and would take a very long time on modest hardware. Thankfully someone has written a nice utility that uses Hive and Python to run the generator on a Hadoop cluster. While Hadoop clusters are not easy to setup, using a Hadoop cloud service like Azure HDInsight is remarkably easy. With HDInsight, you can use a powerful cluster of machines to generate the data quickly, and when you’re done you can delete the cluster, leaving the data in place.

Most of the instructions should follow through to work with on-prem or non-HDInsight Hadoop clusters, though there will be some changes to accommodate differences in HDInsight.

Comments closed

Databricks Dashboards

Megan Quinn takes us through building dashboards with Apache Zeppelin on Databricks:

The first step in any type of analysis is to understand the dataset itself. A Databricks dashboard can provide a concise format in which to present relevant information about the data to clients, as well as a quick reference for analysts when returning to a project.

To create this dashboard, a user can simply switch to Dashboard view instead of Code view under the View tab. The user can either click on an existing dashboard or create a new one. Creating a new dashboard will automatically display any of the visualizations present in the notebook. Customization of the dashboard is easily achieved by clicking on the chart icon in the top right corner of the desired command cells to add new elements.

This isn’t quite a step-by-step guide but does spur on ideas.

Comments closed

Working with Columns in Spark

Achilleus has a two-parter on working with columns in Spark. Part 1 covers some of the basic syntax and several functions:

Also, we can have typed columns which is basically a column with an expression encoder specified for the expected input and return type.

scala> val name = $"name".as[String]
name: org.apache.spark.sql.TypedColumn[Any,String] = name
scala> val name = $"name"
name: org.apache.spark.sql.ColumnName = name

There are more than 50 methods(67 the last time I counted ) that can be used for transformations on the column object. We will be covering some of the important methods that are generally used.

Part 2 covers other functions including window functions:

17) over
This is one of the most important function that is used in many of the window operations.We can talk about the window function in detail when discuss about aggregation in spark but for now, it will be fair enough to say that over method provides a way to apply an aggregation over a window specification which in turn can be used to specify partition, order and frame boundaries of the aggregation.

Check out both of these posts for useful tidbits.

Comments closed

Creating Threadpools with ExecutorService in Kafka

Prasanth Nair shows how we can use Java’s ExecutorService to create threadpools for Kafka consumers:

Apache Kafka is one of today’s most commonly used event streaming platforms. While using the Kafka platform, quite often, we run into a scenario where we have to process a large number of events/messages that are placed on a broker. Traditional approaches, where a consumer is listening to a topic and then processes these message within the consumer itself, can become a performance bottleneck if the number of messages being placed on the topic is high. In such cases, the rate at which a consumer can process messages will be very low, as there are a large number of messages getting placed on the topic. A potential solution that can be applied in such a scenario is to offload message processing to the worker threads in a thread pool.

Click through for the Java code.

Comments closed