Press "Enter" to skip to content

Category: Hadoop

An Introduction to Apache Livy

Guy Shilo explains why we should care about Apache Livy:

Apache Livy is an open source server that exposes Spark as a service. Its backend connects to a Spark cluster while the frontend enables REST API. This enables running it as the organization’s Spark gateway and even run in in docker containers.

Not only it enables running Spark jobs from anywhere, but it also enables shared Spark context and a shared RDD cache among all it’s users which is time and memory saving.

I will demonstrate here how to setup Apache Livy on one of the cluster’s nodes and on a separate server.

Click through for the demonstration.

Comments closed

Migrating Databricks Workspaces

Gerhard Brueckl has made DatabricksPS better:

I do not know what is/was the problem here but I did not have time to investigate but instead needed to come up with a proper solution in time. So I had a look what needs to be done for a manual export. Basically there are 5 types of content within a Databricks workspace:

– Workspace items (notebooks and folders)
– Clusters
– Jobs
– Secrets
– Security (users and groups)

For all of them an appropriate REST API is provided by Databricks to manage and also exports and imports. This was fantastic news for me as I knew I could use my existing PowerShell module DatabricksPS to do all the stuff without having to re-invent the wheel again.

I’ve used DatabricksPS and really like it for cases where I’d have to loop with the Databricks REST API—for example, when uploading files.

Comments closed

Resource Allocation in Spark Applications

The folks at Beginner’s Hadoop take us through resource allocation in Spark applications:

Tiny executors essentially means one executor per core. Following table depicts the values of our spar-config params with this approach:

Analysis: With only one executor per core, as we discussed above, we’ll not be able to take advantage of running multiple tasks in the same JVM. Also, shared/cached variables like broadcast variables and accumulators will be replicated in each core of the nodes which is 16 times. Also, we are not leaving enough memory overhead for Hadoop/Yarn daemon processes and we are not counting in ApplicationManager. NOT GOOD!

Read on for the full analysis.

Comments closed

KarelDB: SQL on Kafka

George Leopold informs us on a new project called KarelDB:

Unlike Confluent’s Kafka-based platform, KarelDB is not a streaming database. Yokota nevertheless flagged the relational database largely because it’s based on open-source components backed by Kafka. Hence, he reckons there’s a chance it could take off.

Those open source components include Calcite, an SQL framework that pushes relational queries to the data store, an approach seen as providing more efficient processing. Yokota noted that KarelDB would “automatically benefit” from upcoming Calcite optimizations.

Other open source projects such as the Apache Flink stream processing engine also have leveraged Calcite, including an SQL API. Calcite also includes an SQL parser.

Kafka already had KSQL for Kafka Streams, but this is a totally different validation of Feasel’s Law.

Comments closed

HBase and S3

Krishna Maheshwari, et al, explain how we can allow Apache HBase to use S3 for storage:

Cloudera Data Platform (CDP) provides an out-of-the-box solution that allows Apache HBase deployments to use Amazon Simple Storage Service (S3) as its main persistence layer for saving table data. Amazon S3 is an object store which offers a high degree of durability with a pay-per-use cost structure. There is no server-side component to run or manage for S3 — all that is needed is the S3 client library and AWS credentials. However, HBase requires a consistent and atomic filesystem which means that it cannot directly use S3 because it is an eventually consistent object store. Both CDH and HDP have only provided HBase solely using HDFS because there have been long-standing impediments that prevented HBase from natively using S3. To address these issues, we’ve built an out-of-the-box solution which we are delivering for the first time via CDP. When you launch an Operational Database (HBase) cluster on CDP, HBase StoreFiles (the backing files for HBase tables) are stored in S3 and HBase write-ahead-logs (WAL) are stored in an HDFS instance run alongside HBase per usual.

I hadn’t thought of using S3, but it’s an interesting post.

Comments closed

Delta Lake Schema Enforcement

Burak Yavuz, et al, explain the concept of schema enforcement with Databricks Delta Lake:

Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. Like the front desk manager at a busy restaurant that only accepts reservations, it checks to see whether each column in data inserted into the table is on its list of expected columns (in other words, whether each one has a “reservation”), and rejects any writes with columns that aren’t on the list.

Something something “relational database” something something. They also walk us through some examples in a Databricks notebook, so check that out.

Comments closed

Calculating YARN Utilization Metrics

Dmitry Tolpeko shows how you can calculate per-second cluster utilization measures from YARN’s resource manager logs:

But even if you query YARN REST API every second it still can only provide a snapshot of the used YARN resources. It does not show which application allocates or releases containers, their memory and CPU capacity, in which order these events occur, what is their exact timestamp and so on.

For this reason I prefer a different approach that is based on using the YARN Resource Manager logs to calculate the exact per second utilization metrics of a Hadoop cluster.

This is a bit more complicated than hitting the REST API, but Dmitry shows some of the benefits of doing so.

Comments closed

Spark Streaming DStreams

Manish Mishra explains the fundamental abstraction of Spark Streaming:

Before going into details of the operations available on the DStream API, let us look at the input sources from which we can start a Stream. There are multiple ways in which we can get the inputs from e.g. Kafka, Flume, etc. Or simple Idle files. To get the details on the available input sources supported by Spark, you can refer to this section. As part of this blog, we will take the example of Kafka.

Read on to see an example of pulling data from Kafka and converting inputs into microbatches.

Comments closed

Multi-Region Replication with Confluent Platform

David Arthur walks us through multi-region replication of Kafka clusters in the Confluent Platform 5.4 preview:

Running a single Apache Kafka® cluster across multiple datacenters (DCs) is a common, yet somewhat taboo architecture. This architecture, referred to as a stretch cluster, provides several operational benefits and unlocks the door to many uses cases. Stretch clusters provide better durability guarantees and make disaster recovery much easier by avoiding the problem of offset translation and restarting clients. However, in order to operate a reliable stretch cluster, datacenters must be relatively close to each other and have a very stable, low latency, and high-bandwidth connection among the DCs.

This changes with the preview release of Confluent Platform 5.4, which includes multi-region replication built directly into Confluent Server. Now operators can choose to replicate data on a per-region basis, synchronously or asynchronously, per topic. This functionality allows operators to increase data durability and automate client failover in the event of a disaster.

And of course all of those rules about RPO, RTO, etc. apply to this.

Comments closed

Diagnosing TCP SACKs-Related Slowdown in Databricks

Chris Stevens, et al, walk us through troubleshooting a slowdown after using Linux images which have been patched for the TCP SACKs vulnerabilities:

In order to figure out why the straggler task took 15 minutes, we needed to catch it in the act. We reran the benchmark while monitoring the Spark UI, knowing that all but one of the tasks for the save operation would complete within a few minutes. Sorting the tasks in that stage by the Status column, it did not take long for there to be only one task in the RUNNING state. We had found our skewed task and the IP address in the Host column pointed us at the executor experiencing the regression.

This is a nice case study of network troubleshooting, so of course there are Wireshark screenshots in it.

Comments closed