Press "Enter" to skip to content

Category: Hadoop

Performance Testing Hadoop File Formats

Zbigniew Baranowski looks at the performance of several Hadoop file formats for various activities:

The data access and ingestion tests were on a cluster composed of 14 physical machines, each equipped with:

  • 2 x 8 cores @2.60GHz
  • 64GB of RAM
  • 2 x 24 SAS drives

Hadoop cluster was installed from Cloudera Data Hub(CDH) distribution version 5.7.0, this includes:

  • Hadoop core 2.6.0
  • Impala 2.5.0
  • Hive 1.1.0
  • HBase 1.2.0 (configured JVM heap size for region servers = 30GB)
  • (not from CDH) Kudu 1.0 (configured memory limit = 30GB)

Apache Impala (incubating) was used as a data ingestion and data access framework in all the conducted tests presented later in this report.

I would have liked to have seen ORC included as a file format for testing.  Regardless, I think this article shows that there are several file formats for a reason, and you should choose your file format based on most likely expected use.  For example, Avro or Parquet for “write-only” systems or Kudu for larger-scale analytics.

Comments closed

Apache Zeppelin 0.7.0

Vinay Shulka announces Apache Zeppelin 0.7.0:

SPARK IMPROVEMENTS

This release also adds support for Spark 2 including version Spark 2.1. Zeppelin now also links to Spark History Server UI from Zeppelin so users can more easily track Spark jobs. The Livy interpreter now supports specifying packages with the job.

SECURITY IMPROVEMENTS

The major security improvement in Zeppelin 0.7.0 is using Apache Knox’s LDAP Realm to connect to LDAP. Zeppelin home page now lists only the nodes to which the user is authorized to access. Zeppelin now also has the ability to support PAM based authentication.

The full list of improvements is available here

This visualization platform is growing up nicely.

Comments closed

Using Sparklyr To Analyze Flight Data

Aki Ariga uses sparklyr on Apache Spark 2.0 to analyze flight data living in S3:

Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!

You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.

Sparklyr is an exciting technology for distributed data analysis.

Comments closed

Encryption In ElasticMapReduce

Sai Sriparasa shows how to enable encryption in an ElasticMapReduce cluster:

In this post, I go through the process of setting up the encryption of data at multiple levels using security configurations with EMR. Before I dive deep into encryption, here are the different phases where data needs to be encrypted.

Data at rest

  • Data residing on Amazon S3—S3 client-side encryption with EMR
  • Data residing on disk—the Amazon EC2 instance store volumes (except boot volumes) and the attached Amazon EBS volumes of cluster instances are encrypted using Linux Unified Key System (LUKS)

Data in transit

  • Data in transit from EMR to S3, or vice versa—S3 client side encryption with EMR

  • Data in transit between nodes in a cluster—in-transit encryption via Secure Sockets Layer (SSL) for MapReduce and Simple Authentication and Security Layer (SASL) for Spark shuffle encryption

  • Data being spilled to disk or cached during a shuffle phase—Spark shuffle encryption or LUKS encryption

Turns out this is rather straightforward.

Comments closed

Securing MapR

Mitesh Shah provides some high-level information on how to secure a MapR cluster:

  • Security Best Practice #2:  Require Authentication for All Services.  While it’s important for ports to be accessible exclusively from the network segment(s) that require access, you need to go a step further to ensure that only specific users are authorized to access the services running on these ports.  All MapR services — regardless of their accessibility — should require authentication.  A good way to enforce this for MapR platform components is by turning on security.  Note that MapR is the only big data platform that allows for username/password-based authentication with the user registry of your choice, obviating the need for Kerberos and all the complexities that Kerberos brings (e.g., setting up and managing a KDC). MapR supports Kerberos, too, so environments that already have it running can use it with MapR if preferred.

There’s nothing here which is absolutely groundbreaking, but they are good practices.

Comments closed

Beginning With Amazon Athena

Jen Underwood looks at the basics behind Amazon Athena:

Today early adopters of Amazon Athena are using it for big data analytics pipeline projects along with Kinesis streaming data and other Amazon data sources.

Athena is serverless parallel query pay-per-use service. There is no infrastructure to set up or manage. It scales automatically and can handle large datasets or complex distributed queries.

The easy way of thinking about Athena is that it’s ElasticMapReduce (a pay-as-you-go Hadoop cluster) without the ceremony of administering or spinning up the cluster.

Comments closed

Performance Tuning A Streaming Application

Mathieu Dumoulin explains how he was able to get 10x performance out of a streaming application built around Kafka, Spark Streaming, and Apache Ignite:

The main issues for these applications were caused by trying to run a development system’s code, tested on AWS instances on a physical, on-premise cluster running on real data. The original developer was never given access to the production cluster or the real data.

Apache Ignite was a huge source of problems, principally because it is such a new project that nobody had any real experience with it and also because it is not a very mature project yet.

I found this article fascinating, particularly because the answer was a lot more than just “throw some more hardware at the problem.”

Comments closed

Secure Enterprise Data Hub On Azure

James Morantus has a two-parter on Azure, Active Directory, and Cloudera’s enterprise data hub solution.  Part one hits on DNS and Samba:

As you can see, the hostname -f command displays a very long FQDN for my VM and hostname -i gives us the IP address associated with the VM. Next, I did a forward DNS lookup using the host FQDN command, which resolved to the IP address. Then, I did a reverse DNS lookup using host IPaddress as shown in the red box above, it did not locate a reverse entry for that IP address. A reverse lookup is a requirement for a CDH deployment. We’ll revisit this later.

Part two looks at tying everything together in the Azure portal as well as within AD:

The remaining steps must be executed as the Cloudera Director admin user you created earlier. In my case, that’s the “azuredirectoradmin” account. All resources created by Cloudera Director in the Azure Portal will be owned by this account. The “root” user is not allowed to create resources on the Azure Portal.

First, we’ll need to create a SSH key as the “azuredirectoradmin” user on the VM where Cloudera Director is installed. This key will be added to our deployment configuration file, which will be added on all the VMs provisioned by Cloudera Director. This will allow us to use passwordless SSH to the cluster nodes with this key.

This isn’t trivial, but considering all that’s going on, it’s rather straightforward.

Comments closed

Submitting A Spark Job On HDInsight

Bharath Venkatesh shows different ways to run a Spark job on HDInsight:

From HDI 3.5 onwards, our clusters come preinstalled with Zeppelin Notebooks. Much like Jupyter notebooks, Zeppelin is a web-based notebook that enables interactive data analytics. It provides built-in Spark intergration that allows for:

  • Automatic SparkContext and SQLContext injection
  • Runtime jar dependency loading from local filesystem or maven repository. Learn more about dependency loader.
  • Canceling job and displaying its progress

This MSDN article provides a quick easy-to-use onboarding guide to help get acclimatized to Zeppelin. You can also try several applications that come pre-installed on your cluster to get hands on experience of Zeppelin.

Zeppelin is probably my favorite method, but there are good reasons to use all of these.

Comments closed

Monitoring Car Data With Spark And Kafka

Carol McDonald builds a model to determine where Uber cars are clustered:

Uber trip data is published to a MapR Streams topic using the Kafka API. A Spark streaming application, subscribed to the topic, enriches the data with the cluster Id corresponding to the location using a k-means model, and publishes the results in JSON format to another topic. A Spark streaming application subscribed to the second topic analyzes the JSON messages in real time.

This is a fairly detailed post, well worth the read.

Comments closed