Category: Hadoop

Google Analytics is a very powerful platform for monitoring your web site’s metrics including top pages, visitors, bounce rate, etc. As more and more businesses start using Big Data processes, the need to compile as much data as possible becomes more advantageous. The more data you have available, the more options you have to analyze that data and produce some very interesting results that can help you shape your business.

This article assumes that you already have a running Kafka cluster. If you don’t, then please follow my article Kafka Tutorial for Fast Data Architecture to get a cluster up and running. You will also need to have a topic already created for publishing the Google Analytics metrics to. The aforementioned article covers this procedure as well. I created a topic called admintome-ga-pages since we will be collection Google Analytics (ga) metrics about my blog’s pages.

In this article, I will walk you through how to pull metrics data from Google Analytics for your site then take that data and push it to your Kafka Cluster. In a later article, we will cover how to take that data that is consumed in Kafka and analyze it to give us some meaningful business data.

This is a good tutorial and I’m looking forward to part two.

Comments closed

Benefits To Federating The Hadoop NameNode

Published 2018-07-06 by Kevin Feasel

Hanisha Koneru and Arpit Agarwal show us a few benefits to NameNode federation:

The Apache Hadoop Distributed File System (HDFS) is highly scalable and can support petabyte-sizes clusters. However, the entire Namespace (file system metadata) is stored in memory. So even though the storage can be scaled horizontally, the namespace can only be scaled vertically. It is limited by the how many files, blocks and directories can be stored in the memory of a single NameNode process.

Federation was introduced in order to scale the name service horizontally by using multiple independent Namenodes/ Namespaces. The Namenodes are independent of each other and there is no communication between them. The Namenodes can share the same Datanodes for storage.

KEY BENEFITS

Scalability: Federation adds support for horizontal scaling of Namespace

Performance: Adding more Namenodes to a cluster increases the aggregate read/write throughput of the cluster

Isolation: Users and applications can be divided between the Namenodes

Read on for examples.

Comments closed

How Qubole Optimizes Apache Spark Clusters

Published 2018-07-05 by Kevin Feasel

Mikhail Stolpner gives us some tips on how to optimize Apache Spark clusters:

There are four major resources: memory, compute (CPU), disk, and network. Memory and compute are by far the most expensive. Understanding how much compute and memory your application requires is crucial for optimization.

You can configure how much memory and how many CPUs each executor gets. While the number of CPUs for each task is fixed, executor memory is shared between the tasks processed by a single executor.

A few key parameters provide the most impact on how Spark is executed in terms of resources: spark.executor.memory, spark.executor.cores, spark.task.cpus, spark.executor.instances, and spark.qubole.max.executors.

This article gives us some idea of the levers we have available as well as when to pull them. Though the article itself is vendor-specific, a lot of the advice is general.

Comments closed

RStudio Integration With Databricks

Published 2018-07-03 by Kevin Feasel

Brian Dirking, et al, announce support between RStudio and the Databricks platform:

With Databricks RStudio Integration, both popular R packages for interacting with Apache Spark, SparkR or sparklyr can be used the inside the RStudio IDE on Databricks. When multiple users use a cluster, each creates a separate SparkR Context or sparklyr connection, but they are all talking to a single Databricks managed Spark application allowing unique opportunities for collaboration between users. Together, RStudio can take advantage of Databricks’ cluster management and Apache Spark to perform such as a massive model selection as noted in the figure below.

I like seeing this level of integration, especially from a language like R, which has historically been limited to operating on a single machine’s memory.

Comments closed

Auto-Scaling Amazon ElasticMapReduce

Published 2018-07-03 by Kevin Feasel

Brandon Schller gives us some tips on sizing and scaling ElasticMapReduce:

EMR scaling is more complex than simply adding or removing nodes from the cluster. One common misconception is that scaling in Amazon EMR works exactly like Amazon EC2 scaling. With EC2 scaling, you can add/remove nodes almost instantly and without worry, but EMR has more complexity to it, especially when scaling a cluster down. This is because important data or jobs could be running on your nodes.

To prevent data loss, Amazon EMR scaling ensures that your node has no running Apache Hadoop tasks or unique data that could be lost before removing your node. It is worth considering this decommissioning delay when resizing your EMR cluster. By understanding and accounting for how this process works, you can avoid issues that have plagued others, such as slow cluster resizes and inefficient automatic scaling policies.

If you’re using EMR today or think you might use it in the future, you should read this.

Comments closed

Backing Up Azure Data Lake Store Data

Published 2018-06-29 by Kevin Feasel

Hugo Almeida has some hints for backing up Azure Data Lake Store data using Azure Data Factory:

Our Hadoop HDP IaaS cluster on Azure uses Azure Data Lake Store (ADLS) for data repository and accesses it through an applicational user created on Azure Active Directory (AAD). Check this tutorial if you want to connect your own Hadoop to ADLS.

Our ADLS is getting bigger and we’re working on a backup strategy for it. ADLS provides locally-redundant storage (LRS), however, this does not prevent our application from corrupting data or accidentally deleting it. Since Microsoft hasn’t published a new version of ADLS with a clone feature we had to find a way to backup all the data stored in our data lake.

We’re going to show you How to do a full ADLS backup with Azure Data Factory (ADF). ADF does not preserve permissions. However, our Hadoop client can only access the AzureDataLakeStoreFilesystem (adl) through hive with a “hive” user and we can generate these permissions before the backup.

Read the whole thing if you’re thinking of using Azure Data Lake Store.

Comments closed

Enriching Syslog Data In A Kafka Pipeline

Published 2018-06-28 by Kevin Feasel

Robin Moffatt continues his syslog processing series with Kafka and KSQL:

In this article we’re going to conclude our fun with syslog data by looking at how we can enrich inbound streams of syslog data with reference information from elsewhere to produce a real-time enriched data stream. The syslog data in this example comes from various servers and network devices, and the additional information with which we’re going to enrich it is from MongoDB, which happens to be the datastore used by Ubiquiti network devices. With the enriched data we’re going to drive some real-time analytics through Elasticsearch and Kibana, as well as trigger push notifications based on activity of certain devices on the network.

I’ve enjoyed this series—it was a full, end-to-end look at a realistic business problem in Kafka Streams. If you want to get started with Kafka Streams, I’d be hard-pressed to find a better example.

Comments closed

Taxi Cab Data On sqlite And Parquet

Published 2018-06-28 by Kevin Feasel

Mark Litwintschik loads the 1.1 billion rows of New York City taxi data into a SQLite database using data stored on Parquet-formatted files living on HDFS:

The dataset used in this benchmark has 1.1 billion records, 51 columns and is 500 GB in size when in uncompressed CSV format. Instructions on producing the dataset can be found in my Billion Taxi Rides in Redshift blog post. The CSV files were converted into Parquet format using Hive and Snappy compression on an AWS EMR cluster. The conversion resulted in 56 Parquet files which take up 105 GB of space.

Where decompression is I/O or network bound it makes sense to keep the compressed data as compact as possible. That being said, there are cases where decompression is compute bound and compression schemes like Snappy play a useful role in lowering the overhead.

I’ve downloaded the Parquet files to my local file system and imported them onto HDFS. Since this is all running on a single SSD drive I’ve set the HDFS replication factor to 1.

It’s not the fastest result I’ve seen from Mark’s work, but I was impressed that SQLite could take that abuse.

Comments closed

Confluent Hub: A Central Repo For Kafka Connect

Published 2018-06-25 by Kevin Feasel

Tim Berglund announces Confluent Hub:

Connect has been an integral part of Apache Kafka since version 0.9, released late 2015. It has proved to be an effective framework for streaming data in and out of Kafka from nearby systems like relational databases, Amazon S3, HDFS clusters, and even nonstandard legacy systems that typically show themselves in the enterprise. Connect is an API on which the connectors themselves are built, plus a run-time framework that runs them in a scalable, fault-tolerant way. The intent was for the community to provide its own connectors to plug into this framework and do the work of data integration while saving everyone a bunch of unrewarding coding that was near-boilerplate and didn’t add a lot of differentiated value to the business.

So where would those connectors live? Well, GitHub, for starters. At the time of this writing, there were 660 repositories matching the search phrase “Kafka Connect” on the popular hosting service, all in various stages of repair and levels of maintenance. Beyond those, Confluent’s popular Connectors page has proven to be one of the best ways to find connectors, some of which are supported by Confluent, and others of which have robust community support behind them. The Connectors page lists for each entry the type of connector, the developer, a few tags, and how you could obtain the code—but that’s really all it did. You still had to go find the released JARs for the connector, download them, and know how to install them properly. And if there were no released JARs available, you had to clone the repository, figure out how to run the build, and then install the JARs into your own Kafka Connect installation. Maybe not rocket science, but we all know it’s never as simple as it sounds. And besides, this was just connectors—no transformations or converters were available on this page.

We knew there was a better way. We wanted something that was easier to use, would avoid you having to building a connector from source every time you wanted an update (and learning a new build tool every now and then), and would be built on top of a meaningful and functional discovery mechanism. And most importantly, we wanted to avoid the pitfalls of manually moving JARs around and having to debug why Connect didn’t find them.

This looks like a good addition to the Kafka ecosystem.

Comments closed

Tuning Spark Jobs Running On YARN

Published 2018-06-25 by Kevin Feasel

Anubhav Tarar shows us ways of optimizing YARN to run Apache Spark jobs:

1. yarn-client mode: In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. To manage the memory first make sure that you have your yarn-site.xml in spark,

spark.yarn.am.memory: To increase the memory you should set spark.yarn.am.memory property in spark-defaults.conf but make sure that you do not allocate more memory than capacity of node manager which is defined in yarn-site.xml as yarn.nodemanager.resource.memory-mb or you can also give it when you are running spark submit with –conf parameter

For example $SPARK_HOME/bin/spark-submit –conf spark.yarn.am.memory=1024m

Check it out for a few other configuration settings you can tweak.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31