Category: Hadoop

PySpark With MapR

Published 2016-08-17 by Kevin Feasel

Justin Brandenburg has a tutorial on combining Python and Spark on the MapR platform:

Looking at the first 5 records of the RDD

kddcup_data.take(5)
This output is difficult to read. This is because we are asking PySpark to show us data that is in the RDD format. PySpark has a DataFrame functionality. If the Python version is 2.7 or higher, you can utilize the pandas package. However, pandas doesn’t work on Python versions 2.6, so we use the Spark SQL functionality to create DataFrames for exploration.

The full example is a fairly simple k-means clustering process, which is a great introduction to PySpark.

Comments closed

Concurrency In Hadoop Using ZooKeeper

Published 2016-08-16 by Kevin Feasel

Garima Dosi discusses an architecture using ZooKeeper to introduce some limited protections for concurrent access in HDFS:

The ZooKeeper nodes topology as per the design looks like this. ZooKeeper works like a filesystem starting with a root directory followed with several nodes (analogous to folders) and finally the data nodes (analogous to files). The circles in the image represent the name of a property/folder that we are trying to maintain and the rounded boxes are the values/files for those properties/folders.

So, the image above shows that the “global version” is 100 and there are 10 & 20 read requests being executed on versions 98 and 99 respectively and since there is a write request in progress, no other write request would be taken up until it completes.

This feels a little overly complicated to me.

Comments closed

VirtualBox Setup

Published 2016-08-16 by Kevin Feasel

Jon Morisi has notes on downloading and preparing the Hortonworks sandbox VM via VirtualBox:

Lessons learned:

Use wget for downloads that repeatedly drop the connection.
Don’t try to run VirtualBox inside a VM.
Enable VTx in the BIOS to run VirtualBox.

This post is mostly about troubles in getting the VM and preparing the software. Definitely take his advice on enabling VTx in your BIOS.

Comments closed

Pitfalls Of DIY Hadoop

Published 2016-08-09 by Kevin Feasel

Ben Davis discusses considerations when rolling your own Hadoop cluster:

5. Security hardening
I find it is easier to deploy Hadoop in a fairly low security configuration. This is because there are a range of ports that Hadoop talks on and having an incorrectly configured firewall can cause you problems. So after deployment, set aside time to identify how to customise your firewalls, user and group settings, Kerberos and ssl settings.

I think the article makes some good points. DIY is great for a proof of concept or for playing around with a technology, but if you don’t already have a good amount of experience with a technology, you’ll probably make costly mistakes in development and administration. This is not Hadoop-specific: I’ve seen companies do terrible things to SQL Server because they didn’t know the correct way to do it but needed to get work done. As part of a proof of concept, do all the terrible things you’d like; they’re how you’ll learn. But if this is going to production, it’s a good idea to have people who know what they’re doing involved.

Comments closed

Getting Started With Hadoop

Published 2016-08-09 by Kevin Feasel

Jon Morisi is looking at Hadoop:

From here I think I’ll start playing around on a sandbox. Each of the distributions offers a way to spin up a VM or log into a cloud based environment. There are also docker images out there. (search hadoop, cloudera, or hortonworks). Most of these docker images look fairly new, so don’t cut yourself.

I’m looking at the HortonWorks distro, so I’ll probably setup a Hortonworks Sandbox.

Jon includes some resources he’s used to learn a bit about the topic. I think he’s going down the right path with videos and mailing lists—the ecosystem changes too quickly for books to have much long-term value, and mailing lists & forums tend to be better for keeping up to date. My biggest suggestion is to get case studies and play around. Check out studies on e-mail ingest, real-time data analysis, and analyzing fantasy sports for starters.

Comments closed

Explaining Yarn Container Memory Allocations

Published 2016-08-08 by Kevin Feasel

Skumar T explains container sizes in Yarn:

So jobs on yarn cluster runs in individual containers which is allocated by Node Manager which in turn gets permissions from Resource Manager.

So few configuration parameters of node manager those are important in context of jobs running in the containers.

–>yarn.nodemanager.resource.memory-mb 8192(value)

Amount of physical memory, in MB, that can be allocated for containers.

–>yarn.nodemanager.pmem-check-enabled true(value)

Whether physical memory limits will be enforced for containers.

The bottom half of the article goes into an extended example.

Comments closed

Developing Spark Applications In .NET

Published 2016-08-05 by Kevin Feasel

Kaarthik Sivashanmugam talks about Mobius, a Microsoft-driven .NET wrapper for Spark:

The C# language binding to Spark is similar to the Python and R bindings. In fact, Mobius follows the same design pattern and leverages the existing implementation of language binding components in Spark where applicable for consistency and reuse. The following picture shows the dependency between the .NET application and the C# API in Mobius, which internally depends on Spark’s public API in Scala and Java and extends PythonRDD from PySpark to implement CSharpRDD.

Looks like there’s some fuzziness on just how well F# is supported. Still, this is very exciting as a way of bridging the gap for .NET developers.

Comments closed

Optimizing HBase In HDInsight

Published 2016-08-05 by Kevin Feasel

Ashish Thapliyal links to a 30-minute presentation on HBase optimization:

This session was presented by Nitin Verma (Sr. Software Engineer) and Pravin Mittal (Principal Engineering Manager) @ HBaseCon 2016. The session goes deeper into success story of enabling a big internal customer on HDInsight HBase.

HBase design is a totally different mindset from relational design, so you have to unlearn a lot of habits when moving over to it.

Comments closed

Spatial Functions In Hive

Published 2016-08-04 by Kevin Feasel

Constantin Stanca has a couple of posts on using Hive to implement geospatial queries. First, an overview:

The Esri Geometry API for Java includes geometry objects (e.g. points, lines, and polygons), spatial operations (e.g. intersects, buffer), and spatial indexing. By deploying the library (as a jar) within Hadoop, you are able to build custom MapReduce applications using Java to complete analysis on your spatial data. This can be used as a standalone library, or combined with Spatial Framework for Hadoop to create a SQL like experience.

The Spatial Framework for Hadoop includes among others, the Hive Spatial library with User-Defined Functions and SerDes for spatial analysis in Hive. By enabling this library in Hive, you are able to construct queries using Hive Query Language (HQL), which is very similar to SQL. This allows you to avoid complicated MapReduce algorithms and stick to a more familiar workflow. The API used by the Hive UDF’s could be used by developers building geometry functions for 3rd-party applications using Storm, Spark, HBase etc.

He follows that up with some pieces hive misses compared to SQL Server, Oracle, etc.:

As discussed with ESRI recently, there are no plans to open source all spatial functions currently available for traditional RDBMS like Oracle, SQL Server, or Netezza, as those are commercially licensed packages. The best option to compensate for the 5-10% missing functions is to contribute to ESRI’s open source repository: https://github.com/Esri/spatial-framework-for-hadoop. ESRI does not provide a commercial library for Hive including all spatial functions.

Be sure to check out that second link to get an understanding of exactly what’s missing. Via Mark Herring.

Comments closed

Ingesting E-Mail Into Hadoop

Published 2016-08-03 by Kevin Feasel

Jordan Volz and Stefan Salandy show how to feed e-mails into Hadoop for almost-immediate analysis:

In particular, compliance-related use cases centered on electronic forms of communication, such as archiving, supervision, and e-discovery, are extremely important in financial services and related industries where being “out of compliance” can result in hefty fines. For example, financial institutions are under regulatory pressure to archive all forms of e-communication (email, IM, social media, proprietary communication tools, and so on) for a set period of time. Once data has grown past its retention period, it can then be permanently removed; in the meantime, such data is subject to e-discovery requests and legal holds. Even outside of compliance use cases, most large organizations that are subject to litigation have some form of archive in place for purposes of e-discovery.

Traditional solutions in this area comprise various moving parts and can be quite costly and complex to implement, maintain, and upgrade. By using the Hadoop stack to take advantage of cost-efficient distributed computing, companies can expect significant cost savings and performance benefits.

In this post, as a simple example of this use case, I’ll describe how to set up an open source, real-time ingestion pipeline from the leading source of electronic communication, Microsoft Exchange.

Most of this post is about setting up the interconnections between Exchange and Apache James, and feeding data in. It looks like this will be part 1 of a multi-part series.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31