Category: Hadoop

MapR Goes Spark-First

Published 2016-06-08 by Kevin Feasel

MapR has introduced a new version of their platform which is based on Spark:

With the emergence of Spark as a unified computing engine, developers can perform ETL and advanced analytics in both continuous (streaming) and batch mode either programmatically (using Scala, Java, Python, or R) or with procedural SQL (using Spark SQL or Hive QL).

With MapR converging the data management platform, you can now take a preferential Spark-first approach. This differs from the traditional approach of starting with extended Hadoop tools and then adding Spark as part of your big data technology stack. As a unified computing engine, Spark can be used for faster batch ETL and analytics (with Spark core instead of MapReduce and Hive), machine learning (with Spark MLlib instead of Mahout), and streaming ETL and analytics (with Spark Streaming instead of Storm).

MapReduce is so 2012…

Comments closed

HDInsight Tool For IntelliJ

Published 2016-06-08 by Kevin Feasel

Xiaoyong Zhu introduces the new HDInsight Tool for IntelliJ:

This tools extends IntelliJ to support Spark job life cycle from create, author, debug and submit job to Azure cluster and view results. This IntelliJ HDInsight tool integrates well with Azure to allow user navigate HDInsight Spark clusters and view associated Azure storage account. To further boost productivity, the IntelliJ HDInsight tool also offers the capability to view Spark job history, display detailed job logs, and the job output to boost developer productivity. A few usability improvements have been implemented upon user preview feedback, which includes auto locate artifact, add intelligence to remember assembly location, caches spark logs, etc.

It looks like this is specifically designed for Spark-enabled clusters.

Comments closed

Incorporating NiFi Into Brownfield Code

Published 2016-06-07 by Kevin Feasel

Paul Boal discusses how he incorporated Apache NiFi in an existing process:

Typically, data warehousing and ETL tool vendors recommended that we write your own custom components. After all, the target market for ETL tools is a space where the tools are specifically marketed as reducing the need for “error prone and time consuming” manual coding. When I ran across this tutorial on writing your own NiFi processor it occurred to me that NiFi is the exact opposite. It’s both Open Source and designed for extensibility from the ground up. I found it quite reasonable to write a custom NiFi processor that leverages our existing code base.

The existing code is a Java program with separate classes for each device vendor, all with the same interface to abstract the nuances of each vendor from the main data export program. This interface follows a traditional paradigm: login, query, query, query, logout. Given that my input to NiFi above takes in simple username, password, and query criteria arguments, it seems trivial to create a NiFi processor class that adapts the existing code into the NiFi API. Here’s a slightly abbreviated version of the actual code. (In reality, it’s all of 70 lines of code.)

In almost any realistic scenario, you’re not going to have the opportunity to start from scratch. You will always have legacy components, external dependencies, and existing user bases to satisfy. I like this article because it moves forward from that starting point.

Comments closed

Lipwig

Published 2016-06-07 by Kevin Feasel

Peter Coates shows how to make Hive EXPLAIN plans a lot prettier:

As you probably know, if you prepend the word EXPLAIN to your SQL query and then run it, Hive prints out a text description of the query plan. This lets you explore the effects such variations as code changes, the use of analyze, turning on/off the cost-based optimizer (CBO), and so on. It’s an essential tool for optimizing Hive.

The output of EXPLAIN is far from pretty, but fortunately, a simple pipeline of Linux commands can give you a slick graphical rendition like the one below.

I’m going to have to keep this in mind.

Comments closed

Lambda And Kappa

Published 2016-06-06 by Kevin Feasel

Alex Woodie has a story on two competing data architectures:

Jay Kreps, the co-creator of Apache Kafka and CEO of Confluent, was one of the first big data architects to espouse an alternative to the Lambda architecture, which he did with his 2014 O’Reilly story “Questioning the Lambda Architecture.” While Kreps appreciated some aspects of the Lambda architecture—in particular how it deals with reprocessing data—he stated that the downside was just too great.

“The Lambda architecture says I have to have Hadoop and I have to have Storm and I’m going to implement everything in both places and keep them in sync. “I think that’s extremely hard to do,” Kreps tells Datanami. “I think one of the biggest things hurting stream processing is the amount of complexity that you have to incur to build something. That makes it slow to build applications that way, hard to roll them out, and hard to make them reliable enough to be a key part of the business.

I wonder if we’re seeing the next generation of Kimball v Inmon here, or if one will absolutely dominate.

Comments closed

Storm 1.0: Enhanced Debugging

Published 2016-06-06 by Kevin Feasel

Taylor Goetz discusses improvements in Storm 1.0:

The log file viewer added in the Apache Storm 0.9.1 release made accessing Storm’s log files significantly easier, but in some cases still required examination individual log files one-by-one. In Storm 1.0 the UI now includes a powerful search feature that allows you to search a specific topology log file, or across all topology log files in the cluster, even archived files.

When performing a topology-wide search, the UI will search across all supervisor nodes for a match. The search results include a link to the matching log file, as well as host and port information that allow you quickly identify on which machine a specific log event occurred. This feature is particularly helpful when trying to track down when and where a particular error occurred.

The examples Taylor gives are all built around scaling. When you have dozens or hundreds of nodes, one-by-one solutions just don’t work.

Comments closed

EMR Now Supports Phoenix

Published 2016-06-03 by Kevin Feasel

Jonathan Fritz notes that Amazon’s ElasticMapReduce now offers easy support for Apache Phoenix:

Once your HBase table is ready, it’s time to map a table in Phoenix to your data in HBase. You use a JDBC connection to access Phoenix, and there are two drivers included on your cluster under /usr/lib/phoenix/bin. First, the Phoenix client connects directly to HBase processes to execute queries, which requires several ports to be open in your Amazon EC2 Security Group (for ZooKeeper, HBase Master, and RegionServers on your cluster) if your client is off-cluster.

Second, the Phoenix thin client connects to the Phoenix Query Server, which runs on port 8765 on the master node of your EMR cluster. This allows you to use a local client without adjusting your Amazon EC2 Security Groups by creating a SSH tunnel to the master node and using port forwarding for port 8765. The Phoenix Query Server is still a new component, and not all SQL clients can support the Phoenix thin client.

I am not HBase’s biggest fan, but I do think that Phoenix fixes one of the biggest problems HBase has: its being completely foreign to most data professionals. It’s not an accident that as a data platform matures, its development language looks more and more like SQL.

Comments closed

CDH Update

Published 2016-06-01 by Kevin Feasel

Cloudera reports that CDH 5.7 includes a large number of changes to Hue, the web-based Hive UI:

Single-page app: The initial page loads very quickly and asynchronously fetches the list of tables, table statistics, data sample, and partition list. Subsequent navigation clicks will trigger only 1 or 2 calls to the server, instead of reloading all the page resources again. As an added bonus, the browser history now works on all the pages.

These are some nice changes. I still don’t think a web app replaces quality tooling (like Management Studio), but if a web app is what you have, it should at least be nice.

Comments closed

Heron

Published 2016-05-27 by Kevin Feasel

Twitter is open sourcing Heron, a competitor to Apache Storm:

Apache Storm was the original solution to Twitter’s problems. It was created by a marketing intelligence company called BackType, and Twitter bought the company in 2011 and eventually open-sourced Storm, providing it to the Apache Foundation.

There’s no question Storm has a lot of advantages. It’s scalable and fault-tolerant, with a decent ecosystem of “spouts,” or systems for receiving data from established sources. But it was reputedly also hard to work with and hard to get good results from, and despite a recent 1.0 renovation, it’s been challenged by other projects, including Apache Spark and its own revised streaming framework.

It’s good to see this competition in the streaming space.

Comments closed

Kafka 0.10

Published 2016-05-25 by Kevin Feasel

Kafka 0.10 is now available:

Kafka Streams: Kafka Streams was introduced as part of thetech preview release of the Confluent Platform few months ago and is now available through Apache Kafka 0.10.0.0. Kafka Streams is a library that turns Apache Kafka into a full featured, modern stream processing system. Kafka Streams includes a high level language for describing common stream operations (such as joining, filtering, and aggregating records), allowing developers to quickly develop powerful streaming applications. Kafka Streams offers a true event-at-a-time processing model, handles out-of-order data, allows stateful and stateless processing and can easily be deployed on many different systems— Kafka Streams applications can run on YARN, be deployed on Mesos, run in Docker containers, or just embedded into existing Java applications.

There are some nice improvements in this latest version of Kafka.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31