Category: Hadoop

Submitting A Spark Job On HDInsight

Published 2017-01-09 by Kevin Feasel

Bharath Venkatesh shows different ways to run a Spark job on HDInsight:

From HDI 3.5 onwards, our clusters come preinstalled with Zeppelin Notebooks. Much like Jupyter notebooks, Zeppelin is a web-based notebook that enables interactive data analytics. It provides built-in Spark intergration that allows for:

Automatic SparkContext and SQLContext injection

Runtime jar dependency loading from local filesystem or maven repository. Learn more about dependency loader.

Canceling job and displaying its progress

This MSDN article provides a quick easy-to-use onboarding guide to help get acclimatized to Zeppelin. You can also try several applications that come pre-installed on your cluster to get hands on experience of Zeppelin.

Zeppelin is probably my favorite method, but there are good reasons to use all of these.

Comments closed

Monitoring Car Data With Spark And Kafka

Published 2017-01-06 by Kevin Feasel

Carol McDonald builds a model to determine where Uber cars are clustered:

Uber trip data is published to a MapR Streams topic using the Kafka API. A Spark streaming application, subscribed to the topic, enriches the data with the cluster Id corresponding to the location using a k-means model, and publishes the results in JSON format to another topic. A Spark streaming application subscribed to the second topic analyzes the JSON messages in real time.

This is a fairly detailed post, well worth the read.

Comments closed

HDInsight With Hive LLAP

Published 2016-12-29 by Kevin Feasel

Rashin Gupta explains some performance benefits of using Hive 2.0 (LLAP) on HDInsight:

With LLAP, we allow data scientists to query data interactively in the same storage location where data is prepared. This means that customers do not have to move their data from a Hadoop cluster to another analytic engine for data warehousing scenarios. Using ORC file format, queries can use advanced joins, aggregations and other advanced Hive optimizations against the same data that was created in the data preparation phase.

In addition, LLAP can also cache this data in its containers so that future queries can be queried from in-memory rather than from on-disk. Using caching brings Hadoop closer to other in-memory analytic engines and opens Hadoop up to many new scenarios where interactive is a must like BI reporting and data analysis.

Even with this, Hive is still more of a “warehousing” technology, but this moves it closer to real-time (or at least “not slow”) warehousing.

Comments closed

Building A Multi-Node Hadoop Cluster With Spark

Published 2016-12-27 by Kevin Feasel

Rao Swati has a step-by-step instruction guide on how to set up a multi-node cluster with Hadoop 2.7.3 and Spark 1.6.2:

Important Notes:

Start-dfs.sh will start NameNode, SecondaryNamenode, DataNode on master and DataNode on all slaves node.

Start-yarn.sh will start NodeManager, ResourceManager on the master node and NodeManager on slaves.

Perform Hadoop namenode -format only once otherwise you will get an incompatible cluster_id exception. To resolve this error clear temporary data location for datanode i.e, remove the files present in $HADOOP_HOME/dfs/name/data folder.

If you’d like to set up your own Hadoop cluster rather than using one of the big vendors (Hortonworks, Cloudera, MapR) or a PaaS solution like HDInsight or ElasticMapReduce, this will give you a head start.

Comments closed

Jupyter On ElasticMapReduce

Published 2016-12-23 by Kevin Feasel

Tom Zeng shows howt o install Jupyter Notebooks on Amazon’s ElasticMapReduce:

By default (with no --password and --port arguments), Jupyter will run on port 8888 with no password protection; JupyterHub will run on port 8000. The --port and --jupyterhub-port arguments can be used to override the default ports to avoid conflicts with other applications.

The --r option installs the IRKernel for R. It also installs SparkR and sparklyr for R, so make sure Spark is one of the selected EMR applications to be installed. You’ll need the Spark application if you use the --toree argument.

If you used --jupyterhub, use Linux users to sign in to JupyterHub. (Be sure to create passwords for the Linux users first.) hadoop, the default admin user for JupyterHub, can be used to set up other users. The –password option sets the password for Jupyter and for the hadoop user for JupyterHub.

Installation is fairly straightforward, and they include a series of samples you can get to try out Jupyter.

Comments closed

Getting Finer-Grained Security In Spark

Published 2016-12-22 by Kevin Feasel

Vadim Vaks explains how to get finer-grained permissions within Spark using Ranger and LLAP:

With LLAP enabled, Spark reads from HDFS go directly through LLAP. Besides conferring all of the aforementioned benefits on Spark, LLAP is also a natural place to enforce fine grain security policies. The only other capability required is a centralized authorization system. This need is met by Apache Ranger. Apache Ranger provides centralized authorization and audit services for many components that run on Yarn or rely on data from HDFS. Ranger allows authoring of security policies for: – HDFS – Yarn – Hive (Spark with LLAP) – HBase – Kafka – Storm – Solr – Atlas – Knox Each of the above services integrate with Ranger via a plugin that pulls the latest security policies, caches them, and then applies them at run time.

Read on for more details.

Comments closed

Spark Versus Flink

Published 2016-12-21 by Kevin Feasel

Sibanjan Das compares Apache Flink to Apache Spark:

The primitive concept of Apache Flink is the high-throughput and low-latency stream processing framework which also supports batch processing. The architecture is a flip of the other Big Data processing architectures where the primary notion was the batch processing framework. This is something that organizations have been looking for over the last decade. There is a need for platforms supporting low latency data movement for applications where even a millisecond delay can lead to severe consequences. The prospect of Apache Flink seems to be significant and looks like the goal for stream processing.

While comparing these two, don’t forget about Kafka Streams. We’ve entered the streaming era for Hadoop & friends, and it’s an exciting time.

Comments closed

Understanding HDFS Disk Checks

Published 2016-12-21 by Kevin Feasel

Xiao Chen explains how the HDFS Disk Checker works for data nodes:

The function of block scanner is to scan block data to detect possible corruptions. Since data corruption may happen at any time on any block on any DataNode, it is important to identify those errors in a timely manner. This way, the NameNode can remove the corrupted blocks and re-replicate accordingly, to maintain data integrity and reduce client errors. On the other hand, we don’t want to utilize too many resources, so that disk I/O can still serve actual requests.

Therefore, block scanner needs to make sure that suspicious blocks are scanned relatively quickly, and other blocks are scanned every once in awhile, at a relatively lower frequency, without significant I/O usage.

This is a nice article for operations folks who own Hadoop clusters.

Comments closed

Using Polybase To Insert Into HDFS

Published 2016-12-21 by Kevin Feasel

I have a post on writing to HDFS using Polybase:

What’s interesting is the error message itself is correct, but could be confusing. Note that it’s looking for a path with this name, but it isn’t seeing a path; it’s seeing a file with that name. Therefore, it throws an error.

This proves that you cannot control insertion into a single file by specifying the file at create time. If you do want to keep the files nicely packed (which is a good thing for Hadoop!), you could run a job on the Hadoop cluster to concatenate all of the results of the various files into one big file and delete the other files. You might do this as part of a staging process, where Polybase inserts into a staging table and then something kicks off an append process to put the data into the real tables.

Sometime in the future, I plan to see how it scales: with multiple files writing to a multi-node Hadoop cluster, do I get better write performance with a Polybase scaleout cluster? And if so, how close to linear scale can I get?

Comments closed

Connecting Apache Drill To Power BI

Published 2016-12-20 by Kevin Feasel

Bryan Smith shows how to connect Apache Drill to Power BI:

Clicking Next takes me to the From ODBC dialog. Here, I click on the Advanced options item, ignoring the Data Source Name (DSN) drop-down, and enter a connection string with the appropriate substitution for the host parameter:

driver={MapR Drill ODBC Driver};connectiontype=Direct;host=maprcluster-3xrrusnk-node0.westus.cloudapp.azure.com;port=31010;authenticationtype=No Authentication

Notice the connection string employs a Direct connection type, indicating that the app will speak directly to one of the nodes in the cluster (as identified by the host parameter) and not to the ZooKeeper service. ZooKeeper is in use on the cluster but is not exposed externally, given the network security group changes made during my earlier deployment. Even if ZooKeeper were exposed, it tracks the nodes of the cluster using their internal names so that any app outside the virtual network containing the cluster would not be able to leverage the information in ZooKeeper to form a connection. The only option that works here is the Direct connection type.

It’s worth reading the whole thing, as well as checking out the UserVoice suggestion for implementing full Apache Drill support.

Comments closed