WebHCat

Kevin Feasel

2017-03-08

Hadoop

Jiang Mouren has a two-parter on WebHCat.  First, how it works:

SSH shell/Oozie hive action directly interact with YARN for HIVE execution where as Program using HdInsight Jobs SDK/ADF (Azure Data Factory) uses WebHCat REST interface to submit the jobs.

WebHCat is a REST interface for remote jobs (Hive, Pig, Scoop, MapReduce) execution. WebHCat translates the job submission requests into YARN applications and reports the status based on the YARN application status. WebHCat results are coming from YARN and troubleshooting some of them needs to go to YARN.

Then, how to debug issues:

2.1.2. WebHCat times out

HDInsight Gateway times out responses which take longer than 2Minutes resulting in “502 BadGateway”. WebHCat queries YARN services for job status and if they take longer than the request might timeout.

When this happens collect the following logs for further investigation:

/var/log/webchat. Typical contents of directory will be like

  • webhcat.log is the log4j log to which server writes logs
  • webhcat-console.log is stdout of server is started.
  • webhcat-console-error.log is stderr of server process

NOTE: webhcat.log will roll-over daily hence files like webhcat.log.YYYY-MM-DD will also present. For logs to a specific time range make sure that appropriate file is selected.

Because HDInsight doesn’t support WebHDFS, WebHCat is the primary method for cluster access, so it’s good to know.

Related Posts

Working With Skewed Data In Pig

Dmitry Tolpeko explains how you can use the Weighted Range Partitioner in Apache Pig to work with highly skewed data: The problem is that there are 3,000 map tasks are launched to read the daily data and there are 250 distinct event types, so the mappers will produce 3,000 * 250 = 750,000 files per day. That’s […]

Read More

Spark Streaming Using DStreams Or DataFrames?

Yaroslav Tkachenko contrasts the two methods for operating on data with Spark Streaming: Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot […]

Read More

Categories

March 2017
MTWTFSS
« Feb Apr »
 12345
6789101112
13141516171819
20212223242526
2728293031