Jiang Mouren has a two-parter on WebHCat. First, how it works:
SSH shell/Oozie hive action directly interact with YARN for HIVE execution where as Program using HdInsight Jobs SDK/ADF (Azure Data Factory) uses WebHCat REST interface to submit the jobs.
WebHCat is a REST interface for remote jobs (Hive, Pig, Scoop, MapReduce) execution. WebHCat translates the job submission requests into YARN applications and reports the status based on the YARN application status. WebHCat results are coming from YARN and troubleshooting some of them needs to go to YARN.
Then, how to debug issues:
2.1.2. WebHCat times out
HDInsight Gateway times out responses which take longer than 2Minutes resulting in “502 BadGateway”. WebHCat queries YARN services for job status and if they take longer than the request might timeout.
When this happens collect the following logs for further investigation:
/var/log/webchat. Typical contents of directory will be like
- webhcat.log is the log4j log to which server writes logs
- webhcat-console.log is stdout of server is started.
- webhcat-console-error.log is stderr of server process
NOTE: webhcat.log will roll-over daily hence files like webhcat.log.YYYY-MM-DD will also present. For logs to a specific time range make sure that appropriate file is selected.
Because HDInsight doesn’t support WebHDFS, WebHCat is the primary method for cluster access, so it’s good to know.