Category: Hadoop

Kafka Cruise Control Frontend

Published 2019-02-14 by Kevin Feasel

Naresh Kumar Vudutha announces the Kafka Cruise Control Frontend:

For those that may be unfamiliar, Cruise Control features include:

1. Kafka broker resource utilization tracking
2. The ability to query the latest replica state (offline, URP, out of sync) from brokers
3. Goal-based resource distribution
4. Anomaly detection with self-healing
5. Admin operations on Kafka (add/remove/demote brokers, rebalance cluster, run PLE)

In this post, we will take a look at the frontend for Cruise Control, which provides a birds-eye view of all the Kafka installations and provides a single place to manage all of them.

That’s a lot of functionality in one tool.

Comments closed

No-Code ML On Cloudera Data Science Workbench

Published 2019-02-12 by Kevin Feasel

Tim Spann has a post covering ML on the Cloudera Data Science Workbench:

Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP (https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_hdp.html), but it will work for all CDSW regardless of install type.
In my simple example, I built a Python model that uses TextBlob to run sentiment analysis against a passed-in sentence. It returns Sentiment Polarity and Subjectivity, which we can immediately act upon in our flow.
CDSW is extremely easy to work with and I was up and running in a few minutes. For my model, I created a python 3 script and a shell script for install details. Both of these artifacts are available here: https://github.com/tspannhw/nifi-cdsw.

The “no code” portion was less interesting to me than the scalable ML portion, as “no code” either drops into tedium or ends up being replaced by code.

Comments closed

Sqoop From MySQL To Cloudera

Published 2019-02-07 by Kevin Feasel

Alan Choi and Laurel Hale show us how to use Sqoop to migrate data from MySQL into Impala:

The basic import steps described for tiny tables applies to importing bigger tables into Impala. The difference occurs when you construct your sqoop import command. For large tables, you want it to run fast, so setting parallelism to 1, which specifies one map task during the import won’t work well. Instead, using the default parallelism setting, which is 4 map tasks to import in parallel, is a good place to start. So you don’t need to specify a value for the -m option unless you want to increase the number of parallel map tasks.
Another difference is that bigger tables usually have a primary key, which become good candidates where you can split the data without skewing it. The tiny_table we imported earlier doesn’t have a primary key. Also note that the -e option for the sqoop import command, which instructs Sqoop to import the data returned for the specified SQL statement doesn’t work if you split data on a string column. If stringcolumns are used to split the data with the -e option, it generates incompatible SQL. So if you decide to split data on the primary key for your bigger table, make sure the primary key is on a column of a numeric data type, such as int, which works best with the -e option because it generates compatible SQL.

Read the whole thing. Sqoop has been around for a while because it does its job well.

Comments closed

Lessons Learned From A Kafka Streams Implementation

Published 2019-02-05 by Kevin Feasel

Rishi Dhanaraj provides us with some lessons learned from implementing Kafka Streams to read data from Cassandra and Mongo and write into Mongo:

This Python script ran on a single machine, and is from the early days of the company. However, this script didn’t scale since it cannot run in a distributed manner. As a result, this Python job ends up flapping—crashing and restarting regularly in production depending on the load it needs to process.
Second, the Python script puts read pressure on MongoDB and Cassandra, because it has to query the databases for each batch of walk-ins and Zenreach Messages. MongoDB and Cassandra are our primary databases for serving customer read queries. So we wanted to remove the additional read pressure added by this job, which currently competes for resources with our customers.
For these reasons, we wanted to move to a streaming solution—specifically, Kafka Streams. We already switched to Kafka Streams for walk-in detection, which my teammate Eugen Feller explained in a previous post.

Click through for a review of the architecture and some tips if you want to do this yourself.

Comments closed

Overwriting Data In Use With Databricks

Published 2019-02-05 by Kevin Feasel

Piotr Starczynski shows us how we can read data from a table, transform it, and write it back to the same file:

Recently I have reached interesting problem in Databricks Non delta. I tried to read data from the the table (table on the top of file) slightly transform it and write it back to the same location that i have been reading from. Attempt to execute code like that would manifest with exception:“org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from”.
The lets try to answer the question How to write into a table(dataframe) that we are reading from as this might be a common use case?
This problem is trivial but it is very confusing, if we do not understand how queries are processed in spark.

Click through for the answer. I’m a little squeamish about doing this because my expectation is for data to flow from one source to another source; feeding the data back to the initial source feels strange, like running a load of clothes through the washer and dryer and then dumping them back into the hamper with the remainder of the dirty clothes.

Comments closed

Enabling Cloudera Manager Debug Mode

Published 2019-02-01 by Kevin Feasel

Guy Shilo has a quick tip around debugging in Cloudera Manager:

This is a short post but it can save you some wandering and searching.
Sometimes when you try to find and fix issues with Cloudera Manager you will want to increase the log level to debug so you can see what’s wrong.
The procedure cannot be found in the documentation (or at least cannot be found easily), so here how it’s done:

As you’d expect, going into debug mode generates a lot of data on a real cluster, so use sparingly.

Comments closed

Working With WebHDFS From Node.js

Published 2019-01-31 by Kevin Feasel

Somanth Veettil shows us how to use Node.js to work with the WebHDFS REST API:

There is an npm module, “node-webhdfs,” with a wrapper that allows you to access Hadoop WebHDFS APIs. You can install the node-webhdfs package using npm:
npm install webhdfs
After the above step, you can write a Node.js program to access this API. Below are a few steps to help you out.

Click through for examples on how the package works.

Comments closed

Impala Improvements in CDH 5.15.0

Published 2019-01-31 by Kevin Feasel

Michael Ho, et al, share some improvements in Apache Impala’s scalability in the Cloudera Distribution of Hadoop:

Kudu RPC (KRPC) supports asynchronous RPCs. This removes the need to have a single thread per connection. Connections between hosts are long-lived. All RPCs between two hosts multiplex on the same established connection. This drastically cuts down the number of TCP connections between hosts and decouples the number of connections from the number of query fragments.
The error handling semantics are much cleaner and the RPC library transparently re-establishes broken connections. Support for SASL and TLS are built-in. KRPC uses protocol buffers for payload serialization. In addition to structured data, KRPC also supports attaching binary data payloads to RPCs, which removes the cost of data serialization and is used for large data objects like Impala’s intermediate row batches. There is also support for RPC cancellation which comes in handy when a query is cancelled because it allows query teardown to happen sooner.

Looks like there were some pretty nice gains out of this project.

Comments closed

Azure Data Factory Data Flows

Published 2019-01-31 by Kevin Feasel

Joost van Rossum takes a look at data flows in Azure Data Factory:

2) Create Databricks Service
Yes you are reading this correctly. Under the hood Data Factory is using Databricks to execute the Data flows, but don’t worry you don’t have to write code.
Create a Databricks Service and choose the right region. This should be the same as your storage region to prevent high data movement costs. As Pricing Tier you can use Standard for this introduction. Creating the service it self doesn’t cost anything.

Joost shows the work you have to do to build out a data flow. This has been a big hole in ADF—yeah, ADF seems more like an ELT tool than an ETL tool but even within that space, there are times when you need to do a bit more than pump-and-dump.

Comments closed

Password Protect Everything, Including Hadoop

Published 2019-01-30 by Kevin Feasel

George Leopold summarizes a recent Securonix report:

The malware spreads via brute-force attacks on weak passwords “or by exploiting one of three vulnerabilities found on Hadoop YARN Resource Manager, Redis [in-memory key-value store service] and ActiveMQ,” Securonix said. Once logged into database services, the malware can for example delete existing databases stored on a server and create another with a ransom note specifying a bitcoin payment.
The security analyst recommends continuous review of cloud-based services like Hadoop and YARN instances and their exposure to the Internet. Along with strong passwords, companies should “restrict access whenever possible to reduce the potential attack surface.”

It’s pretty standard advice: secure your data, password-protect your systems, and minimize the number of computers that get to touch your computers.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31