Category: Hadoop

Kafka Streams Basics

Published 2017-07-25 by Kevin Feasel

Anuj Saxena walks through Kafka Streams and provides a quick example:

The features provided by Kafka Streams:

Highly scalable, elastic, distributed, and fault-tolerant application.
Stateful and stateless processing.
Event-time processing with windowing, joins, and aggregations.
We can use the already-defined most common transformation operation using Kafka Streams DSL or the lower-level processor API, which allow us to define and connect custom processors.
Low barrier to entry, which means it does not take much configuration and setup to run a small scale trial of stream processing; the rest depends on your use case.
No separate cluster requirements for processing (integrated with Kafka).
Employs one-record-at-a-time processing to achieve millisecond processing latency, and supports event-time based windowing operations with the late arrival of records.
Supports Kafka Connect to connect to different applications and databases.

Read on for more details as well as a sample script to get started.

Comments closed

R For Apache Impala

Published 2017-07-24 by Kevin Feasel

Ian Cook describes implyr, an R interface for Apache Impala:

dplyr provides a grammar of data manipulation, consisting of set of verbs (including mutate(), select(), filter(), summarise(), and arrange()) that can be used together to perform common data manipulation tasks. The implyr package helps dplyr translate this grammar into Impala-compatible SQL commands. This gives R users access to Impala’s scale and speed on large distributed datasets while using the same familiar dplyr syntax that they are accustomed to using on local data frames and other data sources. R users can also choose to directly write SQL commands and execute them on Impala using implyr.

implyr builds upon recent work from RStudio and other contributors, including major updates to the packages dplyr and DBI, and new packages dbplyr and odbc. implyr together with these packages enables data scientists and data engineers to more easily interact with Impala through self-service data science tools like Cloudera Data Science Workbench.

It looks like this project is off to a good start already.

Comments closed

Apache Drill Interface For R

Published 2017-07-18 by Kevin Feasel

Bob Rudis announces a new package on CRAN:

I’m extremely pleased to announce that the sergeant package is now on CRAN or will be hitting your local CRAN mirror soon.

sergeant provides JDBC, DBI and dplyr/dbplyr interfaces to Apache Drill. I’ve also wrapped a few goodies into the dplyr custom functions that work with Drill and if you have Drill UDFs that don’t work “out of the box” with sergeant‘s dplyr interface, file an issue and I’ll make a special one for it in the package.

Seems quite useful if you’re working with MapR. H/T R-bloggers

Comments closed

Using Sqoop To Move Data To Hadoop

Published 2017-07-18 by Kevin Feasel

The folks at Redglue have a few hints on using Sqoop to move data from a relational database to Hadoop:

“Data gets updated” problem

Data gets updated many times and loading data with Sqoop is not a single event as data that you are importing can be updated (INSERTed, DELETed or UPDATed). What is important here, is that, HDFS is an “append-only filesystem” (exceptions made to HBase and Hive with ACID, but they are mostly tricks) and the options are pretty simple: replace the dataset, add data to dataset (partition for example) or merge datasets between old and new data.

If the data that you are loading is a small dataset, don’t think twice, replace and overwrite it.

If the data that you are loading is a big data set, a “incremental” load is recommended. This can be a little tricky as Sqoop needs to know what modification were done since the last incremental or full import.

I’m not a huge fan of Sqoop and prefer to use my own ingest mechanisms, but it’s an easy way to get started.

Comments closed

Hadoop Name Node Capacity Planning

Published 2017-07-17 by Kevin Feasel

Mamta Chawla has some rules of thumb for sizing your Hadoop name node:

Both name node servers should have highly reliable storage for their namespace storage and edit-log journaling. That’s why — contrary to the recommended JBOD for data nodes — RAID is recommended for name nodes.

Master servers should have at least four redundant storage volumes — some local and some networked — but each can be relatively small (typically 1TB).

It is easy to determine the memory needed for both name node and secondary name node. The memory needed by name node to manage the HDFS cluster metadata in memory and the memory needed for the OS must be added together. Typically, the memory needed by the secondary name node should be identical to the name node.

Click through for some specific recommendations.

Comments closed

An Introduction To Kafka

Published 2017-07-14 by Kevin Feasel

Prashant Sharma explains the basics of Apache Kafka:

Apache describes Kafka as a distributed streaming platform that lets us:

Publish and subscribe to streams of records.
Store streams of records in a fault-tolerant way.
Process streams of records as they occur.

Kafka is probably the most generally interesting of the current Hadoop ecosystem, with Spark not too far behind. By “generally interesting,” I mean in the sense that companies with no vested interest in Hadoop as a whole could still be excited by the prospect of Kafka.

Comments closed

Cloudera Enterprise 5.12 GA

Published 2017-07-14 by Kevin Feasel

Fred Koopmans announces that Cloudera Enterprise 5.12 is now generally available:

Data Science & Engineering

Cloudera Data Science Workbench enhancements include:
- GPU Support: Cloudera Data Science Workbench now enables popular deep learning frameworks to run on GPUs, both on-premises and in the cloud.
- Embedded Web UIs: Users can work with the Apache Spark Web UI for Spark sessions. Other interactive web applications like TensorBoard, Shiny, and Plotly now appear directly in the workbench.
- Enhanced Job Scheduling: Cloudera Data Science Workbench users can now schedule jobs directly from external schedulers or orchestration systems via the new Jobs API.

Read on for more enhancements.

Comments closed

How Kafka Does Exactly-Once

Published 2017-07-05 by Kevin Feasel

Neha Narkhede explains exactly-once messaging and describes how Kafka implements their exactly-once process:

In a distributed system, the computers that make up the system can always fail independently of one another. In the case of Kafka, an individual broker can crash, or a network failure can happen while the producer is sending a message to a topic. Depending on the action the producer takes to handle such a failure, you can get different semantics:

At least once semantics: if the producer receives an acknowledgement (ack) from the Kafka broker and acks=all, it means that the message has been written exactly once to the Kafka topic. However, if a producer ack times out or receives an error, it might retry sending the message assuming that the message was not written to the Kafka topic. If the broker had failed right before it sent the ack but after the message was successfully written to the Kafka topic, this retry leads to the message being written twice and hence delivered more than once to the end consumer. And everybody loves a cheerful giver, but this approach can lead to duplicated work and incorrect results.
At most once semantics: if the producer does not retry when an ack times out or returns an error, then the message might end up not being written to the Kafka topic, and hence not delivered to the consumer. In most cases it will be, but in order to avoid the possibility of duplication, we accept that sometimes messages will not get through.
Exactly once semantics: even if a producer retries sending a message, it leads to the message being delivered exactly once to the end consumer. Exactly-once semantics is the most desirable guarantee, but also a poorly understood one. This is because it requires a cooperation between the messaging system itself and the application producing and consuming the messages. For instance, if after consuming a message successfully you rewind your Kafka consumer to a previous offset, you will receive all the messages from that offset to the latest one, all over again. This shows why the messaging system and the client application must cooperate to make exactly-once semantics happen.

Read on for a discussion of technical details. I appreciate how Neha linked to a 60+ page design document as well, for those wanting to dig into the details.

Comments closed

Connecting Hadoop To LDAP

Published 2017-07-05 by Kevin Feasel

Giovanni Lanzani walks through some of the difficulties of getting LDAP working with Hadoop:

This section could probably could have much less workarounds if I’d knew more about LDAP.

But I’m a data scientist at heart and I want to get things done.

If you ever dealt with Hadoop, you know that there are a bunch of non-interactive users, i.e. users who are not supposed to login, such as hdfs, spark, hadoop, etc. These users are important to have. However the groups with the same name are also important to have. For example when using airflow and launching a spark job, the log folders will be created under the airflow user, in the spark group.

LDAP, however, doesn’t allow you, to my knowledge, to have overlapping user/groups, as Unix does.

The way I solved it was to create, in LDAP, the spark_user (or hdfs_user or …) to work around this limitation.

Also, Giovanni apparently lives in an interesting neighborhood.

Comments closed

S3Guard

Published 2017-06-30 by Kevin Feasel

Mingliang Liu and Rajesh Balamohan explain why you shouldn’t use S3 as your primary Hadoop data store, as well as a tool which helps mitigate those problems:

Some of the real world use cases which can be impacted due to the S3 eventual consistency model are:

Listing Files. Newly created files might not be visible for data processing. In Hive, Spark and MapReduce, this can lead to erroneous results from incomplete source data or failure to commit all intermediate results.
ETL Workflow. Systems like Oozie rely on marker files to trigger the subsequent workflows. Any delay in the visibility of these files can lead to delays in the subsequent workflows.
Existence-guarded path operations. Any action which fails if the destination path is present may see a deleted file in a listing, and so fail — even though the file has already been deleted.

Read on to see how S3Guard works and how to enable it in HDP 2.6.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30