Press "Enter" to skip to content

Category: Hadoop

Launching A Sparklyr Cluster

David Smith shows how to launch a sparklyr cluster in Azure:

When you’re finished, shut down your cluster using the aztk spark cluster delete command. (While you can delete the nodes from the Pools view in the Azure portal, the command does some additional cleanup for you.) You’ll be charged for each node in the cluster at the usual VM rates for as long as the cluster is provisioned. (One cost-saving option is to use low-priority VMs for the nodes, for savings of up to 90% compared to the usual rates.)

That’s it! Once you get used to it, it’s all quick and easy — the longest part is waiting for the cluster to spin up in Step 5. This is just a summary, but the full details see the guide SparklyR on Azure with AZTK.

It’ll take a bit more than five minutes to get started, but it is a good sight easier than building the servers yourself.

Comments closed

Securing KSQL

Yeva Byzek shows the methods available to secure a Kafka Streams application:

To connect to a secured Kafka cluster, Kafka client applications need to provide their security credentials. In the same way, we configure KSQL such that the KSQL servers are authenticated and authorized, and data communication is encrypted when communicating with the Kafka cluster. We can configure KSQL for:

Read the whole thing if you’re thinking about using Kafka Streams.

Comments closed

Streaming ETL In Practice Using KSQL

Robin Moffatt builds an example of streaming ETL using Oracle, GoldenGate, and Kafka:

So in this post I’m going to show an example of what streaming ETL looks like in practice. I’m replacing batch extracts with event streams, and batch transformation with in-flight transformation of these event streams. We’ll take a stream of data from a transactional system built on Oracle, transform it, and stream it into Elasticsearch to land the results to, but your choice of datastore is up to you—with Kafka’s Connect API you can stream the data to almost anywhere! Using KSQL we’ll see how to filter streams of events in real-time from a database, how to join between events from two database tables, and how to create rolling aggregates on this data.

It’s a very useful example.

Comments closed

Automating HDF Cluster Deployment

Ali Bajwa has a how-to guide for automating HDF 3.1 cluster deployment on AWS:

The release of HDF 3.1 brings about a significant number of improvements in HDF: Apache Nifi 1.5, Kafka 1.0, plus the new NiFi registry. In addition, there were improvements to Storm, Streaming Analytics Manager, Schema Registry components. This article shows how you can use ambari-bootstrap project to easily generate a blueprint and deploy HDF clusters to both either single node or development/demo environments in 5 easy steps. To quickly setup a single node setup, a prebuilt AMI is available for AWS as well as a script that automates these steps, so you can deploy the cluster in a few commands.

Click through for the installation guide.

Comments closed

The Business Value Of Upgrading To Hadoop 3

Roni Fontaine, Vinod Vavilapalli, and Saumitra Buragohain explain some of the business case for upgrading to Hadoop 3 from Hadoop 2:

Hadoop 2 doesn’t support GPUs. Hadoop 3 enables scheduling of additional resources, such as disks and GPUs for better integration with containers, deep learning & machine learning.  This feature provides the basis for supporting GPUs in Hadoop clusters, which enhances the performance of computations required for Data Science and AI use cases.

Hadoop 2 cannot accommodate intra-node disk balancing. Hadoop 3 has intra-node disk balancing. If you are repurposing or adding new storage to an existing server with older capacity drives, this leads to unevenly disks space in each server.   With intra-node disk balancing, the space in each disk is evenly distributed.

Hadoop 2 has only inter-queue preemption across queues. Hadoop 3 introduces intra-queue preemption which goes to the next level time by allowing preemption between application within a single queue. This means that you can prioritize jobs within the queue based on user limits and/or application priority

Read on for more examples.

Comments closed

Installing Apache Mesos On EC2

Anubhav Tarar has a guide for setting up Apache Mesos along with Spark and Hadoop on EC2:

Apache Mesos is open source project for managing computer clusters originally developed at the University Of California. It sits between the application layer and operating system to manage the application works efficiently on the large-scale distributed environment.

In this blog, we will see how to setup mesos client and master on ec2 from scratch.

Read on for the step-by-step guide.

Comments closed

User-Defined Functions In KSQL

Kai Waehner demonstrates building a user-defined function for Kafka Streams:

As you can see, the full implementation is just a few lines of Java code. In general, you need to implement the logic between receiving input and returning output of the UDF in the evaluate()method. You also need to implement exception handling (e.g. invalid input arguments) where applicable. The init() method is empty in this case, but could initialise any required object instances.

Note that this UDF has state: dateFormat can be null or already initialized. However, no worries. You do not have to manage the scope as Kafka Streams (and therefore KSQL) threads are independent of each other. So this won’t cause any issues.

Click through for the entire process.

Comments closed

Hortonworks DataFlow 3.1 Released

George Vetticaden and Haimo Liu announce Hortonworks DataFlow version 3.1:

Apache Kafka 1.0 support with full integration with HDF Services – Kafka 1.0 provides important new features including more stringent message processing semantics with support for message headers and transactions, performance improvements and advanced security options.

  • Apache Ambari support for Kafka 1.0 – Install, configure, manage, upgrade, monitor, and secure Kafka 1.0 clusters with Ambari.

  • Apache Ranger support for Kafka 1.0 – Manage access control policies (ACLs) using resource or tag-based security for Kafka 1.0 clusters.

  • New NiFi and SAM processors for Kafka 1.0 – New processors in NiFi and Hortonworks Streaming Analytics Manager (SAM) support Kafka 1.0 features including message headers and transactions.

Click through for the list of top changes.

Comments closed

KSQL 0.4 Released

Apurva Mehta announces the release of KSQL 0.4:

The SHOW TOPICS command has been enhanced to include the number of active consumers and also the number of active consumer groups which are reading the topics.

Consumer groups are a feature of Apache Kafka which enable multiple consumer processes to divide the work of consuming Kafka topic. You can learn more about them in the Kafka Consumer JavaDocs, and of course you should read the SHOW TOPICS documentation for more information.

Read on for the full set of changes.

Comments closed

Understanding Kafka Consumers And Offsets

Simarpreet Kaur Monga builds a simple Kafka consumer in Scala to demonstrate how offsets work:

The method endOffsets accepts a collection of TopicPartition, for which you want to find the endOffsets.

As I want to find the endOffsets of the partitions that are assigned to my topic, I have passed the value of consumer.assignment() in the parameter of endOffsets. consumer.assignment gives the set of TopicPartitions that the Consumer has been assigned.

Note: You should call the method assignment only after calling poll on the consumer; otherwise, it will give null as the result. Additionally, the method endOffsets doesn’t change the position of the consumer, unlike seek methods, which do change the consumer position/offset.

Read the whole thing.

Comments closed