Category: Hadoop

How Humio Uses Kafka

Published 2018-10-19 by Kevin Feasel

Kresten Krab describes ways that Humio uses Apache Kafka for their product:

Humio is a log analytics system built to run both on-prem and as a hosted offering. It is designed for “on-prem first” because, in many logging use cases, you need the privacy and security of managing your own logging solution. And because volume limitations can often be a problem in Hosted scenarios.

From a software provider’s point of view, fixing issues in an on-prem solution is inherently problematic, and so we have strived to make the solution simple. To realize this goal, a Humio installation consists only of a single process per node running Humio itself, being dependent on Kafka running nearby (we recommend deploying one Humio node per physical CPU so a dual-socket machine typically runs two Humio nodes).

We use Kafka for two things: buffering ingest and as a sequencer of events among the nodes of a Humio cluster.

Read on for more details and a few tips on using Kafka to its fullest.

Comments closed

Looking At Databricks Cluster Pricing

Published 2018-10-19 by Kevin Feasel

Tristan Robinson takes a look at Azure Databricks pricing:

The use of databricks for data engineering or data analytics workloads is becoming more prevalent as the platform grows, and has made its way into most of our recent modern data architecture proposals – whether that be PaaS warehouses, or data science platforms.

To run any type of workload on the platform, you will need to setup a cluster to do the processing for you. While the Azure-based platform has made this relatively simple for development purposes, i.e. give it a name, select a runtime, select the type of VMs you want and away you go – for production workloads, a bit more thought needs to go into the configuration/cost. In the following blog I’ll start by looking at the pricing in a bit more detail which will aim to provide a cost element to the cluster configuration process.

There are a few complicating factors in figuring out cluster price but rest assured that it will be costly.

Comments closed

Generating Load For Kafka With JMeter

Published 2018-10-17 by Kevin Feasel

Anup Shirolkar shows us a way to use JMeter to generate load for Apache Kafka clusters:

The Anomalia Machina is going to require (at least!) one more thing as stated in the intro, loading with lots of data! Kafka is a log aggregation system and operates on a publish-subscribe mechanism. The Kafka cluster in Anomalia Machina will be accumulating a lot of events which are to be processed to discover anomalies. The exact sequence of processing is still being prototyped at this point in time, but there is a solid requirement of a tool/mechanism to load the Kafka cluster with lots of data in a hurry.

The requirements pointed me in direction of looking for ‘Kafka Load Testing’. Firstly thinking of load testing, one tool comes to mind which is used very widely for load testing of Java based systems: ‘Jmeter’. Jmeter has rich toolset to perform various types of testing. It also comes with many advantages viz. Open source, easy to use, platform independent, distributed testing etc. I can use Jmeter and test its ability to perform cluster loading.

Read on for the demonstration.

Comments closed

Data Science And Data Engineering In HDP 3.0

Published 2018-10-17 by Kevin Feasel

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers:

We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different blog (please see the video)- to summarize, as we trained the car on a track, we collected about 30K images with corresponding steering angle data. The training data was stored in a HDP 3.0 cluster and the TensorFlow model was trained using 6 GPU cards and then the model was deployed back on the car. The deep learning use case highlights the combined power of HDP 3.0.

Click through for more additions and demos.

Comments closed

Security Improvements In Kafka And Confluent Platform

Published 2018-10-12 by Kevin Feasel

Vahid Fereydouny demonstrates a number of security improvements made to Apache Kafka 2.0 as well as Confluent Platform 5.0:

Over the past several quarters, we have made major security enhancements to Confluent Platform, which have helped many of you safeguard your business-critical applications. With the latest release, we increased the robustness of our security feature set to help with:

Using standard and central directory services like Active Directory (AD)/Lightweight Directory Access Protocol (LDAP)

Simplifying the management of access control lists (ACLs)

Proactive management and monitoring of security configurations to address the gaps as soon as possible

The following new security features are available in both Confluent Platform 5.0 and Apache Kafka 2.0:

Support for ACL-prefixed wildcards to simplify the management of access control

Kafka Connect password protection with support for externalizing secrets (to “secrets stores,” etc., like Hashicorp Vault)

The following security features are available only in Confluent Platform 5.0:

AD/LDAP group support

Feature access controls in Confluent Control Center

Viewing of broker configurations in Confluent Control Center, including differences in security configurations between brokers

Let’s walk through each of these enhancements in detail.

Read on for examples.

Comments closed

SparkSession Versus SparkContext

Published 2018-10-12 by Kevin Feasel

Abhishek Baranwal explains the differences between the SparkSession object and the SparkContext object when writing Spark code:

Prior to spark 2.0, SparkContext was used as a channel to access all spark functionality. The spark driver program uses sparkContext to connect to the cluster through resource manager.

SparkConf is required to create the spark context object, which stores configuration parameters like appName (to identify your spark driver), number core and memory size of executor running on worker node.

In order to use API’s of SQL, Hive, and Streaming, separate context needs to be created.

Read on to see where SparkSession fits in.

Comments closed

Hortonworks And Cloudera To Merge

Published 2018-10-08 by Kevin Feasel

Ashley Stirrup analyzes the merger of the two largest Hadoop vendors:

Overall, this is great news for customers, the Hadoop ecosystem and the future of the market. Both company’s customers can now sleep at night knowing that the pace of innovation from Cloudera 2.0 will continue and accelerate. Combining the Cloudera and Hortonworks technologies means that instead of having to pick one stack or the other, now customers can have the best of both worlds. The statement from their press release “From the Edge to AI” really sums up how complementary some of the investments that Hortonworks made in IoT complement Cloudera’s investments in machine learning. From an ecosystem and innovation perspective, we’ll see fewer competing Apache projects with much stronger investments. This can only mean better experiences for any user of big data open source technologies.

At the same time, it’s no secret how much our world is changing with innovation coming in so many shapes and sizes. This is the world that Cloudera 2.0 must navigate. Today, winning in the cloud is quite simply a matter of survival. That is just as true for the new Cloudera as it is for every single company in every industry in the world. The difference is that Cloudera will be competing with a wide range of cloud-native companies both big and small that are experiencing explosive growth. Carving out their place in this emerging world will be critical.

The company has so many of the right pieces including connectivity, computing, and machine learning. Their challenge will be, making all of it simple to adopt in the cloud while continuing to generate business outcomes. Today we are seeing strong growth from cloud data warehouses like Amazon Redshift, Snowflake, Azure SQL Data Warehouse and Google Big Query. Apache Spark and service players like Databricks and Qubole are also seeing strong growth. Cloudera now has decisions to make on how they approach this ecosystem and they choose to compete with and who they choose to complement.

Rob Bearden on the Hortonworks side:

Cloudera has a like-minded approach to next generation data management and analytics solutions for hybrid deployments. Like Hortonworks, Cloudera believes data can drive high velocity business model transformations, and has innovated in ways that benefit the market and create new revenue opportunities. We are confident that our combined company will be ideally positioned to redefine the future of data as we extend our leadership and expand our offerings.

This transformational event will create benefits and growth opportunities for our stakeholders. Together with Cloudera, we will accelerate market development, fuel innovation and produce substantial benefits for our customers, partners, employees and the community.

By merging Cloudera’s investments in data warehousing and machine learning with Hortonworks’ investments in end-to-end data management, we are generating a winning combination, which will establish the standard for hybrid cloud data management.

Mike Olson on the Cloudera side:

We’re announcing the combination today, but we don’t expect the deal to close for several months. We’ll undergo the normal regulatory review that any merger of scale involving public companies gets, and the shareholders from both companies will have to meet and approve the deal.

Between now and the close date, we remain independent companies. Our customers are running our respective products. Our sales teams are working separate from each other with current and new customers to win more business and to make those customers successful. We’ll both continue to do that.

Customers who are running CDH, HDP and HDF are getting a new promise. Those product lines will each be supported and maintained for at least three years from the date our merger closes. Any customer who chooses either can be sure of a long-term future for the platform selected.

Guy Shilo isn’t quite as pleased:

On the business side, the new company will be a de facto monopoly, as those two are the largest Hadoop vendors in terms of market share. Less competition often leads to lack of incentive to innovate and rising prices. Let’s hope the joint company will not go this way and leverage its funds and power to improving their products and services.

On the technological side, it will be interesting to see the way CDH and HDP will go. Will they keep both products alive ? will they continue only one ? which of them will it be ? Will they take the HortonWorks approach that embraces the Hadoop open source community and its fast changing versions or the Cloudera more conservative approach ?

I am cautiously pessimistic about this. Cloudera and Hortonworks combined for a huge amount of the Hadoop market (approximately 80% as of a couple of years ago). There are several competitors in the broader market, but I thought that Cloudera and Hortonworks gave us two separate visions for different types of companies.

Comments closed

Troubleshooting KSQL Executions

Published 2018-10-05 by Kevin Feasel

Robin Moffatt shows us some of the tools available for researching problems with KSQL queries executed against a server:

What does any self-respecting application need? Metrics! We need to know how many messages have been processed, when the last message was processed and so on.

The simplest option for gathering these metrics comes from within KSQL itself, using the same DESCRIBE EXTENDED command that we saw before:
ksql> DESCRIBE EXTENDED GOOD_RATINGS;
[...]
Local runtime statistics
------------------------
messages-per-sec:      1.10 total-messages:     2898 last-message: 9/17/18 1:48:47 PM UTC
 failed-messages:         0 failed-messages-per-sec:         0 last-failed: n/a
(Statistics of the local KSQL server interaction with the Kafka topic GOOD_RATINGS)
ksql>

You can get more details, including explain plans, from this. There are external tools which Robin demonstrates as well, which let you track the streams over time.

Comments closed

HDP 3.0 Updates To Hive And Druid

Published 2018-10-03 by Kevin Feasel

Nishant Bangarwa has some updates to Apache Druid in HDP 3.0:

There are numerous improvements that went into HDP 3.0 and the performance improvements shown are an aggregate result of all of them. Here are some of the more noteworthy improvements related to Druid-Hive integration :

1. Druid Expressions Support – HIVE-18893/ CALCITE-2170 added support for Druid expressions in Hive. In HDP 3.0, Hive can push the computation of SQL expressions as part of a Druid query and they can be evaluated by Druid.
2. Use of Scan Query instead of Select Query – In HDP 3.0 we use Druid Scan query instead of Select Query. Scan Query is a streaming version of Select Query which returns the results in a compact streaming format. Scan query also does not need all the results to be retained in memory before they can be returned to Hive. This improves the memory usage of the historical nodes too.
3. GroupBy Query Improvements – Many optimizations are done in order to address the performance of GroupBy queries on Druid side. Main ones are –
  1. #4660 Parallel sort for ConcurrentGrouper
  2. #4576 Array-based aggregation for GroupBy query
  3. #4668 Add IntGrouper to avoid unnecessary boxing/unboxing in array-based aggregation
  4. #4704 Parallel merging of intermediate groupby results on the broker nodes.
4. Better column pruning – In some cases when hive cannot push any operator to druid, hive ended up pulling all the columns from druid. This led to lots of unnecessary data transfer between druid and hive. HIVE-15619 improved the column pruner logic to only fetch the columns from druid which are required to answer a query.
5. Druid Version upgrade from 0.10.0 to 0.12.2 – HDP 3.0 comes with latest version of Druid i.e 0.12.2 which has many new features, performance enhancements and bug-fixes over the previous version.

Druid is still a specialty technology which doesn’t fit every use case, but if it does fit your use, you’ll get a lot of performance benefit out of it.

Comments closed

Building A Basic Kafka Producer

Published 2018-10-02 by Kevin Feasel

M. Mallikarjun shows us a simple producer in Kafka:

A Kafka producer is an application that can act as a source of data in a Kafka cluster. A producer can publish messages to one or more Kafka topics.

So, how many ways are there to implement a Kafka producer? Well, there are a lot! But in this article, we shall walk you through two ways.

Kafka Command Line Tools

Kafka Producer Java API

You can write producers in quite a few languages. Java is the example here, but there are several libraries, including a good one for .NET.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31