Press "Enter" to skip to content

Category: Hadoop

What’s New In Cloudera Enterprise 6.0

The Cloudera Hive team looks at the introduction of Apache Hive 2.1 into Cloudera Enterprise 6:

We are also focusing on efficiency across our platform. While on-premises platform efficiency helps manage costs in the long run, the immediate benefits of in-cloud deployments are realized by reducing total cost of ownership (TCO). We introduced Hive-on-Spark two years ago to meet  this goal in collaboration with Intel which is our strategic partner. We have a longstanding collaboration with Intel to optimize Cloudera’s stack on Intel architecture for our customers’ benefit.

In Enterprise 6.0, taking our strategic partnership with Intel ahead for further efficiency gains in Hive, we introduce a major performance and efficiency enhancement in HoS called Parquet Vectorization. This feature enables the HoS engine to process a vector of columns instead of one row at a time by batching data rows together into column vectors and making each operator work on such column vectors. This leads to better utilization of CPU caches and achieves high instructions per cycle by efficiently using the CPU instruction pipeline. In addition, we include numerous other performance improvements. For example, Hive often scans a given table multiple times during self joins, self-unions, or shared sub-queries. To address this, Dynamic RDD caching in HoS reuses a single scan across all these operations. Similarly, when the same subquery is used repeatedly, HoS executes this only once instead of separately for each subquery invocation.  Overall, with all these enhancements, in Enterprise 6.0 Hive can be up to 2.2X faster than Hive on the latest Enterprise 5.x release. The majority of these gains can be attributed to Parquet Vectorization for Hive-on-Spark.

This is another case where the Cloudera-Hortonworks merger will get interesting:  Cloudera seemed to hitch its wagon to Impala and Hortonworks to Hive; will they support both as much as they each did independently, or will the new corporate overlords settle on one of the two?

Comments closed

Whither Running Kafka On Kubernetes

Gwen Shapira walks through some of the costs and benefits of using Kubernetes to host your Apache Kafka brokers:

First, if you are running most of your other applications and microservices on Kubernetes, it becomes the organizational path of least resistance. This is just like how organizations who standardized on VMs have found it very difficult to allocate physical machines with local disks for Kafka.

I see situations with larger organizations where deploying Kafka outside of Kubernetes causes significant organizational headache that involves many approvals. When this is the case, I usually say that this isn’t a good hill to die on. It is possible to run Kafka on Kubernetes, so just do it. You’ll get your environment allocated faster and will be able to use your time to do productive work rather than fight an organizational battle.
And if things go wrong, you’ll get much better service from your internal infrastructure teams, because you’ll be running in an environment that is familiar to them.

Read on for more benefits as well as a few drawbacks.

Comments closed

Medium-Term Effects Of The Cloudera-Hortonworks Merger

Alex Woodie describes some of the ramifications of Cloudera’s merger with Hortonworks:

Whatever camp you sit in, the merger undoubtedly caught the attention of the 2,500 organizations that have adopted Cloudera’s Distribution of Hadoop (CDH) or the Hortonworks Data Platform (HDP) over the years — not to mention the thousands of other companies that have adopted open source Apache Hadoop platforms or Hadoop ecosystem components in the cloud. These Global 2000 companies have invested billions of dollars into building giant clusters to store and process many exabytes worth of data, and they’re not going to just turn them off overnight because the two biggest players suddenly decided to merge.

At the same time, these customers need to be reassured that Cloudera has a plan to maintain the investments they’ve already made in HDP and CDH platforms, both in a short-term or tactical sense, as well as in terms of Cloudera’s long-range strategy to evolve its platform to meet emerging future compute and storage needs.

Read on for more detail.

Comments closed

Spark Streaming On Azure Databricks

Tristan Robinson shows us how to run Spark Streaming within Azure Databricks:

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set.

In terms of options for real-time stream processing on Azure you have the following:

  • Azure Stream Analytics

  • Spark Streaming / Storm on HDInsight

  • Spark Streaming on Databricks

  • Azure Functions

Click through for more.

Comments closed

Clients For Working With HDFS

Mark Litwintschik reviews several clients for working with the Hadoop Distributed Filesystem:

The Hadoop Distributed File System (HDFS) allows you to both federate storage across many computers as well as distribute files in a redundant manor across a cluster. HDFS is a key component to many storage clusters that possess more than a petabyte of capacity.

Each computer acting as a storage node in a cluster can contain one or more storage devices. This can allow several mechanical storage drives to both store data more reliably than SSDs, keep the cost per gigabyte down as well as go some way to exhausting the SATA bus capacity of a given system.

Hadoop ships with a feature-rich and robust JVM-based HDFS client. For many that interact with HDFS directly it is the go-to tool for any given task. That said, there is a growing population of alternative HDFS clients. Some optimise for responsiveness while others make it easier to utilise HDFS in Python applications. In this post I’ll walk through a few of these offerings.

Read on for reviews of those offerings.

Comments closed

Monitoring Apache NiFi With A Custom Dashboard

Tim Spann has started a new series on monitoring Apache NiFi:

In this little proof of concept work, we grab some of these flows process them in Apache NiFi and then store them in Apache Hive 3 tables for analytics. We should probably push the data to HBase for aggregates and Druid for time series. We will see as this expands.

There are also other data access options including the NiFi REST API and the NiFi Python APIs.

Boostrap Notifier

  • Send notification when the NiFi starts, stops or died unexpectedly
  • Two OOTB notifications
  • Email notification service
  • HTTP notification service
  • It’s easy to write a custom notification service

Reporting Tasks

  • AmbariReportingTask (global, per process group)

  • MonitorDiskUsage(Flowfile, content, provenance repositories)

  • MonitorMemory

Much of this is an overview of the tools and measures available.

Comments closed

Big Data Clusters In SQL Server 2019

James Serra lays out some of the architecture behind SQL Server 2019 Big Data Clusters:

While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer.  The virtual data layer allows users to query data from many sources through a single, unified interface.  Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date.  Storage costs and data governance complexity are minimized.  See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and  Data Virtualization vs. Data Movement.

SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:

James covers some of the reasoning behind this and the shift from using Polybase to integrate data with Hadoop + Azure Blob Storage to using SQL Server as a data virtualization engine.

Comments closed

Mounting HDFS As A Local Filesystem

Guy Shilo looks at two techniques for mounting HDFS as a local filesystem:

NFS Gateway is a HDFS component that enables the use to expose HDFS through NFS3 interface so that Linux machines can mount it and access it just as a local filesystem.

The manual installation is quite cumbersome and is covered here.

Cloudera manager automates the process so we will use it. If you do not already have NFS Gateway installed in your Cloudera cluster, go to HDFS -> Instances -> Add role instances and choose a host for NFS Gateway:

Guy also looks at Fuse and runs a quick test to see which is faster.

Comments closed

How Humio Uses Kafka

Kresten Krab describes ways that Humio uses Apache Kafka for their product:

Humio is a log analytics system built to run both on-prem and as a hosted offering. It is designed for “on-prem first” because, in many logging use cases, you need the privacy and security of managing your own logging solution. And because volume limitations can often be a problem in Hosted scenarios.

From a software provider’s point of view, fixing issues in an on-prem solution is inherently problematic, and so we have strived to make the solution simple. To realize this goal, a Humio installation consists only of a single process per node running Humio itself, being dependent on Kafka running nearby (we recommend deploying one Humio node per physical CPU so a dual-socket machine typically runs two Humio nodes).

We use Kafka for two things: buffering ingest and as a sequencer of events among the nodes of a Humio cluster.

Read on for more details and a few tips on using Kafka to its fullest.

Comments closed

Looking At Databricks Cluster Pricing

Tristan Robinson takes a look at Azure Databricks pricing:

The use of databricks for data engineering or data analytics workloads is becoming more prevalent as the platform grows, and has made its way into most of our recent modern data architecture proposals – whether that be PaaS warehouses, or data science platforms.

To run any type of workload on the platform, you will need to setup a cluster to do the processing for you. While the Azure-based platform has made this relatively simple for development purposes, i.e. give it a name, select a runtime, select the type of VMs you want and away you go – for production workloads, a bit more thought needs to go into the configuration/cost.  In the following blog I’ll start by looking at the pricing in a bit more detail which will aim to provide a cost element to the cluster configuration process.

There are a few complicating factors in figuring out cluster price but rest assured that it will be costly.

Comments closed