Press "Enter" to skip to content

Category: Hadoop

Operational Database Management Tools in Cloudera Data Platform

Gokul Kamaraj, et al, describe tools available to DBAs in the Cloudera Data Platform:

Cloudera provides multiple mechanisms to allow backup and recovery, including:

– Snapshots
– Replication
– Export
– CopyTable
– HTable API
– Offline backup of HDFS data

These can be run manually or scheduled using Replication Manager. Backups can also be moved to other instances of the OpDB or alternate storage targets such as AWS S3 or Azure ADLS gen 2.

Even in the Platform-as-a-Service world, there’s still plenty of scope for database administration.

Comments closed

Serialization in Apache Flink

Nico Kruber walks us through the viable set of serializers in Apache Flink:

Flink handles data types and serialization with its own type descriptors, generic type extraction, and type serialization framework. We recommend reading through the documentation first in order to be able to follow the arguments we present below. In essence, Flink tries to infer information about your job’s data types for wire and state serialization, and to be able to use grouping, joining, and aggregation operations by referring to individual field names, e.g. stream.keyBy(“ruleId”) or dataSet.join(another).where("name").equalTo("personName"). It also allows optimizations in the serialization format as well as reducing unnecessary de/serializations (mainly in certain Batch operations as well as in the SQL/Table APIs).

Click through for notes on each serializer and a graph which shows how the choice of a serializer can make a huge difference.

Comments closed

Distributed XGBoost in Cloudera

Harshal Patil walk us through the XGBoost algorithm and shows how we can use it in Cloudera Machine Learning:

DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as Numpy, Pandas, and Scikit-Learn. Dask was initially released around 2014 and has since built significant following and support. 

DASK uses Python natively, distinguishing it from Spark, which is written in Java, and has the overhead of running JVMs and context switching between Python and Java. It is also much harder to debug Spark errors vs. looking at a Python stack trace that comes from DASK.

We will run Xgboost on DASK to train in parallel on CML. The source code for this blog can be found here.

Click through for the process.

Comments closed

Is Kafka a Database?

Kai Wähner asks a question I hadn’t thought about:

Can and should Apache Kafka replace a database? How long can and should I store data in Kafka? How can I query and process data in Kafka? These are common questions that come up more and more. Short answers like “Yes” or “It depends” are not good enough for you? Then this read is for you! This blog post explains the idea behind databases and different features like storage, queries, and transactions to evaluate when Kafka is a good fit and when it is not.

This is an interesting review of the Kafka ecosystem and shows that Apache Kafka really does blur the lines regarding what is a database.

Comments closed

Database Administration in Cloudera Data Platform

Gokul Kamaraj and Liliana Kadar walk through tools for the DBA in Cloudera Data Platform:

You can use Cloudera Manager to automate the process of upgrading the operational database in your Cloudera Data Platform-Data Center (CDP-DC). Upgrades are provided through releases or maintenance patches. Cloudera Manager installs the releases and/or patches and manages the configuration as well as the restart process.

If you are using CDP on a public cloud such as Amazon AWS, you have to create a new Data hub cluster to upgrade to the new versions of various components.  For more information about creating a new operational database Data hub cluster, see Getting Started with Operational Database on CDP

Cloudera’s offering is a cluster-based offering; upgrades and patches all span multiple nodes (servers) and installation, configuration, reboot are all automated, including rolling reboots where applicable.

Click through for a walkthrough of other tools for Hadoop DBAs.

Comments closed

Stateful Functions in Apache Flink

Stephan Ewen announces Stateful Functions 2.0:

Today, we are announcing the release of Stateful Functions (StateFun) 2.0 — the first release of Stateful Functions as part of the Apache Flink project. This release marks a big milestone: Stateful Functions 2.0 is not only an API update, but the first version of an event-driven database that is built on Apache Flink.

Stateful Functions 2.0 makes it possible to combine StateFun’s powerful approach to state and composition with the elasticity, rapid scaling/scale-to-zero and rolling upgrade capabilities of FaaS implementations like AWS Lambda and modern resource orchestration frameworks like Kubernetes.

With these features, Stateful Functions 2.0 addresses two of the most cited shortcomings of many FaaS setups today: consistent state and efficient messaging between functions.

Read on to see how it works.

Comments closed

Decade Two of Hadoop

Arun Murthy takes us through decade two of Hadoop:

By the end of the first decade, we needed a fundamental rethink — not just for the public cloud, but also for on-premises. It’s also helpful to cast an eye on the various technological forces driving Hadoop’s evolution over the next decade:

– Cloud experiences fundamentally changed expectations for easy to use, self-service, on-demand, elastic consumption of software and apps as services.
– Separation of compute and storage is now practical in both public and private clouds, significantly increasing workload performance.
– Containers and kubernetes are ubiquitous as a standard operating environment that is more flexible and agile.
– The integration of streaming, analytics and machine learning — the data lifecycle — is recognized as a prerequisite for nearly every data-driven business use case.

“Core” Hadoop (not including products in the broader Hadoop ecosystem like Spark, Kafka, etc.) hit a major stress point with migration out of data centers running direct attached storage. This is how Cloudera is working to pick up some of that lost momentum.

Comments closed

Using Apache Flink to Read from Apache Kafka

Preetdeep Kumar crosses the streams:

Apache Flink provides various connectors to integrate with other systems. In this article, I will share an example of consuming records from Kafka through FlinkKafkaConsumer and producing records to Kafka using FlinkKafkaProducer.

Read on for an example. I’m glad to see that integration between these two competitors (more exactly, Flink and Kafka Streams are competitors) is so easy.

Comments closed

Hive + LLAP Now Faster with ElasticMapReduce 6

Suthan Phillips has a benchmark for ElasticMapReduce 5 versus 6:

To evaluate the performance benefits of running Hive with Amazon EMR release 6.0.0, we’re using 70 TCP-DS queries with a 3 TB Apache Parquet dataset on a six-node c4.8xlarge EMR cluster to compare the total runtime and geometric mean with results from EMR release 5.29.0.

The results show that the TPC-DS queries run twice as fast in Amazon EMR 6.0.0 (Hive 3.1.2) compared to Amazon EMR 5.29.0 (Hive 2.3.6) with the default Amazon EMR Hive configuration.

The following graph shows performance improvements measured as total runtime for 70 TPC-DS queries. Amazon EMR 6.0.0 has the better (lower) runtime.

Click through for the measures and a bit more info on LLAP.

Comments closed