Hadoop – Page 53 – Curated SQL

All Data Replication (DR) combinations are supported:
– hot-hot
– hot -warm
– hot-cold
– hot-warm-cold
– other permutations of these configurations
The direction of replication can be uni-directional, bi-directional or multi-directional replication through advanced geo-distributed topologies.

It’s interesting to watch the evolution of Hadoop administration, going from “the cluster is our HA option” to having realistic plans if problems occur. The post doesn’t really cover DR, where the evolution has been greater.

Comments closed

Memory Management in Flink 1.10

Published 2020-04-21 by Kevin Feasel

Andrey Zagrebin walks us through some memory management improvements in the most recent version of Apache Flink:

Apache Flink 1.10 comes with significant changes to the memory model of the Task Managers and configuration options for your Flink applications. These recently-introduced changes make Flink more adaptable to all kinds of deployment environments (e.g. Kubernetes, Yarn, Mesos), providing strict control over its memory consumption. In this post, we describe Flink’s memory model, as it stands in Flink 1.10, how to set up and manage memory consumption of your Flink applications and the recent changes the community implemented in the latest Apache Flink release.

Click through to learn about the current model and methods to control memory utilization.

Comments closed

Operational Database Management Tools in Cloudera Data Platform

Published 2020-04-20 by Kevin Feasel

Gokul Kamaraj, et al, describe tools available to DBAs in the Cloudera Data Platform:

Cloudera provides multiple mechanisms to allow backup and recovery, including:
– Snapshots
– Replication
– Export
– CopyTable
– HTable API
– Offline backup of HDFS data
These can be run manually or scheduled using Replication Manager. Backups can also be moved to other instances of the OpDB or alternate storage targets such as AWS S3 or Azure ADLS gen 2.

Even in the Platform-as-a-Service world, there’s still plenty of scope for database administration.

Comments closed

Apache Kafka 2.5 Released

Published 2020-04-17 by Kevin Feasel

David Arthur announces Apache Kafka 2.5:

KIP-500 update
In Apache Kafka 2.5, some preparatory work has been done towards the removal of Apache ZooKeeper™ (ZK).
– KIP-555: details about the ZooKeeper deprecation process in admin tools
– KIP-543: dynamic configs will not require ZooKeeper access

KIP-500 looks like a doozy.

Comments closed

Serialization in Apache Flink

Published 2020-04-15 by Kevin Feasel

Nico Kruber walks us through the viable set of serializers in Apache Flink:

Flink handles data types and serialization with its own type descriptors, generic type extraction, and type serialization framework. We recommend reading through the documentation first in order to be able to follow the arguments we present below. In essence, Flink tries to infer information about your job’s data types for wire and state serialization, and to be able to use grouping, joining, and aggregation operations by referring to individual field names, e.g. stream.keyBy(“ruleId”) or dataSet.join(another).where("name").equalTo("personName"). It also allows optimizations in the serialization format as well as reducing unnecessary de/serializations (mainly in certain Batch operations as well as in the SQL/Table APIs).

Click through for notes on each serializer and a graph which shows how the choice of a serializer can make a huge difference.

Comments closed

Distributed XGBoost in Cloudera

Published 2020-04-13 by Kevin Feasel

Harshal Patil walk us through the XGBoost algorithm and shows how we can use it in Cloudera Machine Learning:

DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as Numpy, Pandas, and Scikit-Learn. Dask was initially released around 2014 and has since built significant following and support.
DASK uses Python natively, distinguishing it from Spark, which is written in Java, and has the overhead of running JVMs and context switching between Python and Java. It is also much harder to debug Spark errors vs. looking at a Python stack trace that comes from DASK.
We will run Xgboost on DASK to train in parallel on CML. The source code for this blog can be found here.

Click through for the process.

Comments closed

Is Kafka a Database?

Published 2020-04-10 by Kevin Feasel

Kai Wähner asks a question I hadn’t thought about:

Can and should Apache Kafka replace a database? How long can and should I store data in Kafka? How can I query and process data in Kafka? These are common questions that come up more and more. Short answers like “Yes” or “It depends” are not good enough for you? Then this read is for you! This blog post explains the idea behind databases and different features like storage, queries, and transactions to evaluate when Kafka is a good fit and when it is not.

This is an interesting review of the Kafka ecosystem and shows that Apache Kafka really does blur the lines regarding what is a database.

Comments closed

Database Administration in Cloudera Data Platform

Published 2020-04-10 by Kevin Feasel

Gokul Kamaraj and Liliana Kadar walk through tools for the DBA in Cloudera Data Platform:

You can use Cloudera Manager to automate the process of upgrading the operational database in your Cloudera Data Platform-Data Center (CDP-DC). Upgrades are provided through releases or maintenance patches. Cloudera Manager installs the releases and/or patches and manages the configuration as well as the restart process.
If you are using CDP on a public cloud such as Amazon AWS, you have to create a new Data hub cluster to upgrade to the new versions of various components. For more information about creating a new operational database Data hub cluster, see Getting Started with Operational Database on CDP.
Cloudera’s offering is a cluster-based offering; upgrades and patches all span multiple nodes (servers) and installation, configuration, reboot are all automated, including rolling reboots where applicable.

Click through for a walkthrough of other tools for Hadoop DBAs.

Comments closed

Stateful Functions in Apache Flink

Published 2020-04-09 by Kevin Feasel

Stephan Ewen announces Stateful Functions 2.0:

Today, we are announcing the release of Stateful Functions (StateFun) 2.0 — the first release of Stateful Functions as part of the Apache Flink project. This release marks a big milestone: Stateful Functions 2.0 is not only an API update, but the first version of an event-driven database that is built on Apache Flink.
Stateful Functions 2.0 makes it possible to combine StateFun’s powerful approach to state and composition with the elasticity, rapid scaling/scale-to-zero and rolling upgrade capabilities of FaaS implementations like AWS Lambda and modern resource orchestration frameworks like Kubernetes.
With these features, Stateful Functions 2.0 addresses two of the most cited shortcomings of many FaaS setups today: consistent state and efficient messaging between functions.

Read on to see how it works.

Comments closed

Decade Two of Hadoop

Published 2020-04-08 by Kevin Feasel

Arun Murthy takes us through decade two of Hadoop:

By the end of the first decade, we needed a fundamental rethink — not just for the public cloud, but also for on-premises. It’s also helpful to cast an eye on the various technological forces driving Hadoop’s evolution over the next decade:
– Cloud experiences fundamentally changed expectations for easy to use, self-service, on-demand, elastic consumption of software and apps as services.
– Separation of compute and storage is now practical in both public and private clouds, significantly increasing workload performance.
– Containers and kubernetes are ubiquitous as a standard operating environment that is more flexible and agile.
– The integration of streaming, analytics and machine learning — the data lifecycle — is recognized as a prerequisite for nearly every data-driven business use case.

“Core” Hadoop (not including products in the broader Hadoop ecosystem like Spark, Kafka, etc.) hit a major stress point with migration out of data centers running direct attached storage. This is how Cloudera is working to pick up some of that lost momentum.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop

Cloudera Data Platform High Availability Options

Memory Management in Flink 1.10

Operational Database Management Tools in Cloudera Data Platform

Apache Kafka 2.5 Released

Serialization in Apache Flink

Distributed XGBoost in Cloudera

Is Kafka a Database?

Database Administration in Cloudera Data Platform

Stateful Functions in Apache Flink

Decade Two of Hadoop