Press "Enter" to skip to content

Category: Hadoop

Database Integrity in Cloudera Data Platform

Gokul Kamaraj and Liliana Kadar continue a series on operational database tooling in Hadoop:

Referential integrity is supported through the implementation of ‘constraints’ as well as enforcing business rules for attributes in the table. 

Constraints are configurable, and you can use it across different tables. Keep in mind that you have to choose a behavior depending on the specific configuration given to that constraint. 

This is rather underdeveloped compared to relational database platforms, but it’s still an improvement over the olden days, in which referential integrity was “write code which does that after the fact.”

Comments closed

Reading Query Plans in Spark

Daniel Ciocirlan has a primer on query plans in Apache Spark:

Let’s go over some examples of query plans and how to read them. Let’s go back to the one we’ve just shown:

 == Physical Plan == *(1) Project [(id#0L * 5) AS id#2L]

+- *(1) Range (1, 1000000, step=1, splits=6)

We read this plan backwards, bottom to top:

Spark does have some UI components which make this a bit easier, but you’ll probably end up in a situation where you need to read it in this format.

Comments closed

Developing for Databricks with VS Code

Gerhard Brueckl tells us what comes after notebooks for users with development backgrounds:

For those users Databricks has developed Databricks Connect (Azure docs) which allows you to work with your local IDE of choice (Jupyter, PyCharm, RStudio, IntelliJ, Eclipse or Visual Studio Code) but execute the code on a Databricks cluster. This is awesome and provides a lot of advantages compared to the standard notebook UI. The two most important ones are probably the proper integration into source control / git and the ability to extend your IDE with tools like automatic formatters, linters, custom syntax highlighting, …

While Databricks Connect solves the problem of local execution and debugging, there was still a gap when it came to pushing your local changes back to Databricks to be executed as part of a regular ETL or ML pipeline. So far you had to either “deploy” your changes by manually uploading them via the Databricks UI again or write a script that uploads it via the REST API (Azure docs).

Gerhard has a nice extension for Visual Studio Code which helps with this. I’m also a huge fan of the DatabricksPS module, so I’ll happily plug that here.

Comments closed

Cloudera Data Platform High Availability Options

Liliana Kadar, et al, explain what options DBAs have around high availability when working with the Cloudera Data Platform:

All Data Replication (DR) combinations are supported:

– hot-hot
– hot -warm
– hot-cold
– hot-warm-cold
– other permutations of these configurations

The direction of replication can be uni-directional, bi-directional or multi-directional replication through advanced geo-distributed topologies. 

It’s interesting to watch the evolution of Hadoop administration, going from “the cluster is our HA option” to having realistic plans if problems occur. The post doesn’t really cover DR, where the evolution has been greater.

Comments closed

Memory Management in Flink 1.10

Andrey Zagrebin walks us through some memory management improvements in the most recent version of Apache Flink:

Apache Flink 1.10 comes with significant changes to the memory model of the Task Managers and configuration options for your Flink applications. These recently-introduced changes make Flink more adaptable to all kinds of deployment environments (e.g. Kubernetes, Yarn, Mesos), providing strict control over its memory consumption. In this post, we describe Flink’s memory model, as it stands in Flink 1.10, how to set up and manage memory consumption of your Flink applications and the recent changes the community implemented in the latest Apache Flink release.

Click through to learn about the current model and methods to control memory utilization.

Comments closed

Operational Database Management Tools in Cloudera Data Platform

Gokul Kamaraj, et al, describe tools available to DBAs in the Cloudera Data Platform:

Cloudera provides multiple mechanisms to allow backup and recovery, including:

– Snapshots
– Replication
– Export
– CopyTable
– HTable API
– Offline backup of HDFS data

These can be run manually or scheduled using Replication Manager. Backups can also be moved to other instances of the OpDB or alternate storage targets such as AWS S3 or Azure ADLS gen 2.

Even in the Platform-as-a-Service world, there’s still plenty of scope for database administration.

Comments closed

Serialization in Apache Flink

Nico Kruber walks us through the viable set of serializers in Apache Flink:

Flink handles data types and serialization with its own type descriptors, generic type extraction, and type serialization framework. We recommend reading through the documentation first in order to be able to follow the arguments we present below. In essence, Flink tries to infer information about your job’s data types for wire and state serialization, and to be able to use grouping, joining, and aggregation operations by referring to individual field names, e.g. stream.keyBy(“ruleId”) or dataSet.join(another).where("name").equalTo("personName"). It also allows optimizations in the serialization format as well as reducing unnecessary de/serializations (mainly in certain Batch operations as well as in the SQL/Table APIs).

Click through for notes on each serializer and a graph which shows how the choice of a serializer can make a huge difference.

Comments closed

Distributed XGBoost in Cloudera

Harshal Patil walk us through the XGBoost algorithm and shows how we can use it in Cloudera Machine Learning:

DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as Numpy, Pandas, and Scikit-Learn. Dask was initially released around 2014 and has since built significant following and support. 

DASK uses Python natively, distinguishing it from Spark, which is written in Java, and has the overhead of running JVMs and context switching between Python and Java. It is also much harder to debug Spark errors vs. looking at a Python stack trace that comes from DASK.

We will run Xgboost on DASK to train in parallel on CML. The source code for this blog can be found here.

Click through for the process.

Comments closed

Is Kafka a Database?

Kai Wähner asks a question I hadn’t thought about:

Can and should Apache Kafka replace a database? How long can and should I store data in Kafka? How can I query and process data in Kafka? These are common questions that come up more and more. Short answers like “Yes” or “It depends” are not good enough for you? Then this read is for you! This blog post explains the idea behind databases and different features like storage, queries, and transactions to evaluate when Kafka is a good fit and when it is not.

This is an interesting review of the Kafka ecosystem and shows that Apache Kafka really does blur the lines regarding what is a database.

Comments closed