Category: Hadoop

In this blog, I will demonstrate how COD can easily be used as a backend system to store data and images for a simple web application. To build this application, we will be using Phoenix, one of the underlying components of COD, along with Flask. For storing images, we will be using an HBase (Apache Phoenix backend storage) capability called MOB (medium objects). MOB allows us to read/write values from 100k-10MB quickly.
*For development ease of use, you can also use the Phoenix query server instead of COD. The query server is a small build of phoenix that is meant for development purposes only, and data is deleted in each build.

Click through for the demo and for a link to the GitHub repo.

Comments closed

Persisting an RDD in Spark

Published 2020-10-06 by Kevin Feasel

Sarfaraz Hussain takes us through caching / persisting RDDs in Apache Spark:

Spark RDD persistence is an optimization technique which saves the result of RDD evaluation in cache memory. Using this we save the intermediate result so that we can use it further if required. It reduces the computation overhead.
When we persist an RDD, each node stores the partitions of it that it computes in memory and reuses them in other actions on that RDD (or RDD derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Read on to see how you can do this and some of the options available to you when caching. This is extremely useful when working with external data sources, as then you don’t risk hitting the external source multiple times.

Comments closed

Stream Processing with ksqldb

Published 2020-10-05 by Kevin Feasel

Michael Drogalis takes us through how stream processing works with ksqldb:

ksqlDB, the event streaming database, is becoming one of the most popular ways to work with Apache Kafka^®. Every day, we answer many questions about the project, but here’s a question with an answer that we are always trying to improve: How does ksqlDB work?
The mechanics behind stream processing can be challenging to grasp. The concepts are abstract, and many of them involve motion—two things that are hard for the mind’s eye to visualize. Let’s pop open the hood of ksqlDB to explore its essential concepts, how each works, and how it all relates to Kafka.

Click through for a demo with animations.

Comments closed

Delta Lake DML Internals

Published 2020-10-05 by Kevin Feasel

Tathagata Das, et al, take us through how Delta Lake handles update, delete, and merge operations:

`DELETE` works just like `UPDATE` under the hood. Delta Lake makes two scans of the data: the first scan is to identify any data files that contain rows matching the predicate condition. The second scan reads the matching data files into memory, at which point Delta Lake deletes the rows in question before writing out the newly clean data to disk.
After Delta Lake completes a `DELETE` operation successfully, the old data files are not deleted — they’re still retained on disk, but recorded as “tombstoned” (no longer part of the active table) in the Delta Lake transaction log. Remember, those old files aren’t deleted immediately because you might still need them to time travel back to an earlier version of the table. If you want to delete files older than a certain time period, you can use the `VACUUM` command.

Click through for a video as well as a blog post with the details.

Comments closed

Building a Hadoop Cluster with Spark in Kubernetes

Published 2020-09-29 by Kevin Feasel

Gopal takes us through building up a Hadoop cluster via Kubernetes:

In our current scenario, we have 4 Node cluster where one is master node (HDFS Name node and YARN resource manager) and other three are slave nodes (HDFS data node and YARN Node manager)
In this cluster, we have implemented Kerberos, which makes this cluster more secure.
Kerberos services are already running in the different server which would be treated as KDC server.
In all of the nodes, we have to do a client configuration for Kerberos which I have already written in my previous blog. please go through below kerberos authentication links for more info.
kerberos authentication

Read on for the walkthrough.

Comments closed

Choosing Between Hive LLAP and Impala

Published 2020-09-25 by Kevin Feasel

David Dichmann walks us through the differences between Impala and Hive LLAP:

Written in C++, which is very CPU efficient, with a very fast query planner and metadata caching, Impala is optimized for low latency queries. Because of this, Impala is an ideal engine for use with a data mart, since people working with data marts are mostly running read-only queries and not large scale writes.
Impala also has a very efficient run-time execution framework, using code generation, process-to-process communication, massive parallelism, and metadata caching. Because of this, Impala is also great when working with ad-hoc queries, like when exploring by iteratively digging into data. You’ll want to change your query over and over again, at a moment’s notice, and have very fast response times so you’re not waiting forever for each iteration.

I was curious what would end up happening with Hive and Impala once old Cloudera (Impala) and Hortonworks (Hive) merged together. Looks like the answer, at least for now, is that they’re both useful in different circumstances. But I do wonder how long that lasts—it’s not impossible to sell using two separate data platform products for different steps in a warehouse implementation, but I could see architects and CIOs wanting to make things simpler and narrow down to one unless there was a particularly smooth bridge between the two.

Comments closed

Operational Database Security in Cloudera Data Platform

Published 2020-09-24 by Kevin Feasel

Liliana Kadar, et al, walk us through some of the database security and auditing features in Cloudera Data Platform:

Database object-level security is available through the centralized authorization framework of Apache Ranger.
Both fine-grained access control of database objects and access to metadata is provided. Protected database objects include: database, table, column, view and User Defined Functions (UDFs).
Fine-grained access control for special administrative operations that can be performed on OpDBMS is also supported.

Click through for the full story.

Comments closed

Key Metrics for Kafka Monitoring

Published 2020-09-24 by Kevin Feasel

Preetdeep Kumar shares three metrics which are important for monitoring Kafka clusters:

There are 100s of metrics documented as part of Kafka monitoring out of which CPU, Memory, Disk, and Network related metrics are always useful in monitoring any systems. In this article, I share 3 metrics that I found to be very useful from a development point of view, saved us some time while triaging a few corner cases reported by customers.

Click through for those measures.

Comments closed

Connecting to Azure Databricks from Power BI

Published 2020-09-23 by Kevin Feasel

Gerhard Brueckl walks us through the Power BI connector to Azure Databricks:

I work a lot with Azure Databricks and a topic that always comes up is reporting on top of the data that is processed with Databricks. Even though notebooks offer some great ways to visualize data for analysts and power users, it is usually not the kind of report the top-management would expect. For those scenarios, you still need to use a proper reporting tool, which usually is Power BI when you are already using Azure and other Microsoft tools.
So, I am very happy that there is finally an official connector in PowerBI to access data from Azure Databricks! Previously you had to use the generic Spark connector (docs) which was rather difficult to configure and did only support authentication using a Databricks Personal Access Token.

Click through to see how it works.

Comments closed

The Session Window in Flink

Published 2020-09-22 by Kevin Feasel

Kundan Kumarr continues a series on windows in Apache Flink:

In the real world, all the work that we do online- Visiting a website, Clicking around the website, do online transactions, and so on are in sessions. We might just go to an e-commerce website like amazon, looking for products, clicking around for a bit, and then stop. All is done within a session. There is a use case where these websites may want to track pages that we visited in a single session. For that, it needs to group all clicks together which are streaming in, based on a session. These streaming use cases can be implemented easily by Flink Session window.
The Session windows assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time. The number of entities within a session window is not fixed. Because it is a user who defines typically how long the session would be. A session window closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred. For example, once we have been idle on the amazon website let say for 1 minute that is the end of the previous session and if go back to the site after 1 sec it will start a new session. The way it would determine the session is the pause between one click and another click.

Click through for a depiction and an example.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31