Press "Enter" to skip to content

Category: Hadoop

Choosing a SQL Platform on Hadoop

Sagar Kewalramani walks us through the choices for SQL platforms on the Cloudera Data Platform:

CDW on CDP is a new service that enables you to create a self-service data warehouse for teams of Business Intelligence (BI) analysts.  You can quickly provision a new data warehouse and share any data set with a specific team or detpartment. Do you remember when you could provision a data warehouse on your own?  Without infrastructure and platform teams getting involved? This was never possible.  CDW fulfills this mission.  

However, CDW makes several SQL engines available, and with more choice comes more opportunities for confusion.   Let’s explore the SQL engines available in CDW on CDP and talk about which is the right SQL option for the right use case.

So many choices!  Impala? Hive LLAP?  Spark? What to use when?  Let’s explore.

Infrastructure and platform teams start to get involved approximately two days after the unexpectedly large bill arrives.

That aside, this is a really nice article covering several platform technologies, including Impala, Hive LLAP, and Spark SQL.

Comments closed

Parquet Versus Avro

Matthew Rathbone compares the Parquet and Avro file formats:

JSON improves upon CSV as each row provides some indication of schema, but without a special header-row, there’s no way to derive a schema for every record in the file, and it isn’t always clear what type a ‘null’ value should be interpreted as.

Avro and Parquet on the other hand understand the schema of the data they store. When you write a file in these formats, you need to specify your schema. When you read the file back, it tells you the schema of the data stored within. This is super useful for a framework like Spark, which can use this information to give you a fully formed data-frame with minimal effort.

I was kind of hoping to see ORC in the comparison as well, though even when the Hortonworks-Cloudera competition was at its max, my recollection is that the differences between the two formats were pretty small (where ORC was a little faster for non-nested data and Parquet a little faster for nested data).

Comments closed

Fixing the Small File Problem in Hadoop

Guy Shilo takes us through the Hadoop Archive format:

It has hard time handling many small files. The memory footprint of the namenodes becomes high as they have to keep track of many small blocks and the performance of scans goes down.

The best way to fix this situation is, of course to avoid it in first place. This can be done when designing the application or the pipeline that inserts the data into HDFS, for example, by bundling many files into one container such as sequencefile, Avro or Hadoop archive (.har file).

Hadoop archive is somewhat overlooked option that I want to demonstrate today. You will see that it can be very useful in some cases but not so great in others.

Read the whole thing before giving it a try, as there are some downsides.

Comments closed

Sticky Partitioning in Kafka 2.4

Justine Olshan takes us through sticky partitioning in Kafka 2.4:

The sticky partitioner addresses the problem of spreading out records without keys into smaller batches by picking a single partition to send all non-keyed records. Once the batch at that partition is filled or otherwise completed, the sticky partitioner randomly chooses and “sticks” to a new partition. That way, over a larger period of time, records are about evenly distributed among all the partitions while getting the added benefit of larger batch sizes.

It looks like this is an improvement with few downside tradeoffs.

Comments closed

New Features in Kafka 2.4

Manikumar Reddy announces new features in Apache Kafka 2.4:

KIP-392: Allow consumers to fetch from closest replica

Historically, consumers were only allowed to fetch from leaders. In multi-datacenter deployments, this often means that consumers are forced to incur expensive cross-datacenter network costs in order to fetch from the leader. With KIP-392, Kafka now supports reading from follower replicas. This gives the broker the ability to redirect consumers to nearby replicas in order to save costs.

It’s not the biggest release of Kafka ever, but there are some really nice updates here.

Comments closed

Schiphol Takeoff: Low-Code Automated Deployment

Tim van Cann and Daniel van der Ende have an open source project for automatic deployment on Azure:

To give a bit more insight into why we built Schiphol Takeoff, it’s good to take a look at an example use case. This use case ties a number of components together:

– Data arrives in a (near) real-time stream on an Azure Eventhub.
– A Spark job running on Databricks consumes this data from Eventhub, processes the data, and outputs predictions.
– A REST API is running on Azure Kubernetes Service, which exposes the predictions made by the Spark job.

Conceptually, this is not a very complex setup. However, there are quite a few components involved:

– Azure Eventhub
– Azure Databricks
– Azure Kubernetes Service

Each of these individually has some form of automation, but there is no unified way of coordinating and orchestrating deployment of the code to all at the same time. If, for example, you were to change the name of the consumer group for Azure Eventhub, you could script that. However, you’d also need to manually update your Spark job running on Databricks to ensure it could still consume the data.

This looks pretty nice. I’ll need to dive into it some more.

Comments closed

YARN Container Sizing

Dmitry Tolpeko explains why large YARN containers aren’t always a great idea:

First I noticed that the job used only 100 containers i.e. just one container per cluster node. This was very suspicious as Hive uses the Apache Tez execution engine that can run concurrently only one task in a container.

Looking at the Hive script I found:

set hive.tez.container.size = 10240; -- 10 GB

Looks like someone had a memory problem with this query before and wanted to solve it once and forever!

Read on to see why this was not a great idea.

Comments closed

Geospatial Data Processing with Databricks

Razavi and Michael Johns walk us through examples of processing geospatial data with Databricks:

Earlier, we loaded our base data into a DataFrame. Now we need to turn the latitude/longitude attributes into point geometries. To accomplish this, we will use UDFs to perform operations on DataFrames in a distributed fashion. Please refer to the provided notebooks at the end of the blog for details on adding these frameworks to a cluster and the initialization calls to register UDFs and UDTs. For starters, we have added GeoMesa to our cluster, a framework especially adept at handling vector data. For ingestion, we are mainly leveraging its integration of JTS with Spark SQL which allows us to easily convert to and use registered JTS geometry classes. We will be using the function st_makePoint that given a latitude and longitude create a Point geometry object. Since the function is a UDF, we can apply it to columns directly.

Looks like they have some pretty good functionality here, and they have shared the demos in notebook form.

Comments closed

Flink 1.8.3 Released

Hequn Cheng announces Flink 1.8.3:

The Apache Flink community released the third bugfix version of the Apache Flink 1.8 series.

This release includes 45 fixes and minor improvements for Flink 1.8.2. The list below includes a detailed list of all fixes and improvements.

We highly recommend all users to upgrade to Flink 1.8.3.

There’s a nice list of bugfixes in the update.

Comments closed

Kryo Serialization in Spark

Pinku Swargiary shows us how to configure Spark to use Kryo serialization:

If you need a performance boost and also need to reduce memory usage, Kryo is definitely for you. The join operations and the grouping operations are where serialization has an impact on and they usually have data shuffling. Now lesser the amount of data to be shuffled, the faster will be the operation.
Caching also have an impact when caching to disk or when data is spilled over from memory to disk.

Also, if we look at the size metrics below for both Java and Kryo, we can see the difference.

Sounds like it’s better overall but requires some custom configuration.

Comments closed