Press "Enter" to skip to content

Category: Hadoop

MLflow 0.8.1 Released

Aaron Davidson, et al, announce a new version of Databricks MLflow:

When scoring Python models as Apache Spark UDFs, users can now filter UDF outputs by selecting from an expanded set of result types. For example, specifying a result type of pyspark.sql.types.DoubleType filters the UDF output and returns the first column that contains double precision scalar values. Specifying a result type of pyspark.sql.types.ArrayType(DoubleType) returns all columns that contain double precision scalar values. The example code below demonstrates result type selection using the result_type parameter. And the short example notebook illustrates Spark Model logged and then loaded as a Spark UDF.

Read on for a pretty long list of updates.

Comments closed

File Formats Supported In HDFS

Manoj Pandey covers a few of the file types supported by the Hadoop Distributed File System:

HDFS or Hadoop Distributed File System is the distributed file system provided by the Hadoop Big Data platform. The primary objective of HDFS is to store data reliably even in the presence of node failures in the cluster. This is facilitated with the help of data replication across different racks in the cluster infrastructure. These files stored in HDFS system are used for further data processing by different data processing engines like Hadoop Map-Reduce, Hive, Spark, Impala, Pig etc.

There are a few other formats not included in this list, including RCFile (which has been superseded by both ORC and Parquet), but this hits the highlights.

Comments closed

Using Hive For Data Virtualization

Gunther Hagleitner, et al, walk us through some reasons why we might want to use Apache Hive for data virtualization:

Assume you want to execute a Hive query that accesses data from an external RDBMS behind a JDBC connection. A possible naïve way of doing this would treat the JDBC source as a “dumb” storage system, reading all the raw data over JDBC and processing it in Hive. In this case you would ignore the query capabilities of the RDBMS and pull too much data over the JDBC link, thus ending up with poor performance and an overloaded system.
For that reason, Hive implements smart push-down to other systems by relying on its storage handler interfaces and cost-based optimizer (CBO) powered by Apache Calcite. In particular, Calcite provides rules that match a subset of operators in the logical representation of the query and generates a new equivalent representation with more operations executed in the external system. Hive includes those rules that push computation to the external systems in its query planner, and then relies on Calcite to generate a valid query in the language that those systems support. The storage handler implementations are responsible to send the generated query to the external system, retrieve its results, and transform the incoming data into Hive internal representation so it can be processed further if needed.

A lot of platforms are moving toward data virtualization (e.g., SQL Server with its Big Data Clusters). That appears to be the next product battleground.

Comments closed

Vectorization With Apache Hive And Parquet Tables

Vihang Karajgaonkar, et al, take us through using a performance improvement in Apache Hive using Parquet tables:

The performance benchmarks on CDH 6.0 show that enabling Parquet vectorization significantly improves performance for a typical ETL workload. In the test workload (TPC-DS), enabling parquet vectorization gave 26.5% performance improvement on average (geomean value of runtime for all the queries). Vectorization achieves these performance improvements by reducing the number of virtual function calls and leveraging the SIMD instructions on modern processors. A query is vectorized in Hive when certain conditions like supported column data-types and expressions are satisfied. However, if the query cannot be vectorized its execution falls back to a non-vectorized execution. Overall, for workloads which use the Parquet file format on most modern processors, enabling Parquet vectorization can lead to better query performance in CDH 6.0 and beyond.

This is worth looking into, especially if you are on the Cloudera stack.

Comments closed

More On Confluent’s Licensing Change

Alex Woodie has an article covering Confluent’s recent licensing change:

Confluent this month became the latest commercial open source software company to restrict the use of its software in the cloud. The move prevents cloud companies from using parts of the Confluent Platform, such as the KSQL component that uses SQL to process streaming data, as standalone software as a service (SaaS) offerings.
Jay Kreps, the co-creator of Apache Kafka and the CEO of Confluent, explained the significance of switching the Confluent Platform from the Apache 2.0 license to the new Confluent Community License.

Over at Aiven, CTO Heikki Nousiainen shares his thoughts:

The new Confluent Community License is a proprietary software license, specifically excluding “making available any software-as-a-service, platform-as-a-service, infrastructure-as-a-service or other similar online service that competes with Confluent products or services that provide the Software.”
While the license change does apply to all future versions of the specific software, it doesn’t alter the licensing status of the components in the versions that have been released and utilized by Aiven.

I believe it would be best to read the latter article looking for the significant silences.

Comments closed

TPC-DS Testing With HDP 3.0

Nita Dembla and Gopal Vijayaraghavan compare HDP 3.0 versus HDP 2.6.5 when running the TPC-DS query set and note performance improvements in Hive LLAP:

Hortonworks announced the general availability of HDP 3.0 this year. You may read more about it here. Bundled with HDP 3.0, Apache Hive 3 with LLAP took a significant leap as a Enterprise Ready Real time Database Warehouse with transactional capabilities that continues to serve BI workloads with lower latencies. HDP 3.0 comes with exciting new capabilities – ACID support, materialized views, SQL constraints and Query result cache to name a few.  Additionally, we continued to build and improve on the performance enhancements introduced in earlier releases.
In this blog, we will provide an update on our performance benchmark blog, comparing performance of HDP 3.0 to HDP 2.6.5. The noteworthy difference in benchmark is that all tables are by default transactional and written in ACID format, which means there are additional metadata (ROW_ID) columns to uniquely identify each row and support transactional semantics. Another key database capability used and tested here is SQL constraints. The hive-testbench schema has been enhanced to declare Primary-Foreign key, not null and unique constraints.

Their headline is that Hive 3 is up to 2x faster than Hive 2, with huge gains in a few of the queries.

Comments closed

Using Sqoop’s Logic To Improve Spark JDBC Performance

Avi Yehuda analyzes how Sqoop works to make relational database access from Spark faster:

Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly.
Since both Spark and Sqoop are based on the Hadoop map-reduce framework, it’s clear that Spark can work at least as good as Sqoop, I only needed to find out how to do it. I decided to look closer at what Sqoop does to see if I can imitate that with Spark.
By turning on the verbose flag of Sqoop, you can get a lot more details. What I found was that Sqoop is splitting the input to the different mappers which makes sense, this is map-reduce after all, Spark does the same thing. But before doing that, Sqoop does something smart that Spark doesn’t do.

Read on to see what in particular Sqoop does, and how you can use that in your Spark code.

Comments closed

KSQL Deployment Options

Hojjat Jafarpour shows us two deployment options for Kafka Streams with KSQL:

As I mentioned, we have implemented KSQL on top of the Kafka Streams API. This means that every KSQL query is compiled into a Kafka Streams application. Therefore, KSQL queries follow the same execution model of Kafka Streams applications.
A query can be executed on multiple instances, and each instance will process a portion of the input data from the input topic(s) as well as generate portions of the output data to output topic(s). Based on this execution model and depending on how we want to run our queries, currently, KSQL provides two deployment options.

Read on for those options.

Comments closed

Summarizing Improvements In Spark 2.4

Anmol Sarna summarizes Apache Spark 2.4 and pushes his meme game at the same time:

The next major enhancement was the addition of a lot of new built-in functions, including higher-order functions, to deal with complex data types easier.
Spark 2.4 introduced 24 new built-in functions, such as  array_unionarray_max/min, etc., and 5 higher-order functions, such as transformfilter, etc.
The entire list can be found here.
Earlier, for manipulating the complex types (e.g. array type) directly, there are two typical solutions:
1) exploding the nested structure into individual rows, and applying some functions, and then creating the structure again.
2) building a User Defined Function (UDF).
In contrast, the new built-in functions can directly manipulate complex types, and the higher-order functions can manipulate complex values with an anonymous lambda function similar to UDFs but with much better performance.

2.4 was a big release, so check this out for a great summary of the improvements it brings.

Comments closed

What’s New In Ambari 2.7

Paul Codding and Kat Petre share some of the new features in Ambari 2.7:

With this release, we wanted to make Ambari more enjoyable to use every day, simplify finding and using our API, and unblock teams managing very large clusters.  Here is a preview of a few features we’re excited to share with you:
Revamped UI – Our Hortonworks Product Design Team has created a great new UI design system, and we’ve taken the time to update Ambari to use the new styles, colors, and navigation concepts from it.

I’ll have to rebuild my home Hadoop cluster someday, so I’ll have to check this out when I do.

Comments closed