Press "Enter" to skip to content

Category: Hadoop

Sparklyr 1.3 Released

Yitao Li announces sparklyr 1.3:

sparklyr 1.3 is now available on CRAN, with the following major new features:

Higher-order Functions to easily manipulate arrays and structs
– Support for Apache Avro, a row-oriented data serialization framework
Custom Serialization using R functions to read and write any data format
Other Improvements such as compatibility with EMR 6.0 & Spark 3.0, and initial support for Flint time series library

Between this and the work from the Spark side, we are seeing some nice quality of life improvements for Spark and R.

Comments closed

Survival Analysis in Spark

Rab Saker and Bryan Smith hit on a topic close to my heart:

These patterns seem to indicate that KKBox could actually differentiate between customers based on their lifetime potential using information known at the time of acquisition. This information might help inform or steer specific discounts or promotions to customers as they register for a trial. This information might also inform KKBox of which offerings or capabilities to discontinue as some, e.g. Initial Payment Method 35 or the 7-day payment plan as shown in Figure 3, align with exceptionally high churn rates in the first 30-days with little long-term survivorship.

Of course, there are relationships between these factors so that we should be careful in viewing them in isolation. By deriving a baseline risk (hazard) of customer churn (Figure 4), we can calculate the influence of different factors on the baseline in such a manner that each factor may be considered an independent hazard multiplier.  When combined (through simple multiplication) against the baseline, we can plot the a specific customer’s chances of abandoning a subscription by a given point in time (Table 1).

Click through for the story as well as a set of notebooks.

Comments closed

Calculating Spark Application Resource Allocations

The Hadoop in Real World team walks us through resource allocation for Spark applications:

In this post we will look at how to calculate resource allocation for Spark applications. Figuring out how to allocate resources for a Spark application requires a good understanding of resource allocation properties in YARN and also resource related properties in Spark. Let’s look at both.

This post covers the properties you want to keep an eye on when running Spark applications.

Comments closed

Downsides to Optimization in Spark SQL

Anuj Saxena takes us through some of the pros and cons of using the Catalyst Optimizer in Spark, including a couple of issues:

I am sure the optimizations make the calculation time very short and these optimizations are implemented in such a way that you just have to provide the logic and everything else will be done in abstraction. But as my friend and colleague Ramandeep says “Abstract features come with abstract issues”. So following are the few issues which I have faced in my recent interaction with Spark SQL:

1. Too large of a query to be stored in memory
2. Implicit optimizations interfere with partitioning

Click through for examples of this.

Comments closed

Performance Tuning for Cloudera’s Operational Database

Liliana Kadar, et al, show us the tools we can use to tune Cloudera’s Operatioanl Database:

A query optimizer determined the most efficient way to run a query. Query optimization helps you to reduce the hardware resources required to run a query and also speeds up your query-response time. Cloudera’s Operational Database provides you with various tools such as plan analyzers to make optimal use of your computing resources. 

Cloudera’s OpDB provides various cost-based and rules-based optimizers. You can use different optimizers based on your use cases. OpDB is primarily used for Online Transactional Processing (OLTP) use cases with Apache Phoenix in the OpDB used as a SQL engine. But you can also use Hive and Impala for Online Analytical Processing (OLAP) use cases. 

Read on for recommendations on platform choice as well as indexing and tuning options.

Comments closed

Schema References and Multiple Event Types in a Kafka Topic

Robert Yokota updates some prior knowledge:

In the article Should You Put Several Event Types in the Same Kafka Topic?, Martin Kleppmann discusses when to combine several event types in the same topic and introduces new subject name strategies for determining how Confluent Schema Registry should be used when producing events to an Apache Kafka® topic.

Schema Registry now supports schema references in Confluent Platform 5.5, and this blog post presents an alternative means of putting several event types in the same topic using schema references, discussing the advantages and disadvantages of this approach.

Click through to see how this works out.

Comments closed

Query Scheduling with Apache Hive

Zoltan Haindrich and Jesus Camacho Rodriguez walk us through scheduled queries in Apache Hive:

To fulfill that purpose, recently Apache Hive introduced a new feature called scheduled queries. Using SQL statements, users can schedule Hive queries to run on a recurring basis, monitor their progress, and optionally disable a query schedule.

In a nutshell, every scheduled query in Hive consists of (i) a unique name to identify the schedule, (ii)  the actual SQL statement to be executed, and (iii) the schedule at which the query should be executed defined by a Quartz cron expression. In addition, a scheduled query belongs to a namespace, i.e., a collection of HiveServer2 instances that are responsible to execute the query.

Read on for examples of how you might create, use, and learn about scheduled queries running on a system.

Comments closed

Apache Flink 1.1.0 Released

Marta Paes announces Apache Flink version 1.11:

Change Data Capture (CDC) has become a popular pattern to capture committed changes from a database and propagate those changes to downstream consumers, for example to keep multiple datastores in sync and avoid common pitfalls such as dual writes. Being able to easily ingest and interpret these changelogs into the Table API/SQL has been a highly demanded feature in the Flink community — and it’s now possible with Flink 1.11.

Click through for the full list of updates.

Comments closed

TF-IDF using Spark .NET

Ed Elliott shows how you can use the Spark .NET library to perform machine learning in Apache Spark:

Native spark has two API’s for creating your ML applications. The historical one is Spark.MLLib and the newer API is Spark.ML. A little bit like how there was the old RDD API which the DataFrame API superseded, Spark.ML supersedes Spark.MLLib.

At the end of last year, .NET for Apache Spark had no support for either the Spark.ML or Spark.MLLib API’s so I started implementing Spark.ML. In a similar way that .NET for Apache Spark supports the DataFrame API and not the RDD API, I started with Spark.ML and I believe that having the full Spark ML API will be enough for anyone.

It’s awesome that Ed is helping to move Spark .NET forward in this way.

Comments closed

FlinkSQL in Cloudera Streaming Analytics

Marton Balassi announces support for FlinkSQL in Cloudera Streaming Analytics:

Our 1.2.0.0 release of Cloudera Streaming Analytics Powered by Apache Flink brings a wide range of new functionality, including support for lineage and metadata tracking via Apache Atlas, support for connecting to Apache Kudu and the first iteration of the much-awaited FlinkSQL API.

Flink’s SQL interface democratizes stream processing, as it caters to a much larger community than the currently widely used Java and Scala APIs focusing on the Data Engineering crowd. Generalizing SQL to stream processing and streaming analytics use cases poses a set of challenges: we have to tackle expressing infinite streams and timeliness of records. 

All is happening as Feasel’s Law foretold.

Comments closed