Press "Enter" to skip to content

Category: Hadoop

Data Privacy in Confluent Platform

David Millman shows off the Privitar Kafka Connector:

The initial message structure, in the left column above, is a simple JSON document with five fields. The middle column contains the list of rules that must be applied, defining the policy. On the right is a sample output message generated as a result of the policy being applied to the initial message.

In the Privitar Policy Manager, a user maps the individual fields to the appropriate rule, as shown in the screenshot below. A rule is applied to each of the fields and the schema is read as a single table structure, named testfile. These rules can be applied for every instance of the schema.

Read on for more.

Comments closed

The Cloudera Operational Database Experience

Liliana Kadar, et al, cover scalability options for DBAs working with Cloudera:

Cloudera’s Operational Database (OpDB) supports a scale-up (SMP) environment. The caching layer is able to consume all memory in a large SMP environment. Memory has to be large enough to cover RegionServers, DataNodes and operating system, and to have enough extra space to allow the block cache to assist with reads. When HBase is running with other components, CPU contention and memory contention can be a problem that is easy to address with proper YARN tuning. 

As a result of the scale-up architecture, multiple services and engines can be run on a single node. For smaller nodes, multiple services and engines have to be spread out amongst a larger set of nodes. 

In addition, Krishna Maheshwari, et al, announce a technical preview of their Cloudera Operational Database experience:

The Cloudera Operational Database (COD) experience is a managed dbPaaS solution which abstracts the underlying cluster instance as a Database. It can auto-scale based on the workload utilization of the cluster and will be adding the ability to auto-tune (better performance within the existing infrastructure footprint) and auto-heal (resolve operational problems automatically) later this year. It offers multi-modal client access with NoSQL key-value using Apache HBase APIs and relational SQL with JDBC (via Apache Phoenix).  The latter makes COD accessible to developers who are used to building applications that use MySQL, Postgres, etc.

It’s interesting to see Cloudera move in this direction.

Comments closed

Sparklyr 1.3 Released

Yitao Li announces sparklyr 1.3:

sparklyr 1.3 is now available on CRAN, with the following major new features:

Higher-order Functions to easily manipulate arrays and structs
– Support for Apache Avro, a row-oriented data serialization framework
Custom Serialization using R functions to read and write any data format
Other Improvements such as compatibility with EMR 6.0 & Spark 3.0, and initial support for Flint time series library

Between this and the work from the Spark side, we are seeing some nice quality of life improvements for Spark and R.

Comments closed

Survival Analysis in Spark

Rab Saker and Bryan Smith hit on a topic close to my heart:

These patterns seem to indicate that KKBox could actually differentiate between customers based on their lifetime potential using information known at the time of acquisition. This information might help inform or steer specific discounts or promotions to customers as they register for a trial. This information might also inform KKBox of which offerings or capabilities to discontinue as some, e.g. Initial Payment Method 35 or the 7-day payment plan as shown in Figure 3, align with exceptionally high churn rates in the first 30-days with little long-term survivorship.

Of course, there are relationships between these factors so that we should be careful in viewing them in isolation. By deriving a baseline risk (hazard) of customer churn (Figure 4), we can calculate the influence of different factors on the baseline in such a manner that each factor may be considered an independent hazard multiplier.  When combined (through simple multiplication) against the baseline, we can plot the a specific customer’s chances of abandoning a subscription by a given point in time (Table 1).

Click through for the story as well as a set of notebooks.

Comments closed

Calculating Spark Application Resource Allocations

The Hadoop in Real World team walks us through resource allocation for Spark applications:

In this post we will look at how to calculate resource allocation for Spark applications. Figuring out how to allocate resources for a Spark application requires a good understanding of resource allocation properties in YARN and also resource related properties in Spark. Let’s look at both.

This post covers the properties you want to keep an eye on when running Spark applications.

Comments closed

Downsides to Optimization in Spark SQL

Anuj Saxena takes us through some of the pros and cons of using the Catalyst Optimizer in Spark, including a couple of issues:

I am sure the optimizations make the calculation time very short and these optimizations are implemented in such a way that you just have to provide the logic and everything else will be done in abstraction. But as my friend and colleague Ramandeep says “Abstract features come with abstract issues”. So following are the few issues which I have faced in my recent interaction with Spark SQL:

1. Too large of a query to be stored in memory
2. Implicit optimizations interfere with partitioning

Click through for examples of this.

Comments closed

Performance Tuning for Cloudera’s Operational Database

Liliana Kadar, et al, show us the tools we can use to tune Cloudera’s Operatioanl Database:

A query optimizer determined the most efficient way to run a query. Query optimization helps you to reduce the hardware resources required to run a query and also speeds up your query-response time. Cloudera’s Operational Database provides you with various tools such as plan analyzers to make optimal use of your computing resources. 

Cloudera’s OpDB provides various cost-based and rules-based optimizers. You can use different optimizers based on your use cases. OpDB is primarily used for Online Transactional Processing (OLTP) use cases with Apache Phoenix in the OpDB used as a SQL engine. But you can also use Hive and Impala for Online Analytical Processing (OLAP) use cases. 

Read on for recommendations on platform choice as well as indexing and tuning options.

Comments closed

Schema References and Multiple Event Types in a Kafka Topic

Robert Yokota updates some prior knowledge:

In the article Should You Put Several Event Types in the Same Kafka Topic?, Martin Kleppmann discusses when to combine several event types in the same topic and introduces new subject name strategies for determining how Confluent Schema Registry should be used when producing events to an Apache Kafka® topic.

Schema Registry now supports schema references in Confluent Platform 5.5, and this blog post presents an alternative means of putting several event types in the same topic using schema references, discussing the advantages and disadvantages of this approach.

Click through to see how this works out.

Comments closed

Query Scheduling with Apache Hive

Zoltan Haindrich and Jesus Camacho Rodriguez walk us through scheduled queries in Apache Hive:

To fulfill that purpose, recently Apache Hive introduced a new feature called scheduled queries. Using SQL statements, users can schedule Hive queries to run on a recurring basis, monitor their progress, and optionally disable a query schedule.

In a nutshell, every scheduled query in Hive consists of (i) a unique name to identify the schedule, (ii)  the actual SQL statement to be executed, and (iii) the schedule at which the query should be executed defined by a Quartz cron expression. In addition, a scheduled query belongs to a namespace, i.e., a collection of HiveServer2 instances that are responsible to execute the query.

Read on for examples of how you might create, use, and learn about scheduled queries running on a system.

Comments closed

Apache Flink 1.1.0 Released

Marta Paes announces Apache Flink version 1.11:

Change Data Capture (CDC) has become a popular pattern to capture committed changes from a database and propagate those changes to downstream consumers, for example to keep multiple datastores in sync and avoid common pitfalls such as dual writes. Being able to easily ingest and interpret these changelogs into the Table API/SQL has been a highly demanded feature in the Flink community — and it’s now possible with Flink 1.11.

Click through for the full list of updates.

Comments closed