Press "Enter" to skip to content

Category: Hadoop

Replicating ACID Tables in Hive

Ashutosh Bapat shows off some of the improvements in Apache Hive replication:

Transactional tables in Hive support ACID properties. Unlike non-transactional tables, data read from transactional tables is transactionally consistent, irrespective of the state of the database. Of course, this imposes specific demands on replication of such tables, hence why Hive replication was designed with the following assumptions:

1. A replicated database may contain more than one transactional table with cross-table integrity constraints.
2. A target may host multiple databases, some replicated and some native to the target. The databases replicated from the same source may have transactional tables with cross-database integrity constraints.
3. A user should be able to run read-only workloads on the target and should be able to read transactionally consistent data.
4. Since in Hive a read-only transaction requires a new transaction-id, the transaction-ids on the source and the target of replication may differ. Thus transaction-ids can not be used for reading transactionally consistent data across source and replicated target.

Read on to learn why these assumptions are in place and what they mean for replication.

Comments closed

ClassNotFoundException and .NET Spark

Ed Elliott takes us through two causes for a ClassNotFoundException when running a Spark job with .NET Spark:

There was a breaking change with version 0.4.0 that changed the name of the class that is used to load the dotnet driver in Apache Spark.

To fix the issue you need to use the new package name which adds an extra dotnet near the end, change:

spark-submit --class org.apache.spark.deploy.DotnetRunner

Click through to see what you should change this line of code to read. If that change doesn’t fix your problem, Ed has a broader solution.

Comments closed

Confluent Platform 5.3

Gaetan Castelein announces Confluent Platform 5.3:

Introducing Confluent Operator for Kubernetes
Kubernetes has become the open source standard for orchestrating containerized applications, but running stateful applications such as Kafka can be difficult and requires a specialized skill set. Thus, we decided to automate the process for you.

For the past few months, we have been working closely with a set of customers and partners as part of a preview program to gather their early feedback. We are now ready to release Confluent Operator, our enterprise-ready implementation of the Kubernetes Operator API to automate deployment and key lifecycle operations of Confluent Platform on Kubernetes.

Looks like they’ve been busy lately.

Comments closed

Stream Processing with Kafka

Satish Sharma has a four-part series covering stream processing with Apache Kafka. Part 1 gives us an overview of Kafka:

Apache Kafka is an open-source distributed stream processing platform originally developed by LinkedIn and later donated to Apache in 2011.

We can describe Kafka as a collection of files, filled with messages that are distributed across multiple machines. Most of Kafka analogies revolve around tying these various individual logs together, routing messages from producers to consumers reliably, replicating for fault tolerance, and handling failure gracefully. Its architecture inherits more from storage systems like HDFS, HBase, or Cassandra than it does from traditional messaging systems that implement JMS or AMQP. The underlying abstraction is a partitioned log, essentially a set of append-only files spread over several machines. This encourages sequential access patterns. A Kafka cluster is a distributed system that spreads data over many machines both for fault tolerance and for linear scale-out.

Part 2 covers terminology and concepts:

Kafka Streams API
Kafka Streams API is a Java library that allows you to build real-time applications. These applications can be packaged, deployed, and monitored like any other Java application — there is no need to install separate processing clusters or similar special-purpose and expensive infrastructures!

The Streams API is scalable, lightweight, and fault-tolerant; it is stateless and allows for stateful processing. 

Part 3 has you install and configure Kafka:

For quick testing, let’s start a handy console consumer, which reads messages from a specified topic and displays them back on the console. We will use the same to consumer to read all of our messages from this point forward. Use the following command: 

Linux -> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tutorial-topic --from-beginning

Windows -> bin\windows\kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic tutorial-topic --from-beginning

Part 4 is forthcoming.

Comments closed

Deploying a Big Data Cluster

Mohammad Darab takes us through the Big Data Cluster deployment process using Azure Data Studio:

I’ve been “playing around” with Big Data Clusters for some time now and CTP 3.2 is way ahead when it comes to streamlining the BDC deployment process. You can check out my 4-part series on deploying BDC on AKS to see how cumbersome the process used to be. New in CTP 3.2, you can deploy a BDC on AKS (an existing cluster OR a new cluster) using an Azure Data Studio notebook. Let’s see how.

Click through for instructions. It was rather smart of Microsoft to release the instructions as a notebook.

Comments closed

Spark Access Control in Qubole

Achuth Rajagopal and Shridhar Ramachandran show off the Spark Data Access Control Framework on Qubole’s platform:

With these requirements in mind, we decided to implement Hive Authorization as our first Policy Manager. Hive Authorization policies are stored in the Qubole Metastore which acts as a shared central component and stores metadata related to Hive Resources like Hive Tables. We enhanced Spark to honor the policies stored in the Qubole Metastore while accessing Hive Tables or for adding and modifying those policies.

In summary, we implemented a SQL standard access control layer identical to what is present in Apache Hive or Presto today. The following sections detail the architecture and provide an example that illustrates how it works.

Click through to learn more.

Comments closed

IDEs and Cloudera Data Science Workbench

Bethann Noble walks us through some of the options available for IDEs operating against Cloudera Data Science Workbench:

Other coders on the team including ML and DevOps engineers often work in local IDEs such as PyCharm.  These applications run locally on the user’s computer and connect to CDSW remotely over SSH for code completion and execution.  They must be configured per user and are not associated at the project level in CDSW. The documentation provides sample instructions for the Professional Edition of PyCharm v2019.1.

They support both browser-based and local IDEs.

Comments closed

MLflow 1.1 Released

Max Allen, et al, announce the release of MLflow 1.1:

We’re excited to announce today the release of MLflow 1.1. In this release, we’ve focused on fleshing out the tracking component of MLflow and improving visualization components in the UI.

Some of the major features include:
– Automatic logging from TensorFlow and Keras
– Parallel coordinate plots in the tracking UI
Pandas DataFrame based search API
– Java Fluent API
– Kubernetes execution backend for MLflow projects
– Search Pagination

Looks like they’re putting in a lot of work on this.

Comments closed

Monitoring Backpressure in Apache Flink

Nico Kruber and Piotr Nowosjki explain how you can monitor the flow of your Apache Flink processes:

Probably the most important part of network monitoring is monitoring backpressure, a situation where a system is receiving data at a higher rate than it can process. Such behaviour will result in the sender being backpressured and may be caused by two things:

– The receiver is slow.
This can happen because the receiver is backpressured itself, is unable to keep processing at the same rate as the sender, or is temporarily blocked by garbage collection, lack of system resources, or I/O.

– The network channel is slow.
Even though in such case the receiver is not (directly) involved, we call the sender backpressured due to a potential oversubscription on network bandwidth shared by all subtasks running on the same machine. Beware that, in addition to Flink’s network stack, there may be more network users, such as sources and sinks, distributed file systems (checkpointing, network-attached storage), logging, and metrics. A previous capacity planning blog post provides some more insights.

Read the whole thing. Backpressure is not a topic unique to Flink, but affects any ETL or streaming operation.

Comments closed