Press "Enter" to skip to content

Category: Hadoop

Dimensional Load with Databricks

Leo Furlong shows how we can load an Azure SQL Data Warehouse dimension with Databricks:

Ingesting data into the Data Lake occurs in steps 1 and 2 in our architecture.  Azure Data Factory (ADF) provides an excellent mechanism for loading data from source applications into a Data Lake stored in Azure Data Lake Store Gen2.  In fact, Microsoft offers a template in the ADF Template gallery which provides a metadata driven approach for doing so.  The template comes with a control table example in a SQL Server Database, a data source dataset and a data destination dataset.  More on this template can be found here in the official documentation.

I appreciate that this is a full walkthrough of the process, not just one step.

Comments closed

Stream Processing with Kafka

Satish Sharma wraps up a series on Kafka and Kafka Streams:

Consider a hypothetical fleet management company that needs a dashboard to get the insight of its day to day activities related to vehicles. Each vehicle in this fleet management company is fitted with a GPS based geolocation emitter, which emits location data containing the following information

1. Vehicle Id: A unique id is given to each vehicle on registration with the company.
2. Latitude and Longitude: geolocation information of vehicle.
3. Availability: The value of this field signifies whether the vehicle is available to take a booking or not. Current Status (Online/Offline) denotes whether the vehicle is on duty or not.

Read through the article and then check out Satish’s GitHub repo for more.

Comments closed

Replicating ACID Tables in Hive

Ashutosh Bapat shows off some of the improvements in Apache Hive replication:

Transactional tables in Hive support ACID properties. Unlike non-transactional tables, data read from transactional tables is transactionally consistent, irrespective of the state of the database. Of course, this imposes specific demands on replication of such tables, hence why Hive replication was designed with the following assumptions:

1. A replicated database may contain more than one transactional table with cross-table integrity constraints.
2. A target may host multiple databases, some replicated and some native to the target. The databases replicated from the same source may have transactional tables with cross-database integrity constraints.
3. A user should be able to run read-only workloads on the target and should be able to read transactionally consistent data.
4. Since in Hive a read-only transaction requires a new transaction-id, the transaction-ids on the source and the target of replication may differ. Thus transaction-ids can not be used for reading transactionally consistent data across source and replicated target.

Read on to learn why these assumptions are in place and what they mean for replication.

Comments closed

ClassNotFoundException and .NET Spark

Ed Elliott takes us through two causes for a ClassNotFoundException when running a Spark job with .NET Spark:

There was a breaking change with version 0.4.0 that changed the name of the class that is used to load the dotnet driver in Apache Spark.

To fix the issue you need to use the new package name which adds an extra dotnet near the end, change:

spark-submit --class org.apache.spark.deploy.DotnetRunner

Click through to see what you should change this line of code to read. If that change doesn’t fix your problem, Ed has a broader solution.

Comments closed

Confluent Platform 5.3

Gaetan Castelein announces Confluent Platform 5.3:

Introducing Confluent Operator for Kubernetes
Kubernetes has become the open source standard for orchestrating containerized applications, but running stateful applications such as Kafka can be difficult and requires a specialized skill set. Thus, we decided to automate the process for you.

For the past few months, we have been working closely with a set of customers and partners as part of a preview program to gather their early feedback. We are now ready to release Confluent Operator, our enterprise-ready implementation of the Kubernetes Operator API to automate deployment and key lifecycle operations of Confluent Platform on Kubernetes.

Looks like they’ve been busy lately.

Comments closed

Stream Processing with Kafka

Satish Sharma has a four-part series covering stream processing with Apache Kafka. Part 1 gives us an overview of Kafka:

Apache Kafka is an open-source distributed stream processing platform originally developed by LinkedIn and later donated to Apache in 2011.

We can describe Kafka as a collection of files, filled with messages that are distributed across multiple machines. Most of Kafka analogies revolve around tying these various individual logs together, routing messages from producers to consumers reliably, replicating for fault tolerance, and handling failure gracefully. Its architecture inherits more from storage systems like HDFS, HBase, or Cassandra than it does from traditional messaging systems that implement JMS or AMQP. The underlying abstraction is a partitioned log, essentially a set of append-only files spread over several machines. This encourages sequential access patterns. A Kafka cluster is a distributed system that spreads data over many machines both for fault tolerance and for linear scale-out.

Part 2 covers terminology and concepts:

Kafka Streams API
Kafka Streams API is a Java library that allows you to build real-time applications. These applications can be packaged, deployed, and monitored like any other Java application — there is no need to install separate processing clusters or similar special-purpose and expensive infrastructures!

The Streams API is scalable, lightweight, and fault-tolerant; it is stateless and allows for stateful processing. 

Part 3 has you install and configure Kafka:

For quick testing, let’s start a handy console consumer, which reads messages from a specified topic and displays them back on the console. We will use the same to consumer to read all of our messages from this point forward. Use the following command: 

Linux -> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tutorial-topic --from-beginning

Windows -> bin\windows\kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic tutorial-topic --from-beginning

Part 4 is forthcoming.

Comments closed

Learning Confluent Kafka

Stephane Maarek has some pointers on where to go to learn Confluent Kafka:

I teach online about Apache Kafka, and a very frequent and recurring question I get is:

How can I learn Confluent Kafka?

Let’s get right to it!

I’ve gone through a couple of Stephane’s Kafka courses and they’re excellent. There’s still a lot more for me to go through but if you’re interesting in learning Kafka, this is a great method.

Comments closed

Deploying a Big Data Cluster

Mohammad Darab takes us through the Big Data Cluster deployment process using Azure Data Studio:

I’ve been “playing around” with Big Data Clusters for some time now and CTP 3.2 is way ahead when it comes to streamlining the BDC deployment process. You can check out my 4-part series on deploying BDC on AKS to see how cumbersome the process used to be. New in CTP 3.2, you can deploy a BDC on AKS (an existing cluster OR a new cluster) using an Azure Data Studio notebook. Let’s see how.

Click through for instructions. It was rather smart of Microsoft to release the instructions as a notebook.

Comments closed

Spark Access Control in Qubole

Achuth Rajagopal and Shridhar Ramachandran show off the Spark Data Access Control Framework on Qubole’s platform:

With these requirements in mind, we decided to implement Hive Authorization as our first Policy Manager. Hive Authorization policies are stored in the Qubole Metastore which acts as a shared central component and stores metadata related to Hive Resources like Hive Tables. We enhanced Spark to honor the policies stored in the Qubole Metastore while accessing Hive Tables or for adding and modifying those policies.

In summary, we implemented a SQL standard access control layer identical to what is present in Apache Hive or Presto today. The following sections detail the architecture and provide an example that illustrates how it works.

Click through to learn more.

Comments closed

IDEs and Cloudera Data Science Workbench

Bethann Noble walks us through some of the options available for IDEs operating against Cloudera Data Science Workbench:

Other coders on the team including ML and DevOps engineers often work in local IDEs such as PyCharm.  These applications run locally on the user’s computer and connect to CDSW remotely over SSH for code completion and execution.  They must be configured per user and are not associated at the project level in CDSW. The documentation provides sample instructions for the Professional Edition of PyCharm v2019.1.

They support both browser-based and local IDEs.

Comments closed