Press "Enter" to skip to content

Category: Hadoop

Survival Analysis Notebooks

Dan Morris, et al, walk us through a survival analysis scenario:

In contrast to other methods that may seem similar on the surface, such as linear regression, survival analysis takes censoring into account. Censoring occurs when the start and/or end of a measured value is unknown. For example, suppose our historical data includes records for the two customers below. In the case of customer A, we know the precise duration of the subscription because the customer churned in December 2020. For customer B, we know that the contract started four months ago and is still active, but we do not know how much longer they will be a customer. This is an example of right censoring because we do not yet know the end date for the measured value. Right censoring is what we most commonly see with this form of analysis.

Click through for an intro as well as a half-dozen notebooks.

Leave a Comment

Updates to Message Keys in ksqlDB

Victoria Xia announces an improvement to ksqlDB:

One of the most highly requested enhancements to ksqlDB is here! Apache Kafka® messages may contain data in message keys as well as message values. Until now, ksqlDB could only read limited kinds of data from the key position. ksqlDB’s latest release—ksqlDB 0.15—adds support for many more types of data in messages keys, including message keys with multiple columns. Users of Confluent Cloud ksqlDB already have access to these new features as Confluent Cloud always runs the latest release of ksqlDB.

Read on for more information on this, as well as some of the ramifications of this change.

Leave a Comment

Kafka in .NET

Diogo Souza walks us through building an application which produces and consumes messages using Apache Kafka:

Kafka is just the broker, the stage in which all the action takes place. The producers send messages to the world while the consumers read specific chunks of data. How do you differentiate one specific portion of data from the others? How do consumers know what data to consume? To understand this, you need a new actor in the play: the topics.

Kafka topics are the channels, the carriage that transport messages around. Kafka records produced by producers are organized and stored into topics.

This is a nice overview of Kafka followed by the basics of building a consumer and a producer in C#. I just wish that there was more community usage of Kafka so that the Confluent .NET driver would include some of the really cool stuff they’ve added to Kafka over the past couple of years.

Leave a Comment

RBAC in Hadoop with Kudu and Ranger

Attila Bukor takes us through the process of setting up role-based access controls on Impala tables:

After setting up the integration it’s time to create some policies, as now only trusted users are allowed to perform any action; everyone else is locked out. Resource-based access control (RBAC) policies can be set up for Kudu in Ranger, but Kudu currently doesn’t support tag-based policies, row-level filtering or column masking.

Click through for the process, as well as current limitations.

Leave a Comment

Join Algorithm Selection in Spark

The Hadoop in Real World team takes us through the selection criteria for join types:

There are several factors Spark takes into account before deciding on the type of join algorithm to use to join datasets at runtime.

Spark has the following 5 algorithms to choose from –

1. Broadcast Hash Join
2. Shuffle Hash Join
3. Shuffle Sort Merge Join
4. Broadcast Nested Loop Join
5. Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)

Read on to learn which join types are supported in which circumstances, as well as rules of precedence.

Leave a Comment

Synchronizing Metadata between Spark Tables and Serverless Pool

Charl Roux takes us through one back-end integration mechanism between tables in Azure Synapse Analytics Spark pools and serverless SQL pool:

Synapse provides an exciting feature which allows you to sync Spark database objects to Serverless pools and to query these objects without the Spark pool being active or running.  Synapse workspaces are accessed exclusively through an Azure AD Account and objects are created within this context in the Spark pool. In some scenarios I would like to share the data which I’ve created in my Spark database with other users for reporting or analysis purposes. This is possible with Serverless and in this article I will show you how to complete the required steps from creation of the object to successful execution. 

Click through for the demonstration.

Leave a Comment

Clustering with Apache NiFi

Brent Segner and Ryan O’Donnell explain how clustering works with Apache NiFi:

Although it is entirely possible to deploy NiFi in a single node configuration, this does not represent a best practice for an enterprise graded deployment and would introduce unnecessary risk into a production environment where scaling to meet demand and resiliency are paramount.  In order to get around this concern, as of release 1.0.0, NiFi provides the ability to cluster nodes together using either an embedded or external Apache Zookeeper instance as a highly reliable distributed coordinator. While a simple Google search shows there is plenty of debate around whether it is better to use an embedded or external Zookeeper service as both sides have merit, for the sake of argument and this blog, we will use the embedded flavor in the deployment. 

Click through for more information, including a walkthrough on configuration.

Leave a Comment

Secure Cluster Connectivity in Azure Databricks

Abhinav Garg and Premal Shah have an announcement:

We’re excited to announce the general availability of Secure Cluster Connectivity (also commonly known as No Public IP) on Azure Databricks. This release applies to Microsoft Azure Public Cloud and Azure Government regions, in both Standard and Premium pricing tiers. Hundreds of our global customers including large financial services, healthcare and retail organizations have already adopted the capability to enable secure and reliable deployments of the Azure Databricks unified data platform. It allows them to securely process company and customer data in private Azure Virtual Networks, thus satisfying a major requirement of their enterprise governance policies.

Read on fore more detail about how this works.

Leave a Comment