Press "Enter" to skip to content

Category: Hadoop

Testing Kafka with Kerberos and SSH

Daniel Osvath has a guide for us:

Kerberos authentication is widely used in today’s client/server applications; however getting started with Kerberos may be a daunting task if you don’t have prior experience. Information on setting up Kerberos with an SSH server and client on the web is fragmented and hasn’t been presented in a comprehensive end-to-end way on a simple local setup.

At Confluent, several of our connectors for Apache Kafka® support Kerberos-based authentication. For development and testing of these connectors, we often leverage containers due to their fast, iterative benefits. This tutorial aims to provide a simple setup for a Kerberos test environment with SSH for a passwordless authentication that uses Kerberos tickets. You may use this as a guide for testing the Kerberos functionality of SSH-based client-server applications in a local environment or as a hands-on tutorial if you’re new to Kerberos. To understand the basics of Kerberos before diving into this tutorial, you may find this video helpful. Additionally, if you are looking for a non-SSH-based setup, the setup below for the KDC server container may also be useful.

Click through for two approaches to the problem.

Comments closed

Apache Spark 3.1 Released

Hyukjin Kwon, et al, announce Apache Spark 3.1:

Various new SQL features are added in this release. The widely used standard CHAR/VARCHAR data types are added as variants of the supported String types. More built-in functions (e.g., width_bucket (SPARK-21117) and regexp_extract_all (SPARK-24884) were added. The current number of built-in operators/functions has now reached 350. More DDL/DML/utility commands have been enhanced, including INSERT (SPARK-32976), MERGE (SPARK-32030) and EXPLAIN (SPARK-32337). Starting from this release, in Spark WebUI, the SQL plans are presented in a simpler and structured format (i.e. using EXPLAIN FORMATTED)

There have been quite a few advancements around the SQL side.

Comments closed

Installing Spark on Windows Subsystem for Linux

David Alcock wants Spark, but not Windows Spark:

The post won’t cover any instructions for installing Ubuntu and instead I’ll assume you’ve installed already and downloaded the tgz file from the Apache Spark download page (Step 3 in the above link).

Let’s go straight into the terminal window and get going! I’ve put the commands in bold text (don’t include the $) just so anyone can see a bit easier and who also prefers to ignore my jibberish! 

Click through for the instructions.

Comments closed

Applied ML Prototypes

Alex Bleakley and Santiago Giraldo announce Applied ML Prototypes:

To directly address these challenges, we’ve released Applied ML Prototypes (AMPs) — a revolutionary new way of developing and shipping enterprise ML use cases — which provide complete ML projects that can be deployed with one click directly from Cloudera Machine Learning. AMPs enable data scientists to go from an idea to a fully working ML use case in a fraction of the time, with an end-to-end framework for building, deploying, and monitoring business-ready ML applications instantly. 

AMPs move the starting line for any ML project by enabling data scientists to start with a full end-to-end project developed for a similar use case, including a trained and deployed ML model, as well as prebuilt predictive business applications, out of the box. This means that ML development teams can tackle their own ML business use cases more quickly, from those involving churn modeling, to sentiment analysis, to anomaly detection and beyond.

Getting past the marketing fluff, there are some interesting ideas here.

Comments closed

Power BI Connector for Databricks

Stefania Leone, et al, announce general availability of the Power BI connector for Databricks:

We are excited to announce General Availability (GA) of the Microsoft Power BI connector for Databricks for Power BI Service and Power BI Desktop 2.85.681.0. Following the public preview, we have already seen strong customer adoption, so we are pleased to extend these capabilities to our entire customer base. The native Power BI connector for Databricks in combination with the recently launched SQL Analytics service provides Databricks customers with a first-class experience for performing BI workloads directly on their Delta Lake. SQL Analytics allows customers to operate a multi-cloud lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance than traditional cloud data warehouses.

This is easier to work with than the Apache Spark connector and it looks like it should be faster than that connector as well.

Comments closed

Survival Analysis Notebooks

Dan Morris, et al, walk us through a survival analysis scenario:

In contrast to other methods that may seem similar on the surface, such as linear regression, survival analysis takes censoring into account. Censoring occurs when the start and/or end of a measured value is unknown. For example, suppose our historical data includes records for the two customers below. In the case of customer A, we know the precise duration of the subscription because the customer churned in December 2020. For customer B, we know that the contract started four months ago and is still active, but we do not know how much longer they will be a customer. This is an example of right censoring because we do not yet know the end date for the measured value. Right censoring is what we most commonly see with this form of analysis.

Click through for an intro as well as a half-dozen notebooks.

Comments closed

Updates to Message Keys in ksqlDB

Victoria Xia announces an improvement to ksqlDB:

One of the most highly requested enhancements to ksqlDB is here! Apache Kafka® messages may contain data in message keys as well as message values. Until now, ksqlDB could only read limited kinds of data from the key position. ksqlDB’s latest release—ksqlDB 0.15—adds support for many more types of data in messages keys, including message keys with multiple columns. Users of Confluent Cloud ksqlDB already have access to these new features as Confluent Cloud always runs the latest release of ksqlDB.

Read on for more information on this, as well as some of the ramifications of this change.

Comments closed

Kafka in .NET

Diogo Souza walks us through building an application which produces and consumes messages using Apache Kafka:

Kafka is just the broker, the stage in which all the action takes place. The producers send messages to the world while the consumers read specific chunks of data. How do you differentiate one specific portion of data from the others? How do consumers know what data to consume? To understand this, you need a new actor in the play: the topics.

Kafka topics are the channels, the carriage that transport messages around. Kafka records produced by producers are organized and stored into topics.

This is a nice overview of Kafka followed by the basics of building a consumer and a producer in C#. I just wish that there was more community usage of Kafka so that the Confluent .NET driver would include some of the really cool stuff they’ve added to Kafka over the past couple of years.

Comments closed