Hadoop – Page 34 – Curated SQL

Kafka Replication with MIrrorMaker

Published 2021-03-19 by Kevin Feasel

In this new two part blog series we’ll turn our gaze to the newest version of MirrorMaker 2 (MM2), the Apache Kafka cross-cluster mirroring, or replication, technology. MirrorMaker 2 is built on top of the Kafka Connect framework for increased reliability and scalability, and is suitable for more demanding geo-replication use cases including migration, backup, disaster recovery and fail-over. In part one we’ll focus on MirrorMaker 2 theory (Kafka replication, architecture, components and terminology) and invent some MirrorMaker 2 rules. Part two will be more practical, and we’ll try out Instaclustr’s managed MirrorMaker 2 service and test the rules out with some experiments.

Go check out part 1.

Comments closed

Batch Execution Mode in Flink’s DataStream API

Published 2021-03-11 by Kevin Feasel

Dawid Wysakowicz takes us through batch execution mode in a streaming solution:

Flink has been following the mantra that Batch is a Special Case of Streaming since the very early days. As the project evolved to address specific uses cases, different core APIs ended up being implemented for batch (DataSet API) and streaming execution (DataStream API), but the higher-level Table API/SQL was subsequently designed following this mantra of unification. With Flink 1.12, the community worked on bringing a similarly unified behaviour to the DataStream API, and took the first steps towards enabling efficient batch execution in the DataStream API.

Read on to see the progress they’ve achieved so far.

Comments closed

Namenode Formatting in Hadoop

Published 2021-03-11 by Kevin Feasel

The Hadoop in Real World team has a warning for us:

What does hadoop namenode -format do and is it safe to run?
Don’t run the command until you fully understand what you are trying to do.

Click through to see what the command does.

Comments closed

Date Cleaning with PySpark

Published 2021-03-10 by Kevin Feasel

Robert J. Blackburn needs to do some cleanup work:

The function will accept the dataframe and a list of columns to process. Because of syntax restrictions the steps are broken out into multiple statements and a sub-function. Luckily, Spark’s lazy evaluation will optimize it during runtime.

Click through for the function in question.

Comments closed

Testing Kafka with Kerberos and SSH

Published 2021-03-05 by Kevin Feasel

Daniel Osvath has a guide for us:

Kerberos authentication is widely used in today’s client/server applications; however getting started with Kerberos may be a daunting task if you don’t have prior experience. Information on setting up Kerberos with an SSH server and client on the web is fragmented and hasn’t been presented in a comprehensive end-to-end way on a simple local setup.
At Confluent, several of our connectors for Apache Kafka^® support Kerberos-based authentication. For development and testing of these connectors, we often leverage containers due to their fast, iterative benefits. This tutorial aims to provide a simple setup for a Kerberos test environment with SSH for a passwordless authentication that uses Kerberos tickets. You may use this as a guide for testing the Kerberos functionality of SSH-based client-server applications in a local environment or as a hands-on tutorial if you’re new to Kerberos. To understand the basics of Kerberos before diving into this tutorial, you may find this video helpful. Additionally, if you are looking for a non-SSH-based setup, the setup below for the KDC server container may also be useful.

Click through for two approaches to the problem.

Comments closed

Apache Spark 3.1 Released

Published 2021-03-04 by Kevin Feasel

Hyukjin Kwon, et al, announce Apache Spark 3.1:

Various new SQL features are added in this release. The widely used standard CHAR/VARCHAR data types are added as variants of the supported String types. More built-in functions (e.g., width_bucket (SPARK-21117) and regexp_extract_all (SPARK-24884) were added. The current number of built-in operators/functions has now reached 350. More DDL/DML/utility commands have been enhanced, including INSERT (SPARK-32976), MERGE (SPARK-32030) and EXPLAIN (SPARK-32337). Starting from this release, in Spark WebUI, the SQL plans are presented in a simpler and structured format (i.e. using EXPLAIN FORMATTED)

There have been quite a few advancements around the SQL side.

Comments closed

Installing Spark on Windows Subsystem for Linux

Published 2021-03-02 by Kevin Feasel

David Alcock wants Spark, but not Windows Spark:

The post won’t cover any instructions for installing Ubuntu and instead I’ll assume you’ve installed already and downloaded the tgz file from the Apache Spark download page (Step 3 in the above link).
Let’s go straight into the terminal window and get going! I’ve put the commands in bold text (don’t include the $) just so anyone can see a bit easier and who also prefers to ignore my jibberish!

Click through for the instructions.

Comments closed

Applied ML Prototypes

Published 2021-03-02 by Kevin Feasel

Alex Bleakley and Santiago Giraldo announce Applied ML Prototypes:

To directly address these challenges, we’ve released Applied ML Prototypes (AMPs) — a revolutionary new way of developing and shipping enterprise ML use cases — which provide complete ML projects that can be deployed with one click directly from Cloudera Machine Learning. AMPs enable data scientists to go from an idea to a fully working ML use case in a fraction of the time, with an end-to-end framework for building, deploying, and monitoring business-ready ML applications instantly.
AMPs move the starting line for any ML project by enabling data scientists to start with a full end-to-end project developed for a similar use case, including a trained and deployed ML model, as well as prebuilt predictive business applications, out of the box. This means that ML development teams can tackle their own ML business use cases more quickly, from those involving churn modeling, to sentiment analysis, to anomaly detection and beyond.

Getting past the marketing fluff, there are some interesting ideas here.

Comments closed

Power BI Connector for Databricks

Published 2021-03-01 by Kevin Feasel

Stefania Leone, et al, announce general availability of the Power BI connector for Databricks:

We are excited to announce General Availability (GA) of the Microsoft Power BI connector for Databricks for Power BI Service and Power BI Desktop 2.85.681.0. Following the public preview, we have already seen strong customer adoption, so we are pleased to extend these capabilities to our entire customer base. The native Power BI connector for Databricks in combination with the recently launched SQL Analytics service provides Databricks customers with a first-class experience for performing BI workloads directly on their Delta Lake. SQL Analytics allows customers to operate a multi-cloud lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance than traditional cloud data warehouses.

This is easier to work with than the Apache Spark connector and it looks like it should be faster than that connector as well.

Comments closed

Survival Analysis Notebooks

Published 2021-02-25 by Kevin Feasel

Dan Morris, et al, walk us through a survival analysis scenario:

In contrast to other methods that may seem similar on the surface, such as linear regression, survival analysis takes censoring into account. Censoring occurs when the start and/or end of a measured value is unknown. For example, suppose our historical data includes records for the two customers below. In the case of customer A, we know the precise duration of the subscription because the customer churned in December 2020. For customer B, we know that the contract started four months ago and is still active, but we do not know how much longer they will be a customer. This is an example of right censoring because we do not yet know the end date for the measured value. Right censoring is what we most commonly see with this form of analysis.

Click through for an intro as well as a half-dozen notebooks.

Comments closed

Category: Hadoop