Hadoop – Page 57 – Curated SQL

The remainder of this post discusses how to implement streaming ETL architectures with Apache Flink and Kinesis Data Analytics. The architecture persists streaming data from one or multiple sources to different destinations and is extensible to your needs. This post does not cover additional filtering, enrichment, and aggregation transformations, although that is a natural extension for practical applications.
This post shows how to build, deploy, and operate the Flink application with Kinesis Data Analytics, without further focusing on these operational aspects. It is only relevant to know that you can create a Kinesis Data Analytics application by uploading the compiled Flink application jar file to Amazon S3 and specifying some additional configuration options with the service. You can then execute the Kinesis Data Analytics application in a fully managed environment. For more information, see Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications and the Amazon Kinesis Data Analytics developer guide.

Click through for the walkthrough.

Comments closed

Finding YARN Cluster Idle Time

Published 2020-02-24 by Kevin Feasel

Dmitry Tolpeko has a Python script to track YARN cluster idle time:

In the previous article Calculating Utilization of Cluster using Resource Manager Logs I showed how to estimate per-second utilization for a Hadoop cluster.
This information can be useful to calculate the idle time statistics for a cluster i.e. time when no any containers are running.

Click through for the script.

Comments closed

MR3: Hive on Kubernetes

Published 2020-02-21 by Kevin Feasel

Alex Woodie reports on a DataMonad production:

MR3 is a software product developed by a team led by Sungwoo Park. The software, which is not open source, is sold by a Delaware-based software company called DataMonad. After prototyping a Java-based execution engine called MR2 in the 2013 timeframe, development of Scala-based MR3 began in 2015. The first release of MR3 was delivered in early 2018, and version 1.0 was released yesterday.
According to DataMonad, MR3 is an execution engine for big data processing, and Hive is the first and main application that’s been configured to run on it (Tez is also supported). The company says MR3 offers comparable performance to the latest release of Hive, dubbed LLAP, but without the technical complexity.

The closed-sourcedness is a bit of a downer, but I like having more competition in the space.

Comments closed

Creating Sources and Sinks with Blink

Published 2020-02-21 by Kevin Feasel

Seth Wiesman has a tutorial showing how to create sources and sinks using Apache Flink’s SQL interface, Blink:

A lot of work focused on improving runtime performance and progressively extending its coverage of the SQL standard. Flink now supports the full TPC-DS query set for batch queries, reflecting the readiness of its SQL engine to address the needs of modern data warehouse-like workloads. Its streaming SQL supports an almost equal set of features – those that are well defined on a streaming runtime – including complex joins and MATCH_RECOGNIZE.
As important as this work is, the community also strives to make these features generally accessible to the broadest audience possible. That is why the Flink community is excited in 1.10 to offer production-ready DDL syntax (e.g., CREATE TABLE, DROP TABLE) and a refactored catalog interface.

Click through for a demonstration. One of the nicest things about the ANSI SQL standard is that it was intended to be a one-language solution, where the language used for administration is the same as the language used for regular queries. That cuts down on the number of languages you need to know to get your job done.

Comments closed

Two Performance Tricks for Spark SQL

Published 2020-02-20 by Kevin Feasel

Divyansh Jain shares a couple of tips when optimizing Apache Spark code:

1. Avoid UDFs. But why..?
Because internally, Catalyst doesn’t optimize and process UDFs at all, which results in losing the optimization level. Instead, try using SparkSql API to develop your application.

Click through for a demo and for the second tip.

Comments closed

Using Sqoop to Import Data into HDFS

Published 2020-02-19 by Kevin Feasel

Jon Morisi has a primer on Sqoop:

In this article, I’ll walk through using Sqoop to import data to Hadoop (HDFS).
“Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.”

With respect to SQL Server, Sqoop has two good use cases: pulling data from SQL Server into HDFS, and pulling data from HDFS into a staging table in SQL Server.

Comments closed

Benchmarking Apache Hadoop Ozone

Published 2020-02-17 by Kevin Feasel

Istvan Fajth and Mukul Kumar Singh take us through a benchmarking test of Apache Hadoop Ozone:

Apache Hadoop Ozone was designed to address the scale limitation of HDFS with respect to small files and the total number of file system objects. On current data center hardware, HDFS has a limit of about 350 million files and 700 million file system objects. Ozone’s architecture addresses these limitations[4]. This article compares the performance of Ozone with HDFS, the de-facto big data file system.
We chose a widely used benchmark, TPC-DS, for this test and a conventional Hadoop stack consisting of Hive, Tez, YARN, and HDFS side by side with Ozone. True to the current industry need for separation of compute and storage, which enables dense storage nodes and elastic compute, we run these tests with the datanodes and node managers segregated. The fundamental ambition of this endeavor, and the subsequent effort in optimizing the product, is to be comparable in terms of stability and performance to HDFS. To that end, we would like to call out the amazing amount of work put in by the community over the past several months towards this goal.

It’s interesting to watch the Hadoop community work through these sorts of challenges, where the hardware paradigm has differed quite a bit from when HDFS was created.

Comments closed

Installing Spark on Windows 10

Published 2020-02-14 by Kevin Feasel

Gopal Tiwari shows how you can install Apache Spark on Windows 10:

By default, Spark SQL projects do not run on Windows OS and require us to perform some basic setup first; that’s all we are going to discuss in this article, as I didn’t find it well documented anywhere over the internet or in books.
This article can also be used for setting up a Spark development environment on Mac or Linux as well. Just make sure you’ll downloading the correct OS-version from Spark’s website.
You can refer to the Scala project used in this article from GitHub here: https://github.com/gopal-tiwari/LocalSparkSql.

I’ve seen (and written) installation guides for Spark. This is a good one, as it goes beyond installation and into kicking off a project and ensuring that it works.

Comments closed

Flink 1.10.0 Released

Published 2020-02-14 by Kevin Feasel

Marta Paes announces the release of Apache Flink 1.10.0:

The Apache Flink community is excited to hit the double digits and announce the release of Flink 1.10.0! As a result of the biggest community effort to date, with over 1.2k issues implemented and more than 200 contributors, this release introduces significant improvements to the overall performance and stability of Flink jobs, a preview of native Kubernetes integration and great advances in Python support (PyFlink).
Flink 1.10 also marks the completion of the Blink integration, hardening streaming SQL and bringing mature batch processing to Flink with production-ready Hive integration and TPC-DS coverage. This blog post describes all major new features and improvements, important changes to be aware of and what to expect moving forward.

Read on for the improvements and let me once more point out the validation of Feasel’s Law.

Comments closed

Building a Cache in ksqlDB

Published 2020-02-12 by Kevin Feasel

Michael Drogalis shows how to build a materialized cache to reduce the load on your Kafka Streams servers:

There are a lot of ways that you can introduce a materialized cache into your architecture. One such way is to leverage ksqlDB, an event streaming database purpose-built for stream processing applications. With native Kafka integration, ksqlDB makes it easy to replicate the pattern of scaling out many sets of distributed caches.
Let’s look at how this works in action with an example application. Imagine that you have a database storing geospatial data of pings from drivers at a ridesharing company. You have a particular piece of logic that you want to move out of the database—a frequently run query to aggregate how active a territory is. You can build a materialized cache for it using ksqlDB.

The tutorial starts you from “grab the Docker container” and takes you through the process.

Comments closed

Category: Hadoop

Streaming Pipelines in AWS with Flink and Kinesis Data Analytics

Finding YARN Cluster Idle Time

MR3: Hive on Kubernetes

Creating Sources and Sinks with Blink

Two Performance Tricks for Spark SQL

Using Sqoop to Import Data into HDFS

Benchmarking Apache Hadoop Ozone

Installing Spark on Windows 10

Flink 1.10.0 Released

Building a Cache in ksqlDB