Press "Enter" to skip to content

Category: Hadoop

Creating Sources and Sinks with Blink

Seth Wiesman has a tutorial showing how to create sources and sinks using Apache Flink’s SQL interface, Blink:

A lot of work focused on improving runtime performance and progressively extending its coverage of the SQL standard. Flink now supports the full TPC-DS query set for batch queries, reflecting the readiness of its SQL engine to address the needs of modern data warehouse-like workloads. Its streaming SQL supports an almost equal set of features – those that are well defined on a streaming runtime – including complex joins and MATCH_RECOGNIZE.

As important as this work is, the community also strives to make these features generally accessible to the broadest audience possible. That is why the Flink community is excited in 1.10 to offer production-ready DDL syntax (e.g., CREATE TABLEDROP TABLE) and a refactored catalog interface.

Click through for a demonstration. One of the nicest things about the ANSI SQL standard is that it was intended to be a one-language solution, where the language used for administration is the same as the language used for regular queries. That cuts down on the number of languages you need to know to get your job done.

Comments closed

MR3: Hive on Kubernetes

Alex Woodie reports on a DataMonad production:

MR3 is a software product developed by a team led by Sungwoo Park. The software, which is not open source, is sold by a Delaware-based software company called DataMonad. After prototyping a Java-based execution engine called MR2 in the 2013 timeframe, development of Scala-based MR3 began in 2015. The first release of MR3 was delivered in early 2018, and version 1.0 was released yesterday.

According to DataMonad, MR3 is an execution engine for big data processing, and Hive is the first and main application that’s been configured to run on it (Tez is also supported). The company says MR3 offers comparable performance to the latest release of Hive, dubbed LLAP, but without the technical complexity.

The closed-sourcedness is a bit of a downer, but I like having more competition in the space.

Comments closed

Using Sqoop to Import Data into HDFS

Jon Morisi has a primer on Sqoop:

In this article, I’ll walk through using Sqoop to import data to Hadoop (HDFS).

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.”

With respect to SQL Server, Sqoop has two good use cases: pulling data from SQL Server into HDFS, and pulling data from HDFS into a staging table in SQL Server.

Comments closed

Benchmarking Apache Hadoop Ozone

Istvan Fajth and Mukul Kumar Singh take us through a benchmarking test of Apache Hadoop Ozone:

Apache Hadoop Ozone was designed to address the scale limitation of HDFS with respect to small files and the total number of file system objects. On current data center hardware, HDFS has a limit of about 350 million files and 700 million file system objects. Ozone’s architecture addresses these limitations[4]. This article compares the performance of Ozone with HDFS, the de-facto big data file system. 

We chose a widely used benchmark, TPC-DS, for this test and a conventional Hadoop stack consisting of Hive, Tez, YARN, and HDFS side by side with Ozone. True to the current industry need for separation of compute and storage, which enables dense storage nodes and elastic compute, we run these tests with the datanodes and node managers segregated. The fundamental ambition of this endeavor, and the subsequent effort in optimizing the product, is to be comparable in terms of stability and performance to HDFS. To that end, we would like to call out the amazing amount of work put in by the community over the past several months towards this goal.

It’s interesting to watch the Hadoop community work through these sorts of challenges, where the hardware paradigm has differed quite a bit from when HDFS was created.

Comments closed

Installing Spark on Windows 10

Gopal Tiwari shows how you can install Apache Spark on Windows 10:

By default, Spark SQL projects do not run on Windows OS and require us to perform some basic setup first; that’s all we are going to discuss in this article, as I didn’t find it well documented anywhere over the internet or in books.

This article can also be used for setting up a Spark development environment on Mac or Linux as well. Just make sure you’ll downloading the correct OS-version from Spark’s website. 

You can refer to the Scala project used in this article from GitHub here: https://github.com/gopal-tiwari/LocalSparkSql.

I’ve seen (and written) installation guides for Spark. This is a good one, as it goes beyond installation and into kicking off a project and ensuring that it works.

Comments closed

Flink 1.10.0 Released

Marta Paes announces the release of Apache Flink 1.10.0:

The Apache Flink community is excited to hit the double digits and announce the release of Flink 1.10.0! As a result of the biggest community effort to date, with over 1.2k issues implemented and more than 200 contributors, this release introduces significant improvements to the overall performance and stability of Flink jobs, a preview of native Kubernetes integration and great advances in Python support (PyFlink).

Flink 1.10 also marks the completion of the Blink integration, hardening streaming SQL and bringing mature batch processing to Flink with production-ready Hive integration and TPC-DS coverage. This blog post describes all major new features and improvements, important changes to be aware of and what to expect moving forward.

Read on for the improvements and let me once more point out the validation of Feasel’s Law.

Comments closed

Building a Cache in ksqlDB

Michael Drogalis shows how to build a materialized cache to reduce the load on your Kafka Streams servers:

There are a lot of ways that you can introduce a materialized cache into your architecture. One such way is to leverage ksqlDB, an event streaming database purpose-built for stream processing applications. With native Kafka integration, ksqlDB makes it easy to replicate the pattern of scaling out many sets of distributed caches.

Let’s look at how this works in action with an example application. Imagine that you have a database storing geospatial data of pings from drivers at a ridesharing company. You have a particular piece of logic that you want to move out of the database—a frequently run query to aggregate how active a territory is. You can build a materialized cache for it using ksqlDB.

The tutorial starts you from “grab the Docker container” and takes you through the process.

Comments closed

Finding the Right Disk and Data Node Sizes in HDFS

Lokesh Jain has some advice when it comes to disk and data node size:

There are two factors to keep in mind when choosing node capacity. These will be discussed in detail in the next sections.

1. Large Disks – total node capacity being the same, using more disks is better as it yields higher aggregate IO bandwidth.
2. Dense Nodes – as nodes get denser, recovery after node failure takes longer.

These factors are not HDFS-specific and will impact any distributed storage service that replicates data for redundancy and serves live workloads.

Click through for specific advice on maximum disk and node sizes.

Comments closed

Unit Testing in Apache Flink

Kartik Khare has a guide to help us write unit tests for our Apache Flink code:

Writing tests for process functions, that work with time, is quite similar to writing tests for stateful functions because you can also use test harness. However, you need to take care of another aspect, which is providing timestamps for events and controlling the current time of the application. By setting the current (processing or event) time, you can trigger registered timers, which will call the onTimer method of the function

Click through for demos and more details on the test harness.

Comments closed