Category: Hadoop

The Elasticsearch sink connector helps you integrate Apache Kafka^® and Elasticsearch with minimum effort. You can take data you’ve stored in Kafka and stream it into Elasticsearch to then be used for log analysis or full-text search. Alternatively, you can perform real-time analytics on this data or use it with other applications like Kibana.
For some background on what Elasticsearch is, you can read this blog post by Sarwar Bhuiyan. You can also learn more about Kafka Connect in this blog post by Tiffany Chang and in this presentation from Robin Moffatt.

This is a demo-heavy walkthrough, so check it out.

Comments closed

Implicit Type Conversions with Spark SQL

Published 2020-03-06 by Kevin Feasel

Manoj Pandey walks us through an unexpected error with Spark SQL:

While working on some data analysis I saw one Spark SQL query was not getting me expected results. The table had some good amount of data, I was filtering on a value but some records were missing. So, I checked online and found that Spark SQL works differently compared to SQL Server, in this case while comparing 2 different datatypes columns or variables.

Read on to learn more about the issue. This is the downside of Feasel’s Law: just because both system interfaces are SQL doesn’t mean that they’re equivalent or that the assertions and assumptions you can make for one follow through to the next.

Comments closed

Confluent Developer

Published 2020-03-05 by Kevin Feasel

Tim Berglund announces Confluent Developer:

Today, I am pleased to announce the launch of Confluent Developer, the one and only portal for everything you need to get started with Apache Kafka^®, Confluent Platform, and Confluent Cloud! Everything on Confluent Developer is completely free and ungated. It’s a single online source of everything you’ll need to learn Kafka: links to documentation, collections of video tutorials, links to sample code, the entire collection of guided Kafka Tutorials, an index of podcast episodes, and a link to our global network of meetups.

The site is laid out really well.

Comments closed

Secure Azure Data Source Access from Databricks

Published 2020-03-04 by Kevin Feasel

Bhavin Kukadia, Abhinav Garg, and Michal Marusan show us the right way to access Azure data sources from Azure Databricks:

Enterprise Security is a core tenet of building software at both Databricks and Microsoft, and thus it’s considered as a first-class citizen in Azure Databricks. In the context of this blog, secure connectivity refers to ensuring that traffic from Azure Databricks to Azure data services remains on the Azure network backbone, with the inherent ability to whitelist Azure Databricks as an allowed source. As a security best practice, we recommend a couple of options which customers could use to establish such a data access mechanism to Azure Data services like Azure Blob Storage, Azure Data Lake Store Gen2, Azure Synapse Data Warehouse, Azure CosmosDB etc. Please read further for a discussion on Azure Private Link and Service Endpoints.

This is more about network configuration rather than things like “store your credentials and other secrets in Azure Key Vault,” which is also a good idea.

Comments closed

Incremental Imports with Sqoop

Published 2020-03-03 by Kevin Feasel

Jon Morisi continues a series on Sqoop:

In my last two blog posts I walked through how to use Sqoop to perform full imports. Nightly full imports with overwrite has it’s place for small tables like dimension tables. However, in real-world scenarios you’re also going to want a way to import only the delta values since the last time an import was run. Sqoop offers two ways to perform incremental imports: append and lastmodified.
Both incremental imports can be run manually or created as job using the “sqoop job” command. When running incremental imports manually from the command line the “–last-value” arg is used to specify the reference value for the check-column. Alternately sqoop jobs track the “check-column” in the job and the value of the check-column is used for subsequent job runs as the where predicate in the SQL statement. I.E. select columns from table where check-column > (last-max-check-column-value).

This is where Sqoop starts to break down for me, and Jon lists some of the issues in the post.

Comments closed

Hive: Shuffle Failed with Too Many Fetch Failures

Published 2020-02-28 by Kevin Feasel

Dmitry Tolpeko takes us through an ugly error:

On one of the clusters I noticed an increased rate of shuffle errors, and the restart of a job did not help, it still failed with the same error.
The error was as follows:
Error: Error while running task ( failure ) : org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal (Shuffle.java:301)
Caused by: java.io.IOException: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true

Click through to understand what this error means and what you can do about it.

Comments closed

How Apache Beam Runs on Top of Apache Flink

Published 2020-02-27 by Kevin Feasel

Maximilian Michels and Markos Sfikas explain why you might want to combine Apache Beam with Apache Flink:

Apache Flink and Apache Beam are open-source frameworks for parallel, distributed data processing at scale. Unlike Flink, Beam does not come with a full-blown execution engine of its own but plugs into other execution engines, such as Apache Flink, Apache Spark, or Google Cloud Dataflow. In this blog post we discuss the reasons to use Flink together with Beam for your batch and stream processing needs. We also take a closer look at how Beam works with Flink to provide an idea of the technical aspects of running Beam pipelines with Flink. We hope you find some useful information on how and why the two frameworks can be utilized in combination. For more information, you can refer to the corresponding documentation on the Beam website or contact the community through the Beam mailing list.

Read on for the full story. If you’re so inclined, you can also check out the full talk as a video.

Comments closed

Loading Data into Delta Lake

Published 2020-02-27 by Kevin Feasel

Prakash Chockalingam takes us through auto-loading Delta Lake from various sources:

Auto Loader is an optimized file source that overcomes all the above limitations and provides a seamless way for data teams to load the raw data at low cost and latency with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.

This does look interesting.

Comments closed

Using a Spark Listener

Published 2020-02-26 by Kevin Feasel

Bipin Patwardhan shares with us an event ingestion engine for Apache Spark:

In the last quarter of 2019, I developed a meta-data driven, ingestion engine using Spark. The framework /library has multiple patterns to cater to multiple source and destination combinations. For example, two patterns are available for loading flat files to cloud storage (one to load data to AWS S3 and another to load data to Azure Blob).
As data loading philosophies have changed from Extract-Transform-Load (ETL) to Extract-Load-Transform (ETL), such a framework is very useful, as it reduces the time needed to set up ingestion jobs.

Is anyone else getting Integration Services or Informatica flashbacks? Because I sure am.

Comments closed

Streaming Pipelines in AWS with Flink and Kinesis Data Analytics

Published 2020-02-25 by Kevin Feasel

Steffen Hasumann shows us how to put together a streaming ETL pipeline in AWS using Apache Flink and Amazon Kinesis Data Analytics:

The remainder of this post discusses how to implement streaming ETL architectures with Apache Flink and Kinesis Data Analytics. The architecture persists streaming data from one or multiple sources to different destinations and is extensible to your needs. This post does not cover additional filtering, enrichment, and aggregation transformations, although that is a natural extension for practical applications.
This post shows how to build, deploy, and operate the Flink application with Kinesis Data Analytics, without further focusing on these operational aspects. It is only relevant to know that you can create a Kinesis Data Analytics application by uploading the compiled Flink application jar file to Amazon S3 and specifying some additional configuration options with the service. You can then execute the Kinesis Data Analytics application in a fully managed environment. For more information, see Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications and the Amazon Kinesis Data Analytics developer guide.

Click through for the walkthrough.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31