Press "Enter" to skip to content

Category: Hadoop

Connecting Kafka to Elasticsearch

Danny Kay and Liz Bennett build an example of writing Kafka topic data to Elasticsearch:

The Elasticsearch sink connector helps you integrate Apache Kafka® and Elasticsearch with minimum effort. You can take data you’ve stored in Kafka and stream it into Elasticsearch to then be used for log analysis or full-text search. Alternatively, you can perform real-time analytics on this data or use it with other applications like Kibana.

For some background on what Elasticsearch is, you can read this blog post by Sarwar Bhuiyan. You can also learn more about Kafka Connect in this blog post by Tiffany Chang and in this presentation from Robin Moffatt.

This is a demo-heavy walkthrough, so check it out.

Comments closed

Implicit Type Conversions with Spark SQL

Manoj Pandey walks us through an unexpected error with Spark SQL:

While working on some data analysis I saw one Spark SQL query was not getting me expected results. The table had some good amount of data, I was filtering on a value but some records were missing. So, I checked online and found that Spark SQL works differently compared to SQL Server, in this case while comparing 2 different datatypes columns or variables.

Read on to learn more about the issue. This is the downside of Feasel’s Law: just because both system interfaces are SQL doesn’t mean that they’re equivalent or that the assertions and assumptions you can make for one follow through to the next.

Comments closed

Confluent Developer

Tim Berglund announces Confluent Developer:

Today, I am pleased to announce the launch of Confluent Developer, the one and only portal for everything you need to get started with Apache Kafka®, Confluent Platform, and Confluent Cloud! Everything on Confluent Developer is completely free and ungated. It’s a single online source of everything you’ll need to learn Kafka: links to documentation, collections of video tutorials, links to sample code, the entire collection of guided Kafka Tutorials, an index of podcast episodes, and a link to our global network of meetups.

The site is laid out really well.

Comments closed

Secure Azure Data Source Access from Databricks

Bhavin Kukadia, Abhinav Garg, and Michal Marusan show us the right way to access Azure data sources from Azure Databricks:

Enterprise Security is a core tenet of building software at both Databricks and Microsoft, and thus it’s considered as a first-class citizen in Azure Databricks. In the context of this blog, secure connectivity refers to ensuring that traffic from Azure Databricks to Azure data services remains on the Azure network backbone, with the inherent ability to whitelist Azure Databricks as an allowed source. As a security best practice, we recommend a couple of options which customers could use to establish such a data access mechanism to Azure Data services like Azure Blob StorageAzure Data Lake Store Gen2Azure Synapse Data WarehouseAzure CosmosDB etc. Please read further for a discussion on Azure Private Link and Service Endpoints.

This is more about network configuration rather than things like “store your credentials and other secrets in Azure Key Vault,” which is also a good idea.

Comments closed

Incremental Imports with Sqoop

Jon Morisi continues a series on Sqoop:

In my last two blog posts I walked through how to use Sqoop to perform full imports.  Nightly full imports with overwrite has it’s place for small tables like dimension tables.  However, in real-world scenarios you’re also going to want a way to import only the delta values since the last time an import was run.  Sqoop offers two ways to perform incremental imports: append and lastmodified.

Both incremental imports can be run manually or created as job using the “sqoop job” command.  When running incremental imports manually from the command line the “–last-value” arg is used to specify the reference value for the check-column.  Alternately sqoop jobs track the “check-column” in the job and the value of the check-column is used for subsequent job runs as the where predicate in the SQL statement.  I.E. select columns from table where check-column > (last-max-check-column-value).

This is where Sqoop starts to break down for me, and Jon lists some of the issues in the post.

Comments closed

Hive: Shuffle Failed with Too Many Fetch Failures

Dmitry Tolpeko takes us through an ugly error:

On one of the clusters I noticed an increased rate of shuffle errors, and the restart of a job did not help, it still failed with the same error.

The error was as follows:

Error: Error while running task ( failure ) : org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal (Shuffle.java:301)

Caused by: java.io.IOException: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true

Click through to understand what this error means and what you can do about it.

Comments closed

How Apache Beam Runs on Top of Apache Flink

Maximilian Michels and Markos Sfikas explain why you might want to combine Apache Beam with Apache Flink:

Apache Flink and Apache Beam are open-source frameworks for parallel, distributed data processing at scale. Unlike Flink, Beam does not come with a full-blown execution engine of its own but plugs into other execution engines, such as Apache Flink, Apache Spark, or Google Cloud Dataflow. In this blog post we discuss the reasons to use Flink together with Beam for your batch and stream processing needs. We also take a closer look at how Beam works with Flink to provide an idea of the technical aspects of running Beam pipelines with Flink. We hope you find some useful information on how and why the two frameworks can be utilized in combination. For more information, you can refer to the corresponding documentation on the Beam website or contact the community through the Beam mailing list.

Read on for the full story. If you’re so inclined, you can also check out the full talk as a video.

Comments closed

Loading Data into Delta Lake

Prakash Chockalingam takes us through auto-loading Delta Lake from various sources:

Auto Loader is an optimized file source that overcomes all the above limitations and provides a seamless way for data teams to load the raw data at low cost and latency with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.

This does look interesting.

Comments closed

Using a Spark Listener

Bipin Patwardhan shares with us an event ingestion engine for Apache Spark:

In the last quarter of 2019, I developed a meta-data driven, ingestion engine using Spark. The framework /library has multiple patterns to cater to multiple source and destination combinations. For example, two patterns are available for loading flat files to cloud storage (one to load data to AWS S3 and another to load data to Azure Blob).

As data loading philosophies have changed from Extract-Transform-Load (ETL) to Extract-Load-Transform (ETL), such a framework is very useful, as it reduces the time needed to set up ingestion jobs.

Is anyone else getting Integration Services or Informatica flashbacks? Because I sure am.

Comments closed

Using Sqoop to Move Data into Hive

Jon Morisi continues a series on Sqoop:

Sqoop completes the import task by running MapReduce jobs importing the data to HDFS, and then running Hive commands (CREATE TABLE / LOAD DATA INPATH) to move the data to Hive.  The default HDFS location is: /user/[login]/[TABLENAME].  If you have any issues during the import you may need to remove the HDFS directory prior to re-running, or else you will get an Error:

Read on for sample calls and additional notes.

Comments closed