Press "Enter" to skip to content

Category: Hadoop

Shipping Kafka Logs to Kibana via Filebeat

Shivani Sarthi uses Filebeat to perform log shipping:

To ship the Kafka logs, we will be using the filebeat agent. A filebeat agent is a lightweight shipper whose purpose is to forward and centralize the log data.

For filebeat to work, you need to install it as an agent on the desired servers. Filebeat then monitors the log files, collects the log events, and forwards them to the ElasticSearch or LogStash for indexing.

Click through for an Ansible script to install Filebeat, integrate with Kafka, and communicate with Logstash for eventual querying via Kibana.

Comments closed

Flink 1.15 Released

Joe Moser and Yun Gao announce Apache Flink 1.15:

Thanks to our well-organized and open community, Apache Flink continues to grow as a technology and remain one of the most active projects in the Apache community. With the release of Flink 1.15, we are proud to announce a number of exciting changes.

One of the main concepts that makes Apache Flink stand out is the unification of batch (aka bounded) and stream (aka unbounded) data processing, which helps reduce the complexity of development. A lot of effort went into this unification in the previous releases, and you can expect more efforts in this direction.

Apache Flink is not only growing when it comes to contributions and users, but also out of the original use cases. We are seeing a trend towards more business/analytics use cases implemented in low-/no-code. Flink SQL is the feature in the Flink ecosystem that enables such uses cases and this is why its popularity continues to grow.

Flink SQL is Feasel’s Law in action.

Comments closed

Comparing Databricks to Synapse Spark Pools

Corrinna Peters makes comparisons:

There are different cases for using both depending on the specific needs and requirements, Synapse and Databricks are similar, but both have their own areas of specialities or rather areas where they are above the other.

Data Lake – they both allow you to query the data from the data lake, Synapse uses either the SQL on demand pool or Spark and Databricks uses the Databricks workspace once you have mounted the data lake. If you are predominately a SQL user and prefer the code and the BI developer feel then Synapse would be the correct choice whereas if you are a Data Scientist and prefer to code in Python or R then Databricks would feel more at home.

Read on for a nuanced take. My less nuanced take is, Databricks beats the pants off of Synapse Spark pools in terms of performance. Synapse has a much better overall ecosystem, expanding beyond Spark and into T-SQL (in two flavors) and log/event analytics with KQL. If you’re spending 100% of your time in Spark and don’t care about the rest, use Databricks; if Spark is a relatively small part of your warehousing work, use Synapse.

1 Comment

Azure Databricks Security Considerations

Craig Porteous provides some advice on configuring Azure Databricks:

Azure Databricks is an analytics platform and often serves as the central compute component of a data platform, to process ETL/ELT data pipelines and data science workloads. As Databricks is a third-party platform-as-a-service offering securing it works differently to most other first-party services in Azure; for example, we can’t use private endpoints. (More on these in the Azure Storage post)

The two main approaches to working with Databricks in our secure platform are VNet Peering or VNet Injection

Click through to learn the difference between these two, as well as a few other factors to keep in mind as you’re deploying Databricks.

Comments closed

From Confluent Cloud into Azure Synapse Analytics

Jacob Bogie and Dustin Vannoy show how to integrate Kafka in Confluent Cloud with pools in Azure Synapse Analytics:

Just released this fall, is the fully managed Synapse Connector. Azure Synapse Analytics provides a platform for data analysts and data scientists to analyze and combine data from multiple sources. Within Confluent Cloud, data can be synched to dedicated SQL pools via the fully managed Synapse sink connector and attached to Synapse Analytics workspace. Once added to the Synapse Analytics workspace, analysts have the ability to perform advanced analytics and reporting on data in the Confluent pipeline. The ability to access event-level data enables event-level analytics and data exploration.

Click through for two examples, one of loading data into a dedicated SQL pool and one of streaming data into Spark Streaming running on (naturally) a Spark pool.

Comments closed

Connecting Kafka Cross-Network

Praful Khandelwal sets up a hybrid Kafka cluster:

In this article, we will be talking about a simple set-up involving local machine (macOS) and Azure VM. We’ll discuss the step-by-step procedure to produce events from local machine to Kafka broker hosted on Azure VM and also to consume those events back in local machine. While this does not cover the exact scenario described above, it gives a fair idea about how the Kafka messages can be exchanged across the network.

Kafka is pretty chatty, so I’d hope to have really good network connectivity, such as a Direct Connect (for AWS) or Express Route (Azure) in place.

Comments closed

Zero-Rename Writes in ElasticMapReduce Hive

Suthan Phillips, et al, show off some updates to the way Hive transactions commit in AWS’s ElasticMapReduce:

Our customers use Apache Hive on Amazon EMR for large-scale data analytics and extract, transform, and load (ETL) jobs. Amazon EMR Hive uses Apache Tez as the default job execution engine, which creates Directed Acyclic Graphs (DAGs) to process data. Each DAG can contain multiple vertices from which tasks are created to run the application in parallel. Their final output is written to Amazon Simple Storage Service (Amazon S3).

Hive initially writes data to staging directories and then move it to the final location after a series of rename operations. This design of Hive renames supports task failure recovery, such as rescheduling the failed task with another attempt, running speculative execution, and recovering from a failed job attempt. These move and rename operations don’t have a significant performance impact in HDFS because it’s only a metadata operation when compared to Amazon S3 where the performance can degrade significantly based on the number of files written.

This post discusses the new optimized committer for Hive in Amazon EMR and also highlights its impressive performance by running a TPCx-BB performance benchmark and comparing it with the Hive default commit logic.

Read on for a description of how commit operations work in general and how the updated Hive committer can help with certain types of queries.

Comments closed

Right to Be Forgotten in Delta Lake

Milos Colic, et al, tackle a tricky problem:

With Delta, we have one more tool at our disposal to address GDPR compliance and, in particular, “the right to be forgotten” – VACUUM. Vacuum operation removes the files that are no longer needed and that are older than a predefined retention period. The default retention period is 30 days to align with GDPR definition of undue delay. Our earlier blog on a similar topic explains in detail how you can find and delete personal information related to a consumer by running two commands:

The part I’m finding tricky here is, how does this handle “time travel” scenarios in which you’re looking at prior iterations of data? I haven’t run through all of the scenarios so this is just speculation, but it seems that even with all of these changes, you’d still have to worry about historical data containing that sensitive information.

Comments closed