Hadoop – Page 18 – Curated SQL

Using the HAVING Clause in Spark

Published 2022-05-16 by Kevin Feasel

Lnadon Robinson continues the Spark Starter Guide:

Having is similar to filtering (filter(), where() or where (in a SQL clause)), but the use cases differ slightly. While filtering allows you to apply conditions on your non-aggregated columns to limit the result set, Having allows you to apply conditions on aggregate functions / columns instead.

Read on for examples in Spark SQL, both as a SQL query and Scala/Python function calls.

Comments closed

Optimizing Hive Performance with Tez

Published 2022-05-11 by Kevin Feasel

Jay Desai has some recommendations around tuning Tez queries:

Tuning Hive on Tez queries can never be done in a one-size-fits-all approach. The performance on queries depends on the size of the data, file types, query design, and query patterns. During performance testing, evaluate and validate configuration parameters and any SQL modifications. It is advisable to make one change at a time during performance testing of the workload, and would be best to assess the impact of tuning changes in your development and QA environments before using them in production environments. Cloudera WXM can assist in evaluating the benefits of query changes during performance testing.

Click through for several configuration and query considerations.

Comments closed

Shipping Kafka Logs to Kibana via Filebeat

Published 2022-05-11 by Kevin Feasel

Shivani Sarthi uses Filebeat to perform log shipping:

To ship the Kafka logs, we will be using the filebeat agent. A filebeat agent is a lightweight shipper whose purpose is to forward and centralize the log data.
For filebeat to work, you need to install it as an agent on the desired servers. Filebeat then monitors the log files, collects the log events, and forwards them to the ElasticSearch or LogStash for indexing.

Click through for an Ansible script to install Filebeat, integrate with Kafka, and communicate with Logstash for eventual querying via Kibana.

Comments closed

Flink 1.15 Released

Published 2022-05-05 by Kevin Feasel

Joe Moser and Yun Gao announce Apache Flink 1.15:

Thanks to our well-organized and open community, Apache Flink continues to grow as a technology and remain one of the most active projects in the Apache community. With the release of Flink 1.15, we are proud to announce a number of exciting changes.
One of the main concepts that makes Apache Flink stand out is the unification of batch (aka bounded) and stream (aka unbounded) data processing, which helps reduce the complexity of development. A lot of effort went into this unification in the previous releases, and you can expect more efforts in this direction.
Apache Flink is not only growing when it comes to contributions and users, but also out of the original use cases. We are seeing a trend towards more business/analytics use cases implemented in low-/no-code. Flink SQL is the feature in the Flink ecosystem that enables such uses cases and this is why its popularity continues to grow.

Flink SQL is Feasel’s Law in action.

Comments closed

Comparing Databricks to Synapse Spark Pools

Published 2022-05-04 by Kevin Feasel

Corrinna Peters makes comparisons:

There are different cases for using both depending on the specific needs and requirements, Synapse and Databricks are similar, but both have their own areas of specialities or rather areas where they are above the other.
Data Lake – they both allow you to query the data from the data lake, Synapse uses either the SQL on demand pool or Spark and Databricks uses the Databricks workspace once you have mounted the data lake. If you are predominately a SQL user and prefer the code and the BI developer feel then Synapse would be the correct choice whereas if you are a Data Scientist and prefer to code in Python or R then Databricks would feel more at home.

Read on for a nuanced take. My less nuanced take is, Databricks beats the pants off of Synapse Spark pools in terms of performance. Synapse has a much better overall ecosystem, expanding beyond Spark and into T-SQL (in two flavors) and log/event analytics with KQL. If you’re spending 100% of your time in Spark and don’t care about the rest, use Databricks; if Spark is a relatively small part of your warehousing work, use Synapse.

1 Comment

Generating Identity Integers in Spark

Published 2022-04-28 by Kevin Feasel

The Hadoop in Real World team hits a favorite topic of mine, monotonicity:

We would like to add an index column with a unique incrementing value like below.
There are few options to implement this use case in Spark. Let’s see them one by one.

These are all functions which you apply to existing data rather than generating new values like IDENTITY or Sequences in relational databases.

Comments closed

Azure Databricks Security Considerations

Published 2022-04-26 by Kevin Feasel

Craig Porteous provides some advice on configuring Azure Databricks:

Azure Databricks is an analytics platform and often serves as the central compute component of a data platform, to process ETL/ELT data pipelines and data science workloads. As Databricks is a third-party platform-as-a-service offering securing it works differently to most other first-party services in Azure; for example, we can’t use private endpoints. (More on these in the Azure Storage post)
The two main approaches to working with Databricks in our secure platform are VNet Peering or VNet Injection

Click through to learn the difference between these two, as well as a few other factors to keep in mind as you’re deploying Databricks.

Comments closed

From Confluent Cloud into Azure Synapse Analytics

Published 2022-04-12 by Kevin Feasel

Jacob Bogie and Dustin Vannoy show how to integrate Kafka in Confluent Cloud with pools in Azure Synapse Analytics:

Just released this fall, is the fully managed Synapse Connector. Azure Synapse Analytics provides a platform for data analysts and data scientists to analyze and combine data from multiple sources. Within Confluent Cloud, data can be synched to dedicated SQL pools via the fully managed Synapse sink connector and attached to Synapse Analytics workspace. Once added to the Synapse Analytics workspace, analysts have the ability to perform advanced analytics and reporting on data in the Confluent pipeline. The ability to access event-level data enables event-level analytics and data exploration.

Click through for two examples, one of loading data into a dedicated SQL pool and one of streaming data into Spark Streaming running on (naturally) a Spark pool.

Comments closed

Connecting Kafka Cross-Network

Published 2022-04-11 by Kevin Feasel

Praful Khandelwal sets up a hybrid Kafka cluster:

In this article, we will be talking about a simple set-up involving local machine (macOS) and Azure VM. We’ll discuss the step-by-step procedure to produce events from local machine to Kafka broker hosted on Azure VM and also to consume those events back in local machine. While this does not cover the exact scenario described above, it gives a fair idea about how the Kafka messages can be exchanged across the network.

Kafka is pretty chatty, so I’d hope to have really good network connectivity, such as a Direct Connect (for AWS) or Express Route (Azure) in place.

Comments closed

Parsing URLs with Hive

Published 2022-04-05 by Kevin Feasel

The Hadoop in Real World team shreds some URLs:

Hive offers 2 functions to work with URLS – parse_url and parse_url_tuple.
With both functions you can extract information like – PROTOCOL, HOST, PATH, QUERY, Query parameters etc.
Let’s see them in action.

Let’s, shall we?

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: Hadoop