Press "Enter" to skip to content

Category: Hadoop

Run Spark within Azure ML Compute

James Nguyen makes an announcement:

Following the blog post on Turning AML compute into Ray and Dask , we’ve added a new exciting capability to run Spark within AML compute where Spark shares the same context with your ML code. The Spark version is 3.2.1 with support for Delta Lake and Synapse SQL read/write. This enables users of AML to perform powerful data transformation and even Spark ML within AML interactive notebook or in a job run. 

Traditionally, Azure ML integrates with Spark Synapse or external compute services via a pipeline step or better via magic command like %synapse, but the computing context is separate from your AML logic so you still need to run Spark in a separate step and persist the output to some storage and load it in your AML script.

With this approach, Spark is available right within your AML code whether it’s AML notebook, python script or pipeline step. It shares the common computing context and most of the cases you can just directly convert the Spark Dataframe to Pandas and Dask Dataframe without persisting first to an intermediary storage.

I’ll have to try this out to see if it makes up for their getting rid of the Spark-based curated environments last year.

Comments closed

Filter on Aggregate Columns in Spark

Landon Robinson shows off the HAVING clause:

Having is similar to filtering (filter()where() or where, in a SQL clause), but the use cases differ slightly. While filtering allows you to apply conditions on your data to limit the result set, Having allows you to apply conditions on aggregate functions on your data to limit your result set.

Both limit your result set – but the difference in how they are applied is the key. In short: where filters are for row-level filteringHaving filters are for aggregate-level filtering. As a result, using a Having statement can also simplify (or outright negate) the need to use some sub-queries.

Click through for examples of HAVING in use.

Comments closed

Ranking Data in Spark

Landon Robinson continues the Spark Starter Guide:

Ranking is, fundamentally, ordering based on a condition. So, in essence, it’s like a combination of a where clause and order by clause—the exception being that data is not removed through ranking , it is, well, ranked, instead. While ordering allows you to sort data based on a column, ranking allows you to allocate a number (e.g. row number or rank) to each row (based on a column or condition) so that you can utilize it in logical decision making, like selecting a top result, or applying further transformations.

One very common ranking function is row_number(), which allows you to assign a unique value or “rank” to each row or rows within a grouping based on a specification. That specification, at least in Spark, is controlled by partitioning and ordering a dataset. The result allows you, for example, to achieve “top n” analysis in Spark.

One minor adjustment I’d make is not calling the output of ROW_NUMBER() “Rank” because then it’d make me think that’s the output of the RANK() window function. In the event of ties, those two outputs will differ.

Comments closed

get_json_object and json_tuple in Hive

The Hadoop in Real World team looks at a pair of Hive functions:

Both get_json_object and json_tuple functions in Hive are meant to work with JSON data in Hive. 

Let’s create a table with some JSON data to work with. We have created a table named hirw_courses and loaded the below JSON text into the table.

Click through for that script as well as seeing how you use each of get_json_object() and json_tuple() and which might be better-suited for your purposes.

Comments closed

Databricks Delta Sharing for Azure

Will Girten, et al, announce Delta Sharing on Azure:

Included in this release is a new and improved API for listing all the tables under all schemas in a share. The new API supports pagination similar to other APIs.

For example, to list all the tables in the Delta share my_share, you can simply send a GET request to the /shares/{share_name}/all-tables endpoint on the sharing server.

Prior to that, you might want to read up on Delta Sharing.

Comments closed

Using Synapse Link for Cosmos DB

I have a post combining Synapse Link for Cosmos DB and the Spark to Synapse SQL Connector:

In this post, we saw how to enable Cosmos DB’s Analytical store, access data using Synapse Link for Cosmos DB, and use the Spark to Synapse SQL Connector to move that data into a dedicated SQL pool. We saw how to do this in a workspace using a managed virtual network with data exfiltration protection enabled, meaning this is the largest number of steps necessary.

Click through for product descriptions and step-by-step instructions.

Comments closed

Bucketing Data in Hive

Chitra Sapkal explains why bucketing in Hive can be so powerful:

When a column has a high cardinality, we can’t perform partitioning on it. A very high number of partitions will generate too many Hadoop files which would increase the load on the node. That’s because the node will have to keep the metadata of every partition, and that would affect the performance of that node

In simple words, You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions.

Click through to see how bucketing works and examples of how you can use it to make queries faster.

Comments closed

Static versus Dynamic Partitioning in Hive

The Hadoop in Real World team explains the difference between two partitioning strategies:

The difference between static and dynamic partitioning only exists when the partition is being created based on how the partitions are added to the table. Once the partitions are created, the tables won’t have any difference like static and dynamic partitions. All partitions are treated and one and the same.

Click through for the difference.

Comments closed

Configuring a Debezium Connector for Event Hub Streaming

Niels Berglund continues a series:

This series came about as I in the post How to Use Kafka Client with Azure Event Hubs, somewhat foolishly said:

An interesting point here is that it is not only your Kafka applications that can publish to Event Hubs but any application that uses Kafka Client 1.0+, like Kafka Connect connectors!

I wrote the above without testing it myself, so when I was called out on it, I started researching (read “Googling”) to see if it was possible. The result of the “Googling” didn’t give a 100% answer, so I decided to try it out, and this series is the result.

In the first post, – as mentioned – we configured Kafka Connect to connect into Event Hubs. In this post, we look at configuring the Debezium connector.

Click through and enjoy the fruits of Berglund’s Folly—which, as far as it goes, I’d still rate Seward’s Folly higher but this one’s pretty good too.

Comments closed