Hadoop – Page 20 – Curated SQL

Apache Flink and Delta Lake

Published 2022-02-11 by Kevin Feasel

Max Fisher and Dylan Gessner use Flink to load data in Delta Lake:

As with all parts of our platform, we are constantly raising the bar and adding new features to enhance developers’ abilities to build the applications that will make their Lakehouse a reality. Building real-time applications on Databricks is no exception. Features like asynchronous checkpointing, session windows, and Delta Live Tables allow organizations to build even more powerful, real-time pipelines on Databricks using Delta Lake as the foundation for all the data that flows through the Lakehouse.
However, for organizations that leverage Flink for real-time transformations, it might appear that they are unable to take advantage of some of the great Delta Lake and Databricks features, but that is not the case. In this blog we will explore how Flink developers can build pipelines to integrate their Flink applications into the broader Lakehouse architecture.

Click through for two methods of doing so.

Comments closed

Building a Calendar Table in Spark

Published 2022-02-07 by Kevin Feasel

Chris Koester brings calendar tables to Spark:

With the Data Lakehouse architecture shifting data warehouse workloads to the data lake, the ability to generate a calendar dimension (AKA date dimension) in Spark has become increasingly important. Thankfully, this task is made easy with PySpark and Spark SQL. Let’s dive right into the code!

Read on to see how you can create one.

Comments closed

Run Spark within Azure ML Compute

Published 2022-02-04 by Kevin Feasel

James Nguyen makes an announcement:

Following the blog post on Turning AML compute into Ray and Dask , we’ve added a new exciting capability to run Spark within AML compute where Spark shares the same context with your ML code. The Spark version is 3.2.1 with support for Delta Lake and Synapse SQL read/write. This enables users of AML to perform powerful data transformation and even Spark ML within AML interactive notebook or in a job run.
Traditionally, Azure ML integrates with Spark Synapse or external compute services via a pipeline step or better via magic command like %synapse, but the computing context is separate from your AML logic so you still need to run Spark in a separate step and persist the output to some storage and load it in your AML script.
With this approach, Spark is available right within your AML code whether it’s AML notebook, python script or pipeline step. It shares the common computing context and most of the cases you can just directly convert the Spark Dataframe to Pandas and Dask Dataframe without persisting first to an intermediary storage.

I’ll have to try this out to see if it makes up for their getting rid of the Spark-based curated environments last year.

Comments closed

Filter on Aggregate Columns in Spark

Published 2022-02-03 by Kevin Feasel

Landon Robinson shows off the HAVING clause:

Having is similar to filtering (filter(), where() or where, in a SQL clause), but the use cases differ slightly. While filtering allows you to apply conditions on your data to limit the result set, Having allows you to apply conditions on aggregate functions on your data to limit your result set.
Both limit your result set – but the difference in how they are applied is the key. In short: where filters are for row-level filtering. Having filters are for aggregate-level filtering. As a result, using a Having statement can also simplify (or outright negate) the need to use some sub-queries.

Click through for examples of HAVING in use.

Comments closed

Ranking Data in Spark

Published 2022-02-02 by Kevin Feasel

Landon Robinson continues the Spark Starter Guide:

Ranking is, fundamentally, ordering based on a condition. So, in essence, it’s like a combination of a where clause and order by clause—the exception being that data is not removed through ranking , it is, well, ranked, instead. While ordering allows you to sort data based on a column, ranking allows you to allocate a number (e.g. row number or rank) to each row (based on a column or condition) so that you can utilize it in logical decision making, like selecting a top result, or applying further transformations.
One very common ranking function is row_number(), which allows you to assign a unique value or “rank” to each row or rows within a grouping based on a specification. That specification, at least in Spark, is controlled by partitioning and ordering a dataset. The result allows you, for example, to achieve “top n” analysis in Spark.

One minor adjustment I’d make is not calling the output of ROW_NUMBER() “Rank” because then it’d make me think that’s the output of the RANK() window function. In the event of ties, those two outputs will differ.

Comments closed

get_json_object and json_tuple in Hive

Published 2022-02-02 by Kevin Feasel

The Hadoop in Real World team looks at a pair of Hive functions:

Both get_json_object and json_tuple functions in Hive are meant to work with JSON data in Hive.
Let’s create a table with some JSON data to work with. We have created a table named hirw_courses and loaded the below JSON text into the table.

Click through for that script as well as seeing how you use each of get_json_object() and json_tuple() and which might be better-suited for your purposes.

Comments closed

Inserting Arrays into Hive Tables

Published 2022-01-27 by Kevin Feasel

The Hadoop in Real World team takes us through an interesting insertion problem:

When we attempt to insert in to the employee_depts table with the below INSERT, the insert operation will fail with TOK_FUNCTION not supported in insert/values SemanticException.

The solution is straightforward but not something I would have figured out intuitively.

Comments closed

Databricks Delta Sharing for Azure

Published 2022-01-25 by Kevin Feasel

Will Girten, et al, announce Delta Sharing on Azure:

Included in this release is a new and improved API for listing all the tables under all schemas in a share. The new API supports pagination similar to other APIs.
For example, to list all the tables in the Delta share my_share, you can simply send a GET request to the /shares/{share_name}/all-tables endpoint on the sharing server.

Prior to that, you might want to read up on Delta Sharing.

Comments closed

Using Synapse Link for Cosmos DB

Published 2022-01-25 by Kevin Feasel

I have a post combining Synapse Link for Cosmos DB and the Spark to Synapse SQL Connector:

In this post, we saw how to enable Cosmos DB’s Analytical store, access data using Synapse Link for Cosmos DB, and use the Spark to Synapse SQL Connector to move that data into a dedicated SQL pool. We saw how to do this in a workspace using a managed virtual network with data exfiltration protection enabled, meaning this is the largest number of steps necessary.

Click through for product descriptions and step-by-step instructions.

Comments closed

Bucketing Data in Hive

Published 2022-01-20 by Kevin Feasel

Chitra Sapkal explains why bucketing in Hive can be so powerful:

When a column has a high cardinality, we can’t perform partitioning on it. A very high number of partitions will generate too many Hadoop files which would increase the load on the node. That’s because the node will have to keep the metadata of every partition, and that would affect the performance of that node
In simple words, You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions.

Click through to see how bucketing works and examples of how you can use it to make queries faster.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop