Spark – Page 30 – Curated SQL

Wrapping up the Azure Databricks Advent

Published 2021-01-04 by Kevin Feasel

Tomaz Kastrun laughs at 24-day advent calendars:

In the last two days we have focused on understanding Apache Spark through performance tuning and through troubleshooting. Both require some deeper understanding of Spark and Azure Databricks, but gives also a great insight to all who will need to improve performance and work with Spark.
Today, I would like to list couple of additional Learning material, documentation and any other additional resources for further exploration on Azure Databricks.

Click through for links to additional resources on Apache Spark and Databricks, as well as the other 30 entries in the series.

Comments closed

Apache Spark Performance Tuning

Published 2020-12-29 by Kevin Feasel

Tomaz Kastrun provides a few hints when performance tuning Apache Spark code:

DataFrame versus Datasets versus SQL versus RDD is another choice, yet it is fairly easy. DataFrames, Datasets and SQL objects are all equal in performance and stability (at least from Spar 2.3 and above), meaning that if you are using DataFrames in any language, performance will be the same. Again, when writing custom objects of functions (UDF), there will be some performance degradation with both R or Python, so switching to Scala or Java might be a optimisation.

Read on for the details. My version is “When performance matters the most, be willing to switch to Scala.” It’s not always correct, but is rarely outright bad advice.

Comments closed

Using Powershell to Automate Azure Databricks Processes

Published 2020-12-28 by Kevin Feasel

Tomaz Kastrun continues a series on Databricks:

Yesterday we looked into bringing the capabilities of Databricks closer to your client machine. And making that coding, data wrangling and data science little bit more convenient.
Today we will look into deploying Databricks workspace using Powershell.

By the way, if Powershell automation of Databricks tasks is of interest to you, also check out Gerhard Brueckl’s extension module for much more along those lines.

Also, I give Tomaz a lot of credit: most Advent calendars stop at 24 days but Tomaz laughs off such limitations.

Comments closed

Spark Streaming in a Databricks Notebook

Published 2020-12-23 by Kevin Feasel

Tomaz Kastrun shows off Spark Streaming in a Databricks notebook:

Spark Streaming is the process that can analyse not only batches of data but also streams of data in near real-time. It gives the powerful interactive and analytical applications across both hot and cold data (streaming data and historical data). Spark Streaming is a fault tolerance system, meaning due to lineage of operations, Spark will always remember where you stopped and in case of a worker error, another worker can always recreate all the data transformation from partitioned RDD (assuming that all the RDD transformations are deterministic).

Click through for the demo.

Comments closed

Repartitioning and Coalescing in Spark

Published 2020-12-22 by Kevin Feasel

The Hadoop in Real World team contrasts repartitioning and coalescing in Spark:

The function of repartition and coalesce functions in Spark is to change the number of partitions on a DataFrame.

Read on to see how the two differ.

Comments closed

Using Scala in a Databricks Notebook

Published 2020-12-22 by Kevin Feasel

Tomaz Kastrun take a look at the original Spark language:

Let us start with Databricks datasets, that are available within every workspace and are here mainly for test purposes. This is nothing new; both Python and R come with sample datasets. For example the Iris dataset that is available with Base R engine and Seaborn Python package. Same goes with Databricks and sample dataset can be found in /databricks-datasets folder.

Click through for the walkthrough and introduction to Scala as it relates to Apache Spark.

Comments closed

Apache Spark Basics in Azure Synapse Analytics

Published 2020-12-21 by Kevin Feasel

Euan Garden shows off some Apache Spark functionality in Azure Synapse Analytics:

Apache Spark has been a long-time favorite tool amongst data engineers and data scientists; it is well known for handling large scale data processing and complex machine learning workloads.
Azure Synapse Analytics offers a fully managed and integrated Apache Spark experience. By leveraging Apache Spark in Azure Synapse, you can benefit from integrated security, fully managed provisioning, and tight-coupling to other Azure services, such as SQL databases (dedicated and serverless), Azure Key Vault , ADLS Gen2, and Azure Blob Storage as well as fast starting, high performance compute instances.

Click through for the demo.

Comments closed

TF-IDF in .NET for Spark, Updated

Published 2020-12-21 by Kevin Feasel

Ed Elliott has been busy:

Apache Spark has had a machine learning API for quite some time and this has been partially implemented in .NET for Apache Spark.
In this post we will look at how we can use the Apache Spark ML API from .NET. This is the second version of this post, the first version was written before version 1 of .NET for Apache Spark and there was a vital piece of the implementation missing which meant although we could build the model in .NET, we couldn’t actually use it. The necessary functionality is now available and so I am updating the post. To see the previous version go to: https://the.agilesql.club/2020/07/tf-idf-in-.net-for-apache-spark-using-spark-ml/

Read on for more information, as well as a call to action.

Comments closed

End-to-End ML with Azure Databricks

Published 2020-12-17 by Kevin Feasel

Tomaz Kastrun takes us through a machine learning scenario using Azure Databricks:

In the past couple of days we looked into configurations and infrastructure and today it is again time to do an analysis, let’s call it end-to-end analysis using R or Python or SQL.

Read on for the process.

Comments closed

sparklyr 1.5 Released

Published 2020-12-16 by Kevin Feasel

Yitao Li announces version 1.5 of sparklyr:

A large fraction of pull requests that went into the sparklyr 1.5 release were focused on making Spark dataframes work with various dplyr verbs in the same way that R dataframes do. The full list of dplyr-related bugs and feature requests that were resolved in sparklyr 1.5 can be found in here.
In this section, we will showcase three new dplyr functionalities that were shipped with sparklyr 1.5.

Read on to learn more about this update. H/T R-Bloggers

Comments closed

Category: Spark