Spark – Page 43 – Curated SQL

Joining RDDs in Spark

Published 2019-10-29 by Kevin Feasel

Brad Llewellyn takes us through more Spark RDD and DataFrame exercises, including joins:

We can make use of the built-in .join() function for RDDs. Similar to the .aggregateByKey() function we saw in the previous post, the .join() function for RDDs requires a 2-element tuple, with the first element being the key and the second element being the value. So, we need to use the .map() function to restructure our RDDs to store the keys in the first element and the original array/tuple in the second element. After the join, we end up with an awkward nested structure of arrays and tuples that we need to restructure using another .map() function, leading to a lengthy code snippet.

This is a place where DataFrames make so much more sense.

Comments closed

Azure AD Credential Passthrough and Databricks

Published 2019-10-28 by Kevin Feasel

Anna Shrestinian, et al, explain how Azure Databricks enables Azure Active Directory credential passthrough when working with Azure Data Lake Storage Gen2:

Azure Data Lake Storage (ADLS) Gen2, which became generally available earlier this year, is quickly becoming the standard for data storage in Azure for analytics consumption. ADLS Gen2 enables a hierarchical file system that extends Azure Blob Storage capabilities and provides enhanced manageability, security and performance.
The hierarchical file system provides granular access control to ADLS Gen2. Role-based access control (RBAC) could be used to grant role assignments to top-level resources and POSIX compliant access control lists (ACLs) allow for finer grain permissions at the folder and file level. These features allow users to securely access their data within Azure Databricks using the Azure Blob File System (ABFS) driver, which is built into the Databricks Runtime.

There are some tradeoffs involved, particularly around using High Concurrency clusters (or limiting yourself to one user account), but it’s a nice bit of added value when you’re a heavy Azure user.

Comments closed

A New Notebook Tool: Polynote

Published 2019-10-24 by Kevin Feasel

Jeremy Smith, et al, announce a new product:

We are pleased to announce the open-source launch of Polynote: a new, polyglot notebook with first-class Scala support, Apache Spark integration, multi-language interoperability including Scala, Python, and SQL, as-you-type autocomplete, and more.
Polynote provides data scientists and machine learning researchers with a notebook environment that allows them the freedom to seamlessly integrate our JVM-based ML platform — which makes heavy use of Scala — with the Python ecosystem’s popular machine learning and visualization libraries. It has seen substantial adoption among Netflix’s personalization and recommendation teams, and it is now being integrated with the rest of our research platform.

There are some nice pieces to it, especially around language interop.

Comments closed

Spark Transformations and Actions

Published 2019-10-22 by Kevin Feasel

Divyansh Jain differentiates the key sets of functions in Spark:

Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.

Read on for more details.

Comments closed

MLFlow on Databricks Community Edition

Published 2019-10-18 by Kevin Feasel

Jules Damji and Siddharth Murching have an interesting announcement:

Today, we are excited to extend Databricks Community Edition with hosted MLflow for free, as part of our ongoing commitment to help developers learn about machine learning lifecycle. With the Community Edition, you can try tutorials that demonstrate how to track results and experiments as you build machine learning models—a crucial stage in the machine learning model’s development lifecycle.
MLflow is an open-source platform for the machine learning lifecycle with four components: MLflow Tracking, MLflow Projects, MLflow Models, and MLflow Registry. MLflow is now included in Databricks Community Edition, meaning that you can utilize its Tracking and Model APIs within a notebook or from your laptop just as easily as you would with managed MLflow in Databricks Enterprise Edition.

I like showing off Databricks Community Edition, and I’m glad to see them extend it a bit.

Comments closed

Delta Lake to Become an Open Standard

Published 2019-10-17 by Kevin Feasel

Michael Armbrust and Reynold Xin have exciting news about Delta Lake:

At today’s Spark + AI Summit Europe in Amsterdam, we announced that Delta Lake is becoming a Linux Foundation project. Together with the community, the project aims to establish an open standard for managing large amounts of data in data lakes. The Apache 2.0 software license remains unchanged.
Delta Lake focuses on improving the reliability and scalability of data lakes. Its higher level abstractions and guarantees, including ACID transactions and time travel, drastically simplify the complexity of real-world data engineering architecture. Since we open sourced Delta Lake six months ago, we have been humbled by the reception. The project has been deployed at thousands of organizations and processes exabytes of data each month, becoming an indispensable pillar in data and AI architectures.

Read on to see what this means for Delta Lake.

Comments closed

The Benefits of Delta Lake

Published 2019-10-15 by Kevin Feasel

Kaushik Nath explains what a Delta Lake is and why it is beneficial:

Data lakes have generated a large amount of publicity as the new storage technology for our big data era. Because something new is always better, right?
All this hype around data lakes has ignored their inherent drawbacks and limitations. Well, I’m Not Here to create a debate by saying that no one should ever use data lakes. But I am saying that companies should enter into the data lake investment with eyes wide open. Otherwise it might lead to some serious complications.

Delta Lake is a concept intended to mitigate some of the issues with data lakes in general, turning them into data swamps.

Comments closed

PySpark DataFrame Joining

Published 2019-10-15 by Kevin Feasel

Monika Rathor shows the various ways you can join DataFrames with PySpark:

PySpark provides multiple ways to combine dataframes i.e. join, merge, union, SQL interface, etc. In this article, we will take a look at how the PySpark join function is similar to SQL join, where two or more tables or dataframes can be combined based on conditions.

One join type you don’t directly get in SQL Server is the left anti join. We can build something quite similar with NOT EXISTS, though.

Comments closed

Financial Time Series Analysis in Databricks

Published 2019-10-11 by Kevin Feasel

Ricardo Portilla shares a demo of financial time series analysis in Databricks:

We’ve shown a merging technique above, so now let’s focus on a standard aggregation, namely Volume-Weighted Average Price (VWAP), which is the average price weighted by volume. This metric is an indicator of the trend and value of the security throughout the day. The vwap function within our wrapper class (in the attached notebook) shows where the VWAP falls above or below the trading price of the security. In particular, we can now identify the window during which the VWAP (in orange) falls below the trade price, showing that the stock is overbought.

Click through for the article, as well as a notebook you can try out.

Comments closed

Differences in Spark RDDs and DataSets

Published 2019-10-09 by Kevin Feasel

Brad Llewellyn looks at some of the differences between RDDs and DataSets in Spark:

We see that there are some differences between filtering RDDs, Data Frames and Datasets. The first major difference is the same one we keep seeing, RDDs reference by indices instead of column names. There’s also an interesting difference of using 2 =’s vs 3 =’s for equality operators. Simply put, “==” tries to directly equate two objects, whereas “===” tries to dynamically define what “equality” means. In the case of filter(), it’s typically used to determine whether the value in one column (income, in our case) is equal to the value of another column (string literal “<=50K”, in our case). In other words, if you want to compare values in one column to values in another column, “===” is the way to go.

Interestingly, there was another difference caused by the way we imported our data. Since we custom-built our RDD parsing algorithm to use <COMMA><SPACE> as the delimiter, we don’t need to trim our RDD values. However, we used the built-in sqlContext.read.csv() function for the Data Frame and Dataset, which doesn’t trim by default. So, we used the ltrim() function to remove the leading whitespace. This function can be imported from the org.apache.spark.sql.functions library.

Read on for more, including quite a few code samples.

Comments closed

Category: Spark