Single-Node PySpark

Gengliang Weng, et al, explain that even a single Spark node can be useful:

It’s been a few years since Intel was able to push CPU clock rate higher. Rather than making a single core more powerful with higher frequency, the latest chips are scaling in terms of core count. Hence, it is not uncommon for laptops or workstations to have 16 cores, and servers to have 64 or even 128 cores. In this manner, these multi-core single-node machines’ work resemble a distributed system more than a traditional single core machine.

We often hear that distributed systems are slower than single-node systems when data fits in a single machine’s memory. By comparing memory usage and performance between Spark and Pandas using common SQL queries, we observed that is not always the case. We used three common SQL queries to show single-node comparison of Spark and Pandas:

Query 1. SELECT max(ss_list_price) FROM store_sales

Query 2. SELECT count(distinct ss_customer_sk) FROM store_sales

Query 3. SELECT sum(ss_net_profit) FROM store_sales GROUP BY ss_store_sk

To demonstrate the above, we measure the maximum data size (both Parquet and CSV) Pandas can load on a single node with 244 GB of memory, and compare the performance of three queries.

Click through for the results.

Related Posts

Deploying Cloudera Enterprise On Azure

Xavier Morera announces a new Cloudera course: You will start by learning the Microsoft Azure services required to deploy a secure, elastic, Cloudera Enterprise cluster. These core services include security, networking, virtual machine management, and storage, just to name a few. Then, you’ll learn best practices and patterns for cloud-based clusters, including tips and caveats for security […]

Read More

Working With The Databricks API Via Powershell

Gerhard Brueckl has a Powershell module for interacting with Databricks, either Azure or AWS: As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended […]

Read More

Categories

May 2018
MTWTFSS
« Apr Jun »
 123456
78910111213
14151617181920
21222324252627
28293031