Single-Node PySpark

Gengliang Weng, et al, explain that even a single Spark node can be useful:

It’s been a few years since Intel was able to push CPU clock rate higher. Rather than making a single core more powerful with higher frequency, the latest chips are scaling in terms of core count. Hence, it is not uncommon for laptops or workstations to have 16 cores, and servers to have 64 or even 128 cores. In this manner, these multi-core single-node machines’ work resemble a distributed system more than a traditional single core machine.

We often hear that distributed systems are slower than single-node systems when data fits in a single machine’s memory. By comparing memory usage and performance between Spark and Pandas using common SQL queries, we observed that is not always the case. We used three common SQL queries to show single-node comparison of Spark and Pandas:

Query 1. SELECT max(ss_list_price) FROM store_sales

Query 2. SELECT count(distinct ss_customer_sk) FROM store_sales

Query 3. SELECT sum(ss_net_profit) FROM store_sales GROUP BY ss_store_sk

To demonstrate the above, we measure the maximum data size (both Parquet and CSV) Pandas can load on a single node with 244 GB of memory, and compare the performance of three queries.

Click through for the results.

Related Posts

Choose Your Hadoop File Format

Alex Woodie explains three of the most common Hadoop file formats: You have many choices when it comes to storing and processing data on Hadoop, which can be both a blessing and a curse. The data may arrive in your Hadoop cluster in a human readable format like JSON or XML, or as a CSV […]

Read More

What’s New In Hadoop 3.1?

Wangda Tan, et al, look at some of the new features in Apache Hadoop 3.1: The diagram below captures the building blocks together at a high level. If you have to tie this back to a fictitious self-flying drone company, the company will collect tons of raw images from the test drones’ built-in cameras for […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories

May 2018
MTWTFSS
« Apr  
 123456
78910111213
14151617181920
21222324252627
28293031