Using IO Cache To Speed Up Spark Jobs

Chris Seferlis looks at what the HDInsight team has done to speed up Apache Spark jobs:

The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.

There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.

Read on for more details.

Related Posts

From pandas to Spark with koalas

Achilleus tries out Koalas: Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium […]

Read More

Quick Hits on Managed Instance Backup / Restore

Jovan Popovic has some pieces of advice for backing up and restoring databases on Azure SQL Managed Instances: Managed Instance takes automatic backups (full backups every week, differential every 12 hours, and log backups every 5-10 min) that you can use to restore a database to some point of time in past within the retention […]

Read More

Categories

October 2018
MTWTFSS
« Sep Nov »
1234567
891011121314
15161718192021
22232425262728
293031