Partition Handling In Spark 2.1

Eric Liang, et al, discuss a change to Spark 2.1 which will make certain partitioned table access faster:

In Spark 2.1, we drastically improve the initial latency of queries that touch a small fraction of table partitions. In some cases, queries that took tens of minutes on a fresh Spark cluster now execute in seconds. Our improvements cut down on table memory overheads, and make the SQL experience starting cold comparable to that on a “hot” cluster with table metadata fully cached in memory.

This looks like a nice improvement in Spark.

Related Posts

Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem: H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, […]

Read More

Converting CSV To ORC

Mark Litwintschik investigates whether Spark is faster at converting CSV files to ORC format than Hive or Presto: Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. […]

Read More

Categories

December 2016
MTWTFSS
« Nov Jan »
 1234
567891011
12131415161718
19202122232425
262728293031