Press "Enter" to skip to content

Day: March 31, 2021

Caching versus Persisting in Spark

The Hadoop in Real World team explains a subtle difference:

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, the objects behind the RDD, DataFrame or Dataset on which cache() or persist() is called will be kept in memory or on the configured storage level on the nodes. 

That’s the similarity, but click through for the difference.

Comments closed

Working with Prediction Intervals

Bryan Shalloway explains how generating prediction intervals is different from making point predictions:

Before using the model for predictive inference, one should have reviewed overall performance on a holdout dataset to ensure the model is sufficiently accurate for the business context. For example, for our problem is an average error of ~12% and 90% prediction intervals of +/- ~25% of Sale_Price useful? If the answer is “no,” that suggests the need for more effort in improving the accuracy of the model (e.g. trying other transformations, features, model types). For our examples we are assuming the answer is ‘yes,’ our model is accurate enough (so it is appropriate to move-on and focus on prediction intervals).

Click through for the article.

Comments closed

Spark Performance in Azure Synapse Analytics

Euan Garden shares some numbers around Apache Spark performance in Azure Synapse Analytics:

To compare the performance, we derived queries from TPC-DS with 1TB scale and ran them on 8 nodes Azure E8V3 cluster (15 executors – 28g memory, 4 cores). Even though our version running inside Azure Synapse today is a derivative of Apache Spark™ 2.4.4, we compared it with the latest open-source release of Apache Spark™ 3.0.1 and saw Azure Synapse was 2x faster in total runtime for the Test-DS comparison.

Click through for several techniques the Azure Synapse Analytics team has implemented to make some significant performance improvements. It’s still slower than Databricks, but considerably faster than the open-source Apache Spark baseline.

Comments closed

Staging Your Data with ETL

Martin Schoombee provides some advice on creating ETL processes:

The concept of staging is not a complicated one, but you shouldn’t be deceived by the apparent simplicity of it. There’s a lot you can do in this phase of your ETL process, and it is as much a skill as it is an art to get it all right and still appear simplistic.

I have a few primary objectives when designing a staging process: Efficiency, Modularity, Recoverability & Traceability. Let’s take a closer look at each one of these, and some ideas & good practices that will help you achieve it…

Click through for several tips around each of those points.

Comments closed

Optimizing Power BI Data Load from a Folder of Parquet Files

Chris Webb has a tip for us:

In all the testing I’ve done recently with importing data from Parquet files into Power BI I noticed something strange: loading data from a folder containing multiple Parquet files seemed a lot slower than I would expect, based on the time taken to load data from a single file. So I wondered – is there something that can be optimised? It turns out there is and in this blog post I’ll show you what I did.

Click through to see how Chris cut load time down to approximately half what it was.

Comments closed

A Primer on Transparent Data Encryption

Matthew McGiffen walks us through the intention of Transparent Data Encryption:

Transparent Data Encryption (TDE) was introduced in SQL 2008 as a way of protecting “at rest” data. It continues to be available in all versions of SQL right up until the present, until recently it was only available in the Enterprise editions of SQL Server but from SQL 2019 it was made available in standard edition.

Read on for more detail.

Comments closed