Press "Enter" to skip to content

Month: March 2021

Caching versus Persisting in Spark

The Hadoop in Real World team explains a subtle difference:

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, the objects behind the RDD, DataFrame or Dataset on which cache() or persist() is called will be kept in memory or on the configured storage level on the nodes. 

That’s the similarity, but click through for the difference.

Comments closed

Working with Prediction Intervals

Bryan Shalloway explains how generating prediction intervals is different from making point predictions:

Before using the model for predictive inference, one should have reviewed overall performance on a holdout dataset to ensure the model is sufficiently accurate for the business context. For example, for our problem is an average error of ~12% and 90% prediction intervals of +/- ~25% of Sale_Price useful? If the answer is “no,” that suggests the need for more effort in improving the accuracy of the model (e.g. trying other transformations, features, model types). For our examples we are assuming the answer is ‘yes,’ our model is accurate enough (so it is appropriate to move-on and focus on prediction intervals).

Click through for the article.

Comments closed

Spark Performance in Azure Synapse Analytics

Euan Garden shares some numbers around Apache Spark performance in Azure Synapse Analytics:

To compare the performance, we derived queries from TPC-DS with 1TB scale and ran them on 8 nodes Azure E8V3 cluster (15 executors – 28g memory, 4 cores). Even though our version running inside Azure Synapse today is a derivative of Apache Spark™ 2.4.4, we compared it with the latest open-source release of Apache Spark™ 3.0.1 and saw Azure Synapse was 2x faster in total runtime for the Test-DS comparison.

Click through for several techniques the Azure Synapse Analytics team has implemented to make some significant performance improvements. It’s still slower than Databricks, but considerably faster than the open-source Apache Spark baseline.

Comments closed

Staging Your Data with ETL

Martin Schoombee provides some advice on creating ETL processes:

The concept of staging is not a complicated one, but you shouldn’t be deceived by the apparent simplicity of it. There’s a lot you can do in this phase of your ETL process, and it is as much a skill as it is an art to get it all right and still appear simplistic.

I have a few primary objectives when designing a staging process: Efficiency, Modularity, Recoverability & Traceability. Let’s take a closer look at each one of these, and some ideas & good practices that will help you achieve it…

Click through for several tips around each of those points.

Comments closed

Optimizing Power BI Data Load from a Folder of Parquet Files

Chris Webb has a tip for us:

In all the testing I’ve done recently with importing data from Parquet files into Power BI I noticed something strange: loading data from a folder containing multiple Parquet files seemed a lot slower than I would expect, based on the time taken to load data from a single file. So I wondered – is there something that can be optimised? It turns out there is and in this blog post I’ll show you what I did.

Click through to see how Chris cut load time down to approximately half what it was.

Comments closed

A Primer on Transparent Data Encryption

Matthew McGiffen walks us through the intention of Transparent Data Encryption:

Transparent Data Encryption (TDE) was introduced in SQL 2008 as a way of protecting “at rest” data. It continues to be available in all versions of SQL right up until the present, until recently it was only available in the Enterprise editions of SQL Server but from SQL 2019 it was made available in standard edition.

Read on for more detail.

Comments closed

Avoiding Temporal Coupling in Code

Yamini Bansal explains a common error in class and method design:

Temporal means time based. So, consider if the time(instant/order) at which one member of a class is called, affects the calling of another member. Or, if sequence of calling members of a class is something that should be kept in mind before using that class’s member then it means they are coupled.

Click through for an example. The basic concept is, I shouldn’t need to know that I must call setup method X() before I can take advantage of some useful method Y(). This is because a new person coming in might not realize that X() exists, will try to call Y(), and something breaks. Calling a method with a valid set of input parameters and having it immediately break is a sign of a dodgy API.

Comments closed

Hyperparameter Tuning in Azure Machine Learning

Dinesh Asanka takes us through hyperparameter tuning with Azure Machine Learning’s designer:

In the above experiment, both the previous model and the TMH included the model so that we can compare both models. In the above experiment, Tune Model Hyperparameters control is inserted between the Split Data and Score Model controls as shown. In the TMH, control has three inputs. The first control needs the relevant technique and, in this scenario, it is the Two-Class Logistic Regression technique. The second input needs the train data set and the last input needs the evaluation data set and for that, the test data set can be used.

Tune Model Hyperparameters control provides the best combinations and it will be connected to the score model. After the test data stream is connected to the score model, the output of the model is connected to the second input of the Evaluate model so that the previous model and the tuned model can be compared.

I’m not sure if there’s something handled internally in the Tune Model Hyperparameters component, but based on the pipeline images, I’d actually want two separate Split components so that I ended up with something more like 50-20-30 for training, hyperparameter testing, and validation. The first two pipelines appear to be 70-30-0 instead, and so can give you a false sense of confidence in model quality.

Comments closed

Auto-Generating Relative Links in Azure Data Studio Notebooks

Julie Koesmarno points out a new feature:

As you enrich your collection of notebooks (organized in a Jupyter Book, hopefully), you will likely want to link from one notebook to another notebook in the directory you are working on.

If you are familiar with markdown, you know that this process can be painful as you’d need to know where the target link is located and where it is located in relation to the notebook that you want to link from.

Luckily in Azure Data Studio v1.27.0, there is a new Insert Link button in the Text Cell that does the automatic translation “hard coded path” to “relative path” link. Check this out!

Click through for a demo. I like the idea as a way of preventing a common problem when sending artifacts somewhere: all of those hard-coded links to a network share I can’t access or a folder on somebody else’s laptop.

Comments closed