Fresh R Installation On Linux

Kevin Feasel



Marcelo Perlin has a script to install R on Linux:

Since I formatted all my three computers (home/laptop/work), I wrote a small bash file to automate the process of installing R and its dependencies. I use lots of R packages in a daily basis. For some of them, it is required to install dependencies using the terminal. Each time that a install.package() failed, I saved the name of the required software and added it to the bash file. While my bash file will not cover all dependencies for all packages, it will suffice for a great proportion.

Another option might be to grab a Docker image.

S3 Versus HDFS For Spark Data Storage

Reynold Xin, Josh Rosen, and Kyle Pistor argue that you should use blob storage (S3, Azure Blob, etc.) instead of disk when building a cloud-based Spark cluster:

Based on our experience, S3’s availability has been fantastic. Only twice in the last six years have we experienced S3 downtime and we have never experienced data loss from S3.

Amazon claims 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. The official SLA from Amazon can be found here: Service Level Agreement – Amazon Simple Storage Service (S3).

For HDFS, in contrast, it is difficult to estimate availability and durability. One could theoretically compute the two SLA attributes based on EC2’s mean time between failures (MTTF), plus upgrade and maintenance downtimes. In reality, those are difficult to quantify. Our understanding working with customers is that the majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year.

It’s interesting how opinion has shifted; even a year ago, the recommendation would be different.

Understanding Random Forests

Manish Kumar Barnwal explains how random forest algorithms work:

Say our dataset has 1,000 rows and 30 columns. There are two levels of randomness in this algorithm:

  • At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.
  • At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for the first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on.

This  is a nice article and includes cases when not to use random forests.

Making Entity Framework Writes A Little Less Slow

Ilya Chumakov has some tips for making Entity Framework inserts and updates a lot faster:

When adding or modifying a large number of records (10³ and more), the Entity Framework performance is far from perfect. The reasons are architectural peculiarities of the framework, and non-optimality of the generated SQL. Leaping ahead, I can reveal that saving data through a bypass of the context significantly minimizes the execution time.

There’s some good advice in here, though not my favorite advice, which is don’t use Entity Framework.

Check Those Aliases

Kevin Feasel



Erik Darling warns you about accidentally using the wrong alias in a query:

People will often tell you to clearly alias your tables, and they’re right. It will make them more readable and understandable to whomever has to read your code next, puzzling over the 52 self joins and WHERE clause that starts off with 1 = 2. It can also help solve odd performance problems.

Take this query, for instance.

This isn’t just for subqueries; even simple joins can go haywire when you accidentally use the wrong alias and both tables happen to have the same column name.

Restoring A BACPAC File

Steve Jones shows how to restore a database saved in .bacpac format:

I needed to get the WideWorldImporters sample database for a project and noticed that there was a BACPAC available. I downloaded it and needed to restore this as a database. At least, that’s what many people would think.

However, if you go to the restore dialog, and select Device and then pick your location, there’s no filter for a .bacpac. In fact, if you choose one, it won’t restore. You’ll get the “no backupset selected” error.

Read on for a step-by-step guide showing how to do this.

Automating Azure SQL DB Maintenance

Tim Radney shows several methods for performing automated Azure SQL Database maintenance, including runbooks:

Once you create your account, you can then start creating runbooks. You can do just about anything with the runbooks. There are numerous existing run books that you can browse through and modify for your own use, including provisioning, monitoring, life cycle management, and more.

You can create the runbooks offline, or using the Azure Portal, and they’re built using PowerShell. In this example, we will reuse the code from the PowerShell demo and also demonstrate how we can use the built in Azure Service scheduler to run our existing PowerShell code and not have to rely on an on-premises scheduler, task scheduler, or Azure VM to schedule a job.

Read the whole thing if you have Azure SQL Database instances in your environment.

Dimensional Design Tips

Koen Verbeeck provides some helpful hints when designing dimensions in SQL Server Analysis Services Multidimensional models:

Although traditional dimension modeling – as explained by Ralph Kimball – tries to avoid snowflaking, it might help the processing of larger dimensions. For example, suppose you have a large customer dimension with over 10 million members. One attribute is the customer country. Realistically, there should only be a bit over 200 countries, maximum. When SSAS processes the dimension, it sends SELECT DISTINCT commands to SQL Server. Such a query on top of a large dimension might take some time. However, if you would snowflake (aka normalize) the country attribute into another dimension, the SELECT DISTINCT will run much faster. Here, you need to trade-off performance against the simplicity of your design.

There are several good tips here.

Smarter Differential Backups

Dennes Torres shows us how we can use a new column in an old DMV to make our full vs differential backup processes smarter:

What are the possibilities with this new field ? We are now able to check how many extents have changed since last full backup and decide if a full backup is really needed or we can live with a differential backup, achieving smarter backup plans.

Change our full backup jobs to first check this field and decide if the backup will be full or differential can save space and maintenance time with databases that aren’t updated so often.

Read on to learn more about this new column, which will be available in SQL Server 2017.


June 2017
« May