Press "Enter" to skip to content

Author: Kevin Feasel

S3 Versus HDFS For Spark Data Storage

Reynold Xin, Josh Rosen, and Kyle Pistor argue that you should use blob storage (S3, Azure Blob, etc.) instead of disk when building a cloud-based Spark cluster:

Based on our experience, S3’s availability has been fantastic. Only twice in the last six years have we experienced S3 downtime and we have never experienced data loss from S3.

Amazon claims 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. The official SLA from Amazon can be found here: Service Level Agreement – Amazon Simple Storage Service (S3).

For HDFS, in contrast, it is difficult to estimate availability and durability. One could theoretically compute the two SLA attributes based on EC2’s mean time between failures (MTTF), plus upgrade and maintenance downtimes. In reality, those are difficult to quantify. Our understanding working with customers is that the majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year.

It’s interesting how opinion has shifted; even a year ago, the recommendation would be different.

Comments closed

Understanding Random Forests

Manish Kumar Barnwal explains how random forest algorithms work:

Say our dataset has 1,000 rows and 30 columns. There are two levels of randomness in this algorithm:

  • At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.
  • At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for the first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on.

This  is a nice article and includes cases when not to use random forests.

Comments closed

Making Entity Framework Writes A Little Less Slow

Ilya Chumakov has some tips for making Entity Framework inserts and updates a lot faster:

When adding or modifying a large number of records (10³ and more), the Entity Framework performance is far from perfect. The reasons are architectural peculiarities of the framework, and non-optimality of the generated SQL. Leaping ahead, I can reveal that saving data through a bypass of the context significantly minimizes the execution time.

There’s some good advice in here, though not my favorite advice, which is don’t use Entity Framework.

Comments closed

Check Those Aliases

Erik Darling warns you about accidentally using the wrong alias in a query:

People will often tell you to clearly alias your tables, and they’re right. It will make them more readable and understandable to whomever has to read your code next, puzzling over the 52 self joins and WHERE clause that starts off with 1 = 2. It can also help solve odd performance problems.

Take this query, for instance.

This isn’t just for subqueries; even simple joins can go haywire when you accidentally use the wrong alias and both tables happen to have the same column name.

Comments closed

Restoring A BACPAC File

Steve Jones shows how to restore a database saved in .bacpac format:

I needed to get the WideWorldImporters sample database for a project and noticed that there was a BACPAC available. I downloaded it and needed to restore this as a database. At least, that’s what many people would think.

However, if you go to the restore dialog, and select Device and then pick your location, there’s no filter for a .bacpac. In fact, if you choose one, it won’t restore. You’ll get the “no backupset selected” error.

Read on for a step-by-step guide showing how to do this.

Comments closed

Automating Azure SQL DB Maintenance

Tim Radney shows several methods for performing automated Azure SQL Database maintenance, including runbooks:

Once you create your account, you can then start creating runbooks. You can do just about anything with the runbooks. There are numerous existing run books that you can browse through and modify for your own use, including provisioning, monitoring, life cycle management, and more.

You can create the runbooks offline, or using the Azure Portal, and they’re built using PowerShell. In this example, we will reuse the code from the PowerShell demo and also demonstrate how we can use the built in Azure Service scheduler to run our existing PowerShell code and not have to rely on an on-premises scheduler, task scheduler, or Azure VM to schedule a job.

Read the whole thing if you have Azure SQL Database instances in your environment.

Comments closed

Dimensional Design Tips

Koen Verbeeck provides some helpful hints when designing dimensions in SQL Server Analysis Services Multidimensional models:

Although traditional dimension modeling – as explained by Ralph Kimball – tries to avoid snowflaking, it might help the processing of larger dimensions. For example, suppose you have a large customer dimension with over 10 million members. One attribute is the customer country. Realistically, there should only be a bit over 200 countries, maximum. When SSAS processes the dimension, it sends SELECT DISTINCT commands to SQL Server. Such a query on top of a large dimension might take some time. However, if you would snowflake (aka normalize) the country attribute into another dimension, the SELECT DISTINCT will run much faster. Here, you need to trade-off performance against the simplicity of your design.

There are several good tips here.

Comments closed

Smarter Differential Backups

Dennes Torres shows us how we can use a new column in an old DMV to make our full vs differential backup processes smarter:

What are the possibilities with this new field ? We are now able to check how many extents have changed since last full backup and decide if a full backup is really needed or we can live with a differential backup, achieving smarter backup plans.

Change our full backup jobs to first check this field and decide if the backup will be full or differential can save space and maintenance time with databases that aren’t updated so often.

Read on to learn more about this new column, which will be available in SQL Server 2017.

Comments closed

Cross-Database Queries With Azure SQL DB

Dustin Ryan shows how to set up cross-database queries within Azure SQL Database:

2. Vertical queries (in preview): A vertical elastic query is a query that is executed across databases that contain different schemas and different data sets. An elastic query can be executed across any two Azure SQL Database instances. This is actually really easy to set up and that what this blog post is about! The diagram below represents a query being issued against tables that exist in separate Azure SQL Database instances that contain different schemas.

Read on to learn how to implement vertical elastic queries today.

Comments closed

Using Common Table Expressions To Drive Queries

Lukas Eder wants one result set which returns records using predicate B if and only if there were no records using predicate A:

We’ve seen that we can easily solve the original problem with SQL only: Select some data from a table using predicate A, and if we don’t find any data for predicate A, then try finding data using predicate B from the same table.

Oracle and PostgreSQL can both optimise away the unnecessary query 2 by inserting a “probe” in their execution plans that knows whether the query 2 needs to be executed or not. In Oracle, we’ve even seen a situation where the combined query outperforms two individual queries. SQL Server 2014 surprisingly does not have such an optimisation.

Interesting totally-not-a-comparison between the three database products.  There are some things I’d ideally like the SQL Server optimizer to do with common table expressions, but as Lukas notes, it doesn’t, so user beware.

Comments closed