Press "Enter" to skip to content

Category: Storage

S3 Versus HDFS For Spark Data Storage

Reynold Xin, Josh Rosen, and Kyle Pistor argue that you should use blob storage (S3, Azure Blob, etc.) instead of disk when building a cloud-based Spark cluster:

Based on our experience, S3’s availability has been fantastic. Only twice in the last six years have we experienced S3 downtime and we have never experienced data loss from S3.

Amazon claims 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. The official SLA from Amazon can be found here: Service Level Agreement – Amazon Simple Storage Service (S3).

For HDFS, in contrast, it is difficult to estimate availability and durability. One could theoretically compute the two SLA attributes based on EC2’s mean time between failures (MTTF), plus upgrade and maintenance downtimes. In reality, those are difficult to quantify. Our understanding working with customers is that the majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year.

It’s interesting how opinion has shifted; even a year ago, the recommendation would be different.

Comments closed

Selecting Into A Filegroup

Denis Gobo points out a new feature in SQL Server 2017:

With the CTP2 build of SQL Server 2017, you now have the ability of doing a SELECT INTO operation into a specific filegroup.

The syntax looks like this

SELECT * INTO TableName ON FileGroup FROM SomeQuery

What is the use of this you might ask? Well maybe you have some ad-hoc queries where you save some data but you don’t want it sitting on your expensive SAN. Or maybe you populate staging tables on the fly and you want it to end up on a specific SSD because you need the speed of the SSD disk for these operations.

This might also be helpful in migrating tables to different storage.

Comments closed

Re-Provisioning Volumes In Azure

Robert Bishop shows how to create data disks in Azure and attach them to VMs:

Recently a client found this article on “Best Practices for SQL Server in Azure Virtual Machines” and wanted to re-provision his volumes to adhere to them.

No my first thoughts was wait, I’m a DBA, not a System Admin that’s not my role! But thinking more about it I realized the client views this as a SQL Server issue and I am the SQL Server Consultant and that it is my job to remedy this problem.

Not being 100% confident in Azure, I spun up a VM SQL Server and attempted to add some volumes.  To my surprise, this was way too easy.

Click through for the steps.

Comments closed

Securing Azure Storage

Christos Matskas has an article on securing Azure blobs and containers:

All communication with the Azure Storage via connection strings and BLOB URLs enforce the use of HTTPS, which provides Encryption in Transit. You can enforce the use of “Always HTTPS” by setting the connection string like this: “DefaultEndpointsProtocol=https;AccountName=myblob1…” or in SAS signatures, as in the example below:

https://myblob1.blob.core.windows.net/?sv=2015-04-05&ss=bfqt&srt=sco&sp=rwdlacup&se=2016-09-22T02:21:41Z&st=2016-09-21T18:21:41Z&spr=https&sig=hxInpKBYAxvwdI9kbBglbrgcl1EJjHqDRTF2lVGeSUU%3D

To protect data at rest, the service provides an option to encrypt the data as they are stored in the account. There’s no additional cost associated with encrypting the data at rest and it’s a good idea to switch it on as soon as the account is created. There is a one-click setting at the Storage Account level to enable it, and the encryption is applied on both new and existing storage accounts. The data is encrypted with AES 256 cipher and it’s now generally available to all Azure regions and Azure clouds (public, government etc)

There’s some good information here, making it worth the read.

Comments closed

Removing Tempdb Files

Erin Stellato relates a painful experience of tempdb filling up a VM’s drive:

I made a mistake with a script today. I created three new tempdb files sized at 10GB each that filled up a hard drive.

Whoops.

Luckily it was in one of my own testing VMs, so it wasn’t awful. Fixing it, however, was a fun one.

**NOTE: All work was done in a test environment. Proceed with caution if you’re running these commands in Production and make sure you understand the ramifications.

It’s a good opportunity to learn from Erin’s experience.

Comments closed

PySpark Persistence

David Crook shows how to save data to disk from PySpark:

This is working on HDInsight v3.5 w/Spark 2.0 and Azure Data Lake Storage as the underlying storage system.  What is nice about this is that my cluster only has access to its cluster section of the folder structure.  I have the structure root/clusters/dasciencecluster.  This particular cluster starts at dasciencecluster, while other clusters may start somewhere else.  Therefor my data is saved to root/clusters/dasciencecluster/data/open_data/RF_Model.txt

It’s pretty easy to do, and the Scala code would look suspiciously similar.  The Java version of the code would be seven pages long.

Comments closed

Considerations For Reducing I/O Costs

Monica Rathbun gives a few methods for reducing how many I/O operations a query requires:

Implicit Conversions

Implicit conversions often happen when a query is comparing two or more columns with different data types. In the below example, the system is having to perform extra I/O in order to compare a varchar(max) column to an nvarchar(4000) column, which leads to an implicit conversion, and ultimately a scan instead of a seek. By fixing the tables to have matching data types, or simply converting this value before evaluation, you can greatly reduce I/O and improve cardinality (the estimated rows the optimizer should expect).

There’s some good advice here if your main hardware constraint is being I/O bound.

Comments closed

Azure Data Lake Store Best Practices

Ust Oldfield provides recommendations on how to size and lay out files in Azure Data Lake Store:

The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.

In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.

This is the type of thing that you can easily forget about, but it makes a huge difference down the line.

Comments closed

Partition Alignment In Storage

Wayne Sheffield has an article on I/O partition alignment:

What we need to do is to offset the beginning of the data being stored on disk to a location more conducive to how the program is operating. This offset is known as the “Partition Alignment Offset”. To be in tune with SQL Server, this value should be an increment of 64KB. However, you also need to consider the entire storage subsystem – the disks, controllers and memory. Starting with Windows Server 2008, this offset is at 1024KB – a nice increment of 64KB that also works very nicely with most RAID disks/controllers. Prior to Windows Server 2008, partition alignment offset was not explicitly performed, so this will need to be performed.

If you’ve migrated disk from server to server to server over the years, this is worth checking out.

Comments closed