Press "Enter" to skip to content

Category: Storage

Securing Azure Storage

Christos Matskas has an article on securing Azure blobs and containers:

All communication with the Azure Storage via connection strings and BLOB URLs enforce the use of HTTPS, which provides Encryption in Transit. You can enforce the use of “Always HTTPS” by setting the connection string like this: “DefaultEndpointsProtocol=https;AccountName=myblob1…” or in SAS signatures, as in the example below:

https://myblob1.blob.core.windows.net/?sv=2015-04-05&ss=bfqt&srt=sco&sp=rwdlacup&se=2016-09-22T02:21:41Z&st=2016-09-21T18:21:41Z&spr=https&sig=hxInpKBYAxvwdI9kbBglbrgcl1EJjHqDRTF2lVGeSUU%3D

To protect data at rest, the service provides an option to encrypt the data as they are stored in the account. There’s no additional cost associated with encrypting the data at rest and it’s a good idea to switch it on as soon as the account is created. There is a one-click setting at the Storage Account level to enable it, and the encryption is applied on both new and existing storage accounts. The data is encrypted with AES 256 cipher and it’s now generally available to all Azure regions and Azure clouds (public, government etc)

There’s some good information here, making it worth the read.

Comments closed

Removing Tempdb Files

Erin Stellato relates a painful experience of tempdb filling up a VM’s drive:

I made a mistake with a script today. I created three new tempdb files sized at 10GB each that filled up a hard drive.

Whoops.

Luckily it was in one of my own testing VMs, so it wasn’t awful. Fixing it, however, was a fun one.

**NOTE: All work was done in a test environment. Proceed with caution if you’re running these commands in Production and make sure you understand the ramifications.

It’s a good opportunity to learn from Erin’s experience.

Comments closed

PySpark Persistence

David Crook shows how to save data to disk from PySpark:

This is working on HDInsight v3.5 w/Spark 2.0 and Azure Data Lake Storage as the underlying storage system.  What is nice about this is that my cluster only has access to its cluster section of the folder structure.  I have the structure root/clusters/dasciencecluster.  This particular cluster starts at dasciencecluster, while other clusters may start somewhere else.  Therefor my data is saved to root/clusters/dasciencecluster/data/open_data/RF_Model.txt

It’s pretty easy to do, and the Scala code would look suspiciously similar.  The Java version of the code would be seven pages long.

Comments closed

Considerations For Reducing I/O Costs

Monica Rathbun gives a few methods for reducing how many I/O operations a query requires:

Implicit Conversions

Implicit conversions often happen when a query is comparing two or more columns with different data types. In the below example, the system is having to perform extra I/O in order to compare a varchar(max) column to an nvarchar(4000) column, which leads to an implicit conversion, and ultimately a scan instead of a seek. By fixing the tables to have matching data types, or simply converting this value before evaluation, you can greatly reduce I/O and improve cardinality (the estimated rows the optimizer should expect).

There’s some good advice here if your main hardware constraint is being I/O bound.

Comments closed

Azure Data Lake Store Best Practices

Ust Oldfield provides recommendations on how to size and lay out files in Azure Data Lake Store:

The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.

In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.

This is the type of thing that you can easily forget about, but it makes a huge difference down the line.

Comments closed

Partition Alignment In Storage

Wayne Sheffield has an article on I/O partition alignment:

What we need to do is to offset the beginning of the data being stored on disk to a location more conducive to how the program is operating. This offset is known as the “Partition Alignment Offset”. To be in tune with SQL Server, this value should be an increment of 64KB. However, you also need to consider the entire storage subsystem – the disks, controllers and memory. Starting with Windows Server 2008, this offset is at 1024KB – a nice increment of 64KB that also works very nicely with most RAID disks/controllers. Prior to Windows Server 2008, partition alignment offset was not explicitly performed, so this will need to be performed.

If you’ve migrated disk from server to server to server over the years, this is worth checking out.

Comments closed

Cool Storage

James Serra talks about “cool storage” in Azure Blob Storage:

The access tiers available for blob storage accounts are “hot” and “cold”.  In general, hot data is classified as data that is accessed very frequently and needs to be highly durable and available.  On the other hand, cool data is data that is infrequently accessed and long-lived.  Cool data can tolerate a slightly lower availability, but still requires high durability and similar time to access and throughput characteristics as hot data.  For cool data, slightly lower availability SLA and higher access costs are acceptable tradeoffs for much lower storage costs.  Azure Blob storage now addresses this need for differentiated storage tiers for data with different access patterns and pricing model.  So you can now choose between Cool and Hot access tiers to store your less frequently accessed cool data at a lower storage cost, and store more frequently accessed hot data at a lower access cost.  The Access Tier attribute of hot or cold is set at an account level and applies to all objects in that account.  So if you want to have both a hot access tier and a cold access tier, you will need two accounts.  If there is a change in the usage pattern of your data, you can also switch between these access tiers at any time.

It looks like there shouldn’t be a performance difference between the two;  it’s more of a cost difference in which you might be able to save money by choosing your tier wisely.

Comments closed

LDF Stamp Change

The CSS SQL Server engineering team points out that the LDF stamp has changed from 0x00 to 0xC0:

Question:  If the log is stamped with 0xC0’s instead of 0x00’s how is it a performance gain?

Many of the new hardware implementations detect patterns of 0x00’s.   The space is acquired and zero’s written to stable media, then a background, hardware based garbage collector reclaims the blocks.

This is a very interesting background article which shows an integration pain point between the database platform and the storage platform.

Comments closed

Who Grew The Database?

Andy Galbraith helps us figure out who to blame for database growth:

I feel a little dirty writing about the Default Trace in the world of Extended Events, but I also know that many people simply don’t know how to use XEvents, and this can be faster if you already have it in your toolbox.  Also it will work back to SQL 2005 where XEvents were new in SQL 2008.

I have modified this several times to improve it – I started with a query from Tibor Karaszi (blog/@TiborKaraszi), modified it with some code from Jason Strate (blog/@StrateSQL), and then modified that myself for what is included and what is filtered.  There are links to both Tibor’s and Jason’s source material in the code below.

I usually just blame the BI team for database growth.

Comments closed