Press "Enter" to skip to content

Category: Storage

How Writing to Parquet Works

Dmitry Tolpeko walks us through the algorithm for writing to Parquet format:

After writing first 100 values for a column (for 100 rows), the Parquet writer checks if this 100-values column content exceeds the specified page size (default is 1 MB).

If the raw data size for the column does not exceed the page size threshold then the next page size check is constantly adjusted based on the actual column size, so it neither checked after every column value nor after every 100 values. Thus the page size is not the strict limit.

If the raw data size size exceeds the page size, the column content is compressed (if a compression is specified for the Parquet file), and flushed into the Page store for the column.

This is a nice explanation of the process.

Comments closed

Blob Storage Enhancements

James Serra is keeping on top of things around Azure Blob Storage:

Account failover: Customer-initiated storage account failover is now generally available, allowing you to determine when to initiate a failover instead of waiting for Microsoft to do so.  When you perform a failover, the secondary replica of the storage account becomes the new primary. The DNS records for all storage service endpoints—blob, file, queue, and table—are updated to point to this new primary. Once the failover is complete, clients will automatically begin reading from and writing to data to the storage account in the new primary region, with no code changes. Customer initiated failover is available for GRS, RA-GRS, GZRS and RA-GZRS accounts.  To learn more, see Disaster recovery and account failover

Read on for several more improvements.

Comments closed

Query Acceleration for Blob Storage and Data Lake Gen2

James Serra takes us through Query Acceleration for Azure Blob Storage and Azure Data Lake Storage Gen2:

Just announced is Query Acceleration for Azure Data Lake Storage Gen2 (ADLS) as well as Blob Storage. This is a new capability for ADLS that enables applications and analytics frameworks to dramatically optimize data processing by retrieving only the data that they require to perform a given operation from storage. This reduces the time and processing power that is required to query stored data.

For example, if an application will execute a SELECT statement that filters columns and rows from a csv file, instead of all pulling the entire csv file over the network into the application and then filtering the data, it will instead do the filtering at the time the data is read from the disk, so that only the filtered data is transferred over the network to the application. So if you have a csv file with 50 columns and 1 million rows, but the filters limit the data to 5 columns and 1000 rows, then only the 5 columns and 1000 rows will be retrieved from the disk and sent over the network to the application.

Click through to learn more, including current libraries which support this and information on the additional cost. I’d really like to see PolyBase support this, as it would alleviate one of the problems with using Blob Storage + PolyBase: the need to pull all of that data down to your SQL Server instance before doing any filtering.

Comments closed

Querying Database and Log File Sizes with T-SQL

Allen White takes us through an easy technique to check database and log file sizes:

As a consultant, I have to be able to quickly spot problems, and one of the problems I frequently find is transaction log files that are incorrectly sized.

There are two catalog views in the master database which make this easy to do – sys.master_files and sys.databases. The sys.master_files view contains the database and individual file names, and the data_space_id column always has a value of 0 for the log file. The size column returns the value in 8KB pages, so we have to multiply the column by 8, then divide by 1024 to get the size in megabytes (MB).

Click through for the demo.

Comments closed

Bulk Migration of Data and Log Files in SQL Server

David Fowler shows how you can change file paths for all of your databases in one fell swoop:

You’ve got a SQL Server with a few hundred databases on it (to be honest it doesn’t even need to be quite that many) and you need to move all the data and log files to a new location. Perhaps you’re going to be migrating onto a new, shiny SAN or maybe your disks are just about full and you need to shift a bunch of the files off somewhere else.

The first thing that you’re going to need to do is change the paths of the files in SQL. That’s easy enough to do with an ALTER DATABASE statement.

ALTER DATABASE SQLUndercover MODIFY FILE (NAME SQLUndercover_Log, FILENAME = 'F:\SQLLogs\SQLUndercover_Log.ldf'

But that’s going to get very tedious very quickly if you’ve got to do that for a whole lotta databases. So to help out, I thought I’d share a little script that I’ve been using for a while (or a variation on it at least) to make the process far easier and generate all the ALTER statements for you.

Click through for the script as well as a bit of advice around the actual moving of the files.

Comments closed

Loading Data into Delta Lake

Prakash Chockalingam takes us through auto-loading Delta Lake from various sources:

Auto Loader is an optimized file source that overcomes all the above limitations and provides a seamless way for data teams to load the raw data at low cost and latency with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.

This does look interesting.

Comments closed

Storing Power BI Audit Logs in Blob Storage

Gilbert Quevauvilliers works around a built-in constraint with Power BI Audit Logs:

With the new Power BI Get-PowerBIActivityEvent I wanted to find a way where I could automate the entire process where it all runs in the cloud.

One of the current challenges with the Audit logs is that they only store 90 days, so if you want to do analysis for longer than 90 days the log files have to be stored somewhere. Why not use Azure Blob Storage?

Whilst these steps might appear to be rather technical if you follow them and you have access to an Azure Subscription you can do this too.

Gilbert warns us up-front that this will be a lengthy post and that is quite true. But if you need to hold those audit logs more than 90 days, this is a great way of doing so.

Comments closed

Finding the Right Disk and Data Node Sizes in HDFS

Lokesh Jain has some advice when it comes to disk and data node size:

There are two factors to keep in mind when choosing node capacity. These will be discussed in detail in the next sections.

1. Large Disks – total node capacity being the same, using more disks is better as it yields higher aggregate IO bandwidth.
2. Dense Nodes – as nodes get denser, recovery after node failure takes longer.

These factors are not HDFS-specific and will impact any distributed storage service that replicates data for redundancy and serves live workloads.

Click through for specific advice on maximum disk and node sizes.

Comments closed

Tips for Using Azure Storage

James Serra takes us through Azure Data Lake Store Gen2 and Azure Blob Storage:

Azure Data Lake Store (ADLS) Gen2 should be used instead of Azure Blob Storage unless there is a needed feature that is not yet GA’d in ADLS Gen2.

The major features that are missing from ADLS Gen2 are premium tiersoft deletepage blobsappend blobs, and snapshots. The major features that are in preview are archive tierlifecycle management, and diagnostic logs. Check out all the missing features at Known issues with Azure Data Lake Storage Gen2.

Note that underneath the covers, ADLS Gen2 uses Azure Blob Storage and is simply a layer over blob storage providing additional features (i.e. hierarchical file system, better performance, enhanced security, Hadoop compatible access).

Click through for a bullet point list of useful information.

Comments closed

Creating a Gen-2 Azure Data Lake Store

Cecilia Brusatori shares how to build a generation-2 data lake in Azure:

Finally, you’ve decided that Data Lake Gen 2 is good for your Data Analytics Scenario and you’ve started the journey, went to the Azure Portal and searched for it. Mhh you don’t see it in the options to create it, let’s try the search bar [typing Data Lake Gen2….] Nothing… Ok maybe you’ve missed something…. nope!
So what is in fact a Data Lake Gen 2? it is a blob storage account, optimized for Data Analytics.
Let’s take a look at how you are able to create it!

If you’re used to the first generation, where Azure Data Lake Storage was its own thing, it might take a minute to realize where it went.

Comments closed