Using Hive: Tiered Or Decoupled Storage?

Brandon Wilson and Gopal Vijayaraghavan compare a series of Hive queries against EC2 instances with persistent storage and S3:

There are advantages and disadvantages to each approach. The tiered approach has the most flexibility for an operator to tune the performance of the cluster while trading off size of the hot data zone for better performance or smaller resource footprint. The downside of this approach is that, having data on HDFS, resizing the cluster is a slow and tedious process due to HDFS needing to be rebalanced to achieve performance and fault-tolerance expectations. Thus this architecture is generally only used for statically sized clusters with steady, well-known workloads.

The decoupled architecture, on the other hand, enables maximum flexibility for cluster growth and reduction. For example, a cluster could run at 100 nodes during the day to support analytics and reporting and then shrink to 24 nodes overnight to support smaller ETL workloads. Historically, the disadvantage to decoupling is that cloud storage is not local and therefore could drastically affect runtime of the analytical workloads (hence the hybrid approach of tiered storage). However, the advent of LLAP in Hive 2.0 has limited this overhead making the approach far more attractive. The dynamic cache within LLAP also means that we do not need to statically define what data is hot. It can be inferred at query time (i.e., as users access the data, that data will become hot). We will look closer at how LLAP closes the runtime gap in the next section.

Historically, the argument was that you should avoid S3 in part because it’s relatively flaky compared to disks (in terms of performance and in its eventual consistency model).  It looks like that’s no longer a pressing concern.

Storage Spaces Direct Complications With SQL Server

Allan Hirt walks us through some of the sharp edges around using Storage Spaces Direct (S2D) beneath a Windows Failover Cluster Instance running SQL Server:

Arguably, the biggest thing about S2D is that the solutions currently have to be certified (see this bit of documentation from MS for more detail). This obviously doesn’t really affect, say, virtualized versions or ones up in the public cloud such as in Azure in a meaningful way, but it’s still technically a requirement much like logoed hardware for Windows Server supportability. Anyone want to point me to the logo stamped on your VMs? Didn’t think so. Now, from a pure FCI perspective none of this is an issue. The way a Windows Server failover cluster (WSFC) is currently designed, it is expecting that all nodes participating in the WSFC are also using/needed S2D. Why am I mentioning this? Disaster recovery.

It shouldn’t stop you from moving forward with S2D, but means you’ll probably have a bit more research on your hands.

Testing Disk Speed With diskspd

Marek Masko shows how to test I/O performance using the diskspd utility:

What is Diskspd?

Diskspd is a storage testing tool created by Microsoft Windows, Windows Server and Cloud Server Infrastructure Engineering teams. It combines robust and granular IO workload definition with flexible runtime and output options. That makes it a perfect tool for storage performance testing, validation and benchmarking.

Where to find Diskspd?

Diskspd is a free and open source utility. Its source code can be found on GitHub. The repository also hosts other frameworks which use Diskspd. You can find them under ‘Frameworks’ directory. A binary release is hosted by Microsoft at the following location: http://aka.ms/diskspd.

Click through for more details, including an example of a poorly-performing I/O solution.

Expanding LVM Drives

David Klee shows how to expand an LVM drive on Linux:

Next in our SQL Server on Linux series is one important question. On Windows, if you’re about to run out of space, you get your VM admin / storage admin to expand one or more of your drives, and you go to Disk Management and expand the drive with no downtime. How do we accomplish this same task on Linux?

First, SSH into your VM. Get your appropriate system engineer to expand the drive that needs to be expanded. You won’t be able to see it at first in Linux because, just like in Windows, it’ll need to rescan the storage to ‘see’ the extra space. Sometimes Windows does it automatically, and sometimes you have to initiate it manually. In Linux it only does this on system startup.

Let’s grow our data drive from 250GB to 300GB first.

Click through to see how to do that.

In-Memory OLTP: When You’re Out Of Space

Ned Otter shows us what happens when you run out of disk space and you’re using memory-optimized objects:

In my lab, I’m running Windows Server 2012. Let’s use Powershell to install the File System Resource Manager, which will allow us to create a quota for the relevant folder:

add-windowsfeature –name fs-resource-manager –includemanagementtools

After installing the Windows feature we can set the quota for the folder, but we shouldn’t enable it just yet, because first we have to verify the current size of the folder.

On my server, I created a quota of 1.5GB, and then enabled it.

Now let’s INSERT rows into the table, in batches of 1000, until we reach the limit (the INSERT script is listed in Part 2, I’m trying to keep this post from getting too long).

Click through to see what happens.  It’s not exactly a swath of carnage, but it’s also something you really don’t want to happen.

Backups And Distributed File Shares

Wayne Sheffield ran into a new oddity recently:

I was working on a client’s site today, setting up database backup routines. Part of which is to perform a database backup, and verify that everything went okay. I had Windows Explorer open to the location that the backup was going to. When the backup finished, I navigated over to Windows Explorer… and I have a missing database backup. There wasn’t a file in the directory for the backup that I had just performed.

After double and triple checking that I was looking at the same path that I had backed up the database to, I went in search of the network sysadmin to help me figure it out.

Read on for the solution.

Thinking About Scalable Persistent Memory

Ned Otter has a good post thinking about Scalable Persistent Memory:

There were other potential issues when using Persistent Memory, detailed in this blog post. But what’s not covered in that post is the fact that deploying NVIDMM-N reduced the memory speed and/or capacity, because they are not compatible with LRDIMM. This causes you to use RDIMM, which reduces capacity, and because NVDIMM-N operates at a slower speed than RDIMM, it also affects total memory speed.

HP has since released Gen10 servers, and they have changed the landscape for those seeking reduced latency by storing larger data sets in memory. For one thing, they raise the bar for what’s now referred to as Scalable Persistent Memory, with a total server capacity of 1TB. To be clear, NVDIMM-N is not used in this configuration. Instead, regular DIMMs are used, and they are persisted to flash via a power source (this was also the case for NVDIMM-N, but both the flash, DIMM, and power source were located on the NVDIMM-N).

Check it out.  I’m happy that things are improving, but it sounds like this won’t be a panacea.

Using Diskspd To Test Storage Performance

Aamir Syed gives an example of Diskspd parameters to test a storage subsystem:

It’s important to test your storage performance especially prior to installing or deploying a new SQL Server.

Microsoft has provided us with a great tool called Diskspd, which was meant to replace SQLIO. Diskspd synthetically generates workloads to run against your server.  It’s pretty robust and has a lot of parameters so that you can customize your test.

Ex. In the command below, I specified -b8k, which means the block size is going to run at 8k, which is the size that SQL uses for pages.

Click through for a sample run and explanation of each parameter.

Storage For In-Memory OLTP

Ned Otter has started a new series on In-Memory OLTP:

A memory-optimized database must have a special filegroup designated for memory-optimized data, known as a memory-optimized filegroup. This special filegroup is logically associated with one or more “containers”.What the heck is a “container”? Well, it’s just a fancy word for “folder”, nothing more, nothing less. But what is actually stored in those fancy folders?

Containers hold files known as “checkpoint file pairs”, which are also known as “data and delta files”, and these files persist durable memory-optimized data (in this blog post series, I’ll use the terms CFP and data/delta files interchangeably). You’ll note on the following image that it clearly states in bold red letters, “NO MAXSIZE” and “STREAMING”. “NO MAXSIZE” means that you can’t specify how large these files will grow, nor can you specify how large the container that houses them can grow (unless you set a quota, but you should NOT do that). And there’s also no way at the database level to control the size of anything having to do with In-Memory OLTP storage – you simply must have enough available free space for the data and delta files to grow.

This is the first potential resource issue for In-Memory OLTP: certain types of data modifications are no longer allowed if the volume your container resides upon runs out of free space. I’ll cover workload recovery from resource depletion in a future blog post.

Read the whole thing.  I’m looking forward to this series.

Automating Azure Data Lake Storage ACLs

Shannon Lowder shows how to automate Azure Data Lake Storage access control lists:

Now that you have these, you can use a for each loop to set your permissions.

foreach ($ACL in $ACLs) { write-host "Grant $useremail " $ACL[1] " access to " $ACL[0]; Set-AzureRmDataLakeStoreItemAclEntry -AccountName $adls -Path $ACL[0] -AceType User -Id $(Get-AzureRmADUser -Mail $useremail ).Id -Permissions $ACL[1] Set-AzureRmDataLakeStoreItemAclEntry -AccountName $adls -Path $ACL[0] -AceType User -Id $(Get-AzureRmADUser -Mail $useremail ).Id -Permissions $ACL[1] -Default
}

Now, for each permission, we’ll set the ACL and the default.  Why set both?  Well, when folders are created under each of the target folders, you want to cascade those permissions down from parent to child, right?  Well, that’s what the Default ACL controls.  If you skip the second Set-AzureRMDataLakeStoreItemAclEntry, then new folders would not inherit the permissions of the containing folder and your users would be unable to access their files properly.

Read the whole thing.  Shannon also has one of the very few valid use cases for 3D pie charts.

Categories

April 2018
MTWTFSS
« Mar  
 1
2345678
9101112131415
16171819202122
23242526272829
30