Press "Enter" to skip to content

Day: May 13, 2019

Auto-Terminating Unused EMR Clusters

Praveen Krishamoorthy Ravikumar shows how you can use AWS Lambda to terminate ElasticMapReduce clusters which have been idle for a certain amount of time:

To avoid this overhead, you must track the idleness of the EMR cluster and terminate it if it is running idle for long hours. There is the Amazon EMR native IsIdle Amazon CloudWatch metric, which determines the idleness of the cluster by checking whether there’s a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster.

In this blog post, we propose a solution to cut down this overhead cost. We implemented a bash script to be installed in the master node of the EMR cluster, and the script is scheduled to run every 5 minutes. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive) for some predefined set of data points, it triggers an alarm, which in turn executes an AWS Lambda function that terminates the cluster.

We went a slightly different route for auto-termination, killing after a fixed number of hours.

Comments closed

The Pains of Small Files in Hadoop

Shashank Naik and Bhagya Gummalla explain why having a large number of small data files in Hadoop is a problem:

When a NameNode restarts, it must load the filesystem metadata from local disk into memory. This means that if the namenode metadata is large, restarts will be slower. The NameNode must also track changes in the block locations on the cluster. Too many small files can also cause the NameNode to run out of metadata space in memory before the DataNodes run out of data space on disk. The datanodes also report block  changes to the NameNode over the network; more blocks means more changes to report over the network.

More files mean more read requests that need to be served by the NameNode, which may end up clogging NameNode’s capacity to do so. This will increase the RPC queue and processing latency, which will then lead to degraded performance and responsiveness. An overall RPC workload of close to 40K~50K RPCs/s is considered high.

There are a few reasons you want to pack data into large files on Hadoop and this explains them well.

Comments closed

Changes in R 3.6.0

David Smith lays out the major changes in R 3.6.0:

A major update to the open-source R language, R 3.6.0, was released on April 26 and is now available for download for Windows, Mac and Linux. As a major update, it has many new features, user-visible changes and bug fixes. You can read the details in the release announcement, and in this blog post I’ll highlight the most significant ones.

There are some good changes in here.

Comments closed

Compress Those Indexes

Pamela Mooney reminds us that if you’re going to compress your heap or clustered index, remember those non-clustered indexes as well:

We compress to gain space, correct? So why sacrifice it to our indexes?

Here is a script (complete with demo databases and tables) that will find the culprits and fix them for you.  Put it (minus the demo databases and tables) in a job, and quit worrying about this.  Your tables (and indexes) will be sparkling clean. You’re welcome.

Click through for the script. It’s a good reminder that compressing a clustered index does not automatically compress non-clustered indexes—nor should it, as these won’t necessarily compress at the same ratio because data type makeup can differ.

Comments closed

Getting Prior Year’s Year-To-Date with DAX

Kasper de Jonge takes a look at how to calculate a prior year’s year-to-date over the same period as the current year:

Well maybe.. what happens here is that the DAX engine took the whole date range we have in context and shifts it back 12 months. This means for year 2019 it will use January first to December 31. So we get the entire year, is that what we want? Or do we want to see the sales for the previous year until the day we have data for this year so we can compare? Both need different DAX so let’s take a look.

Read on for a detailed analysis, including where you might go wrong.

Comments closed

Dropping Tables in Bulk

Jeff Mlakar talks about a topic I like—dropping lots and lots of stuff:

Let’s assume that you have lots of tables that need to be dropped according to some criteria. Trying to do them all at once isn’t a good idea. Even with a powerful server it will either take forever or simply never finish.

For example – you may have millions of tables in sys.tables or millions of indexes you need to drop. SQL Server won’t process them well if you try to run it as one big statement.

I’ve never had millions of tables or millions of indexes to drop and now I am jealous. Regardless, Jeff has two techniques for us when you have a lot of work to do. And if you do need to figure out key dependencies, I have a script for that.

Comments closed

Table Partitioning: WAIT_AT_LOW_PRIORITY on Standard Edition

Michael Bourgon explains what the WAIT_AT_LOW_PRIORITY option does with table partitioning and that it is available in Standard Edition:

But how about WAIT_AT_LOW_PRIORITY?  That was introduced in 2014 to make table partitioning deal better with That-Dude-From-Accounting-Who-Kicks-Off-A-Massive-Query-On-Friday-at-5pm, which causes partitioning to hang on Saturday when you’re trying to add and remove partitions.

Read on for a demo.

Comments closed