To avoid this overhead, you must track the idleness of the EMR cluster and terminate it if it is running idle for long hours. There is the Amazon EMR native IsIdle Amazon CloudWatch metric, which determines the idleness of the cluster by checking whether there’s a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster.
In this blog post, we propose a solution to cut down this overhead cost. We implemented a bash script to be installed in the master node of the EMR cluster, and the script is scheduled to run every 5 minutes. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive) for some predefined set of data points, it triggers an alarm, which in turn executes an AWS Lambda function that terminates the cluster.
We went a slightly different route for auto-termination, killing after a fixed number of hours.
When a NameNode restarts, it must load the filesystem metadata from local disk into memory. This means that if the namenode metadata is large, restarts will be slower. The NameNode must also track changes in the block locations on the cluster. Too many small files can also cause the NameNode to run out of metadata space in memory before the DataNodes run out of data space on disk. The datanodes also report block changes to the NameNode over the network; more blocks means more changes to report over the network.
More files mean more read requests that need to be served by the NameNode, which may end up clogging NameNode’s capacity to do so. This will increase the RPC queue and processing latency, which will then lead to degraded performance and responsiveness. An overall RPC workload of close to 40K~50K RPCs/s is considered high.
There are a few reasons you want to pack data into large files on Hadoop and this explains them well.
Using sp_WhoIsActive in a slightly different way, we can see what a query has been up to over a duration of our choosing.
The delta columns are what happened over the span of time we pick. The columns that come back normally are cumulative to where the query is at now.
Click through to see what it does and how you might benefit from it.
A major update to the open-source R language, R 3.6.0, was released on April 26 and is now available for download for Windows, Mac and Linux. As a major update, it has many new features, user-visible changes and bug fixes. You can read the details in the release announcement, and in this blog post I’ll highlight the most significant ones.
There are some good changes in here.
We compress to gain space, correct? So why sacrifice it to our indexes?
Here is a script (complete with demo databases and tables) that will find the culprits and fix them for you. Put it (minus the demo databases and tables) in a job, and quit worrying about this. Your tables (and indexes) will be sparkling clean. You’re welcome.
Click through for the script. It’s a good reminder that compressing a clustered index does not automatically compress non-clustered indexes—nor should it, as these won’t necessarily compress at the same ratio because data type makeup can differ.
Well maybe.. what happens here is that the DAX engine took the whole date range we have in context and shifts it back 12 months. This means for year 2019 it will use January first to December 31. So we get the entire year, is that what we want? Or do we want to see the sales for the previous year until the day we have data for this year so we can compare? Both need different DAX so let’s take a look.
Read on for a detailed analysis, including where you might go wrong.
Let’s assume that you have lots of tables that need to be dropped according to some criteria. Trying to do them all at once isn’t a good idea. Even with a powerful server it will either take forever or simply never finish.
For example – you may have millions of tables in sys.tables or millions of indexes you need to drop. SQL Server won’t process them well if you try to run it as one big statement.
I’ve never had millions of tables or millions of indexes to drop and now I am jealous. Regardless, Jeff has two techniques for us when you have a lot of work to do. And if you do need to figure out key dependencies, I have a script for that.
But how about WAIT_AT_LOW_PRIORITY? That was introduced in 2014 to make table partitioning deal better with That-Dude-From-Accounting-Who-Kicks-Off-A-Massive-Query-On-Friday-at-5pm, which causes partitioning to hang on Saturday when you’re trying to add and remove partitions.
Read on for a demo.