Press "Enter" to skip to content

Month: May 2019

Automating Log File Expansion

Max Vernon shows how you can automatically expand log files to optimize VLF counts:

SQL Server Database Log file expansion can be fairly tedious if you need to make the log bigger in many reasonably-sized growth increments. It can be tedious because you may need to write and execute a large number of ALTER DATABASE ... MODIFY FILE ... commands.

The following code automatically grows a SQL Server Database log file, using the size and growth increments you configure in the script. If you set the @DebugOnly flag to 1, the script will only print the commands required, instead of executing them. This allows you to see what exactly will be executed ahead of time. Alternately, you could copy-and-paste the commands into a query window and execute them one-by-one.

Click through for that code.

Comments closed

Azure Data Studio May Release

Alan Yu announces the May release of Azure Data Studio:

Since its release two months ago, the community continues to love SQL Notebooks. This month, we had a laser-eyed focus on quality of life bug fixes instead of new features. These improvements include:

– Markdown rendering improvements, including better support for notes and tables
– Usability improvements to the toolbar
– Markdown links for trusted notebooks no longer requires Command/Ctrl + click and can be clicked directly
– Improvements in cleaning up Jupyter processes after closing notebooks and reducing errors when starting multiple notebooks concurrently
– Improvements to SQL Notebook connections to ensure errors don’t occur when running two notebooks against the same database
– Improvements to notebook auto-scrolling to the currently executing cell when clicking the run cells button from the toolbar
– General stability and performance improvements

And based on some of the GitHub comments, I’m going to really like the June release if those changes all make it in.

Comments closed

Memory Pressure and Azure SQL Managed Instances

Jovan Popovic takes us through determining whether we have enough memory on an Azure SQL Managed Instance:

Managed Instance has memory that is proportional to the number of cores. As an example, in Gen5 architecture you have 5.1GB of memory per vCore, meaning that 8-core instance will have 41GB memory. Periodically you should check is this amount of memory good for your workload.
 
Do not monitor does the Managed Instance use ~100% of available memory. Some people believe that this is an issue (like hitting 100% CPU), but this is expected and desired behavior. Managed Instance should use as much as possible memory to cache the pages from disk into the buffer pool. The only case where you will not see near 100% usage of memory is the case where you have the databases much smaller that the available memory size so all of them can fit into the memory.

The spoiler version is that it’s the same process as on-prem.

Comments closed

Azure Data Factory: Mapping and Wrangling Data Flows

Cathrine Wilhelmsen explains the difference between Mapping Data Flows and Wrangling Data Flows in Azure Data Factory:

Now, we all know that the consultant answer to “which should I use?” is It Depends ™ 🙂 But what does it depend on?

To me, it boils down to a few key questions you need to ask:
– What is the task or problem you are trying to solve?
– Where and how will you use the output?
– Which tool are you most comfortable using?

Read on to see how they both work.

Comments closed

Auto-Terminating Unused EMR Clusters

Praveen Krishamoorthy Ravikumar shows how you can use AWS Lambda to terminate ElasticMapReduce clusters which have been idle for a certain amount of time:

To avoid this overhead, you must track the idleness of the EMR cluster and terminate it if it is running idle for long hours. There is the Amazon EMR native IsIdle Amazon CloudWatch metric, which determines the idleness of the cluster by checking whether there’s a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster.

In this blog post, we propose a solution to cut down this overhead cost. We implemented a bash script to be installed in the master node of the EMR cluster, and the script is scheduled to run every 5 minutes. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive) for some predefined set of data points, it triggers an alarm, which in turn executes an AWS Lambda function that terminates the cluster.

We went a slightly different route for auto-termination, killing after a fixed number of hours.

Comments closed

The Pains of Small Files in Hadoop

Shashank Naik and Bhagya Gummalla explain why having a large number of small data files in Hadoop is a problem:

When a NameNode restarts, it must load the filesystem metadata from local disk into memory. This means that if the namenode metadata is large, restarts will be slower. The NameNode must also track changes in the block locations on the cluster. Too many small files can also cause the NameNode to run out of metadata space in memory before the DataNodes run out of data space on disk. The datanodes also report block  changes to the NameNode over the network; more blocks means more changes to report over the network.

More files mean more read requests that need to be served by the NameNode, which may end up clogging NameNode’s capacity to do so. This will increase the RPC queue and processing latency, which will then lead to degraded performance and responsiveness. An overall RPC workload of close to 40K~50K RPCs/s is considered high.

There are a few reasons you want to pack data into large files on Hadoop and this explains them well.

Comments closed

Changes in R 3.6.0

David Smith lays out the major changes in R 3.6.0:

A major update to the open-source R language, R 3.6.0, was released on April 26 and is now available for download for Windows, Mac and Linux. As a major update, it has many new features, user-visible changes and bug fixes. You can read the details in the release announcement, and in this blog post I’ll highlight the most significant ones.

There are some good changes in here.

Comments closed

Compress Those Indexes

Pamela Mooney reminds us that if you’re going to compress your heap or clustered index, remember those non-clustered indexes as well:

We compress to gain space, correct? So why sacrifice it to our indexes?

Here is a script (complete with demo databases and tables) that will find the culprits and fix them for you.  Put it (minus the demo databases and tables) in a job, and quit worrying about this.  Your tables (and indexes) will be sparkling clean. You’re welcome.

Click through for the script. It’s a good reminder that compressing a clustered index does not automatically compress non-clustered indexes—nor should it, as these won’t necessarily compress at the same ratio because data type makeup can differ.

Comments closed