Kevin Feasel – Page 1052

Now, we all know that the consultant answer to “which should I use?” is It Depends ™ 🙂 But what does it depend on?
To me, it boils down to a few key questions you need to ask:
– What is the task or problem you are trying to solve?
– Where and how will you use the output?
– Which tool are you most comfortable using?

Read on to see how they both work.

Comments closed

Auto-Terminating Unused EMR Clusters

Published 2019-05-13 by Kevin Feasel

Praveen Krishamoorthy Ravikumar shows how you can use AWS Lambda to terminate ElasticMapReduce clusters which have been idle for a certain amount of time:

To avoid this overhead, you must track the idleness of the EMR cluster and terminate it if it is running idle for long hours. There is the Amazon EMR native IsIdle Amazon CloudWatch metric, which determines the idleness of the cluster by checking whether there’s a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster.
In this blog post, we propose a solution to cut down this overhead cost. We implemented a bash script to be installed in the master node of the EMR cluster, and the script is scheduled to run every 5 minutes. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive) for some predefined set of data points, it triggers an alarm, which in turn executes an AWS Lambda function that terminates the cluster.

We went a slightly different route for auto-termination, killing after a fixed number of hours.

Comments closed

The Pains of Small Files in Hadoop

Published 2019-05-13 by Kevin Feasel

Shashank Naik and Bhagya Gummalla explain why having a large number of small data files in Hadoop is a problem:

When a NameNode restarts, it must load the filesystem metadata from local disk into memory. This means that if the namenode metadata is large, restarts will be slower. The NameNode must also track changes in the block locations on the cluster. Too many small files can also cause the NameNode to run out of metadata space in memory before the DataNodes run out of data space on disk. The datanodes also report block changes to the NameNode over the network; more blocks means more changes to report over the network.
More files mean more read requests that need to be served by the NameNode, which may end up clogging NameNode’s capacity to do so. This will increase the RPC queue and processing latency, which will then lead to degraded performance and responsiveness. An overall RPC workload of close to 40K~50K RPCs/s is considered high.

There are a few reasons you want to pack data into large files on Hadoop and this explains them well.

Comments closed

Interval Tracking with WhoIsActive

Published 2019-05-13 by Kevin Feasel

Erik Darling looks at one of my favorite sp_WhoIsActive features:

Using sp_WhoIsActive in a slightly different way, we can see what a query has been up to over a duration of our choosing.
The delta columns are what happened over the span of time we pick. The columns that come back normally are cumulative to where the query is at now.

Click through to see what it does and how you might benefit from it.

Comments closed

Changes in R 3.6.0

Published 2019-05-13 by Kevin Feasel

David Smith lays out the major changes in R 3.6.0:

A major update to the open-source R language, R 3.6.0, was released on April 26 and is now available for download for Windows, Mac and Linux. As a major update, it has many new features, user-visible changes and bug fixes. You can read the details in the release announcement, and in this blog post I’ll highlight the most significant ones.

There are some good changes in here.

Comments closed

Compress Those Indexes

Published 2019-05-13 by Kevin Feasel

Pamela Mooney reminds us that if you’re going to compress your heap or clustered index, remember those non-clustered indexes as well:

We compress to gain space, correct? So why sacrifice it to our indexes?
Here is a script (complete with demo databases and tables) that will find the culprits and fix them for you. Put it (minus the demo databases and tables) in a job, and quit worrying about this. Your tables (and indexes) will be sparkling clean. You’re welcome.

Click through for the script. It’s a good reminder that compressing a clustered index does not automatically compress non-clustered indexes—nor should it, as these won’t necessarily compress at the same ratio because data type makeup can differ.

Comments closed

Getting Prior Year’s Year-To-Date with DAX

Published 2019-05-13 by Kevin Feasel

Kasper de Jonge takes a look at how to calculate a prior year’s year-to-date over the same period as the current year:

Well maybe.. what happens here is that the DAX engine took the whole date range we have in context and shifts it back 12 months. This means for year 2019 it will use January first to December 31. So we get the entire year, is that what we want? Or do we want to see the sales for the previous year until the day we have data for this year so we can compare? Both need different DAX so let’s take a look.

Read on for a detailed analysis, including where you might go wrong.

Comments closed

Dropping Tables in Bulk

Published 2019-05-13 by Kevin Feasel

Jeff Mlakar talks about a topic I like—dropping lots and lots of stuff:

Let’s assume that you have lots of tables that need to be dropped according to some criteria. Trying to do them all at once isn’t a good idea. Even with a powerful server it will either take forever or simply never finish.
For example – you may have millions of tables in sys.tables or millions of indexes you need to drop. SQL Server won’t process them well if you try to run it as one big statement.

I’ve never had millions of tables or millions of indexes to drop and now I am jealous. Regardless, Jeff has two techniques for us when you have a lot of work to do. And if you do need to figure out key dependencies, I have a script for that.

Comments closed

Table Partitioning: WAIT_AT_LOW_PRIORITY on Standard Edition

Published 2019-05-13 by Kevin Feasel

Michael Bourgon explains what the WAIT_AT_LOW_PRIORITY option does with table partitioning and that it is available in Standard Edition:

But how about WAIT_AT_LOW_PRIORITY? That was introduced in 2014 to make table partitioning deal better with That-Dude-From-Accounting-Who-Kicks-Off-A-Massive-Query-On-Friday-at-5pm, which causes partitioning to hang on Saturday when you’re trying to add and remove partitions.

Read on for a demo.

Comments closed

Sentiment Analysis with Python

Published 2019-05-09 by Kevin Feasel

Bruno Stecanella shows us how to use MonkeyLearn to perform sentiment analysis in Python:

Sentiment analysis is a set of Natural Language Processing (NLP) techniques that takes a text (in more academic circles, a document) written in natural language and extracts the opinions present in the text.
In a more practical sense, our objective here is to take a text and produce a label (or labels) that summarizes the sentiment of this text, e.g. positive, neutral, and negative.
For example, if we were dealing with hotel reviews, we would want the sentence ‘The staff were lovely‘ to be labeled as Positive, and the sentence ‘The shared bathroom was absolutely disgusting‘ labeled as Negative.

Click through for a demo.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Author: Kevin Feasel

Azure Data Factory: Mapping and Wrangling Data Flows

Auto-Terminating Unused EMR Clusters

The Pains of Small Files in Hadoop

Interval Tracking with WhoIsActive

Changes in R 3.6.0

Compress Those Indexes

Getting Prior Year’s Year-To-Date with DAX

Dropping Tables in Bulk

Table Partitioning: WAIT_AT_LOW_PRIORITY on Standard Edition

Sentiment Analysis with Python