Press "Enter" to skip to content

Day: December 2, 2021

ElasticMapReduce Serverless

Damon Cortesi, et al, announce serverless EMR is now in preview:

Today we’re happy to announce Amazon EMR Serverless, a new option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run petabyte-scale data analytics in the cloud. With EMR Serverless, you can run applications built using open-source frameworks such as Apache Spark, Hive, and Presto, without having to configure, manage, optimize, or secure clusters. EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and you only pay for the resources that your applications use.

In this post, we discuss the benefits of EMR Serverless, walk you through the core concepts of EMR Serverless and how you can use it, and show you a quick demo.

If you’re already using EMR for ephemeral work—that is, using a Spark cluster to perform data transformations and then shutting it down—this makes a lot of sense as long as there’s not a major difference in cost.

Comments closed

A Primer on Apache Spark

Tomaz Kastrun has started a new series:

Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally it was developed at the Berkeley’s AMPLab, and later donated to the Apache Software Foundation, which has maintained it since.

Click through to learn more about the product.

Comments closed

Creating an Availability Group on Linux in Azure with Pacemaker

Andrew Pruski slams in all of the exciting nouns:

There are new Ubuntu Pro 20.04 images available in the Azure marketplace with SQL Server 2019 pre-installed so I thought I’d run through how to create a three node pacemaker cluster with these new images in order to deploy a SQL Server availability group.

Disclaimer – The following steps will create the cluster but will not have been tested in a production environment. Any HA configuration for SQL Server needs to be thoroughly tested before going “live”.

Click through to see how.

Comments closed

To and From Date Filtering in one Slicer

Reza Rad uses a date slicer:

Power BI from and to date slicer

It happens that you might have two fields as From and To date (or Start and End date) in your dataset, and you want a date slicer in the report. The date slicer has to filter records in a way that the FROM and the TO dates are in the range of dates selected in the slicer. There are multiple ways of doing this. In this article and video, I’ll explain a simple but effective method for that. I have explained in another article, how this can be done using two date slicers, you can read that from here.

Click through to see what you need and how you can put one of these in place.

Comments closed

Diagnosing Stalled Dispatchers

Sean Gallardy has another memory dump to investigate:

If you’ve been following with the series of demystifying dumps then you’re probably thinking or have thought something along the lines of, “All of these issues seem to be around stalled or non-yielding things…” and you’d be correct. If you revisit the non-yielding scheduler dump post, I explain a little about cooperative scheduling and how things need to place nice, if one piece isn’t playing nice by doing its part in a timely manner or not giving up time to others, that’s going to be a very large problem for the system in general. The stalled dispatcher falls into the same set of issues.

There are various dispatcher pools in SQL Server, though the most widely known one is for extended events. Dispatcher pools are items that hold pieces of work to be completed, like a dumping ground for “someone pick this up and do this” type of tasks. If it helps, you can think of them in a similar way as connection pooling, where there are background threads which will do something (like making or updating connections) when needed. It can respond to pressure by making new threads if needed or trim threads when the work slows down. In SQL Server dispatchers set their own attributes including when an item is considered stuck, the number of threads to use, etc., which means some may never be considered stalled as they are able to disable their own timeouts.

Read on to see what might cause a stalled dispatcher and what a memory dump for one can tell us.

Comments closed

Powershell Equality Operations

Dave Mason is not amused:

When comparing two values in PowerShell, you’ll have to march to the beat of a different drum. The syntax is drastically different:

The short reason why Powershell uses equality operators like -eq is that Bash uses them. Though the funny thing is that Bash actually uses == for string equality comparison and only uses -eq for numeric equality comparisons. The norm for POSIX is =, adding yet another level of fun.

Comments closed

The Data Professional Salary Survey

Brent Ozar has re-opened the data professional salary survey:

We’re data people, you and I. We make better decisions when we work off data instead of feelings.

It’s time for our annual salary survey to find out what data professionals make. You fill out the data, we open source the whole thing, and you can analyze the data to spot trends and do a better job of negotiating your own salary:

Click through for the link to the survey. It looks like most of the questions have stayed the same this year, which is good for longer-term analysis.

Comments closed