Press "Enter" to skip to content

Day: March 16, 2020

Tuning EMR Performance with Dr. Elephant and Sparklens

Nivas Shankar and Mert Hocanin show us how to use a couple of products to tune Hive and Spark jobs:

Data engineers and ETL developers often spend a significant amount of time running and tuning Apache Spark jobs with different parameters to evaluate performance, which can be challenging and time-consuming. Dr. Elephant and Sparklens help you tune your Spark and Hive applications by monitoring your workloads and providing suggested changes to optimize performance parameters, like required Executor nodes, Core nodes, Driver Memory and Hive (Tez or MapReduce) jobs on Mapper, Reducer, Memory, Data Skew configurations. Dr. Elephant gathers job metrics, runs analysis on them, and presents optimization recommendations in a simple way for easy consumption and corrective actions. Similarly, Sparklens makes it easy to understand the scalability limits of Spark applications and compute resources, and runs efficiently with well-defined methods instead of leaning by trial and error, which saves both developer and compute time.

This post demonstrates how to install Dr. Elephant and Sparklens on an Amazon EMR cluster and run workloads to demonstrate these tools’ capabilities. Amazon EMR is a managed Hadoop service offered by AWS to easily and cost-effectively run Hadoop and other open-source frameworks on AWS.

Even if you aren’t using ElasticMapReduce, Dr. Elephant and Sparklens are quite useful products.

Comments closed

Collation Conflicts with Extended Events

Jason Brimhall takes us through a nasty scenario:

Have you ever run into an error like this?

Cannot resolve the collation conflict between “pick a collation” and “pick another collation” in the equal to operation.

This kind of error seems pretty straight forward and it is safe to say that it generally happens in a query. When you know what the query is and you get this error, it is pretty easy to spot and fix the problem. At least you can band-aid it well enough to get past the error with a little use of the collate clause in your query.

But what if the error you are seeing is popping up when you are trying to use Management Studio (SSMS)? The error is less than helpful and can look a little something like this.

In this case, it was when trying to expand Extended Events sessions. Click through to see what’s going on and how to fix the problem.

Comments closed

Network Security Changes Around Azure SQL DB

Rohit Nayak announces some changes to Azure SQL Database’s connectivity and network security:

Now in general availability, Private Link enables users to have private connectivity from a Microsoft Azure Virtual Network to Azure SQL Database.

This feature creates a private endpoint which maps a private IP Address from the Virtual Network to your Azure SQL Database.

From security perspective, Private Link provides you with data exfiltration protection on the login path to SQL Database. Additionally, it does not require adding of any IP addresses to the firewall on Azure SQL Database or changing the connection string of your application.

Private Link is built on best of class Software Defined Networking (SDN) functionality from the Azure Networking team. Clients can connect to the Private endpoint from within the same Virtual Network, peered Virtual Networking the same region, or via VNet-to-VNet connection across regions. Additionally, clients can connect from on-premises using ExpressRoute, private peering, or VPN tunneling. More information can be found here

Click through to see what else they’ve been working on.

Comments closed

Finding Active Users by Date with DAX

Reza Rad takes us through the subscription pattern:

One of the common patterns of a data model is the subscription pattern. In this pattern, we have subscriptions (or tickets, or issues, or whatever else you want to call it) open date and close date. The way that this data is stored in a table makes it a bit challenging to get informative insight out of it. In this post (first of the series), I am going to explain about the subscription pattern and one of the common calculations needed for it; which is active subscribers at any given date, or open tickets at any given date.

In case you’re curious, here’s a solution which works in T-SQL. I’ve really taken to event-style tables, where there’s one row per state change, so instead of having a begin date and an end date for each action, have a row which contains the date and the type of action. This makes operating on the data a lot easier, though it does make rules preventing common entry problems (end date before start date, etc.) a bit trickier.

Comments closed

Restoring Databases Inside a Container

Aaron Bertrand has a quick-and-easy method of restoring databases in a Docker container:

I have been using Docker containers for local development and testing for some time now; I first blogged about my steps into this brave new world back in late 2016. Most of the time, I just need to create some throwaway database, with a throwaway table, to prove a point or to validate an answer I’ve supplied.

Sometimes, though, I need to work with a real database. It’s a little trickier to do this in a container, because it’s isolated — I can’t just attach or restore from my Downloads folder. I could fire up a VM and attach there, but I actually don’t even use Parallels on my work laptop, and I find that using VMs leads to response times that are a lot more sluggish across the board.

It’s not too difficult to get files into your containers, provided your instance will have a total size less than the size of the container, and this quick tutorial proved to be very helpful.

Read on for the walkthrough. Aaron notes that this is a Mac-specific walkthrough, though the Windows and Linux versions are pretty similar as well, as we still create persistent volumes.

Comments closed

Using Python to Pivot Data in SQL Server

Rajendra Gupta shows a few ways to pivot data using Python:

We can use groupby and lambda functions as well in the Python scripts for Pivot tables. For this example, I have a data set of a few states of India and their cities in a SQL table.

We need a pivot table from this data. In the output, it should list all cities for a state in a column; it should use || as a city name separator.

This is an unorthodox but interesting use of Machine Learning Services.

Comments closed