Press "Enter" to skip to content

Day: September 14, 2018

ElasticMapReduce And RStudio

Tanzir Musabbir demonstrates how to set up Amazon ElasticMapReduce to include an RStudio edge node:

RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running on  Amazon EMR to perform distributed training. In a previous blog post, the author showed how you can install RStudio Server on Amazon EMR cluster. However, in certain scenarios you might want to install it on a standalone Amazon EC2 instance and connect to a remote Amazon EMR cluster. Benefits of running RStudio on EC2 include the following:

  • Running RStudio Server on an EC2 instance, you can keep your scientific models and model artifacts on the instance. You might have to relaunch your EMR cluster to meet your application requirements. By running RStudio Server separately, you have more flexibility and don’t have to depend entirely on an Amazon EMR cluster.
  • Installing RStudio on the master node of Amazon EMR requires sharing of resources with the applications running on the same node. By running RStudio on a standalone Amazon EC2 instance, you can use resources as you need without having to share the resources with other applications.
  • You might have multiple Amazon EMR clusters in your environment. With RStudio on Edge node, you have the flexibility to connect to any EMR clusters in your environment.

There is one major difference between running RStudio Server on an Amazon EMR cluster vs. running it on a standalone Amazon EC2 instance. In the latter case, the instance needs to be configured as an Amazon EMR client (or edge node). By doing so, you can submit Apache Spark jobs and other Hadoop-based jobs from an instance other than EMR master node.

Click through for detailed, step-by-step instructions on how to do this.

Comments closed

Hortonworks Data Analytics Studio

Will Xu and Syed Mahmood announce Hortonworks Data Analytics Studio:

DAS leverages open-source technologies such as Apache Hive to share and extend the value of a modern data architecture in heterogeneous environments. It helps infrastructure administrators manage and optimize the performance of their Hive workloads by delivering visibility into query patterns and storage hotspots. DAS improves performance by uncovering inhibitors to query speed as well as providing recommendations to improve its efficiency.

In the past, Hive view did not provide full auto-complete capability during authoring time. We’ve addressed this shortcoming in DAS. This is not a trivial task especially on large databases, however through a number of caching optimizations we were able to make it work smoothly even with thousands of tables.

This product feels more like Management Studio or SQL Operations Studio than prior Hive UIs.  That’s definitely a good thing.

Comments closed

Clearing The SSAS Cache Using C#

Shabnam Watson shows us a small console program to clear the SQL Server Analysis Services cache:

First let me give you a little background of why you would want to clear SSAS cache from C# code when you can do this using an XMLA command from SSMS.

If you have a slow MDX/DAX SSAS query , you have a couple of options for improving the performance (assuming no hardware changes):

  1. Rewrite the query differently if you have control over the query. (You will have two queries that you want to compare against the same database.)
  2. Make changes to the SSAS database to follow a better design. (You will have one query to run against two databases)

Regardless of which route you go, you should compare the performance before and after the changes to see how much you gained from the change.

Click through for more, including the code.

Comments closed

Azure Data Factory V2 Or SSIS?

Merrill Aldrich explains the differences between Azure Data Factory V2 and SQL Server Integration Services:

The next consideration is a bit more involved if you are new to data integration. Both of these tools excel at transporting data from place to place, but they have important differences in terms of what you can do to modify the data in transit. As a matter of emphasis, ADF has more features geared toward moving the data than performing any complex transformation along the way. SSIS, on the other hand, was built with a large library of transformations that you can chain together to make elaborate data flows including lookups, matching, splitting data, and more.

The tools also overlap quite a lot. In projects this seems to lead to the question of whether you’ll transform the data “in flight” using Extract Transform Load (ETL), or instead move the data to a destination where it’ll be transformed using Extract Load Transform (ELT).

These are not “pretty much the same thing” and Merrill does a good job of explaining what those differences in design mean for the products.

Comments closed

The Story Behind Grant’s Execution Plan Book

Grant Fritchey shares a funny (in a Schadenfreude sense) look at what it took to get SQL Server Execution Plans, 3rd Edition out:

Four years ago, after a bunch of dithering and some negotiations with Tony Davis, my editor, I started to update my book, SQL Server Execution Plans. We managed to convince Hugo Kornelis to be the tech editor. I started to do the real writing in early 2015.

I was most of the way through a first draft and no one liked it. Tony was unhappy. Hugo was unhappy. I was unhappy. I was just trying to update the existing book, SQL Server Execution Plans. It wasn’t working.

We all came to the conclusion that the old book was wrong. Not simply in a technical sense, although there was a lot of that, but in a structural sense. So we started rearranging things. SQL Server 2014 came out, but I was ready for it, having been on a preview, so it was no big deal. We started a second draft.

The book Grant mentions is free in PDF format thanks to Red Gate, so go give it a download and enjoy the fruits of Grant’s labor.

Comments closed