Press "Enter" to skip to content

Category: Cloud

Using mrsdeploy To Run R On Azure

John-Mark Agosta shows how to use mrsdeploy to send R batch jobs up to an Azure VM:

Alternately there are other Azure platforms for operationalization using R Server in the Marketplace, with other operating systems and platforms including HDInsight, Microsoft’s Hadoop offering. Or, equivalently one could use the Data Science VM available in the Marketplace, since it has a copy of R Server installed. Configuration of these platforms is similar to the example covered in this posting.

Provisioning an R Server VM, as reference in the documentation, takes a few steps that are detailed here, which consist of configuring the VM and setting up the server account to authorize remote access. To set up the server you’ll use the system account you set up as a user of the Linux machine. The server account is used for client interaction with the R Server, and should not be confused with the Linux system account. This is a major difference with the Windows version of the R Server VM that uses Active Directory services for authentication.

You can also use mrsdeploy to run batch jobs against Microsoft R Server on a local Hadoop cluster.

Comments closed

The Hive Metastore In HDInsight

Ashish Thapliyal shows how to create a custom Hive metastore in HDInsight:

Custom Metastore – HDInsight lets you pick custom Metastore. It’s a recommended approach for production clusters due to number reasons such as

  • You bring your own Azure SQL database as Metastore

  • As lifecycle of Metastore is not tied to a cluster lifecycle, you can create and delete clusters without worrying about the metadata loss.

  • Custom Metastore lets you attach multiple clusters and cluster types to same Metastore. Example – Single Metastore can be shared across Interactive Hive, Hive and Spark clusters in HDInsight

  • You pay for the cost of Metastore (Azure SQL DB)

Read on to see how to do this.

Comments closed

Scala + Hadoop + HDInsight

Emmanouil Gkatziouras shows that you can run a Hadoop job written in Scala on Azure’s HDInsight:

Previously, we set up a Scala application in order to execute a simple word count on Hadoop.

What comes next is uploading our application to HDInsight. So, we shall proceed in creating a Hadoop cluster on HDInsight.

Read the whole thing, but the upshot is that Scala apps build jar files just like Java would, so there’s nothing special about running them.

Comments closed

Azure Elastic Pools

Derik Hammer explains what Azure SQL Database Elastic Pools do:

Azure SQL Database Elastic Pools are a mechanism for grouping your Azure SQL Databases together into a shared resource pool. Imagine for a moment that you had a physical server on premise. On that server, you have a single SQL Server instance and a single database. This example is similar to how Azure SQL Database works. You have a fixed amount of resources and you pay for those resources, even when you are not using them.

An Elastic Pool is analogous to that same server and instance, instead you add several databases to the instance. The databases will share the same resource pool which can be cheaper than paying for separate sets of resources, as long as your databases’ peak usage times do not align with each other.

Read on to see how you can potentially save money on databases using an elastic pool instead of spinning up the databases independently.

Comments closed

Azure Managed Disks

Dave Bermingham explains what Azure Managed Disks are and why you might want to use them:

What’s Managed Disks you ask? Well, just on February 8th Corey Sanders announced the GA of Managed Disks. You can read all about Managed Disks here. https://azure.microsoft.com/en-us/services/managed-disks/

The reason why Managed Disks would have helped in this outage is that by leveraging an Availability Set combined with Managed Disks you ensure that each of the instances in your Availability Set are connected to a different “Storage scale unit”. So in this particular case, only one of your cluster nodes would have failed, leaving the remaining nodes to take over the workload.

Prior to Managed Disks being available (anything deployed before 2/8/2016), there was no way to ensure that the storage attached to your servers resided on different Storage scale units. Sure, you could use different storage accounts for each instances, but in reality that did not guarantee that those Storage Accounts provisioned storage on different Storage scale units.

Read on for more details.

Comments closed

Kinesis vs SQS

Kevin Sookocheff compares and contrasts Amazon’s Kinesis and SQS offerings:

Complicated Producer and Consumer Libraries

For maximum performance, Kinesis requires deploying producer and consumer libraries alongside your application. As a producer, you deploy a C++ binary with a Java interface for reading and writing data records to a Kinesis stream. As a consumer, you deploy a Java application that can communicate with other programming languages through an interface built on top of standard in and standard out. In either of these cases, adding new producers or consumers to a Kinesis stream presents some investment in development and maintenance.

Click through for the full comparison and figuring out where each fits.

Comments closed

R On Athena

Gopal Wunnava shows how to run R scripts using Amazon Athena as a data source:

Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.

Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”

This is an interesting use case for Athena.

Comments closed

Coalescing In DocumentDB

Melissa Coates shows how to use the null coalesce operator in DocumentDB to provide default values for missing attributes:

This is a quick post to share how we can use the coalesce operator in Azure DocumentDB (which is a schema-free, NoSQL database) to handle situations when the data structure varies from file to file. Varying data structure is a common issue in big data and analytics projects. A schema-free database like DocumentDB allows us to ingest and store the data with varying structures without a lot of upfront effort. However, accommodating these varying data structures is challenging later when we want to analyze the data. When querying the data (think Schema on Read here), I do need to impose a consistent structure on the data to perform analytics.

Read the whole thing.

Comments closed

Biml And ADF, Part 2

Meagan Longoria builds Azure Data Factory pipelines using BimlScript:

The great thing about Biml is that I can use it as much or as little as I feel is helpful. That T-SQL statement to get column lists could have been Biml, but it didn’t have to be. The client can maintain and enhance these pipelines with or without Biml as they see fit. There is no vendor lock-in here. Just as with Biml-generated SSIS projects, there is no difference between a hand-written ADF solution and a Biml-generated ADF solution, other than the Biml-generated solution is probably more consistent.

And have I mentioned the time savings? There is a reason why Varigence gives out shirts that say “It’s Monday and I’m done for the week.”

Click through for the script.

Comments closed