Press "Enter" to skip to content

Category: Cloud

Azure Data Lake Store File Management With httr

Leila Etaati shows how to generate RESTful statements in R using httr:

In this post, I am going to share my experiment in how to do file management in ADLS using R studio,

to do this you need to have below items

1. An Azure subscription

2. Create an Azure Data Lake Store Account

3. Create an Azure Active Directory Application (for the aim of service-to-service authentication).

4. An Authorization Token from Azure Active Directory Application

It’s pretty easy to do, as Leila shows.

Comments closed

Using R In Azure Data Lake Analytics

David Smith links to a tutorial which shows how to use R against Azure Data Lake Analytics:

The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the filesystem directly to Azure HDInsight services, or via HDFS-compatible open-source applications. And for data science applications, you can also access the data directly from R, as this tutorial explains.

To interface with Azure Data Lake, you’ll use U-SQL, a SQL-like language extensible using C#. The R Extensions for U-SQL allow you to reference an R script from a U-SQL statement, and pass data from Data Lake into the R Script. There’s a 500Mb limit for the data passed to R, but the basic idea is that you perform the main data munging tasks in U-SQL, and then pass the prepared data to R for analysis. With this data you can use any function from base R or any R package. (Several common R packages are provided in the environment, or you can upload and install other packages directly, or use the checkpoint package to install everything you need.) The R engine used is R 3.2.2.

Click through for the details.

Comments closed

Testing Azure Data Lake Store Performance

Zhen Zeng and Govind Kamat stress test Azure Data Lake Store:

Now that we know the read and write throughput characteristics of a single Data Node, we would like to see how per-node performance scales when the number of Data Nodes in a cluster is increased.

The tool we use for scale testing is the Tera* suite that comes packaged with Hadoop.  This is a benchmark that combines performance testing of the HDFS and MapReduce layers of a Hadoop cluster.  The suite is comprised of three tools that are typically executed in sequence:

  • TeraGen, that tool that generates the input data.  We use it to test the write performance of HDFS and ADLS.

  • TeraSort, which sorts the input data in a distributed fashion.  This test is CPU bound and we don’t really use it to characterize the I/O performance or HDFS and ADLS, but it is included for completeness.

  • TeraValidate, the test that reads and validates the sorted data from the previous stage.  We use it to test the read performance of HDFS and ADLS.

It’s an interesting look at how well ADLS scales.  In general, my reading of this is fairly positive for Azure Data Lake Store.

Comments closed

Azure Database-Level Firewall Rules And Geo-Replication

Arun Sirpal explains that you don’t need to create database-level firewall rules in Azure on secondary databases when using Active Geo-Replication:

The main purpose of this post today is to discuss this point – If you have an Azure SQL Database involved in Active Geo Replication and opt to use database level firewall rules do you need to create the rules in both the primary and secondary database?

I thought so, but I was wrong. I connect to my primary database and run the following (obfuscated) .

Read on for Arun’s demonstration.

Comments closed

Polybase And HDInsight

I have a post up on trying to integrate Polybase with HDInsight:

But now we run into a problem:  there are certain ports which need to be open for Polybase to work.  This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs.  This goes back to the issue we see with spinning up data nodes in Docker:  ports are not available.  If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.

As I mention in the post, I’d much rather build my own Hadoop cluster; I don’t think you save much maintenance time in the long run going with HDInsight.

Comments closed

Working With CosmosDB

Derik Hammer has an introductory article showing how to work with CosmosDB to store and use document-style data:

Querying Cosmos DB is more powerful and versatile. The CreateDocumentQuery method is used to create an IQueryable<T> object, a member of System.Linq, which can output the query results. The ToList() method will output a List<T> object from the System.Collections.Generic namespace.

Derik also shows how to import the data into Power BI and visualize it.  It’s a nice article if you’ve never played with CosmosDB before.

Comments closed

Scheduled U-SQL Jobs With Azure Data Factory

Melissa Coates shows how to schedule Azure Data Factor workflows to run U-SQL:

This post is a continuation of the blog where I discussed using U-SQL to standardize JSON input files which vary in format from file to file, into a consistent standardized CSV format that’s easier to work with downstream. Now let’s talk about how to make this happen on a schedule with Azure Data Factory (ADF).

This was all done with Version 1 of ADF. I have not tested this yet with the ADF V2 Preview which was just released.

It’s a bit lengthy, but Melissa lays it out step-by-step, making it straightforward to follow.

Comments closed

Using Query Performance Insight To Find High-IO Queries

Jim Donahoe shows how he used Azure’s Query Performance Insight to eliminate 10 billion logical reads:

To access QPI, you simply need to click on the database you want to work with. Once you click on your database, scroll down in the portal to Query Performance Insight(QPI). Once QPI opens, you will see three options to sort on: CPU, DATA I/O, and LOG I/O.  You can also set the timeframe to view, I set for 24 hours.  Now, I have my timeline of 24 hours, and I am able to view which queries had the highest DATA I/O. I made a list of the top 3 from each category(CPU, DATA I/O, and LOG I/O) and presented it to my client. I presented the number of times it was executed, and the usage it utilized each time(all from the QPI information). The client then sent me 10 queries they wanted tuned and listed them in a prioritized list.

Well, by the end of tuning their 3 highest priority queries, we removed over 10 billion logical reads!  Yep, 10 BILLION! The client was very happy with our results and is currently awaiting the preview Standard Elastic Pools to come out of Preview and become GA. I have provided a few screenshots of an AdventureWorksLT database on my personal instance just to show you how to access QPI, and change metrics.

Click through for a demo.

Comments closed

SSIS In Azure

Richie Lee reports that SQL Server Integration Services is now available as a service in Azure:

I’ve written about it elsewhere in greater depth, but here are the headlines:

  • It makes use of SSIS Scale Out, which was released as part of SQL Server 2017.

  • Although it is based on SSIS Scale Out, you can’t actually configure SSIS Scale Out to run on the instance. If this confuses you then read my in-depth post.

  • SSISDB is installed in either SQL Azure or on a Managed Instance.

  • You don’t have to create Integration Services Catalog/SSISDB yourself; it is done for you. So that annoying key management is no longer a problem.

Richie’s got more to say on the topic, so check out the highlights and then his in-depth post.

Comments closed

Machine Learning Services Updates

Umachandar Jayachandran and team have been busy.  First, they announced a preview of SQL Server ML Services in Azure SQL Database:

In-database Machine Learning support was added in SQL Server 2016 and we are now bringing the same functionality to Azure SQL Database. You can now train and score machine learning models in Azure SQL Database and the predictions can be exposed to any application using your database, easily and seamlessly.

The preview functionality allows you to train and score machine learning models using data that fits in memory (in R data frame). Please note that the amount of memory available for R scripts execution depends on the edition of the Azure SQL database and cannot be modified.

No Python support there yet, but it’s upcoming.  Second, we can use the PREDICT function in Azure SQL Database:

Today we are announcing the general availability of the native PREDICT Transact-SQL function in Azure SQL Database. The PREDICT function allows you to perform scoring in real-time using certain RevoScaleR or revoscalepy models in a SQL query without invoking the R or Python runtime.

The PREDICT function support was added in SQL Server 2017. It is a table-valued function that takes a RevoScaleR or revoscalepy model & data (in the form of a table or view or query) as inputs and generates predictions based on the machine learning model. More details of the PREDICT function can be found here.

There are a limited number of models which support PREDICT—things like linear and logistic regression, RevoScaleR’s fast decision trees, etc.  If you have this type of model, however, the predictions stay within SQL Server and end up being much faster than going out to R.

Comments closed