Press "Enter" to skip to content

Day: October 12, 2017

Azure Data Lake Store File Management With httr

Leila Etaati shows how to generate RESTful statements in R using httr:

In this post, I am going to share my experiment in how to do file management in ADLS using R studio,

to do this you need to have below items

1. An Azure subscription

2. Create an Azure Data Lake Store Account

3. Create an Azure Active Directory Application (for the aim of service-to-service authentication).

4. An Authorization Token from Azure Active Directory Application

It’s pretty easy to do, as Leila shows.

Comments closed

Using R In Azure Data Lake Analytics

David Smith links to a tutorial which shows how to use R against Azure Data Lake Analytics:

The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the filesystem directly to Azure HDInsight services, or via HDFS-compatible open-source applications. And for data science applications, you can also access the data directly from R, as this tutorial explains.

To interface with Azure Data Lake, you’ll use U-SQL, a SQL-like language extensible using C#. The R Extensions for U-SQL allow you to reference an R script from a U-SQL statement, and pass data from Data Lake into the R Script. There’s a 500Mb limit for the data passed to R, but the basic idea is that you perform the main data munging tasks in U-SQL, and then pass the prepared data to R for analysis. With this data you can use any function from base R or any R package. (Several common R packages are provided in the environment, or you can upload and install other packages directly, or use the checkpoint package to install everything you need.) The R engine used is R 3.2.2.

Click through for the details.

Comments closed

Linear Regression With Deducer

Sunil Kappal demonstrates how to use Deducer, a GUI for R, to perform a simple linear regression:

Selecting the variables in the Deducer GUI:

  • Outcome variable: Y, or the dependent variable, should be put on this list

  • As numeric: Independent variables that should be treated as covariates should be put in this section. Deducer automatically converts a factor into a numeric variable, so make sure that the order of the factor level is correct

  • As factor: Categorically independent variables (language, ethnicity, etc.).

  • Weights: This option allows the users to apply sampling weights to the regression model.

  • Subset: Helps to define if the analysis needs to be done within a subset of the whole dataset.

Deducer is open source and looks like a pretty decent way of seeing what’s available to you in R.

Comments closed

Testing Azure Data Lake Store Performance

Zhen Zeng and Govind Kamat stress test Azure Data Lake Store:

Now that we know the read and write throughput characteristics of a single Data Node, we would like to see how per-node performance scales when the number of Data Nodes in a cluster is increased.

The tool we use for scale testing is the Tera* suite that comes packaged with Hadoop.  This is a benchmark that combines performance testing of the HDFS and MapReduce layers of a Hadoop cluster.  The suite is comprised of three tools that are typically executed in sequence:

  • TeraGen, that tool that generates the input data.  We use it to test the write performance of HDFS and ADLS.

  • TeraSort, which sorts the input data in a distributed fashion.  This test is CPU bound and we don’t really use it to characterize the I/O performance or HDFS and ADLS, but it is included for completeness.

  • TeraValidate, the test that reads and validates the sorted data from the previous stage.  We use it to test the read performance of HDFS and ADLS.

It’s an interesting look at how well ADLS scales.  In general, my reading of this is fairly positive for Azure Data Lake Store.

Comments closed

New Management Studio Features

Wayne Sheffield looks at two new features in SQL Server Management Studio 17.3:

For this simple test, it worked pretty well, and it should work well for most of the requirements that you have. Time will tell how reliably this new feature does work.

The Import Flat File is available when connecting to SQL Server version 2005 or higher. I haven’t tried this on a lower version, but I don’t see any reason why it wouldn’t work there either. You can read more about this feature in Microsoft’s documentation.

You can definitely break the Import Flat File feature, but I appreciate it being smoother than the SSIS-based wizard of yore.  Wayne also shares his thoughts on the Extended Events Profiler.

Comments closed

Azure Database-Level Firewall Rules And Geo-Replication

Arun Sirpal explains that you don’t need to create database-level firewall rules in Azure on secondary databases when using Active Geo-Replication:

The main purpose of this post today is to discuss this point – If you have an Azure SQL Database involved in Active Geo Replication and opt to use database level firewall rules do you need to create the rules in both the primary and secondary database?

I thought so, but I was wrong. I connect to my primary database and run the following (obfuscated) .

Read on for Arun’s demonstration.

Comments closed

Converting DATETIME2 To VARBINARY

Randolph West unravels a mystery around byte lengths:

Quite a lot to take in. Let’s break this down.

DATETIME2 is a data type that was introduced in SQL Server 2008. It uses up to 8 bytes to store a date and time: 3 bytes for the date component, and up to 5 bytes for the time component.

The point here is that it uses 8 bytes in total. That’s it. No more.

Jemma noted that when converting the DATETIME2 data type to BINARY, it suddenly became longer by exactly one byte, which seems strange.

Read on for the solution.

Comments closed

System Objects And Helper Functions

Kenneth Fisher compares and contrasts querying system tables versus using built-in helper functions which query the system tables:

So which should you use? Well, like most things in the database world, it depends. Do you need to cross databases? Do you need to use NOLOCK for a diagnostic query? Do you need additional information available in one of the views that isn’t available in one of the helper functions?

Click through for the comparison.  I agree with his end result, that you should at least know both methods, even if you lean toward one.

Comments closed