In this post, I am going to share my experiment in how to do file management in ADLS using R studio,
to do this you need to have below items
1. An Azure subscription
2. Create an Azure Data Lake Store Account
3. Create an Azure Active Directory Application (for the aim of service-to-service authentication).
4. An Authorization Token from Azure Active Directory Application
It’s pretty easy to do, as Leila shows.
The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the filesystem directly to Azure HDInsight services, or via HDFS-compatible open-source applications. And for data science applications, you can also access the data directly from R, as this tutorial explains.
To interface with Azure Data Lake, you’ll use U-SQL, a SQL-like language extensible using C#. The R Extensions for U-SQL allow you to reference an R script from a U-SQL statement, and pass data from Data Lake into the R Script. There’s a 500Mb limit for the data passed to R, but the basic idea is that you perform the main data munging tasks in U-SQL, and then pass the prepared data to R for analysis. With this data you can use any function from base R or any R package. (Several common R packages are provided in the environment, or you can upload and install other packages directly, or use the checkpoint package to install everything you need.) The R engine used is R 3.2.2.
Click through for the details.
Selecting the variables in the Deducer GUI:
Outcome variable: Y, or the dependent variable, should be put on this list
As numeric: Independent variables that should be treated as covariates should be put in this section. Deducer automatically converts a factor into a numeric variable, so make sure that the order of the factor level is correct
As factor: Categorically independent variables (language, ethnicity, etc.).
Weights: This option allows the users to apply sampling weights to the regression model.
Subset: Helps to define if the analysis needs to be done within a subset of the whole dataset.
Deducer is open source and looks like a pretty decent way of seeing what’s available to you in R.
Now that we know the read and write throughput characteristics of a single Data Node, we would like to see how per-node performance scales when the number of Data Nodes in a cluster is increased.
The tool we use for scale testing is the Tera* suite that comes packaged with Hadoop. This is a benchmark that combines performance testing of the HDFS and MapReduce layers of a Hadoop cluster. The suite is comprised of three tools that are typically executed in sequence:
TeraGen, that tool that generates the input data. We use it to test the write performance of HDFS and ADLS.
TeraSort, which sorts the input data in a distributed fashion. This test is CPU bound and we don’t really use it to characterize the I/O performance or HDFS and ADLS, but it is included for completeness.
TeraValidate, the test that reads and validates the sorted data from the previous stage. We use it to test the read performance of HDFS and ADLS.
It’s an interesting look at how well ADLS scales. In general, my reading of this is fairly positive for Azure Data Lake Store.
For this simple test, it worked pretty well, and it should work well for most of the requirements that you have. Time will tell how reliably this new feature does work.
The Import Flat File is available when connecting to SQL Server version 2005 or higher. I haven’t tried this on a lower version, but I don’t see any reason why it wouldn’t work there either. You can read more about this feature in Microsoft’s documentation.
You can definitely break the Import Flat File feature, but I appreciate it being smoother than the SSIS-based wizard of yore. Wayne also shares his thoughts on the Extended Events Profiler.
The main purpose of this post today is to discuss this point – If you have an Azure SQL Database involved in Active Geo Replication and opt to use database level firewall rules do you need to create the rules in both the primary and secondary database?
I thought so, but I was wrong. I connect to my primary database and run the following (obfuscated) .
Read on for Arun’s demonstration.
Quite a lot to take in. Let’s break this down.
DATETIME2is a data type that was introduced in SQL Server 2008. It uses up to 8 bytes to store a date and time: 3 bytes for the date component, and up to 5 bytes for the time component.
The point here is that it uses 8 bytes in total. That’s it. No more.
Jemma noted that when converting the
DATETIME2data type to
BINARY, it suddenly became longer by exactly one byte, which seems strange.
Read on for the solution.
So which should you use? Well, like most things in the database world, it depends. Do you need to cross databases? Do you need to use NOLOCK for a diagnostic query? Do you need additional information available in one of the views that isn’t available in one of the helper functions?
Click through for the comparison. I agree with his end result, that you should at least know both methods, even if you lean toward one.