In Microsoft ignite 2017, Azure ML team announce new on-premises tools for doing machine learning. this tools much more comprehensive as it provides
1- a workspace helps data wrangling
2- Data Visualization
3-Easy to deploy
4-Support Python codes
in this post and next posts, I will share my experiment with working this tools.
Click through for the step-by-step installation guide.
Excel is easy to use, but not user friendly
Excel is on nearly every desktop in any Windows based organisation and with the Master Data Services Add-in, it puts the data well within the reach of the users. Whilst it is simple it is in no way user friendly when compared to other applications that your users may be using. Not to mention that for most this will be the only part of the solution they see! Wouldn’t it be great if there was a way to supply the same data but with an intuitive, mobile ready front end that people enjoy using?
Developers are tightly constrained
Developers like to develop, not choose options from drop down menus in a web based portal. With MDS, not only can Devs not make use of Visual Studio and a like but they are very tightly constrained by the business rules engine. At this point we should be able to make use of our preferred IDE so that we can benefit from source control, frameworks and customised business logic.
Not scalable according to modern expectations
Finally, MDS cannot scale to handle any kind of “big data”. It’s a bit of buzz word but as businesses collect more and more data, we need a data management option that can grow with that data. Due to the fact that MDS must be deployed from a server, there is no easy way to meet those big data requirements.
There are a few pieces to Matt’s solution, making for an interesting read.
It shows the many different layers involved with a product like Azure SQL Database. What happens if there is a loss of service for a specific component? Obviously we as customers would not be able to fix the issue as this is the responsibility of Microsoft Engineers, the key for me is being kept in the loop with the issue and it is something that they do pretty well. So what happens if the load balancer has issues?
All communication is done via Service Health within the Azure portal.
Check the comments for another useful Azure status site.
In this post, I am going to share my experiment in how to do file management in ADLS using R studio,
to do this you need to have below items
1. An Azure subscription
2. Create an Azure Data Lake Store Account
3. Create an Azure Active Directory Application (for the aim of service-to-service authentication).
4. An Authorization Token from Azure Active Directory Application
It’s pretty easy to do, as Leila shows.
The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the filesystem directly to Azure HDInsight services, or via HDFS-compatible open-source applications. And for data science applications, you can also access the data directly from R, as this tutorial explains.
To interface with Azure Data Lake, you’ll use U-SQL, a SQL-like language extensible using C#. The R Extensions for U-SQL allow you to reference an R script from a U-SQL statement, and pass data from Data Lake into the R Script. There’s a 500Mb limit for the data passed to R, but the basic idea is that you perform the main data munging tasks in U-SQL, and then pass the prepared data to R for analysis. With this data you can use any function from base R or any R package. (Several common R packages are provided in the environment, or you can upload and install other packages directly, or use the checkpoint package to install everything you need.) The R engine used is R 3.2.2.
Click through for the details.
Now that we know the read and write throughput characteristics of a single Data Node, we would like to see how per-node performance scales when the number of Data Nodes in a cluster is increased.
The tool we use for scale testing is the Tera* suite that comes packaged with Hadoop. This is a benchmark that combines performance testing of the HDFS and MapReduce layers of a Hadoop cluster. The suite is comprised of three tools that are typically executed in sequence:
TeraGen, that tool that generates the input data. We use it to test the write performance of HDFS and ADLS.
TeraSort, which sorts the input data in a distributed fashion. This test is CPU bound and we don’t really use it to characterize the I/O performance or HDFS and ADLS, but it is included for completeness.
TeraValidate, the test that reads and validates the sorted data from the previous stage. We use it to test the read performance of HDFS and ADLS.
It’s an interesting look at how well ADLS scales. In general, my reading of this is fairly positive for Azure Data Lake Store.
The main purpose of this post today is to discuss this point – If you have an Azure SQL Database involved in Active Geo Replication and opt to use database level firewall rules do you need to create the rules in both the primary and secondary database?
I thought so, but I was wrong. I connect to my primary database and run the following (obfuscated) .
Read on for Arun’s demonstration.
But now we run into a problem: there are certain ports which need to be open for Polybase to work. This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs. This goes back to the issue we see with spinning up data nodes in Docker: ports are not available. If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.
As I mention in the post, I’d much rather build my own Hadoop cluster; I don’t think you save much maintenance time in the long run going with HDInsight.
Querying Cosmos DB is more powerful and versatile. The CreateDocumentQuery method is used to create an IQueryable<T> object, a member of System.Linq, which can output the query results. The ToList() method will output a List<T> object from the System.Collections.Generic namespace.
Derik also shows how to import the data into Power BI and visualize it. It’s a nice article if you’ve never played with CosmosDB before.
This post is a continuation of the blog where I discussed using U-SQL to standardize JSON input files which vary in format from file to file, into a consistent standardized CSV format that’s easier to work with downstream. Now let’s talk about how to make this happen on a schedule with Azure Data Factory (ADF).
This was all done with Version 1 of ADF. I have not tested this yet with the ADF V2 Preview which was just released.
It’s a bit lengthy, but Melissa lays it out step-by-step, making it straightforward to follow.