Yan Li has a three-part series looking at management of Azure Data Lake compute. First, an overview:
Scenario 2: Set One Specific Group to Different Limits
New members are joining and sharing the same ADLA account. To prevent any new members, who are just learning ADLA, from mistakenly submitting a job that consumes too much compute resource (increasing cost and blocking other jobs), customers want to set the maximum AU per job for new employees at 30 AUs while others can submit jobs with up to 100 AUs.
- Job AU limit: 100
- Priority limit: 1
Exception Policy: New Employee Policy
Job AU limit: 30
Priority limit: 200
Group: New Employee Group
Next up is a look at job-level policies:
With job-level policies, you can control the maximum AUs and the maximum priority that individual users (or members of security groups) can set on the jobs that they submit. This allows you to not only control the costs incurred by your users but also control the impact they might have on high priority production jobs running in the same ADLA account.
There are two parts to a job level policy:
- Default Policy: This is the policy that is applied to all users of the service.
- Exceptions: The set of “exception” policies apply to specific users.
Submitted jobs that do not violate the job-level policies are still subject to the account level policies as described in Azure Data Lake Analytics Account Level Policy.
Finally, account-level policies:
ADLA supports three types of account-level policies:
Maximum AUs — Controls the maximum number of AUs that can be used by running jobs
Maximum Number of Running Jobs — Controls the maximum number of concurrently running jobs.
Days to Retain Job Queries — Controls how long detailed information about jobs are retained in the users ADLS account.
There’s a good amount of information here.
At the EARL conference in San Francisco this week, JS Tan from Microsoft gave an update (PDF slides here) on the doAzureParallel package . As we’ve noted here before, this package allows you to easily distribute parallel R computations to an Azure cluster. The package was recently updated to support using automatically-scaling Azure Batch clusters with low-priority nodes, which can be used at a discount of up to 80% compared to the price of regular high-availability VMs.
— David Smith (@revodavid) June 7, 2017
That lowers the barrier to usage significantly, so it’s a very welcome update.
In the previous blog post we created an Azure cloud service. Now we are going to create a private virtual Azure network. The importance of this is that when you create a virtual machine in Azure you will use this virtual network to connect to your virtual machine.
This is a screenshot-driven, step-by-step post that makes setting these up easy.
Raw files often land in S3 or HDFS in an uncompressed text format. This format is suboptimal both for the cost of storage and for running analytics on that data. S3DistCp can help you efficiently store data and compress files on the fly with the --outputCodec option:
The current version of S3DistCp supports the codecs gzip, gz, lzo, lzop, and snappy, and the keywords none and keep (the default). These keywords have the following meaning:
“none” – Save files uncompressed. If the files are compressed, then S3DistCp decompresses them.
“keep” – Don’t change the compression of the files but copy them as-is.
This is an important article if you’ve got a Hadoop cluster running on EC2 nodes.
We start with a 16.04 image, we run some upgrades, install python, upgrade pip, install our requirements and expose port 8888 (jupyter’s default port).
Here is our requirements.txt file
Notice how Jupyter is in there, I also added a few other things that I very commonly use including numpy, pandas, plotly, scikit-learn and some azure stuff.
The big benefit to doing this is that your installation of Jupyter can exist independently from your notebooks, so if you accidentally mess up Jupyter, you kill and reload from the image in a couple commands.
As Hive is part of the Azure HDInsight stack it would be tempting to select the HDInsight or Hadoop connector when you’re getting data. However, note HDFS in brackets beside the Azure HDInsight and Hadoop File options as this means that you’ll be connecting to the underlying data store, which can be Azure Data Lake Store or Azure Blob Storage – both of which use HDFS architectures.
But this doesn’t help when you want to access a Hive table. In order to access a Hive table you will first of all need to install the Hive ODBC driver from Microsoft. Once you’ve downloaded and installed the driver you’ll be able to make your connection to Hive using the ODBC connector in PowerBI.
Read the whole thing. Connecting to Hive is pretty easy.
Once you create your account, you can then start creating runbooks. You can do just about anything with the runbooks. There are numerous existing run books that you can browse through and modify for your own use, including provisioning, monitoring, life cycle management, and more.
You can create the runbooks offline, or using the Azure Portal, and they’re built using PowerShell. In this example, we will reuse the code from the PowerShell demo and also demonstrate how we can use the built in Azure Service scheduler to run our existing PowerShell code and not have to rely on an on-premises scheduler, task scheduler, or Azure VM to schedule a job.
Read the whole thing if you have Azure SQL Database instances in your environment.
2. Vertical queries (in preview): A vertical elastic query is a query that is executed across databases that contain different schemas and different data sets. An elastic query can be executed across any two Azure SQL Database instances. This is actually really easy to set up and that what this blog post is about! The diagram below represents a query being issued against tables that exist in separate Azure SQL Database instances that contain different schemas.
Read on to learn how to implement vertical elastic queries today.
When the IoT Hub is created you will get an endpoint hosted in Azure. This is the target for the JSON events being generated from the mobile device.
Azure IoT Hubs are more complex than an Azure Event Hub, perform a lot more device based functions and also have stronger security capability. However, operationally they work pretty much the same.
If you want to learn more about the differences between the two Hubs then this is a great article – https://docs.microsoft.com/en-us/azure/iot-hub/iot-hub-compare-event-hubs
It’s a neat tutorial for a fun weekend project.
I think it is important to highlight a couple of points, more specifically around the requirement of ADALSQL.DLL and proper setup of AD which I will highlight below and reference some links, please do this as it lays the foundation for you.
You need ADALSQL.DLL which is part of the latest SQL Server Management Studio (SSMS) to test access. This stands for Active Directory Authentication Library for SQL Server.
This goes through some of the issues Arun had setting everything up and provides workarounds and explanations.