2) Create Databricks Service
Yes you are reading this correctly. Under the hood Data Factory is using Databricks to execute the Data flows, but don’t worry you don’t have to write code.
Create a Databricks Service and choose the right region. This should be the same as your storage region to prevent high data movement costs. As Pricing Tier you can use Standard for this introduction. Creating the service it self doesn’t cost anything.
Joost shows the work you have to do to build out a data flow. This has been a big hole in ADF—yeah, ADF seems more like an ELT tool than an ETL tool but even within that space, there are times when you need to do a bit more than pump-and-dump.
The steps that need to be performed are:
– Allow the new port in the RHEL firewall
– Change the SSH daemon to listen on the new port
– Add an incoming rule in the VM network security group for the new port
– Remove the rule that allows port 22
The Ubuntu process will be pretty close to this as well.
We understand that streaming data isn’t typically considered “Data Science” by itself. However, it’s are often associated and setting up this background now opens up some cool applications in later posts. For this post, we’ll cover how to sink streaming data to Power BI using Stream Analytics.
The previous posts in this series used Power BI Desktop for all of the showcases. This post will be slightly different in that we will leverage the Power BI Service instead. The Power BI Service is a collaborative web interface that has most of the same reporting capabilities as Power BI Desktop, but lacks the ability to model data at the time of writing. However, we have heard whispers that data modeling capabilities may be coming to the service at some point. The Power BI Service is also the standard method for sharing datasets, reports and dashboards across organizations. For more information on the Power BI Service, read this.
Brad has a nice demo, so check it out.
I logged a case with MS Support and when they came back to me, they advised that the service principal that is spun up in the background had expired. This service principal is required to allow the cluster to interact with the Azure APIs in order to create other Azure resources.
When a service is created within AKS with a type of LoadBalancer, a Load Balancer is created in the background which provides the external IP I was waiting on to allow me to connect to the cluster.
Because this principal had expired, the cluster was unable to create the Load Balancer and the external IP of the service remained in the pending state.
There were a lot of steps here; click through to see just how many.
To generate these, you can log into your AWS dashboard, go to the IAM (Identity and Access Management) dashboard and select the
Userstab. On the
Userstab, add a user and also the administration rights that you want the user to have.Remember to restart R once you have filled in the access key information in the .Renviron file for it to take effect.
At this point, those familiar with
cloudyrsuite is probably asking – “This is exactly the same as
library(aws.ec2), so why use boto3?“. Well, to be honest, I was using
aws.ec2for a while, but I find spot-instances, which the current version of
aws.ec2does not support. In addition I found that
boto3has some other functionalitue – which I prefer. For a full list of
boto3functions to interact with an EC2 instance, have a look at the reference manual.
It’s pretty good stuff; check it out.
Azure SQL Managed Instances do not utilise windows authentication – so your two methods of authenticating applications and users are:
– SQL Authentication:This authentication method uses a username and password.
– Azure Active Directory Authentication:This authentication method uses identities managed by Azure Active Directory and is supported for managed and integrated domains. Use Active Directory authentication (integrated security) whenever possible.
Hamish also elaborates on some of the trickier bits about Azure Active Directory for someone used to on-prem AD solutions.
However, I’m working with a customer who is building their own service based on Azure SQL Database, and I have fully automated their database deployment process. I wanted to take this a few steps further and add the SQL Analytics step as part of our deployment. This was harder than I expected it to be—the code samples in the books online post above weren’t working in my environment. And furthermore, once I got it working, I was having failures in my Azure Automation Runbook once I got the code running in the PowerShell ISE (I was having issues using VS Code on my Mac).
Joey takes us through the problems and provides a working script.
When I speak about CosmosDB, I always get questions like “How can I retrieve information about the execution plans?” or “Isn’t there a tool like SSMS which can show me what’s happening in the background?” Usually, questions like that comes from DBAs. If you have questions like that, I have good and bad news for you. Good news is, Yes you can get retrieve metrics from CosmosDB about execution plans. Bad news is, you need to know some programming to be able to do that because you need to use CosmosDB SDK.
The only way to access this information is from CosmosDB SDK 2.x. I couldn’t retrieve execution metrics by using SDK 3.x for custom queries. Here is the available metrics you can retrieve from CosmosDB queries.
I wonder if this is a “this is still new” thing, a “you don’t need these where you’re going” thing, or a “this is exactly how we envisioned implementation” thing. Especially around getting per-query metrics after the fact.
Our summary findings from TPCDS benchmarks are as follows:
– TPCDS queries are not as sensitive to local disk performance (and hence to EBS volume sizes)
– r5 (Intel) instances are consistently faster than r5a (AMD) instances. However, the speedup depends on the engine and the speedup for r5 (Intel) is lower for Spark (10%) than for Hive (25%).
– r5 instances are also either cheaper (by about 10% for Hive) or the same cost (for Spark)
At least for Hadoop and Spark work, Intel CPUs are a bit better, but there is some nuance in the story so check it out.
RStudio Server Pro is now available on the Azure Marketplace, the company announced on the RStudio Blog earlier this month. This means you can launch RStudio Server Pro on an virtual machine with the memory, disk, and CPU configuration of your choice, and pay by the minute for the VM instance plus a the RStudio software charge. Then, you can use a browser to access the remote RStudio Server (the interface is nigh-indistinguishable from the desktop version), with access to the commercial features of RStudio including support for multiple R version and concurrent R sessions, load-balancing and high availability instances, and enhanced security.
RStudio Server Pro and Microsoft R Server are both very nice for production-quality R servers. You can get away with the open source versions, but there are some good reasons to use the enterprise-grade products in an enterprise.