As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended up having 20+ single scripts which was not really maintainable any more. So I packed them into a PowerShell module and also published it to the PowerShell Gallery (https://www.powershellgallery.com/packages/DatabricksPS) for everyone to use!
This looks like a pretty good module if you work with Databricks.
If you have run through my last Managed Instance blog post, you have a Managed Instance at your disposal. The PowerShell script for creating the network requirements also contains steps to create an Azure VM in a different subnet in the same VNet. Unless you have a site-to-site VPN or Express Route between your on-prem environment and Azure, you will use this VM to connect to your Managed Instance.
Install Management Studio on the Azure VM. To connect to your Managed Instance, you will need the host name for your Managed Instance. You can find the Managed Instance host name on the resource page for your Managed Instance in the Portal.
I think this migration story is a bit easier for DBAs than the old Azure SQL Database strategy of building dacpacs.
The concept of workload management is a key factor for Azure SQL DW as there is only limited concurrency slots available and depending on the resource class, these slots can fill up pretty quickly. Once the concurrency slots are full, queries are queued until a sufficiently sized slot is opened up. Let’s recap what Resource Classes are and how they affect workload management.
A Resource Class is a pre-configured database role that determines how much resource is allocated to queries coming from users that belong to that role. For example, an ETL service account may use a “large” resource class and be allocated a generous amount of the server, however an analyst may use a “small” resource class and therefore only use up a small amount of the server with their queries. There are actually 2 types of resource class, Dynamic and Static. The Dynamic resource classes will grant a set percentage of memory to a query and actual value of this percentage will vary as the Warehouse scales up and down. The key factor is that an xLargeRc (extra-large resource class) will always take up 70% of the Server and will not allow any other queries to be run concurrently. No matter how much you scale up the Warehouse, queries run with an xLargeRc will run one at a time. Conversely, queries run with a smallrc will only be allocated 4% of the Server and therefore as a Warehouse scales up, this 4% becomes a larger amount of resource and can therefore process data quicker.
This looks like a useful addition. Click through for a few examples of how it will work.
An online service provided by Microsoft as part of Power BI (software as a service, or SaaS).
In effect dataflows are an online data collection and storage tool.
- Collection: It uses Power Query to connect to the data at the source and transform that data as needed.
- You will need to be able to access the data either through a cloud service (such as Dynamics 365) or to your PC/Network via a gateway.
- You can also use Power Query to write queries from scratch, such as my Power BI calendar table.
- Storage: Dataflows then stores that data in a table in the cloud so it can be used directly inside PowerBI.com, but more importantly (from my view) directly from Power BI Desktop.
Dataflows leverage the Power Query skills you have learnt (or are learning) using other tools (like Power BI Desktop, Power Query for Excel) allowing you to reuse those same skills in this online tool.
Tables that are created as a result of the dataflow are stored in an Azure Data Lake.
- If you don’t know what that is, don’t worry – I don’t understand it either. The point is it doesn’t matter because it is all done automatically for you by the tool.
Dataflows include the concept of the common data service (CDS) or common data model directly in the tool and you don’t have to know what it is, nor care.
If you don’t know what that is, don’t worry – it doesn’t matter now/yet.
This will become very important in the future as it will make the process of getting data out of complex databases (such as MS Dynamics 365) much easier in the future.
Click through for more detail as well as some good uses for Dataflows.
As background, some of you may remember the AzureSMR package, which was written a few years back as an R interface to Azure. AzureSMR was very successful and gained a significant number of users, but it was never meant to be maintainable in the long term. As more features were added it became more unwieldy until its design limitations became impossible to ignore.
The AzureR family is a refactoring/rewrite of AzureSMR that aims to fix the earlier package’s shortcomings.
The core package of the family is AzureRMR, which provides a lightweight yet powerful interface to Azure Resource Manager. It handles authentication (including automatically renewing when a session token expires), managing resource groups, and working with individual resources and templates. It also calls the Resource Manager REST API directly, so you don’t need to have PowerShell or Python installed; it depends only on commonly used R packages like httr, jsonlite and R6.
This won’t replace the Powershell libraries, but looks like it’d be useful for scenarios like if you need to set up a VM, train a model, and then shut down the VM.
Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries (further details here).
Although people mentioned in their GitHub page that the 1.0.5 Magellan library is available for Apache Spark 2.3+ clusters, I learned through a very difficult process that the only way to make it work in Azure Databricks is if you have an Apache Spark 2.2.1 cluster with Scala 2.11. The cluster I used for this experience consisted of a Standard_DS3_v2 driver type with 14GB Memory, 4 Cores and auto scaling enabled.
In terms of datasets, I used the NYC Taxicab dataset to create the geometry points and the Magellan NYC Neighbourhoods GeoJSON dataset to extract the polygons. Both datasets were stored in a blob storage and added to Azure Databricks as a mount point.
It sounds like this is much faster than using U-SQL to perform the same task.
If you, like me, are a SQL Server guy, you are probably quite familiar with installing SQL Server instances by mounting an ISO file, and running setup. Well, you can forget all that when you deploy a SQL Server 2019 Big Data Cluster. The setup is all done via Python utilities, and various Docker images pulled from a private repository. So, you need Python3. On my box I have Python 3.5, and – according to Microsoft – version 3.7 also works. Make you that you have your Python installation on the path.
When you deploy you use a Python utility:
mssqlctl. To download
mssqlctl, you need Python’s package management system
pipinstalled. During installation you also need a Python HTTP library: Requests. If you do not have it you need to install it:
python -m pip install requests
This isn’t available to the general public quite yet, but when it is publicly available (or if you are part of the Early Access Program), the instructions are nice and clear.
As a follow-up to my blog Azure Archive Blob Storage, Microsoft has released another storage tier called Azure Premium Blob Storage (announcement). It is in private preview in US East 2, US Central and US West regions.
This is a performance tier in Azure Blob Storage, complimenting the existing Hot, Cool, and Archive tiers. Data in Premium Blob Storage is stored on solid-state drives, which are known for lower latency and higher transactional rates compared to traditional hard drives.
It is ideal for workloads that require very fast access time such as interactive video editing, static web content, and online transactions. It also works well for workloads that perform many relatively small transactions, such as capturing telemetry data, message passing, and data transformation.
It’s in private preview for now, but my guess is that it’ll be available to the general public soon enough.
How do I run SQL Server on AWS?
Running SQL Server on AWS can be done in 2 ways.
Relation Database Service (RDS): AWS’s managed solution where some of the administration (maintenance, backups and patching) is handled for you.
EC2: Your very own virtual machine in the cloud. With EC2, you manage SQL Server, just like you would do on-premises. This gives you full control over your SQL instance.
Click through for the comparison.
We recently implemented a Spark streaming application, which consumes data from from multiple Kafka topics. The data consumed from Kafka comprises different types of telemetry events generated by mobile devices. We decided to host the Spark cluster using the Amazon EMR service, which manages a fleet of EC2 instances to run our data-processing pipelines.
As part of preparing the cluster and application for deployment to production, we needed to implement monitoring so we could track the streaming application and the Spark infrastructure itself. At a high level, we wanted ensure that we could monitor the different components of the application, understand performance parameters, and get alerted when things go wrong.
In this post, we’ll walk through how we aggregated relevant metrics in Datadog from our Spark streaming application running on a YARN cluster in EMR.
Check it out. If this is interesting, Priya’s blog has the full series.