As background, some of you may remember the AzureSMR package, which was written a few years back as an R interface to Azure. AzureSMR was very successful and gained a significant number of users, but it was never meant to be maintainable in the long term. As more features were added it became more unwieldy until its design limitations became impossible to ignore.
The AzureR family is a refactoring/rewrite of AzureSMR that aims to fix the earlier package’s shortcomings.
The core package of the family is AzureRMR, which provides a lightweight yet powerful interface to Azure Resource Manager. It handles authentication (including automatically renewing when a session token expires), managing resource groups, and working with individual resources and templates. It also calls the Resource Manager REST API directly, so you don’t need to have PowerShell or Python installed; it depends only on commonly used R packages like httr, jsonlite and R6.
This won’t replace the Powershell libraries, but looks like it’d be useful for scenarios like if you need to set up a VM, train a model, and then shut down the VM.
Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries (further details here).
Although people mentioned in their GitHub page that the 1.0.5 Magellan library is available for Apache Spark 2.3+ clusters, I learned through a very difficult process that the only way to make it work in Azure Databricks is if you have an Apache Spark 2.2.1 cluster with Scala 2.11. The cluster I used for this experience consisted of a Standard_DS3_v2 driver type with 14GB Memory, 4 Cores and auto scaling enabled.
In terms of datasets, I used the NYC Taxicab dataset to create the geometry points and the Magellan NYC Neighbourhoods GeoJSON dataset to extract the polygons. Both datasets were stored in a blob storage and added to Azure Databricks as a mount point.
It sounds like this is much faster than using U-SQL to perform the same task.
If you, like me, are a SQL Server guy, you are probably quite familiar with installing SQL Server instances by mounting an ISO file, and running setup. Well, you can forget all that when you deploy a SQL Server 2019 Big Data Cluster. The setup is all done via Python utilities, and various Docker images pulled from a private repository. So, you need Python3. On my box I have Python 3.5, and – according to Microsoft – version 3.7 also works. Make you that you have your Python installation on the path.
When you deploy you use a Python utility:
mssqlctl. To download
mssqlctl, you need Python’s package management system
pipinstalled. During installation you also need a Python HTTP library: Requests. If you do not have it you need to install it:
python -m pip install requests
This isn’t available to the general public quite yet, but when it is publicly available (or if you are part of the Early Access Program), the instructions are nice and clear.
As a follow-up to my blog Azure Archive Blob Storage, Microsoft has released another storage tier called Azure Premium Blob Storage (announcement). It is in private preview in US East 2, US Central and US West regions.
This is a performance tier in Azure Blob Storage, complimenting the existing Hot, Cool, and Archive tiers. Data in Premium Blob Storage is stored on solid-state drives, which are known for lower latency and higher transactional rates compared to traditional hard drives.
It is ideal for workloads that require very fast access time such as interactive video editing, static web content, and online transactions. It also works well for workloads that perform many relatively small transactions, such as capturing telemetry data, message passing, and data transformation.
It’s in private preview for now, but my guess is that it’ll be available to the general public soon enough.
How do I run SQL Server on AWS?
Running SQL Server on AWS can be done in 2 ways.
Relation Database Service (RDS): AWS’s managed solution where some of the administration (maintenance, backups and patching) is handled for you.
EC2: Your very own virtual machine in the cloud. With EC2, you manage SQL Server, just like you would do on-premises. This gives you full control over your SQL instance.
Click through for the comparison.
We recently implemented a Spark streaming application, which consumes data from from multiple Kafka topics. The data consumed from Kafka comprises different types of telemetry events generated by mobile devices. We decided to host the Spark cluster using the Amazon EMR service, which manages a fleet of EC2 instances to run our data-processing pipelines.
As part of preparing the cluster and application for deployment to production, we needed to implement monitoring so we could track the streaming application and the Spark infrastructure itself. At a high level, we wanted ensure that we could monitor the different components of the application, understand performance parameters, and get alerted when things go wrong.
In this post, we’ll walk through how we aggregated relevant metrics in Datadog from our Spark streaming application running on a YARN cluster in EMR.
Check it out. If this is interesting, Priya’s blog has the full series.
Let’s start with the positive.
Snowflake is a really scalable database. Storage is virtually limitless, since the data is stored on blob storage (S3 on AWS and Blob Storage on Azure). The compute layer (called warehouses) is completely separated from the storage layer and you can scale it independently from storage.
It is really easy to use. This is one of Snowflake’s core goals: make it easy to use for everyone. Most of the technical aspects (clustering, storage etc) are hidden from the user. If you thought SQL Server is easy with it’s “next-next-finish” installation, you’ll be blown away by Snowflake. I really like this aspect, since you have really powerful data warehousing at your finger tips, and the only thing you have to worry about is how to get your data into it. With Azure SQL DW for example, you have to about distribution of the data, how you are going to set things up etc. Not here.
It’s not all positive, but Koen seems quite happy to work with the product.
With the Execute R Script module you can immediately use more than 650 R packages which come preinstalled in the Azure ML Studio environment. You can also use other R packages (including packages not on CRAN) and source in R scripts you develop elsewhere (as shown above), although this does require the time to install them in the Studio environment. You can even create custom ML Studio models encapsulating R code for others to use in the drag-and-drop environment.
If you’re new to Azure ML Studio, check out the Quickstart Tutorial for R to learn how use the Execute R Script module, and to check out what’s new in the latest update follow the link below.
Click through for more details.
It is said to be safe, reliable and proven using complex algorithms and built-in intelligence where it can do the following (see this link for more details: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-automatic-tuning)
- CREATE INDEX – identifies indexes that may improve performance of your workload, creates indexes, and automatically verifies that performance of queries has improved.
- DROP INDEX – identifies redundant and duplicate indexes daily, except for unique indexes, and indexes that were not used for a long time (>90 days). Please note that at this time the option is not compatible with applications using partition switching and index hints.
- FORCE LAST GOOD PLAN – identifies SQL queries using execution plan that is slower than the previous good plan, and queries using the last known good plan instead of the regressed plan.
Personally I don’t enable the option where it is allowed a “free-for-all” when creating/dropping indexes and forcing certain query plans. I like controlling the change, especially for production databases. To force this concept I wanted to use Extended Events to know when / if someone changed my settings for automatic tuning against my database.
Click through for the script.
The problem with self-service BI is that you never quite know what your users are up to. For example, what data sources are they using? Are there hundreds of Excel files being used as data sources for reports that you don’t know about? If so, where are they? Could they and should they be replaced by a database or something else more robust? In this post I’ll show you how you can use Microsoft Flow and the Power BI REST API (see part 1 to find out how to create a Flow custom connector to call the Power BI API) to get the details of all the data sources used in all of the workspaces of your Power BI tenant.
I’ll admit that doing this turned out to be a bit trickier than I had expected. My plan was to use the GetDatasetsAsAdmin endpoint to get a list of all datasets, loop over each one and then call the (undocumented, but in the REST API’s Swagger file and therefore in my custom connector) GetDatsourcesAsAdmin endpoint to get the datasources in each dataset. Both these endpoints require administrative permissions to call, so I made sure my custom connector had the correct permissions (at least Tenant.Read.All – you can check this in the Azure Portal on the app you registered in Azure Active Directory) and I ran the Flow as a user with Power BI Admin permissions. But I kept getting 404 errors when requesting the data sources for certain datasets .
Chris explains why those 404s appear and what you can do about them.