Press "Enter" to skip to content

Category: Cloud

Installing Python Libraries on EMR Clusters with Notebooks

Parag Chaudhari shows how we can install Python libraries on existing ElasticMapReduce clusters using EMR Notebooks:

The notebook-scoped libraries discussed previously require your EMR cluster to have access to a PyPI repository. If you cannot connect your EMR cluster to a repository, use the Python libraries pre-packaged with EMR Notebooks to analyze and visualize your results locally within the notebook. Unlike the notebook-scoped libraries, these local libraries are only available to the Python kernel and are not available to the Spark environment on the cluster. To use these local libraries, export your results from your Spark driver on the cluster to your notebook and use the notebook magic to plot your results locally. Because you are using the notebook and not the cluster to analyze and render your plots, the dataset that you export to the notebook has to be small (recommend less than 100 MB).

Read the whole thing.

Comments closed

Ordered Clustered Columnstore Indexes in Azure SQL DW

Niko Neugebauer takes us through a new feature in preview for Azure SQL Data Warehouse:

After creating (or dropping and recreating a Clustered Columnstore Index we can specify the reserved word ORDER and then one or !!!MULTIPLE!!! columns. This looks like an extremely promising feature!

On Azure SQL Data Warehouse one can of course define table as a Columnstore and with that specification it is also possible to define an ORDER option with one or multiple columns.

For the syntax and basic functionality testing purposes on Azure SQL Data Warehouse, let us then create a table with a Clustered Columnstore Index, load some data and see if by recreating an Ordered Clustered Columnstore Index we can achieve some improvements.

Niko has a few hard-earned lessons from this post.

Comments closed

SQL Server on Azure: Performance Optimized Storage Config

Mine Tokus announces a new feature when using Azure to host IaaS SQL Server instances:

Today, we are excited to announce Performance Optimized Storage Configuration capabilities for the VM’s registered with SQL VM RP. This feature automates storage configuration according to performance best practices for SQL Server on Azure virtual machines through Azure Portal or Azure Quick start Templates when creating a SQL VM. Automated performance best practices include separating Data and Log filescache configuration for premium disks hosting data and log filessupport for Temp DB on local disksupport for Ultra disks to host data, log or Temp DB files and database engine only images. In this article, we will discuss each automated performance best practice in detail.

Read on for the description and check out those links for additional information.

Comments closed

Migrating Databricks Workspaces

Gerhard Brueckl has made DatabricksPS better:

I do not know what is/was the problem here but I did not have time to investigate but instead needed to come up with a proper solution in time. So I had a look what needs to be done for a manual export. Basically there are 5 types of content within a Databricks workspace:

– Workspace items (notebooks and folders)
– Clusters
– Jobs
– Secrets
– Security (users and groups)

For all of them an appropriate REST API is provided by Databricks to manage and also exports and imports. This was fantastic news for me as I knew I could use my existing PowerShell module DatabricksPS to do all the stuff without having to re-invent the wheel again.

I’ve used DatabricksPS and really like it for cases where I’d have to loop with the Databricks REST API—for example, when uploading files.

Comments closed

Backing Up Cosmos DB

Josh Smith takes us through backing up Cosmos DB yourself:

Unfortunately if you are restricting access to your Cosmos DB service based on IP address (a reasonable security measure) then Data Factory won’t work as of this writing as Azure Data Factory doesn’t operate like a trusted Azure service and presents as IP address from somewhere in the data center where it is spun up. Thankfully they are working on this. In the meantime however the next best thing is to use the Cosmos DB migration tool (scripts below) to dump the contents to a location where they can be retained as long as needed. Be aware in addition to the RU cost of returning the data that if you bring these backups back out of the data center where the Cosmos DB lives you’ll also incur egress charges on the data.

Having a plan for this kind of thing is important, even if you normally rely on service-provided automated backups.

Comments closed

Building an Azure Usage Report with Powershell

June Castillote shows us how we can use Powershell to get usage data from Azure for our subscriptions:

In the section above, it would be common for the command to return many thousand objects especially for long date ranges. To prevent overwhelming the API, the Get-UsageAggregates command only returns a maximum of 1000 results. If you’ve saved the $usageData variable as covered in the previous section, you can confirm it by using running this command $usageData.UsageAggregations.count.

What if there are more than 1000 results? You’re going to have to do a little more work.

Knowing how much you’re spending is critical in an Op-X world like Azure or AWS.

Comments closed

SQL Server Monitoring with Grafana and Telegraf

Denzil Ribeiro shows how you can use Telegraf and Grafana to monitor Azure SQL Database databases:

SQL DB Storage
This is the dashboard for monitoring file size and space used, IO latency, and IO throughput, for each file in the database. When using Standard or General Purpose databases, which use data files in Azure blob storage, storage performance depends on several blob properties, exposed via the FILEPROPERTYEX() function.

Aside from a couple Azure SQL DB-specific steps, it’s basically the same process that Tracy Boggiano has for monitoring on-prem and IaaS SQL Server instances.

Comments closed

ADLS Gen2 Navigation in Power Query

Chris Webb shows off hierarchical navigation in Power Query against Azure Data Lake Storage Gen2:

While the documentation on how to import data from Azure Data Lake Gen2 Storage into Power BI is pretty detailed, the connector (which at the time of writing is in beta) that supports this functionality in the Power Query engine has some useful functionality that isn’t so obvious. If you look at the built-in documentation on the AzureStorage.DataLake M function in the Power Query Editor you’ll see there are a lot of options that aren’t in the documentation on the web yet:

Click through for an example.

Comments closed

Hammering Azure SQL DB

John Morehouse shows how we can configure HammerDB to run against Azure SQL Database:

Bench marking your environment is an important step when introducing new hardware, which is accomplished by running a test workload against the hardware.    There are multiple ways to accomplish this to get  SQL Server performance data  One of these methods is using HammerDB, which is a free tool that provides TPC standard bench marking metrics for multiple database systems, including Microsoft SQL Server.  These metrics are an industry standard and are defined by the Transaction Processing Performance Council (TPC).  The results from bench marking will help you to ensure that the new infrastructure will be able to support the expected workload.

Azure introduces ways to quickly implement new hardware.  However, if the Azure environment isn’t setup correctly, you can introduce issues that could potentially degrade performance.  Thankfully, HammerDB is Azure aware which allows you to easily benchmark your cloud environment.

HammerDB isn’t as good as having your own perfectly-created set of scripts which exactly replicate your environment, but I’ve never seen an environment with one of those.

Comments closed

Variable Header Counts and Azure Data Factory

Mark Kromer shows how you can convince Azure Data Factory to skip a variable number of lines before beginning processing:

A common requirement that I see from customers who are processing text files in data lakes with Azure Data Factory, is to read and process files where there are variable numbers of lines that precede both the data headers and the data that needs to be processed. ADF already has facilities that handle the ability to switch headers off or on as well as the ability to specify parameterized skip line counts. However, in many cases, files that are received for processing have variable numbers of superfluous lines that need to be skipped.

In ADF, between pipeline activities and data flows, there are a number of ways to handle this scenario. In this post, I am going to demonstrate one such technique. 

Read on to see which technique Mark chose and how to get it working.

Comments closed