On a recent project I used Azure Data Factory (ADF) to retrieve data from an on premises SQL Server 2014 instance and land them in Azure Data Lake Store (ADLS) as ORC files. This required the use of the Data Management Gateway (DMG). Setup was quick and easy in our development environment. We installed the DMG for development on a separate server in the client’s network, where we also installed SQL Server Management Studio (SSMS) for query development and data validation. We set up resource groups in Azure for development and production, and made sure the settings for development and production were the same. Then we set up a separate server for the production DMG.
Deployment and execution went well in the dev environment. Testing was completed, so we deployed to prod. Deployment went fine, but the pipelines failed execution and returned the following error on the output data sets.
Weird solution, but I’m going to guess that it makes perfect sense if you are able to look at the code.
Each record that is inserted goes onto the next available distribution. This guarantees that you will have a smooth, even distribution of data, but it means you have no way of telling which data is on which distribution. This isn’t always a problem!
If I wanted to perform a count of records, grouped by a particular field, I can perform this on a round-robin table. Each distribution will run the query in parallel and return it’s grouped results. The results can be simply added together as a second part of the query, and adding together 60 smaller datasets shouldn’t be a large overhead. For this kind of single-table aggregation, round-robin distribution is perfectly adequate!
However, the issues arise when we have multiple tables in our query. In order to join two tables. Let’s take a very simple join between a fact table and a dimension. I’ve shown 6 distributions for simplicity, but this would be happening across all 60.
Figuring out which distribution key to use can make a huge difference in performance.
Recently a client found this article on “Best Practices for SQL Server in Azure Virtual Machines” and wanted to re-provision his volumes to adhere to them.
No my first thoughts was wait, I’m a DBA, not a System Admin that’s not my role! But thinking more about it I realized the client views this as a SQL Server issue and I am the SQL Server Consultant and that it is my job to remedy this problem.
Not being 100% confident in Azure, I spun up a VM SQL Server and attempted to add some volumes. To my surprise, this was way too easy.
Click through for the steps.
All communication with the Azure Storage via connection strings and BLOB URLs enforce the use of HTTPS, which provides Encryption in Transit. You can enforce the use of “Always HTTPS” by setting the connection string like this: “DefaultEndpointsProtocol=https;AccountName=myblob1…” or in SAS signatures, as in the example below:
To protect data at rest, the service provides an option to encrypt the data as they are stored in the account. There’s no additional cost associated with encrypting the data at rest and it’s a good idea to switch it on as soon as the account is created. There is a one-click setting at the Storage Account level to enable it, and the encryption is applied on both new and existing storage accounts. The data is encrypted with AES 256 cipher and it’s now generally available to all Azure regions and Azure clouds (public, government etc)
There’s some good information here, making it worth the read.
The Azure Data Lake (ADL) vision from the beginning has been to transform business data into intelligence by providing analytics on any data at cloud scale. ADL enterprise customers gain insights on their business data using a wide range of tools and platforms. Today’s release of Cloudera Enterprise 5.11 brings another very valuable and widely-used Hadoop computation platform to the set of platforms that can leverage ADLS. No matter what big data analytics platform you choose, Azure Data Lake Store provides a single high throughput enterprise-scale hierarchical file system data lake repository for big data.
Anyone with an Azure subscription can now deploy Cloudera clusters with ADLS. To get started, you can use the Cloudera Enterprise Data Hub template or the Cloudera Director template on Azure Marketplace to create a Cloudera cluster. Once the cluster is up, see here for more information on how to set up your Cloudera cluster with ADLS today!
That’s an interesting development.
Classify text using BigDL
In this tutorial, we demonstrate how to solve a text classification problem based on the example found here. This example uses a convolutional neural network to classify posts in the 20 Newsgroup dataset into 20 categories.
We’ve provided a companion Jupyter notebook example on GitHub that you can open in the Jupyter dashboard to execute the code sections.
There’s a lot to this tutorial.
We had no options along the way for selecting names for resources, so we have a lot of auto-generated suffixes for our resource names. This is ok for purely learning scenarios, but not my preference if we’re starting a true project with a pre-configured solution. Following an existing naming convention is impossible with solutions (at this point anyway). A wish list item I have is for the solution deployment UI to display the proposed names for each resource and let us alter if desired before the provisioning begins.
The deployment also doesn’t prompt for which subscription to deploy to (if you have multiple subscriptions like I do). The deployment did go to the subscription I wanted, however, it would be really nice to have that as a selection to make sure it’s not just luck.
It sounds like there are some undesirable defaults, but at least it does appear to be very easy to do.
Power BI Privacy Levels “specify an isolation level that defines the degree that one data source will be isolated from other data sources”. After working through some testing scenarios and trying to discover the real impact to data security, I was unable to effectively show how this might have any bearing on data security in Power BI. During one test was I shown a warning about using data from a website with data I had marked Organizational and Private. In all cases, I was able to merge the data in the query and in the relationships with no warning or filtering. All of the documentation makes the same statement and most bloggers are restating what is found in the Power BI documentation as were not helpful. My takeaway after reviewing this for a significant amount of time is to not consider these settings when evaluating data security in Power BI. I welcome comments or additional references which actually demonstrate how this isolation actually works in practice. In most cases, we are using organizational data within our Power BI solutions and will not be impacted by this setting and my find improved performance when disabling it.
As Steve notes, this is not really a security feature. Instead, it’s intended to be more a warning to users about which data is confidential and which is publicly-sharable .
H2O Flow is an interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work – all within Flow’s browser-based environment. In this blog, we will only focus on its visualization part.
H2O FLOW web service lives in the Spark driver and is routed through the HDInsight gateway, so it can only be accessed when the spark application/Notebook is running
You can click the available link in the Jupyter Notebook, or you can directly access this URL:
Setup is pretty easy.
I worked on an interesting issue today where a user couldn’t restore a backup. Here is what this customer did:
- backed up a database from an on-premises server (2008 R2)
- copied the file to an Azure VM
- tried to restore the backup on the Azure VM (2008 R2 with exact same build#)
But he got the following error:
Msg 3241, Level 16, State 0, Line 4
The media family on device ‘c:\temp\test.bak’ is incorrectly formed. SQL Server cannot process this media family.
Msg 3013, Level 16, State 1, Line 4
RESTORE HEADERONLY is terminating abnormally.
We verified that he could restore the same backup on the local machine (on-premises). Initially I thought the file must have been corrupt during transferring. We used different method to transfer file and zipped the file. The behavior is the same. When we backed up a database from the same Azure VM and tried to restore, it was successful.
Click through for Jack’s findings as well as a couple workarounds.