Press "Enter" to skip to content

Category: Cloud

Spark Deep Learning On AWS

Joseph Spisak, et al, show how to configure and use BigDL in Amazon Web Services’s ElasticMapReduce:

Classify text using BigDL

In this tutorial, we demonstrate how to solve a text classification problem based on the example found here. This example uses a convolutional neural network to classify posts in the 20 Newsgroup dataset into 20 categories.

We’ve provided a companion Jupyter notebook example on GitHub that you can open in the Jupyter dashboard to execute the code sections.

There’s a lot to this tutorial.

Comments closed

Instantiating Cortana Intelligence Gallery Solutions

Melissa Coates has a step-by-step guide showing how to install a solution from the Cortana Intelligence Gallery:

We had no options along the way for selecting names for resources, so we have a lot of auto-generated suffixes for our resource names. This is ok for purely learning scenarios, but not my preference if we’re starting a true project with a pre-configured solution. Following an existing naming convention is impossible with solutions (at this point anyway). A wish list item I have is for the solution deployment UI to display the proposed names for each resource and let us alter if desired before the provisioning begins.

The deployment also doesn’t prompt for which subscription to deploy to (if you have multiple subscriptions like I do). The deployment did go to the subscription I wanted, however, it would be really nice to have that as a selection to make sure it’s not just luck.

It sounds like there are some undesirable defaults, but at least it does appear to be very easy to do.

Comments closed

Data Classification In Power BI

Steve Hughes describes how Power BI data classification works:

Power BI Privacy Levels “specify an isolation level that defines the degree that one data source will be isolated from other data sources”. After working through some testing scenarios and trying to discover the real impact to data security, I was unable to effectively show how this might have any bearing on data security in Power BI. During one test was I shown a warning about using data from a website with data I had marked Organizational and Private. In all cases, I was able to merge the data in the query and in the relationships with no warning or filtering. All of the documentation makes the same statement and most bloggers are restating what is found in the Power BI documentation as were not helpful. My takeaway after reviewing this for a significant amount of time is to not consider these settings when evaluating data security in Power BI. I welcome comments or additional references which actually demonstrate how this isolation actually works in practice. In most cases, we are using organizational data within our Power BI solutions and will not be impacted by this setting and my find improved performance when disabling it.

As Steve notes, this is not really a security feature.  Instead, it’s intended to be more a warning to users about which data is confidential and which is publicly-sharable .

Comments closed

Using h2o.ai On HDInsight

Xiaoyong Zhu shows how to set up h2o.ai on Azure HDInsight:

H2O Flow is an interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work – all within Flow’s browser-based environment. In this blog, we will only focus on its visualization part.

H2O FLOW web service lives in the Spark driver and is routed through the HDInsight gateway, so it can only be accessed when the spark application/Notebook is running

You can click the available link in the Jupyter Notebook, or you can directly access this URL:

https://yourclustername-h2o.apps.azurehdinsight.net/flow/index.html

Setup is pretty easy.

Comments closed

SQL Server Backup To Azure Tool Causing Restore Errors

Jack Li diagnoses an issue in which the Microsoft SQL Server Backup to Microsoft Azure Tool causes errors when trying to restore a database on an Azure VM with SQL Server 2008 R2:

I worked on an interesting issue today where a user couldn’t restore a backup.   Here is what this customer did:

  1. backed up a database from an on-premises server (2008 R2)
  2. copied the file to an Azure VM
  3. tried to restore the backup on the Azure VM (2008 R2 with exact same build#)

But he got the following error:

Msg 3241, Level 16, State 0, Line 4
The media family on device ‘c:\temp\test.bak’ is incorrectly formed. SQL Server cannot process this media family.
Msg 3013, Level 16, State 1, Line 4
RESTORE HEADERONLY is terminating abnormally.

We verified that he could restore the same backup on the local machine (on-premises).  Initially I thought the file must have been corrupt during transferring.   We used different method to transfer file and zipped the file.  The behavior is the same.   When we backed up a database from the same Azure VM and tried to restore, it was successful.

Click through for Jack’s findings as well as a couple workarounds.

Comments closed

On-Prem Power BI Gateway

Steve Hughes shows how to set up a data gateway for Power BI:

First, I will not be discussing the personal gateway in this post. If you have chosen to use the personal gateway, you have limited functionality and should consider using the on-premises data gateway for corporate use.

The on-premises data gateway (referred to as gateway throughout this post) “acts as a bridge, providing quick and secure data transfer between on-premises data and the Power BI, Microsoft Flow, Logic Apps, and PowerApps services.” (ref) Much of what is discussed here will apply to all of the services referenced above, but our primary concern is related to Power BI. Please refer to references at the end of this post for details about data sources supported within the gateway.

Click through for more information.

Comments closed

Thinking About Automation

Chrissy LeMaire has a series of thoughts on this month’s T-SQL Tuesday, and it was worth separating out from the rest of today’s batch:

Y’all know what I’m gonna say here! I love automation and PowerShell. I know for a fact that PowerShell and T-SQL together are the future of SQL Server administration. As someone who often presents about dbatools, the popular SQL PowerShell community project, I’ve seen the excitement and relief that PowerShell automation brings to SQL Server Database Pros.

From making it way easier to migrate entire instances to automating backup testing and verification, PowerShell makes it straight up more enjoyable to be a DBA.

There’s a lot of well-deserved plugging of dbatools.  Hint, hint.

Comments closed

Moving Files In Azure Data Factory

Meagan Longoria has a workaround for how you cannot move a file using Azure Data Factory:

But at this time ADF doesn’t support that. You can copy a file with a copy activity, but you cannot actually move (i.e., copy and delete).

Luckily, we had a workaround for our situation. If you tell ADF to copy data to a file that already exists in the specified location in the data lake, it will overwrite the existing file. We made sure the file name is always the same for each table in the staging area so there is always only one file per table.

Read on for the full details on this workaround.  Also, vote on this feedback item if you want the ability to move files instead of just copying them.

Comments closed

Azure Data Lake Store Best Practices

Ust Oldfield provides recommendations on how to size and lay out files in Azure Data Lake Store:

The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.

In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.

This is the type of thing that you can easily forget about, but it makes a huge difference down the line.

Comments closed

HDInsight 3.6 Available

Ashish Thapliyal points out some Hive improvements in HDInsight 3.6:

2 Create a new Hive table from scratch or alter Table

Create a new table by, clicking on the ‘+’ icon, which opens the create table wizard. Enter table name, column name and choose a data type from the dropdown. You can pick folloiwng advanced hive settings directly from the UI

  • Transactional : Turn on transaction support in Hive, by checking this flag. Note that the table must be bucketed and stored using an ACID compliant format (such as ORC).

  • Location : Hive stores the table data for managed tables in the Hive warehouse directory in HDFS which is configured in hive-site.xml with property hive.metastore.warehouse.dir. The default location is /apps/hive/warehouse. The location can be changed using the Location text field.

  • File Format : The default file format for CREATE TABLE statement is ORC. choose a format from the file format dropdown.

  • Row Format : Select a row format such as Field terminator, Lines terminator, and Stored File type.

  • Table can be altered to add new columns or change the column name or column datatype.

  • Tables can also be renames and altred

Read on for more improvements, including a graphical plan viewer and improved autocomplete.

Comments closed