Press "Enter" to skip to content

Category: Cloud

Understanding Data Gateways

James Serra walks us through the different data gateways available in Azure:

On-premises data gateway: Formally called the enterprise version.  Multiple users can share and reuse a gateway in this mode.  This gateway can be used by Power BI, PowerApps, Microsoft Flow or Azure Logic Apps.  For Power BI, this includes support for both scheduled refresh and DirectQuery.  To add a data source such as SQL Server that can be used by the gateway, check out Manage your data source – SQL Server.  To connect the gateway to your Power BI, you will sign in to Power BI after you install it (see On-premises data gateway in-depth).

Click through for more details on additional gateways.

Comments closed

Debugging Spark In HDInsight

Sajib Mahmood gives various methods for debugging Spark applications running on an HDInsight cluster:

Spark Application Master

To access Spark UI for the running application and get more detailed information on its execution use the Application Master link and navigate through different tabs containing more information on jobs, stages, executors and so on.

These methods also apply for on-prem Spark clusters, although the resource locations might be a little different.

Comments closed

Understanding Azure SQL Elastic Pool

Vincent-Philippe Lauzon explains how SQL Elastic Pools work and why we might want to use them in Azure:

Along came Elastic Pool.  Interestingly, Elastic Pools brought back the notion of a centralized compute shared across databases.  Unlike on premise SQL Server on premise though, that compute doesn’t sit with the server itself but with a new resource called an elastic pool.

This allows us to provision certain compute, i.e. DTUs, to a pool and share it across many databases.

His example is using a large number of small databases, where the total load is never the sum of individual expected loads.  Another reason to use a pool is for cross-database queries in Azure.

Comments closed

Deploying VMs To Azure Using Powershell

Rob Sewell shows how to use Powershell to create your own Azure VM instance of the Microsoft data science virtual machine:

First, an annoyance. To be able to deploy Data Science virtual machines in Azure programmatically  you first have to login to the portal and click some buttons.

In the Portal click new and then marketplace and then search for data science. Choose the Windows Data Science Machine and under the blue Create button you will see a link which says “Want to deploy programmatically? Get started” Clicking this will lead to the following blade.

Click through for a screenshot-laden explanation which leaves you with a working VM in Azure.

Comments closed

Taxi Rides And Amazon Athena

Mark Litwintschik looks at using Amazon Athena to process the New York City taxi rides data set:

It’s important to note that Athena is not a general purpose database. Under the hood is Presto, a query execution engine that runs on top of the Hadoop stack. Athena’s purpose is to ask questions rather than insert records quickly or update random records with low latency.

That being said, Presto’s performance, given it can work on some of the world’s largest datasets, is impressive. Presto is used daily by analysts at Facebook on their multi-petabyte data warehouse so the fact that such a powerful tool is available via a simple web interface with no servers to manage is pretty amazing to say the least.

Athena is Amazon’s response to Azure Data Lake Analytics.  Check out Mark’s blog post for a good way of getting started with Athena.

Comments closed

Locking Azure Resources

Arun Sirpal explains how to lock resources in Azure:

There are 2 types of lock resources in Azure.

  • Delete – Obviously you can’t delete but you can read / modify a resource, this applies to authorised users.
  • ReadOnly – Authorised users can read a resource but they cannot edit or delete it.

For this blog post I create a delete lock on one of my SQL Databases.

My overly simplistic advice:  lock any production resource which you wouldn’t want accidentally deleted.  It won’t prevent a malicious user from doing something catastrophic, but it can prevent the “Oops, I meant to click the thing above this” class of mistake.

Comments closed

Streaming Data With Kinesis

Asaaf Mentzer shows how to join streaming data (specifically, AWS Kinesis) with lookup data:

In this use case, Amazon Kinesis Analytics can be used to define a reference data input on S3, and use S3 for enriching a streaming data source.

For example, bike share systems around the world can publish data files about available bikes and docks, at each station, in real time.  On bike-share system data feeds that follow the General Bikeshare Feed Specification (GBFS), there is a reference dataset that contains a static list of all stations, their capacities, and locations.

There are three different architectures in here, so if you’re looking for streaming data models with Kinesis (or want to apply them to Kafka), this is a solid read.

Comments closed

Multi-Tenant Database Backups

Kennie Nybo Pontoppidan thinks about multi-tenant databases in Azure and how you might back them up:

Backup-restore is not directly supported by standard methods in SQL Server/Azure SQL database. One possible way to backup a tenant could be to have a script, which could bcp data to text files. Restore could similarly be a script, which could bcp from txt files to tables in the destination database. Both scripts could be auto-generated from tenant metadata. If the schema for a tenant has 100 tables, the number of tables in a database in this model grows quickly, and the administrative cost of maintaining scripts and tenant metadata could be high. As a side note, no query execution plans can be reused across tenants, since table names are different.

Thinking about customers which share schema, tables, etc. but need to be handled differently requires some additional effort; pretty much all of the tools around SQL Server assume that you care about things at the table, filegroup, or database level.

Comments closed

Replicating To Azure SQL DB

Jeffrey Verheul shows how to enable replication from your on-prem SQL Server up to Azure SQL DB:

Replication to another on-premise instance is easy. You just follow the steps in the wizard, it works out-of-the-box, and the chances of this process failing are small. With replicating data to an Azure SQL database it’s a bit more of a struggle. Just one single word took me a few HOURS of investigation and a lot of swearing…

The magic word is “secure.”  Read the whole thing if you’re thinking of migrating an app to use Azure SQL DB and want to minimize downtime, or if you just want that extra level of protection that having a copy of your database out of the data center can give you.

Comments closed

Querying Genomic Data With Athena

Aaron Friedman explains how to use Amazon Athena to query S3 files:

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR 

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms.  I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

This feels a lot like a data lake PaaS process where they’re spinning up a Hadoop cluster in the background, but one which you won’t need to manage. Cf. Azure Data Lake Analytics.

Comments closed