Cloud – Page 186 – Curated SQL

It’s important to note that Athena is not a general purpose database. Under the hood is Presto, a query execution engine that runs on top of the Hadoop stack. Athena’s purpose is to ask questions rather than insert records quickly or update random records with low latency.

That being said, Presto’s performance, given it can work on some of the world’s largest datasets, is impressive. Presto is used daily by analysts at Facebook on their multi-petabyte data warehouse so the fact that such a powerful tool is available via a simple web interface with no servers to manage is pretty amazing to say the least.

Athena is Amazon’s response to Azure Data Lake Analytics. Check out Mark’s blog post for a good way of getting started with Athena.

Comments closed

Locking Azure Resources

Published 2016-12-16 by Kevin Feasel

Arun Sirpal explains how to lock resources in Azure:

There are 2 types of lock resources in Azure.

Delete – Obviously you can’t delete but you can read / modify a resource, this applies to authorised users.

ReadOnly – Authorised users can read a resource but they cannot edit or delete it.

For this blog post I create a delete lock on one of my SQL Databases.

My overly simplistic advice: lock any production resource which you wouldn’t want accidentally deleted. It won’t prevent a malicious user from doing something catastrophic, but it can prevent the “Oops, I meant to click the thing above this” class of mistake.

Comments closed

Streaming Data With Kinesis

Published 2016-12-15 by Kevin Feasel

Asaaf Mentzer shows how to join streaming data (specifically, AWS Kinesis) with lookup data:

In this use case, Amazon Kinesis Analytics can be used to define a reference data input on S3, and use S3 for enriching a streaming data source.

For example, bike share systems around the world can publish data files about available bikes and docks, at each station, in real time. On bike-share system data feeds that follow the General Bikeshare Feed Specification (GBFS), there is a reference dataset that contains a static list of all stations, their capacities, and locations.

There are three different architectures in here, so if you’re looking for streaming data models with Kinesis (or want to apply them to Kafka), this is a solid read.

Comments closed

Multi-Tenant Database Backups

Published 2016-12-14 by Kevin Feasel

Kennie Nybo Pontoppidan thinks about multi-tenant databases in Azure and how you might back them up:

Backup-restore is not directly supported by standard methods in SQL Server/Azure SQL database. One possible way to backup a tenant could be to have a script, which could bcp data to text files. Restore could similarly be a script, which could bcp from txt files to tables in the destination database. Both scripts could be auto-generated from tenant metadata. If the schema for a tenant has 100 tables, the number of tables in a database in this model grows quickly, and the administrative cost of maintaining scripts and tenant metadata could be high. As a side note, no query execution plans can be reused across tenants, since table names are different.

Thinking about customers which share schema, tables, etc. but need to be handled differently requires some additional effort; pretty much all of the tools around SQL Server assume that you care about things at the table, filegroup, or database level.

Comments closed

Replicating To Azure SQL DB

Published 2016-12-09 by Kevin Feasel

Jeffrey Verheul shows how to enable replication from your on-prem SQL Server up to Azure SQL DB:

Replication to another on-premise instance is easy. You just follow the steps in the wizard, it works out-of-the-box, and the chances of this process failing are small. With replicating data to an Azure SQL database it’s a bit more of a struggle. Just one single word took me a few HOURS of investigation and a lot of swearing…

The magic word is “secure.” Read the whole thing if you’re thinking of migrating an app to use Azure SQL DB and want to minimize downtime, or if you just want that extra level of protection that having a copy of your database out of the data center can give you.

Comments closed

Querying Genomic Data With Athena

Published 2016-12-08 by Kevin Feasel

Aaron Friedman explains how to use Amazon Athena to query S3 files:

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR.

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms. I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

This feels a lot like a data lake PaaS process where they’re spinning up a Hadoop cluster in the background, but one which you won’t need to manage. Cf. Azure Data Lake Analytics.

Comments closed

Azure TCO Calculator

Published 2016-12-07 by Kevin Feasel

James Serra links to the Azure Total Cost of Ownership calculator:

For a long time clients would ask me how to determine the cost savings by migrating their applications and databases to Azure. I never had a good answer until now: The Total Cost of Ownership (TCO) Calculator. Now in preview, just provide a brief description of your on-premises environment to get an instant estimate of the cost savings you can realize by migrating your application workloads to Microsoft Azure

I think this is a useful start, though a big part of the value I see in Azure is moving certain things from IaaS to PaaS, like getting rid of the web servers and going to Azure Websites.

Comments closed

Apache Ranger On ElasticMapReduce

Published 2016-12-06 by Kevin Feasel

Varun Rao explains role-based access control using Apache Ranger on Amazon ElasticMapReduce:

Using the HUE SQL Editor, execute the following query.

These queries use external tables, and Hive leverages EMRFS to access the data stored in S3. Because HiveServer2 (where Hue is submitting these queries) is checking with Ranger to grant or deny before accessing any data in S3, you can create fine-grained SQL-based permissions for users even though there is a single EC2 role specified for the cluster (which is used by all requests the cluster makes to S3). For more information, see Additional Features of Hive on Amazon EMR.

If your job includes securing a Hadoop cluster, this is a nice read, even if you don’t use EMR.

Comments closed

Identity Not Found

Published 2016-12-05 by Kevin Feasel

Melissa Coates explains Azure Active Directory tenancy to solve an Azure Analysis Services error:

The Analysis Services product team explained to me that a a user from a tenant which has never provisioned Azure Analysis Services cannot be added to another tenant’s provisioned server. Put another way, our Corporate tenant had never provisioned AAS so the Development tenant could not do so via cross-tenant guest security.

One resolution for this is to provision an AAS server in a subscription associated with the Corporate tenant, and then immediately delete the service from the Corporate tenant. Doing that initial provisioning will do the magic behind the scenes and allow the tenant to be known to Azure Analysis Services. Then we can proceed to provision it for real in the Development tenant.

Read the whole thing.

Comments closed

Elastic Database Pools

Published 2016-12-05 by Kevin Feasel

Arun Sirpal describes Azure elastic database pools:

The key to using elastic database pools is that you must understand the characteristics of the databases involved and their utilisation patterns, if you do not understand this then the idea of using an elastic database pool may cause problems.

The maximum amount my pool has is 100 eDTUs, I know for a fact that the S2 databases will not be used at the same time, the other S0 databases might be used at the same time at the most 3 of them at the same time. Basically what I am saying here is that I know that when the databases concurrently peak I know that it will not go beyond the 100 eDTU limit.

One thing that Arun does not mention is the relative ease of interconnecting databases within a pool, so even if it doesn’t end up being cheaper on net, that might be a benefit worth having.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: Cloud

Taxi Rides And Amazon Athena

Locking Azure Resources

Streaming Data With Kinesis

Multi-Tenant Database Backups

Replicating To Azure SQL DB

Querying Genomic Data With Athena

Azure TCO Calculator

Apache Ranger On ElasticMapReduce

Identity Not Found

Elastic Database Pools