Category: Cloud

Cosmos DB Limitations

Published 2017-11-06 by Kevin Feasel

Vincent-Philippe Lauzon points out a few limitations with Cosmos DB:

The original DocumentDB SQL didn’t have any aggregation capacity. But it did acquire those capacities along the way.

Traditionally, that isn’t the strong spot for document-oriented databases. They tend to be more about find documents and manipulating the documents as oppose to aggregating metrics on a mass of documents.

Today, DocumentDB SQL implements the following aggregate functions:

COUNT
SUM
MIN
MAX
AVG

Read on for where the current aggregation limitation is, as well as more.

Comments closed

Is Azure SQL DW A Good Fit For You?

Published 2017-11-06 by Kevin Feasel

Melissa Coates has a nice choose-your-own-adventure story around Azure SQL Data Warehouse:

Q4: How large is your database?

It is difficult to pinpoint an exact number for the absolute minimum size recommended for Azure SQL DW. Many data professionals in the industry see the minimum “practical” data size for Azure SQL DW in the 1-4TB range. Microsoft documentation has recently stated as low as 250GB for a minimum size. Since Azure SQL DW is an MPP (massively parallel processing) system, you experience a significant performance penalty with small data sizes because of the overhead incurred to distribute and consolidate across the nodes (which are distributions in a “shared-nothing” architecture). We recommend Azure SQL DW for a data warehouse which is starting to approach 1TB and expected to continue growing.

Great advice here. I’ve heard too often of people looking at the name “Azure SQL Data Warehouse” and figuring that because they have data warehouses on-prem, this is the appropriate analog. Azure SQL DW is not a typical data warehousing environment; it’s more of a specialized tool than that, so click through to see if it fits your needs.

Comments closed

Automating Azure Data Lake Storage ACLs

Published 2017-10-27 by Kevin Feasel

Shannon Lowder shows how to automate Azure Data Lake Storage access control lists:

Now that you have these, you can use a for each loop to set your permissions.
foreach ($ACL in $ACLs) {
   write-host "Grant $useremail " $ACL[1] " access to " $ACL[0];
    Set-AzureRmDataLakeStoreItemAclEntry -AccountName $adls -Path $ACL[0] -AceType User -Id $(Get-AzureRmADUser -Mail $useremail ).Id -Permissions $ACL[1]
    Set-AzureRmDataLakeStoreItemAclEntry -AccountName $adls -Path $ACL[0] -AceType User -Id $(Get-AzureRmADUser -Mail $useremail ).Id -Permissions $ACL[1] -Default
}
Now, for each permission, we’ll set the ACL and the default. Why set both? Well, when folders are created under each of the target folders, you want to cascade those permissions down from parent to child, right? Well, that’s what the Default ACL controls. If you skip the second Set-AzureRMDataLakeStoreItemAclEntry, then new folders would not inherit the permissions of the containing folder and your users would be unable to access their files properly.

Read the whole thing. Shannon also has one of the very few valid use cases for 3D pie charts.

Comments closed

Cosmos DB Cheat Sheet

Published 2017-10-27 by Kevin Feasel

Melody Zacharias shows us a cheat sheet for Cosmos DB:

The Cosmos DB by Microsoft is their globally distributed, horizontally scalable, multi-model database service that is available through Azure. Released in 2014, it is the ideal DB for globally distributed applications. Formerly called DocumentDB Cosmos it now supports querying documents using SQL as a JSON query language. As a Schema-free platform, it provides automatic indexing of JSON documents without requiring an explicit schema or creation of secondary indexes. For those of use not well versed in JSON, this query cheat-sheet, has come to our rescue. It outlines common queries to retrieve information from 2 JSON documents.

Microsoft has put together this cheat-sheet to help you write your queries faster. This quick reference is a single page PDF that you can print, or keep in a handy computer file. This is version 4, so it just keeps getting better!

Click through for the link to the cheat sheet.

Comments closed

Tips For Running Kafka Streams On AWS

Published 2017-10-26 by Kevin Feasel

Ian Duffy and Nina Hanzlikova have some advice if you’re looking to spin up some EC2 instances to run Kafka Streams:

With upgrades in the underlying Kafka Streams library, the Kafka community introduced many improvements to the underlying stream configuration defaults. Where in previous, more unstable iterations of the client library we spent a lot of time tweaking config values such as session.timeout.ms, max.poll.interval.ms, and request.timeout.ms to achieve some level of stability.

With new releases we found ourselves discarding these custom values and achieving better results. However, some timeout issues persisted on some of our services, where a service would frequently get stuck in a rebalancing state. We noticed that reducing the max.poll.records value for the stream configs would sometimes alleviate issues experienced by these services. From partition lag profiles we also saw that the consuming issue seemed to be confined to only a few partitions, while the others would continue processing normally between re-balances. Ultimately we realised that the processing time for a record in these services could be very long (up to minutes) in some edge cases. Kafka has a fairly large maximum offset commit time before a stream consumer is considered dead (5 minutes) but with larger message batches of data this timeout was still being exceeded. By the time the processing of the record was finished the stream was already marked as failed and so the offset could not be committed. On rebalance, this same record would once again be fetched from Kafka, would fail to process in a timely manner and the situation would repeat. Therefore for any of the affected applications we introduced a processing timeout, ensuring there was an upper bound on the time taken by any of our edge cases.

There are some interesting tidbits in here.

Comments closed

Testing Cosmos DB Performance With Geospatial Data

Published 2017-10-26 by Kevin Feasel

Vincent-Philippe Lauzon has done some performance testing of Cosmos DB when querying geospatial data:

Here are the main attributes of the sample set:

There are 1 200 000 documents

Documents are distributed on 4000 logical partitions with 300 documents per logical partition

%33 of documents (i.e. 400 000 documents) have a location node with a geospatial “point” in there

Points are scattered uniformly on the geospatial rectangle

There are no correlation between the partition key and the geospatial point coordinates

We ran the tests with 4 different Request Units (RUs) configurations:

2500
10000
20000
100000

Read on for the test results and his findings.

Comments closed

Query Store Capture Modes

Published 2017-10-25 by Kevin Feasel

Arun Sirpal notes an important difference in the default Query Store settings for SQL Server 2017 versus Azure SQL Database:

So just remember the only difference when analyzing settings is the difference in Query Store Capture Mode. For Azure it is set to AUTO whereas with local installed SQL Servers it is set to ALL.

What does this mean? ALL means that it is set to capture all queries but AUTO means infrequent queries and queries with insignificant cost are ignored. Thresholds for execution count, compile and runtime duration are internally determined.

Read on to learn more, including how to change these settings.

Comments closed

Azure SQL Database FAQ

Published 2017-10-24 by Kevin Feasel

Dimitri Furman answers some common questions about Azure SQL Database:

Q7. Can I use Windows Authentication in Azure SQL Database?

The short answer is no. Therefore, if you are migrating an application dependent on Windows Authentication from SQL Server to Azure SQL Database, you may have to either switch to SQL Authentication (i.e. use a separate login and password for database access), or use Azure Active Directory Authentication (AAD Authentication).

The latter is conceptually similar to Windows Authentication in the sense that connections from directory principals are authenticated without the need to provide additional secrets, such as a password. Since Azure Active Directory can be federated with the on-premises Active Directory Domain Services, it can effectively authenticate the same Active Directory principals that could access the database prior to migration. However, the authentication flow for AAD Authentication is significantly different, so the analogy with Windows Authentication only goes so far.

There are some good questions in here, especially the one about retry logic; that’s good to have in any situation, but becomes vital when working with a cloud service.

Comments closed

Installing The Azure ML Workbench

Published 2017-10-19 by Kevin Feasel

Leila Etaati walks us through setting up the Azure ML workbench:

In Microsoft ignite 2017, Azure ML team announce new on-premises tools for doing machine learning. this tools much more comprehensive as it provides

1- a workspace helps data wrangling

2- Data Visualization

3-Easy to deploy

4-Support Python codes

in this post and next posts, I will share my experiment with working this tools.

Click through for the step-by-step installation guide.

Comments closed

Master Data In Azure

Published 2017-10-17 by Kevin Feasel

Matt How explains why Master Data Services isn’t a great cloud-based master data management solution and offers up an alternative:

Excel is easy to use, but not user friendly

Excel is on nearly every desktop in any Windows based organisation and with the Master Data Services Add-in, it puts the data well within the reach of the users. Whilst it is simple it is in no way user friendly when compared to other applications that your users may be using. Not to mention that for most this will be the only part of the solution they see! Wouldn’t it be great if there was a way to supply the same data but with an intuitive, mobile ready front end that people enjoy using?

Developers are tightly constrained

Developers like to develop, not choose options from drop down menus in a web based portal. With MDS, not only can Devs not make use of Visual Studio and a like but they are very tightly constrained by the business rules engine. At this point we should be able to make use of our preferred IDE so that we can benefit from source control, frameworks and customised business logic.

Not scalable according to modern expectations

Finally, MDS cannot scale to handle any kind of “big data”. It’s a bit of buzz word but as businesses collect more and more data, we need a data management option that can grow with that data. Due to the fact that MDS must be deployed from a server, there is no easy way to meet those big data requirements.

There are a few pieces to Matt’s solution, making for an interesting read.

Comments closed