Category: Cloud

With upgrades in the underlying Kafka Streams library, the Kafka community introduced many improvements to the underlying stream configuration defaults. Where in previous, more unstable iterations of the client library we spent a lot of time tweaking config values such as session.timeout.ms, max.poll.interval.ms, and request.timeout.ms to achieve some level of stability.

With new releases we found ourselves discarding these custom values and achieving better results. However, some timeout issues persisted on some of our services, where a service would frequently get stuck in a rebalancing state. We noticed that reducing the max.poll.records value for the stream configs would sometimes alleviate issues experienced by these services. From partition lag profiles we also saw that the consuming issue seemed to be confined to only a few partitions, while the others would continue processing normally between re-balances. Ultimately we realised that the processing time for a record in these services could be very long (up to minutes) in some edge cases. Kafka has a fairly large maximum offset commit time before a stream consumer is considered dead (5 minutes) but with larger message batches of data this timeout was still being exceeded. By the time the processing of the record was finished the stream was already marked as failed and so the offset could not be committed. On rebalance, this same record would once again be fetched from Kafka, would fail to process in a timely manner and the situation would repeat. Therefore for any of the affected applications we introduced a processing timeout, ensuring there was an upper bound on the time taken by any of our edge cases.

There are some interesting tidbits in here.

Comments closed

Testing Cosmos DB Performance With Geospatial Data

Published 2017-10-26 by Kevin Feasel

Vincent-Philippe Lauzon has done some performance testing of Cosmos DB when querying geospatial data:

Here are the main attributes of the sample set:

There are 1 200 000 documents

Documents are distributed on 4000 logical partitions with 300 documents per logical partition

%33 of documents (i.e. 400 000 documents) have a location node with a geospatial “point” in there

Points are scattered uniformly on the geospatial rectangle

There are no correlation between the partition key and the geospatial point coordinates

We ran the tests with 4 different Request Units (RUs) configurations:

2500
10000
20000
100000

Read on for the test results and his findings.

Comments closed

Query Store Capture Modes

Published 2017-10-25 by Kevin Feasel

Arun Sirpal notes an important difference in the default Query Store settings for SQL Server 2017 versus Azure SQL Database:

So just remember the only difference when analyzing settings is the difference in Query Store Capture Mode. For Azure it is set to AUTO whereas with local installed SQL Servers it is set to ALL.

What does this mean? ALL means that it is set to capture all queries but AUTO means infrequent queries and queries with insignificant cost are ignored. Thresholds for execution count, compile and runtime duration are internally determined.

Read on to learn more, including how to change these settings.

Comments closed

Azure SQL Database FAQ

Published 2017-10-24 by Kevin Feasel

Dimitri Furman answers some common questions about Azure SQL Database:

Q7. Can I use Windows Authentication in Azure SQL Database?

The short answer is no. Therefore, if you are migrating an application dependent on Windows Authentication from SQL Server to Azure SQL Database, you may have to either switch to SQL Authentication (i.e. use a separate login and password for database access), or use Azure Active Directory Authentication (AAD Authentication).

The latter is conceptually similar to Windows Authentication in the sense that connections from directory principals are authenticated without the need to provide additional secrets, such as a password. Since Azure Active Directory can be federated with the on-premises Active Directory Domain Services, it can effectively authenticate the same Active Directory principals that could access the database prior to migration. However, the authentication flow for AAD Authentication is significantly different, so the analogy with Windows Authentication only goes so far.

There are some good questions in here, especially the one about retry logic; that’s good to have in any situation, but becomes vital when working with a cloud service.

Comments closed

Installing The Azure ML Workbench

Published 2017-10-19 by Kevin Feasel

Leila Etaati walks us through setting up the Azure ML workbench:

In Microsoft ignite 2017, Azure ML team announce new on-premises tools for doing machine learning. this tools much more comprehensive as it provides

1- a workspace helps data wrangling

2- Data Visualization

3-Easy to deploy

4-Support Python codes

in this post and next posts, I will share my experiment with working this tools.

Click through for the step-by-step installation guide.

Comments closed

Master Data In Azure

Published 2017-10-17 by Kevin Feasel

Matt How explains why Master Data Services isn’t a great cloud-based master data management solution and offers up an alternative:

Excel is easy to use, but not user friendly

Excel is on nearly every desktop in any Windows based organisation and with the Master Data Services Add-in, it puts the data well within the reach of the users. Whilst it is simple it is in no way user friendly when compared to other applications that your users may be using. Not to mention that for most this will be the only part of the solution they see! Wouldn’t it be great if there was a way to supply the same data but with an intuitive, mobile ready front end that people enjoy using?

Developers are tightly constrained

Developers like to develop, not choose options from drop down menus in a web based portal. With MDS, not only can Devs not make use of Visual Studio and a like but they are very tightly constrained by the business rules engine. At this point we should be able to make use of our preferred IDE so that we can benefit from source control, frameworks and customised business logic.

Not scalable according to modern expectations

Finally, MDS cannot scale to handle any kind of “big data”. It’s a bit of buzz word but as businesses collect more and more data, we need a data management option that can grow with that data. Due to the fact that MDS must be deployed from a server, there is no easy way to meet those big data requirements.

There are a few pieces to Matt’s solution, making for an interesting read.

Comments closed

Checking Azure Status

Published 2017-10-16 by Kevin Feasel

Arun Sirpal shows where to look if you think you’re experiencing an Azure SQL Database outage:

It shows the many different layers involved with a product like Azure SQL Database. What happens if there is a loss of service for a specific component? Obviously we as customers would not be able to fix the issue as this is the responsibility of Microsoft Engineers, the key for me is being kept in the loop with the issue and it is something that they do pretty well. So what happens if the load balancer has issues?

All communication is done via Service Health within the Azure portal.

Check the comments for another useful Azure status site.

Comments closed

Azure Data Lake Store File Management With httr

Published 2017-10-12 by Kevin Feasel

Leila Etaati shows how to generate RESTful statements in R using httr:

In this post, I am going to share my experiment in how to do file management in ADLS using R studio,

to do this you need to have below items

1. An Azure subscription

2. Create an Azure Data Lake Store Account

3. Create an Azure Active Directory Application (for the aim of service-to-service authentication).

4. An Authorization Token from Azure Active Directory Application

It’s pretty easy to do, as Leila shows.

Comments closed

Using R In Azure Data Lake Analytics

Published 2017-10-12 by Kevin Feasel

David Smith links to a tutorial which shows how to use R against Azure Data Lake Analytics:

The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the filesystem directly to Azure HDInsight services, or via HDFS-compatible open-source applications. And for data science applications, you can also access the data directly from R, as this tutorial explains.

To interface with Azure Data Lake, you’ll use U-SQL, a SQL-like language extensible using C#. The R Extensions for U-SQL allow you to reference an R script from a U-SQL statement, and pass data from Data Lake into the R Script. There’s a 500Mb limit for the data passed to R, but the basic idea is that you perform the main data munging tasks in U-SQL, and then pass the prepared data to R for analysis. With this data you can use any function from base R or any R package. (Several common R packages are provided in the environment, or you can upload and install other packages directly, or use the checkpoint package to install everything you need.) The R engine used is R 3.2.2.

Click through for the details.

Comments closed

Testing Azure Data Lake Store Performance

Published 2017-10-12 by Kevin Feasel

Zhen Zeng and Govind Kamat stress test Azure Data Lake Store:

Now that we know the read and write throughput characteristics of a single Data Node, we would like to see how per-node performance scales when the number of Data Nodes in a cluster is increased.

The tool we use for scale testing is the Tera* suite that comes packaged with Hadoop. This is a benchmark that combines performance testing of the HDFS and MapReduce layers of a Hadoop cluster. The suite is comprised of three tools that are typically executed in sequence:

TeraGen, that tool that generates the input data. We use it to test the write performance of HDFS and ADLS.
TeraSort, which sorts the input data in a distributed fashion. This test is CPU bound and we don’t really use it to characterize the I/O performance or HDFS and ADLS, but it is included for completeness.
TeraValidate, the test that reads and validates the sorted data from the previous stage. We use it to test the read performance of HDFS and ADLS.

It’s an interesting look at how well ADLS scales. In general, my reading of this is fairly positive for Azure Data Lake Store.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31