Press "Enter" to skip to content

Category: Cloud

Kinesis Analytics

Ryan Nienhuis shows how to implement Amazon Kinesis Analytics:

As I covered in the first post, streaming data is continuously generated; therefore, you need to specify bounds when processing data to make your result set deterministic. Some SQL statements operate on individual rows and have natural bounds, such as a continuous filter that evaluates each row based upon a defined SQL WHERE clause. However, SQL statements that process data across rows need to have set bounds, such as calculating the average of particular column. The mechanism that provides these bounds is a window.

Windows are important because they define the bounds for which you want your query to operate. The starting bound is usually the current row that Amazon Kinesis Analytics is processing, and the window defines the ending bound.

Windows are required with any query that works across rows, because the in-application stream is unbounded and windows provide a mechanism to bind the result set and make the query deterministic. Analytics supports three types of windows: tumbling, sliding, and custom.

The concepts here are very similar to Azure’s Stream Analytics.

Comments closed

Improving HBase Cluster Restart Time

Nitin Verma explains how to re-create an HBase cluster a bit faster:

When flush ‘table’ operation is triggered, all the regions belonging to that table will flush independently. Once the HFile corresponding to a region is flushed, it records the max sequence id in metadata and notifies the WAL corresponding to the regionserver. WAL maintains a mapping table for regions and their corresponding flushed sequence id’s. When the HBase cluster restarts, the hMaster will distribute flushed sequence id’s per region to the recovery threads splitting the WAL, so that they can skip the edits which have already been persisted in HFiles.

This is particularly important for clusters which frequently spin up and down, a feature of Platform-as-a-Service solutions like HDInsight.

Comments closed

S3 Or EBS?

Devadutta Ghat, et al, compare Amazon S3 versus Elastic Block Storage (EBS) on the basis of cost and Apache Impala performance:

EBS is attached to the AWS compute node as a fully-functional filesystem (similar to an attached SSD on an on-premise node), and Impala makes use of several filesystem features to deliver higher throughput and lower latency. These features include:

  • HDFS short-circuit reads to bypass HDFS and read files directly from the filesystem
  • OS buffer cache to read frequently accessed files directly from the cache instead of fetching it again
  • Fixed-cost file renames through metadata operations

In contrast, S3 is an object store that is accessed over the network. However, with S3, throughput is better than simple network-attached storage because of its dedicated, high-performance networks. In Cloudera’s internal benchmark testing (detailed below), on an r3.2xlarge, we saw a consistent throughput of about 100MB/s. Furthermore, in S3, there is currently no equivalent to HDFS short-circuit reads. Move/rename operations for data stored in S3 is a copy followed by a delete, while a file move on HDFS is a metadata operation—which is usually problematic for ETL workloads, as they create large number of small files that are typically moved.

It looks like EBS is a solid choice for many workloads.

Comments closed

Azure Automation

Melissa Coates explains Azure Automation:

Azure Automation is a cloud service in Microsoft Azure which let you schedule execution of PowerShell cmdlets and PowerShell workflows. Azure Automation uses the concept of runbooks to execute a set of repeatable, repetitive tasks via PowerShell. Consistency in execution, reduction of errors, and of course saving time, are all key objectives – which makes DBAs and system admins happy, eh?

This is a higher-level discussion including some good tips on the product.

Comments closed

Migrating To Azure SQL Database

Niko Neugebauer is building a compendium of methods to migrate an on-prem database to Azure SQL Database:

I decided to put a list of the migration methods that can be useful for migrating to Azure SQLDatabase. By all means it is not complete and if you have any suggestions to expand it – do not be shy.

The current list of the ways that I am considering is here:

  • SQL Server Management Studio (SSMS)

  • BACPAC + SSMS/Portal/Powershell

  • SQL Azure Migration Wizard (SAMW)

  • SQL Server Data Tools (Visual Studio) + BCP/SSIS

  • Azure Data Factory

  • Transactional Replication

  • Linked Server

Read on for the details on each method.

Comments closed

Real-Time Power BI Dashboards

Reza Rad builds a real-time dashboard with Stream Analytics and Power BI:

IoT Devices or Applications can pass their data to Azure Event Hub, and Azure Event hub can be used as an input to Azure Stream Analytics (which is a data streaming Azure service). Then Azure stream analytics can pass the data from input based on queries to outputs. If Power BI be used as an output then a dataset in Power BI will be generated that can be used for real-time dashboard.

As a result anytime a new data point from application or IoT device comes through Event hubs, and then Stream Analytics, Power BI dashboard will automatically update with new information.

This is a pretty nice weekend project.

Comments closed

Azure SQL Database Size Quotas

Dimitri Furman discusses the MAXSIZE property on an Azure SQL Database:

Customers can use this ability to allow scaling down to a lower service objective, when otherwise scaling down wouldn’t be possible because the database is too large.

While this capability is useful for some customers, the fact that the actual size quota for the database may be different from the maximum size quota for the selected service objective can be unexpected, particularly for customers who are used to working with the traditional SQL Server, where there is no explicit size quota at the database level. Exceeding the unexpectedly low database size quota will prevent new space allocations within the database, which can be a serious problem for many types of applications.

One more thing to think about, I suppose.

Comments closed

Restoring An Azure SQL Database

Arun Sirpal discusses ways to restore a database within Azure SQL Database:

You won’t have the ability to use the same name of the restoring database and the database that you want to replace; if you try you get the screen shot below: To get around this I think you would need to drop the old one once the new one has restored then do a rename.

This is a big difference compared to the on-prem version, so be sure to practice this before you find yourself in a crisis.

Comments closed

Migrating To Azure SQL Database

Mike Fal discusses BACPACs, DACPACs, and migrating on-prem databases to Azure SQL Database:

SQL Server Data Tools(SSDT) have always had a process to extract your database. There are two types of extracts you can perform:

  • DACPAC – A binary file that contains the logical database schema and possibly the data. This file retains the platform version of the database (i.e. 2012, 2014, 2016).

  • BACPAC – A binary file that contains the logical database schema and the data as insert statements. This stores the platform version, but is not locked into it.

Mike also walks through SqlPackage.exe.

Comments closed