Press "Enter" to skip to content

Category: Cloud

Azure SQL Data Warehouse Setup

Arun Sirpal configures a new instance of Azure SQL Data Warehouse:

The information shown here is the DSQL (Distributed SQL) plan – When you send a SQL query to SQL Data Warehouse, the Control node processes a query and converts the code to DSQL then the Control node sends the command to run in each of the compute nodes.

The returned query plan depicts sequential SQL statements; when the query runs it may involve parallelized operations, so some of the sequential statements shown may run at the same time. More information can be found at the following URL https://msdn.microsoft.com/en-us/library/mt631615.aspx.

Arun also looks at running a simple Power BI report off of Azure SQL Data Warehouse; click through for that.

Comments closed

Azure SQL Data Warehouse Architecture

Warner Chaves looks at system views in Azure SQL Data Warehouse:

Unlike the sys.dm_exec_requests view in SQL Server, the sys.dm_pdw_exec_requests view actually keeps up to 10000 records with the information of a request even after it has executed. This capability is very useful as you can track specific query executions as long as their records are still among the 10000 kept by the view. As time passes the oldest records are phased out in favor of more recent ones.

This is an interesting look at some of the differences between Azure SQL Data Warehouse and a “normal” SQL Server installation.  Good reading.

Comments closed

Kinesis Analytics

Ryan Nienhuis shows how to implement Amazon Kinesis Analytics:

As I covered in the first post, streaming data is continuously generated; therefore, you need to specify bounds when processing data to make your result set deterministic. Some SQL statements operate on individual rows and have natural bounds, such as a continuous filter that evaluates each row based upon a defined SQL WHERE clause. However, SQL statements that process data across rows need to have set bounds, such as calculating the average of particular column. The mechanism that provides these bounds is a window.

Windows are important because they define the bounds for which you want your query to operate. The starting bound is usually the current row that Amazon Kinesis Analytics is processing, and the window defines the ending bound.

Windows are required with any query that works across rows, because the in-application stream is unbounded and windows provide a mechanism to bind the result set and make the query deterministic. Analytics supports three types of windows: tumbling, sliding, and custom.

The concepts here are very similar to Azure’s Stream Analytics.

Comments closed

Improving HBase Cluster Restart Time

Nitin Verma explains how to re-create an HBase cluster a bit faster:

When flush ‘table’ operation is triggered, all the regions belonging to that table will flush independently. Once the HFile corresponding to a region is flushed, it records the max sequence id in metadata and notifies the WAL corresponding to the regionserver. WAL maintains a mapping table for regions and their corresponding flushed sequence id’s. When the HBase cluster restarts, the hMaster will distribute flushed sequence id’s per region to the recovery threads splitting the WAL, so that they can skip the edits which have already been persisted in HFiles.

This is particularly important for clusters which frequently spin up and down, a feature of Platform-as-a-Service solutions like HDInsight.

Comments closed

S3 Or EBS?

Devadutta Ghat, et al, compare Amazon S3 versus Elastic Block Storage (EBS) on the basis of cost and Apache Impala performance:

EBS is attached to the AWS compute node as a fully-functional filesystem (similar to an attached SSD on an on-premise node), and Impala makes use of several filesystem features to deliver higher throughput and lower latency. These features include:

  • HDFS short-circuit reads to bypass HDFS and read files directly from the filesystem
  • OS buffer cache to read frequently accessed files directly from the cache instead of fetching it again
  • Fixed-cost file renames through metadata operations

In contrast, S3 is an object store that is accessed over the network. However, with S3, throughput is better than simple network-attached storage because of its dedicated, high-performance networks. In Cloudera’s internal benchmark testing (detailed below), on an r3.2xlarge, we saw a consistent throughput of about 100MB/s. Furthermore, in S3, there is currently no equivalent to HDFS short-circuit reads. Move/rename operations for data stored in S3 is a copy followed by a delete, while a file move on HDFS is a metadata operation—which is usually problematic for ETL workloads, as they create large number of small files that are typically moved.

It looks like EBS is a solid choice for many workloads.

Comments closed

Azure Automation

Melissa Coates explains Azure Automation:

Azure Automation is a cloud service in Microsoft Azure which let you schedule execution of PowerShell cmdlets and PowerShell workflows. Azure Automation uses the concept of runbooks to execute a set of repeatable, repetitive tasks via PowerShell. Consistency in execution, reduction of errors, and of course saving time, are all key objectives – which makes DBAs and system admins happy, eh?

This is a higher-level discussion including some good tips on the product.

Comments closed

Migrating To Azure SQL Database

Niko Neugebauer is building a compendium of methods to migrate an on-prem database to Azure SQL Database:

I decided to put a list of the migration methods that can be useful for migrating to Azure SQLDatabase. By all means it is not complete and if you have any suggestions to expand it – do not be shy.

The current list of the ways that I am considering is here:

  • SQL Server Management Studio (SSMS)

  • BACPAC + SSMS/Portal/Powershell

  • SQL Azure Migration Wizard (SAMW)

  • SQL Server Data Tools (Visual Studio) + BCP/SSIS

  • Azure Data Factory

  • Transactional Replication

  • Linked Server

Read on for the details on each method.

Comments closed

Real-Time Power BI Dashboards

Reza Rad builds a real-time dashboard with Stream Analytics and Power BI:

IoT Devices or Applications can pass their data to Azure Event Hub, and Azure Event hub can be used as an input to Azure Stream Analytics (which is a data streaming Azure service). Then Azure stream analytics can pass the data from input based on queries to outputs. If Power BI be used as an output then a dataset in Power BI will be generated that can be used for real-time dashboard.

As a result anytime a new data point from application or IoT device comes through Event hubs, and then Stream Analytics, Power BI dashboard will automatically update with new information.

This is a pretty nice weekend project.

Comments closed

Azure SQL Database Size Quotas

Dimitri Furman discusses the MAXSIZE property on an Azure SQL Database:

Customers can use this ability to allow scaling down to a lower service objective, when otherwise scaling down wouldn’t be possible because the database is too large.

While this capability is useful for some customers, the fact that the actual size quota for the database may be different from the maximum size quota for the selected service objective can be unexpected, particularly for customers who are used to working with the traditional SQL Server, where there is no explicit size quota at the database level. Exceeding the unexpectedly low database size quota will prevent new space allocations within the database, which can be a serious problem for many types of applications.

One more thing to think about, I suppose.

Comments closed