Press "Enter" to skip to content

Category: Cloud

Auditing Options With Azure SQL Data Warehouse

Janusz Rokicki explores what is available in Azure SQL Data Warehouse when it comes to auditing:

Auditing is disabled by default and the UI experience depends on the region to which the logical server is deployed. For instance, in UK South, the portal offers no options to manage auditing:

In North Europe, the portal allows Table Auditing (table-storage based) to be enabled on the SQL Data Warehouse scope, but it isn’t possible to enable Blob Auditing:

On top of that, Blob Auditing behaves differently when enabled on a logical server level in different regions. In locations that support Table Auditing, turning on Blob Auditing automatically enables it in all databases, including SQL Data Warehouses—and that’s expected. In other regions, Blob Auditing is not automatically enabled and has to be turned on programmatically by calling ARM REST API.

I imagine the plan is to support this across the board but it’s rolling out region by region.

Comments closed

User-Defined Restore Points In Azure SQL DW

Kevin Ngo announces a new feature in Azure SQL Data Warehouse:

Previously, SQL DW supported only automated snapshots guaranteeing an eight-hour recovery point objective (RPO). While this snapshot policy provided high levels of protection, customers asked for more control over restore points to enable more efficient data warehouse management capabilities leading to quicker times of recovery in the event of any workload interruptions or user errors.

Now, with user-defined restore points, in addition to the automated snapshots, you can initiate snapshots before and after significant operations on your data warehouse. With more granular restore points, you ensure that each restore point is logically consistent and limit the impact and reduce recovery time of restoring the data warehouse should this be needed. User-defined restore points can also be labeled so they are easy to identify afterwards.

Creating a user-defined restore point is a one-liner in Powershell, and it’s something you could do after each warehouse load, for example.

Comments closed

Azure Data Lake Analytics Updates

Michael Rys has a boatload of new updates for Azure Data Lake:

The top items include expanding our built-in support for standard file formats with native Parquet support for extractors and outputters (in public preview) and ORC (in private preview)!

In addition, since the fast file set feature now has been generally released, we can consume hundreds of thousands of such files in bulk in a single EXTRACT statement. We will publish a blog at a later date to give you much more detailed information on how this capability helps you to process so many files efficiently in a scalable way.

Important aspects of processing files at scale include:

  1. the ability to generate many files from a rowset in a single statement, providing a way to dynamically partition the data for future use with Hadoop or Spark, or to provide individual files for customers. This has been our top customer ask on the ADL Feedback forum –and now it is in private preview!

  2. the ability to handle many small files. We recommend that you make your files large enough for the processing to be efficient (300MB to 4GB is a good range), but often, your file formats (e.g., images) or data ingestion pipelines (e.g., EventHub archives) are not able to reach that size. Thus, we are adding the ability to group several files into a vertex to increase efficiency and lower cost of your job (we have seen 10 to 30 times improvement in some customer jobs!).

Read on for the full changelog.

Comments closed

Flattening JSON Data With Databricks

Ivan Vazharov gives us a Databricks notebook to parse and flatten JSON using PySpark:

With Databricks you get:

  • An easy way to infer the JSON schema and avoid creating it manually
  • Subtle changes in the JSON schema won’t break things
  • The ability to explode nested lists into rows in a very easy way (see the Notebook below)
  • Speed!

Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe.

Click through for the notebook.

Comments closed

Figuring Out Azure Analysis Services Costs

Chris Webb explains that Azure Analysis Services might not be quite as expensive as you’d first think:

What does this mean for the cost of Azure Analysis Services? Basically, if you’re taking advantage of these features you won’t pay one of the monthly prices quoted on the pricing page linked to at the top of this post. Instead you may do things like:

  • Scale up for one hour every day when you need to process your SSAS database, just to get the extra memory and QPUs needed, then scale down when processing has finished
  • Scale out only on certain days, or certain times of day, to handle increased numbers of users
  • Pause your instance when you are sure that no-one needs to run queries

How do you then calculate the likely cost? For my Azure Analysis Services precon at SQLBits a few months ago I built an Excel workbook that shows how to go about this.

There are some good questions in the comments section, so check those out as well.

Comments closed

Restoring Point-In-Time To Another Azure SQL Managed Instance

Jovan Popovic announces an improvement to Azure SQL Database Managed Instances:

Azure SQL Database Managed Instance enables you to create a database as a copy of another database at some point in time in the past. This is known as point-in-time restore feature, and up till now you could perform point-in-time restore only within the same instance.

The latest release of Azure SQL Database Managed Instance enables you to perform point-in-time restore of a database from one instance to another. This might be useful if you need to be sure that you could easily restore a database to another instance if there is some issue on the original instance, or if you need a database for testing or auditing purposes on the test instance and you want to use copy of some of the existing database on another server.

Click through for the current requirements and limitations, as well as a sample.

Comments closed

Polybase Rejected Row Location

Casey Karst announces a nice improvement to Polybase on Azure SQL Data Warehouse:

Every row of your data is an insight waiting to be found. That is why it is critical you can get every row loaded into your data warehouse. When the data is clean, loading data into Azure SQL Data Warehouse is easy using PolyBase. It is elastic, globally available, and leverages Massively Parallel Processing (MPP). In reality clean data is a luxury that is not always available. In those cases you need to know which rows failed to load and why.

In Azure SQL Data Warehouse the Create External Table definition has been extended to include a Rejected_Row_Location parameter. This value represents the location in the External Data Source where the Error File(s) and Rejected Row(s) will be written.

This is a big improvement, one that I hope to see on the on-prem product.

Comments closed

Event Hub Performance Tips

Vincent-Philippe Lauzon has a few tips for improving Azure Event Hub performance:

Here are some recommendations in the light of the performance and throughput results:

  • If we send many events:  always reuse connections, i.e. do not create a connection only for one event.  This is valid for both AMQP and HTTP.  A simple Connection Pool pattern makes this easy.
  • If we send many events & throughput is a concern:  use AMQP.
  • If we send few events and latency is a concern:  use HTTP / REST.
  • If events naturally comes in batch of many events:  use batch API.
  • If events do not naturally comes in batch of many events:  simply stream events.  Do not try to batch them unless network IO is constrained.
  • If a latency of 0.1 seconds is a concern:  move the call to Event Hubs away from your critical performance path.

Let’s now look at the tests we did to come up with those recommendations.

Read the whole thing.

Comments closed

Databricks MLflow

Matai Zaharia announces a new Databricks offering:

MLflow is inspired by existing ML platforms, but it is designed to be open in two senses:

  1. Open interface: MLflow is designed to work with any ML library, algorithm, deployment tool or language. It’s built around REST APIs and simple data formats (e.g., a model can be viewed as a lambda function) that can be used from a variety of tools, instead of only providing a small set of built-in functionality. This also makes it easy to add MLflow to your existing ML code so you can benefit from it immediately, and to share code using any ML library that others in your organization can run.
  2. Open source: We’re releasing MLflow as an open source project that users and library developers can extend. In addition, MLflow’s open format makes it very easy to share workflow steps and models across organizations if you wish to open source your code.

Mlflow is still currently in alpha, but we believe that it already offers a useful framework to work with ML code, and we would love to hear your feedback. In this post, we’ll introduce MLflow in detail and explain its components.

Even in alpha, it looks nice.

Comments closed

The Basics Of Azure Stream Analytics

Chris Seferlis gives us an overview of Azure Stream Analytics:

Here’s how it works. It starts with a data source such as Event Hub, IoT Hub or Azure Blob Storage, and it uses SQL-like query language that allows transformation on the fly. It helps you process operations like filtering, sorting, aggregating and joining the data together to make it more useable—turning data into information.

From there, when you identify the data that you want/need to use, you can then send that data downstream to be sent to a queue for triggering workflows or further processing of the data. You can also send that data to Power BI for real-time visualization. For example, let’s say you’re looking at a data quality stream and you want to pull certain key words out of Twitter to see how they’re used and watch how that’s being done. By connecting to the Twitter API, you can capture that data, stream it, and then report from it with a Power BI report.

Chris also has a video which you can watch.

Comments closed