Press "Enter" to skip to content

Author: Kevin Feasel

Using The Azure Data Science VM With GPUs

Jennifer Marsman has some tips and tricks around using the Azure Data Science Virtual Machine on an instance running with GPU support:

To get GPU support, you need both hardware with GPUs in a datacenter, as well as the right software – namely, a virtual machine image that includes GPU drivers so you can use the GPU.

The biggest tip is to use the Deep Learning Virtual Machine!  The provisioning experience has been optimized to filter to the options that support GPU (the NC series – see below), which make it easier to set it up correctly.

Read on for the rest of the advice.

Comments closed

Running Hive LLAP As A YARN Service

Gour Saha, et al, demonstrate running Apache Hive LLAP as a YARN service:

Making LLAP as a first-class YARN service also enables us to use some of the other powerful features in YARN that were added in Apache Hadoop 3.0 / 3.1, some of them are noted below.

  1. Advanced container placement scheduling such as affinity and anti-affinity. What Slider used to handle in a custom way is now a core first-class feature (YARN-6592).

  2. Rich APIs for users to fetch/query application details using Timeline Service V2 (YARN-2928 and YARN-5355).

  3. New and improved Services UI in YARN UI2 improving debuggability and log access.

  4. Continuous rolling log aggregation of long running containers (YARN-2443).

  5. Auto-restart of containers by NodeManagers (YARN-4725).

  6. Windowing and threshold based container health monitor (YARN-8122).

  7. In the future, we can also leverage YARN level rolling upgrades for containers and the service as a whole (YARN-7512 and YARN-4726).

Looks like it’s been a fruitful transition.

Comments closed

Preventing Server Manager From Loading

Steve Stedman shows how to prevent the Server Manager app from loading whenever you RDP into a Windows Server machine:

If you frequently connect to many different SQL Server as I do, you are probably used to the Server Manager loading slowly when you log in with Remote Desktop.

The Server Manager has a bad reputation for taking up lots of CPU over time and possibly even bogging down a SQL Server when left open for days on end.

To prevent this from automatically loading you can do the following to quickly disable it for your user session, and your future user sessions.

Read on for the steps.

Comments closed

Auditing Options With Azure SQL Data Warehouse

Janusz Rokicki explores what is available in Azure SQL Data Warehouse when it comes to auditing:

Auditing is disabled by default and the UI experience depends on the region to which the logical server is deployed. For instance, in UK South, the portal offers no options to manage auditing:

In North Europe, the portal allows Table Auditing (table-storage based) to be enabled on the SQL Data Warehouse scope, but it isn’t possible to enable Blob Auditing:

On top of that, Blob Auditing behaves differently when enabled on a logical server level in different regions. In locations that support Table Auditing, turning on Blob Auditing automatically enables it in all databases, including SQL Data Warehouses—and that’s expected. In other regions, Blob Auditing is not automatically enabled and has to be turned on programmatically by calling ARM REST API.

I imagine the plan is to support this across the board but it’s rolling out region by region.

Comments closed

User-Defined Restore Points In Azure SQL DW

Kevin Ngo announces a new feature in Azure SQL Data Warehouse:

Previously, SQL DW supported only automated snapshots guaranteeing an eight-hour recovery point objective (RPO). While this snapshot policy provided high levels of protection, customers asked for more control over restore points to enable more efficient data warehouse management capabilities leading to quicker times of recovery in the event of any workload interruptions or user errors.

Now, with user-defined restore points, in addition to the automated snapshots, you can initiate snapshots before and after significant operations on your data warehouse. With more granular restore points, you ensure that each restore point is logically consistent and limit the impact and reduce recovery time of restoring the data warehouse should this be needed. User-defined restore points can also be labeled so they are easy to identify afterwards.

Creating a user-defined restore point is a one-liner in Powershell, and it’s something you could do after each warehouse load, for example.

Comments closed

Identity Columns And Linked Servers

Kenneth Fisher points out an oddity when inserting data across a linked server into a table with an identity column:

So far so good. Now let’s throw in a twist. Let’s call it through a linked server.

INSERT INTO [(local)\sql2014cs].Test.dbo.IdentTest 
	VALUES ('Col1','Col2');

Msg 213, Level 16, State 1, Line 4
Column name or number of supplied values does not match table definition.

Well that’s a bit odd, right? I mean I used that exact command in the previous test. Turns out that when you do an insert across a linked server that identity column is not ignored. Which means we just need to include the identity value right? Nope.

INSERT INTO [(local)\sql2014cs].Test.dbo.IdentTest 
	VALUES (1,'Col1','Col2');

Msg 7344, Level 16, State 1, Line 4
The OLE DB provider “SQLNCLI11” for linked server “(local)\sql2014cs” could not INSERT INTO table “[(local)\sql2014cs].[Test].[dbo].[IdentTest]” because of column “Id”. The user did not have permission to write to the column.

Click through to see how to do this.

Comments closed

Visualizing Model Input Effects

Ilknur Kaynar Kabul shows us how to use partial dependence plots and individual conditional expectation plots to view the specific effect of an input variable on a model:

A partial dependence (PD) plot depicts the functional relationship between a small number of input variables and predictions. They show how the predictions partially depend on values of the input variables of interest.  For example, a PD plot can show whether the probability of flu increases linearly with fever. It can show whether high energy level will decrease the probability of having flu. PD can also show the type of relationship, such as a step function, curvilinear, linear and so on.

The simplest PD plots are 1-way plots, which show how a model’s predictions depend on a single input. The plot below shows the relationship (according the model that we trained) between price (target) and number of bathrooms. Here, we see that house prices increase as we increase the number of bathroom up to 4. After that it does not change the house price.

These types of plots are helpful for understanding the mechanics behind a model.

Comments closed

Methods For Detecting Anomalies In Business Metrics

Sergey Bryl’ gives us four methods for detecting anomalies in business data:

In this article, by  business metrics, we mean numerical indicators we regularly measure and use to track and assess the performance of a specific business process. There is a huge variety of business metrics in the industry: from conventional to unique ones. The latter are specifically developed for and used in one company or even just by one of its teams. I want to note that usually, a business metrics have dimensions, which imply the possibility of drilling down the structure of the metric. For instance, the number of sessions on the website can have dimensions: types of browsers, channels, countries, advertising campaigns, etc. where the sessions took place. The presence of a large number of dimensions per metric, on the one hand, provides a comprehensive detailed analysis, and, on the other, makes its conduct more complex.

Anomalies are abnormal values of business indicators. We cannot claim anomalies are something bad or good for business. Rather, we should see them as a signal that there have been some events that significantly influenced a business process and our goal is to determine the causes and potential consequences of such events and react immediately. Of course, from the business point of view, it is better to find such events than ignore them.

It was interesting comparing the results of the four methods.  H/T R-bloggers

Comments closed

Understanding Hash Match Aggregates

Itzik Ben-Gan continues his series on grouping and aggregating data by looking at the hash match aggregation process:

The estimated CPU cost for the Hash Aggregate in the plan for Query 8 is 0.166344, and in Query 9 is 0.16903.

It could be an interesting exercise to try and figure out exactly in what way the cardinality of the grouping set, the data types, and aggregate function used affect the cost; I just didn’t pursue this aspect of the costing. So, after making a choice of the grouping set and aggregate function for your query, you can reverse engineer the costing formula. For example, let’s reverse engineer the CPU costing formula for the Hash Aggregate operator when grouping by a single integer column and returning the MAX(orderdate) aggregate. The formula should be:

Operator CPU cost = <startup cost> + @numrows * <cost per row> + @numgroups * <cost per group>

Using the techniques that I demonstrated in the previous articles in the series, I got the following reverse engineered formula:

Operator CPU cost = 0.017749 + @numrows * 0.00000667857 + @numgroups * 0.0000177087

Definitely worth reading in detail.

Comments closed

What’s Coming With Always Encrypted?

Monica Rathbun explains a new feature coming to SQL Server:

As I discussed in part 3 there are many roads blocks the can stop the implementation of Always Encrypted (AE). In the current available versions of SQL Server 2016 and 2017, along with Azure SQL Database, the cost of using AE was way too high for many companies. There are so many code changes needed to implement AE that moving to it is not cost effective for them. Microsoft recognizes this and has found a better way to handle things like aggregations, range comparison, LIKE predicates, ORDER BYs, and other search criteria with the introduction of Secure Enclaves.  For the client discussed in part 1-3 this will make all the difference.

Per MSDN “An enclave is a protected region of memory that acts as a trusted execution environment. An enclave appears as a black box to the containing process and to other processes running on the machine. There is no way to view the data or the code inside the enclave from the outside, even with a debugger.”

If that’s a bit confusing, check out Monica’s explanation as well.

Comments closed