Press "Enter" to skip to content

Curated SQL Posts

Machine Learning Data Preparation Tips

Jen Underwood has some good tips when preparing data for a machine learning operation:

Data preparation for machine learning requires business domain expertise, bias awareness and an experimental thought process. Before preparing your data, you’ll first define a business problem solve. During that exercise, you’ll select an outcome metric and brainstorm potential input variables that influence it from many varied perspectives. From there you will begin identifying, collecting, cleaning, shaping and sampling data to run through automated machine learning model processes.

Note that it is also not unusual for relevant machine learning input data to occur outside of existing transactional processes. If that is the case, you can still start creating a first-generation machine learning model with existing data and continue to build new model versions over time as supplementary data is acquired.

Click through for the ten tips.

Comments closed

Configuring And Monitoring Distributed AGs

Tracy Boggiano has some advice on configuring and monitoring distributed Availability Groups:

Monitoring can be tricky with distributed AGs because of the way it shows up in SSMS.  The distributed AG created on the primary AG does not have the ability to show you a dashboard to monitor traffic like the typical AG and the secondary replicas do not even show the distributed AG.  So, to monitor the distributed AG you are best to head to the DMVs.  DMV sys.availability_groups has the is_distributed column that will allow you to only see the distributed AGs.  Below is a query adapted from the Microsoft Docs query to add additional information to check on the status of the distributed AGs which you must run on the primary AG.

Click through for more details.

Comments closed

Legacy Cardinality Estimation In SQL Server

Kellyn Pot’vin-Gorman explains what the Legacy Cardinality Estimation setting does in SQL Server:

Oracle DBAs have used the CARDINALITY hint for some time and it should be understood that this may appear to be similar, but is actually quite different.  As hinting in TSQL is a bit different than PL/SQL, we can compare similar queries to assist:

TSQL
SELECT CustomerId, OrderAddedDate
FROM OrderTable
WHERE OrderAddedDate >= '2016-05-01';
OPTION (USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION'));
go
PL/SQL

Where you might first mistake the CE hint for the following CARDINALITY hint in Oracle:

SELECT /*+ CARDINALITY(ORD,15000) */ ORD.CUSTOMER_ID, ORD.ORDER_DATE
FROM ORDERS ORD WHERE ORD.ORDER_DATE >= '2016-05-01';

This would be incorrect and the closest hint in Oracle to SQL Server’s legacy CE hint would be the optimizer feature hint:

SELECT /*+ optimizer_features_enable('9.2.0.7') */ ORD.CUSTOMER_ID, ORD.ORDER_DATE FROM ORDERS ORD
WHERE ORD.ORDER_DATE >= '2016-05-01';

If you’re wondering why I chose a 9i version to force the optimizer to, keep reading and you’ll come to understand.

Read on for the comparative explanation as well as more details on SQL Server’s legacy cardinality estimator hint and database-scoped configuration setting.

Comments closed

Columnstore Partition Management–Dealing With Non-Empty Partitions

Dmitri Korotkevitch shows a way of dealing with non-empty partitions on columnstore indexes:

The common, by the book approach recommends dropping columnstore index, splitting or merging partitions and recreating the index afterwards. As you can imagine, it would lead to extremely inefficient process with huge amount of unnecessary overhead on large tables. After all, you have to drop and recreate columnstore index, converting table to Heap, while just subset of the partitions needs to be rebuilt. Fortunately, you can minimize the overhead with simple workaround:

  1. Switch partition(s) to split or merge to the separate staging table

  2. Split or merge partition(s) in the main table. You can do that because partitions will be empty after the previous step

  3. Drop columnstore index in the staging table, split/merge partition(s) there and recreate the index afterwards

  4. Switch partition(s) back from staging to the main table.

Read on for a detailed walkthrough of these steps.

Comments closed

Good Query Store Default Settings

Erin Stellato gives us a starting point for good values for Query Store settings:

QUERY_CAPTURE_MODE

The default value for SQL Server 2016 and SQL Server 2017 is ALL.  For Azure SQL Database, the default value is AUTO.

With AUTO, queries that are insignificant from a resource utilization perspective, or executed infrequently, are not captured.  If you need to capture queries that may only execute a few times, or those that use very few resources, then use ALL.  Otherwise, use AUTO, as this will capture the relevant majority of your workload.

There is a third option, NONE, where no new queries are captured.  Runtime and wait statistics will continue to be captured for queries that are already in Query Store.

I recommend setting this option to AUTO, as the number of queries in your environment that need tuning/your attention is a small percentage of the total number of queries that execute.  You won’t miss out on important data if you exclude queries that don’t use a lot of resources or don’t execute very often.

Click through for good defaults for all nine settings.

Comments closed

Installing SQL Server 2017 Machine Learning Services

Ginger Grant shows how to install SQL Server 2017 Machine Learning Services:

There are two installation options:  In-Database or Standalone.  If you are evaluating Machine Learning Services and you have no knowledge of what the load may be, start by selecting the Machine Learning Service In-Database.  There are several reasons why by default you want to select the In-Database option. One of the problems that Microsoft was looking to solve by incorporating advanced data analytics was to improve performance of the native code by greatly reducing data latency.  If you are analyzing a lot of data which is stored within SQL Server, the performance will be improved if the data does not need to be moved around on a network. Also, the licensing costs of installing R Server standalone also need to be evaluated with a Microsoft representative as well. An evaluation of the resource load on the network, as well as analysis of the code running on SQL Server should be performed prior to the decision to install the Machine Learning Server Standalone.

Read the whole thing.

Comments closed

Looping Through Pester Tests

Rob Sewell shows how to iterate through a collection of Pester tests:

The problem with  Test Cases is that we can only easily loop through one collection, but as Pester is just PowerShell we can simply use ForEach if we wanted to loop through multiple ones, like instances and then databases.

I like to use the ForEach method as it is slightly quicker than other methods. It will only work with PowerShell version 4 and above. Below that version you need to pipe the collection to For-EachObject.

Check it out for a good explanation of running groups of Powershell tests.

Comments closed

Base Versus Simple Containment

Joe Obbish takes a crack at explaining the difference between base containment and simple containment for cardinality estimation:

We know that the first query will return 500k rows and the second query will return 0 rows. However, can SQL Server know that? Each statistics object only contains information about its own column. There’s no correlation between the UNIQUE_ID and MOD_FILTER columns, so there isn’t a way for SQL Server to know that the queries will return different estimates. The query optimizer can create an estimate based on the filters on the WHERE clause and on the histograms of the join columns, but there’s no foolproof way to do that calculation. The presence of the filters introduces uncertainty into the estimate, even with statistics that perfectly describe the data for each column. The containment assumption is all about the modeling assumption that SQL Server has to make to resolve that uncertainty.

It’s an interesting post aimed at trying to get you to think like a simplified cardinality estimator.  SQL Server doesn’t behave exactly like this, but it’s a good mental reference point.

Comments closed

Querying Elasticsearch

Swatee Chand has a tutorial on querying Elasticsearch:

In Elasticsearch, aggregations framework is responsible for providing the aggregated data based on a search query. Aggregations can be composed together in order to build complex summaries of the data. For a better understanding, consider it as a unit-of-work. It develops analytic information over a set of documents that are available in Elasticsearch. Various types of aggregations are available, each of them having its own purpose and output. For simplification, they are generalized to 4 major families:

  1. Bucketing

    Here each bucket is associated with a key and a document. Whenever the aggregation is executed, all the buckets criteria are evaluated on every document. Each time a criterion matches, the document is considered to “fall in” the relevant bucket.

  2. Metric

    Metrics are the aggregations which are responsible for keeping a track and computing the metrics over a set of documents.

  3. Matrix

    Matrix are the aggregations which are responsible for operating on multiple fields. They produce a matrix result out of the values extracted from the requested document fields. Matrix does not support scripting.

  4. Pipeline

    Pipeline are the aggregations which are responsible for aggregating the output of other aggregations and their associated metrics together.

If you deal with Elasticsearch (or have log data that you want to query through), this tutorial will give you an idea of what you can do.

Comments closed

Azure Functions Basics

Vincent-Philippe Lauzon explains the basics of Azure Functions:

In general, serverless refers to an economical model where we pay for compute resources used as opposed to “servers”.

Wait…  isn’t that what the Cloud is about?

Well, yes, on a macro-scale it is, but serverless brings it to a micro-scale.

In the cloud we can provision a VM, for example, run it for 3 hours and pay for 3 hours.  But we can’t pay for 5 seconds of compute on a VM because it won’t have time to boot.

A lot of compute services have a “server-full” model.  In Azure, for instance, a Web App comes in number of instances.  Each instance has a VM associated to it.  We do not manage that VM but we pay for its compute regardless of the number of requests it processes.

In a serverless model, we pay for micro-transactions.

This is the first part in a series and is aimed at giving a conceptual explanation.

Comments closed