Curated SQL – Page 1514 – A Fine Slice Of SQL Server

Basics Of Azure SQL Data Warehouse

Published 2017-08-29 by Kevin Feasel

Minette Steynberg has an article introducing Azure SQL Data Warehouse:

Azure SQL DW is best used for analytical workloads that makes use of large volumes of data and needs to consolidate disparate data into a single location.

Azure SQL DW has been specifically designed to deal with very large volumes of data. In fact, if there is too little data it may perform poorly because the data is distributed. You can imagine that if you had only 10 rows per distribution, the cost of consolidating the data will be way more than the benefit gained by distributing it.

SQL DW is a good place to consolidate disparate data, transform, shape and aggregate it, and then perform analysis on it. It is ideal for running burst workloads, such as month end financial reporting etc.

Azure SQL DW should not be used when small row by row updates are expected as in OLTP workloads. It should only be used for large scale batch operations.

Azure SQL Data Warehouse is fantastic when you’ve got a setup like above and are willing to pay a premium to make things faster. And with appropriately distributed data, it certainly does get faster.

Comments closed

Sizing Memory-Optimized Workloads

Published 2017-08-29 by Kevin Feasel

Prashanth Purnananda gives us a few notes regarding memory-optimized table sizes:

Recovering database with memory-optimized tables involves hydrating the contents of checkpoint files (data/delta files) into memory and then replaying the tail of the log (see this link for more details). One of the important difference between disk based tables and memory-optimized store is frequency of checkpoints. Automatic checkpointing for in-memory tables occurs every 1.5GB of log records unlike traditional or indirect checkpoints (where checkpointing is done more often) leading to longer tail of log for in-memory tables. The 1.5 GB log flush is chosen to strike the right balance between flooding the IO subsystem with too many small inefficient IO operations and too few large IOPs. In most scenarios observed by our CSS teams, long recovery times for memory optimized databases is caused by the long tail of log which needs to be recovered for in-memory tables in the database. For these scenarios, running a manual checkpoint before a restart can reduce recovery times as manual checkpoint forces the checkpoint for memory optimized tables in addition to disk based tables.

If you’re looking at creating memory-optimized tables, these are important administrative notes.

Comments closed

Azure SQL Database Compatibility Level Change

Published 2017-08-29 by Kevin Feasel

Joe Sack reports that new Azure SQL Databases will have a compatibility level of 140 by default:

Once this new database compatibility default goes into effect, if you still wish to use database compatibility level 130 (or lower), please follow the instructions detailed here: View or Change the Compatibility Level of a Database. For example, you may wish to ensure that new databases created in Azure SQL Database use the same compatibility level as other databases in Azure SQL Database to ensure consistent query optimization behavior across development, QA and production versions of your databases. We recommend that database configuration scripts explicitly designate COMPATIBILITY_LEVEL rather than rely on the defaults, in order to ensure consistent application behavior.

For new databases supporting new applications, we recommend using the latest compatibility level (140). For pre-existing databases running at lower compatibility levels, the recommended workflow for upgrading the query processor to a higher compatibility level is detailed in the article, Change the Database Compatibility Mode and Use the Query Store. Note that this article refers to compatibility level 130 and SQL Server, but the same methodology applies for moves to 140 for SQL Server and Azure SQL DB.

It’s good to hear, and as Joe mentions, you have the ability to move back down to 130 if you need it.

Comments closed

Selecting Into Tables, Sans Identity

Published 2017-08-29 by Kevin Feasel

Kenneth Fisher shows a couple of ways to remove an identity property from a column when creating a new table:

A while back I did a post about creating an empty table using a SELECT statement. Basically doing something like this:

1

SELECT TOP 0 * INTO tableNameArchive FROM tableName

will create a new table with the exact same structure as the source table. It can be a really handy way to create an archive table, a temp table, etc. You don’t create any of the extra objects (indexes, triggers, constraints etc) but what you do end up with is every table property from the original table. This includes datatypes, nullability, and (as I’m sure you realized from the title) IDENTITY. Which if you are creating an archive table, a temp table, etc is probably not something you want. Fortunately, there are two easy ways to get rid of the identity.

Click through to see those two methods.

Comments closed

Early Thoughts On New AMD CPUs

Published 2017-08-29 by Kevin Feasel

Glenn Berry talks about the new AMD processor lines and how they might work with SQL Server:

AMD is really pushing the idea of a single-socket EPYC system as a better alternative to a two-socket Intel system for many server workloads. According to AMD, it will be much less expensive, yet will have plenty of cores, memory, and PCIe 3.0 lanes, along with no NUMA overhead. One key advantage AMD is touting is their Infinity Fabric modular interconnect technology, that works both within a single processor and between multiple processors.

For SQL Server 2016/2017 usage, you would still want the “top of the line” SKU for a given physical core count, to get the most performance for each physical core license that you buy. Unlike Intel, AMD does not increase the base clock speed in the lower core count models. These EPYC systems have a lot of PCIe 3.0 lanes and very high memory density, so they might work really well for large SQL Server DW/Reporting workloads. For OLTP workloads, the key will be how much single-threaded performance AMD is able to get from this first-generation of EPYC, and how they compare to Intel’s new Skylake-SP processors. Figure 3 shows the fastest EPYC processor at each core count, which is what you would want for SQL Server usage.

There aren’t too many hard numbers yet, but the worst case scenario is that they force Intel to improve their offerings.

Comments closed

Learning Spark Structured Streaming

Published 2017-08-28 by Kevin Feasel

Jules Damji has a nice compendium of links and additional resources for people wanting to learn more about Apache Spark’s Structured Streaming:

Structured Streaming In Apache Spark: A new high-level API for streaming

Databricks’ engineers and Apache Spark committers Matei Zaharia, Tathagata Das, Michael Armbrust and Reynold Xin expound on why streaming applications are difficult to write, and how Structured Streaming addresses all the underlying complexities.

There’s quite a bit of reading material on the other side.

Comments closed

Simpson’s Paradox Explained

Published 2017-08-28 by Kevin Feasel

Mehdi Daoudi, et al, have a nice explanation of Simpson’s Paradox:

E.H. Simpson first described the phenomenon of Simpson’s paradox in 1951. The actual name “Simpson’s paradox” was introduced by Colin R. Blyth in 1972. Blyth mentioned that:

G.W. Haggstrom pointed out that Simpson’s paradox is the simplest form of the false correlation paradox in which the domain of x is divided into short intervals, on each of which y is a linear function of x with large negative slope, but these short line segments get progressively higher to the right, so that over the whole domain of x, the variable y is practically a linear function of x with large positive slope.

The authors also provide a helpful example with operational metrics, showing how aggregating the data leads to an opposite (and invalid) conclusion.

Comments closed

Gradient Boosting In R

Published 2017-08-28 by Kevin Feasel

Anish Sing Walia walks us through a gradient boosting exercise using R:

An important thing to remember in boosting is that the base learner which is being boosted should not be a complex and complicated learner which has high variance for e.g a neural network with lots of nodes and high weight values.For such learners boosting will have inverse effects.

So I will explain Boosting with respect to decision trees in this tutorial because they can be regarded as weak learners most of the times.We will generate a gradient boosting model.

Click through for more details. H/T R-Bloggers

Comments closed

Columnstore Predicate Pushdown Tipping Point

Published 2017-08-28 by Kevin Feasel

Lonny Niederstadt is hot on the trail, looking for evidence of a tipping point for COUNT(*) aggregates performing pushdown against a clustered columnstore index:

Below is what we want post-execution plans to look like when counting rows in a range – the thin arrow coming out of the Columnstore scan is a hint that predicate pushdown was successful. I didn’t specify MAXDOP in a query hint, and Resource Governor isn’t supplying MAXDOP; MAXDOP 8 is coming from the Database Scoped Configuration. The degree of parallelism turns out to be a significant factor in determining the tipping point. The [key] column is a BigInt. Maybe its surprising that I’m using 27213.9 as the upper bound. But… check out the estimated number of rows 🙂 Again – this estimate is coming from the Legacy CE, specified in the database scoped configuration for my database.

Interesting findings, although it looks like the specific values are going to be more settings-dependent than the usual finding of this nature.

Comments closed

On Global Temp Tables

Published 2017-08-28 by Kevin Feasel

Denis Gobo riffs on global temp tables:

SQL Azure has added something called database scoped global temporary tables.

Azure SQL Database supports global temporary tables that are also stored in tempdb and scoped to the database level. This means that global temporary tables are shared for all users’ sessions within the same Azure SQL database. User sessions from other Azure SQL databases cannot access global temporary tables.

They way you add these is by using a double pound sign

Click through for more. I will occasionally use a global temp table, mostly in conjunction with Central Management Server queries going over a set of databases on each instance, but I limit my usage to manual operations and nothing as part of a normal application or administrative process.

Comments closed

Curated SQL Posts