SQL Scotsman is working on a very interesting series on statistics and the different cardinality estimators. So far, this is a three-part series. Part one is an overview:
A few of those assumptions changed in the new SQL Server 2014/2016 CE, namely:
Independence becomes Correlation: In absence of existing multi-column statistics, the legacy CE views the distribution of data contained across different columns as uncorrelated with one another. This assumption of independence often does not reflect the reality of a typical SQL Server database schema, where implied correlations do actually exist. The new CE uses an increased correlation assumption for multiple predicates and an exponential back off algorithm to derive cardinality estimates.
Simple Join Containment becomes Base Join Containment: Under the legacy CE, the assumption is that non-join predicates are somehow correlated which is called “Simple Containment”. For the new Cardinality Estimator, these non-join predicates are assumed to be independent (called “Base Containment”), and so this can translate into a reduced row estimate for the join. At a high level, the new CE derives the join selectivity from base-table histograms without scaling down using the associated filter predicates. Instead the new CE computes join selectivity using base-table histograms before applying the selectivity of non-join filters.
When To Use Trace Flag 9481
Query Scope: You’ve moved (migrated/upgraded) to SQL Server 2014 / 2016, your databases are at compatibility level 120 / 130 and using the new CE, your workload is performing well overall but there are a few regressions where a small number of queries actually perform worse. Use Trace Flag 9481 on a per query basis as a temporary measure until you can tune / rewrite the query so it performs well without the hint.
The problem with lowering the database compatibility level is that you can’t leverage the new engine functionality available under the latest compatibility level.
This problem was solved in SQL Server 2016 with the introduction of Database Scoped Configurations which gives you the ability to make several database-level configuration changes for properties that were previously configured at the instance-level. In particular, the LEGACY_CARDINALITY_ESTIMATION database scoped configuration allows you to set the cardinality estimation model independent of the database compatibility level. This option allows you to leverage all new functionality provided with compatibility level 130 but still use the legacy CE in the odd chance that the latest CE casuses severe query regressions across your workload.
The article on statistics is quite long for a blog post and a great read. I’m looking forward to reading more.
This post contains a list of various methods that can be used to process (i.e. load data into) an Azure AS tabular model. As you will see – not much has changed from the regular on-premise version (which is a very good thing as it softens the learning curve).
Read on if you’re looking at putting an Analysis Services model into Azure.
The DATA_SOURCE and DATA_FORMAT options are easy: pick you external data source and external file format of choice.
The last major section deals with rejection. We’re going from a semi-structured system to a structured system, and sometimes there are bad rows in our data, as there are no strict checks of structure before inserting records. The Hadoop mindset is that there are two places in which you can perform data quality checks: in the original client (pushing data into HDFS) and in any clients reading data from HDFS. To make things simpler for us, the Polybase engine will outright reject any records which do not adhere to the quality standards you define when you create the table. For example, let’s say that we have a Age column for each of our players, and that each age is an integer. If the first row of our file has headers, then the first row will literally read “Age” and conversion to integer will fail. Polybase rejects this row (removing it from the result set stream) and increments a rejection counter. What happens next depends upon the reject options.
Creating an external table is pretty easy once you have the foundation prepared.
Welcome to this month’s T-SQL Tuesday Round-Up! A few weeks ago, I sent out a call for bloggers and must say that I’m utterly blown away by the response. A whopping FORTY bloggers responded last week with contributions for Growing New Speakers! Four – zero! You people are all amazing!!!
There’s a lot to read here. If you’ve ever thought about speaking, give it a try; there are 40 people trying to convince you this month.
But what about connecting remotely? This isn’t going to be much use if we can’t remotely connect!
Actually connecting remotely is the same as connecting to a named instance. You just use the server’s IP address (not the containers private IP) and the non-default port that we specified when creating the container (remember to allow access to the port in the firewall).
Containers are great, though I do have trouble wrapping my head around containerized databases and have had struggles getting containerized Hadoop to work the way I want.
DBCC OPTIMIZER_WHATIF can be used to pull down your resources or augment them. Often the differences in the execution plans have to do with parallelism and memory grants. This is an example of an execution plan running on an under powered development machine.
This is a good tool to help figure out what an execution plan probably would look like in production when your test environment is much smaller.
You’re not running as yourself, even though that’s the account you signed into SSMS as.
You’re not running under the server account that SQL or SQL Launchpad run under.
You’re running as a new account created when you installed SQL R Service In Database for the purpose of running R code.
John also looks at a couple ways of showing which user is running this code and notes that this solves his file share issue.
I like using PoSh for some tasks, especially when I don’t have an easy way to do something in SSMS or want to run a task across a variety of instances. In this case, as I glanced through the September updates, I found a good one.
I don’t love the mixed naming, and I’ll get used to it, but I do love the autocomplete in PoSh.
Steve has lots of screenshots walking you through this function.
The select statement returned 3104 records, exactly 4 shy of the 3108 I would have expected (777 * 4 = 3108). In each case, the missing row was the first, meaning when I search for LastName = ‘Turgeon’ (the first player in my data set), I get zero rows. When I search for another second basemen in the set, I get back four rows, exactly as I would have expected.
What’s really interesting is the result I get back from Wireshark when I run a query without pushdown: it does actually return the row for Casey Turgeon.
This isn’t an ideal scenario, but it did seem to be consistent in my limited testing.
Cardinality aggregation is used to count distinct values in a data set. For example, if you want to know the number of IPs used in your system, you can use this aggregation on an IP field and then count the results.
Despite the usefulness, cardinality can also be a touchy Elasticsearch feature to use. Performing a unique count on a field with a multitude of possible values when configuring a visualization, for example, can bring Elasticsearch to a halt.
Most of it comes down to writing good queries. But if you don’t know what good Elasticsearch queries look like, read on.