Press "Enter" to skip to content

Author: Kevin Feasel

New MPP For Big Data

James Serra notes that there will be a Microsoft Professional Program for Big Data:

A few months back, Microsoft started the Microsoft Professional Program for Data Science (note the program name change from Microsoft Professional Degree to Microsoft Professional Program, or MPP).  This is online learning via edX.org as a way to learn the skills and get the hands-on experience that a data science role requires.  You may audit any courses, including the associated hands-on labs, for free.  However, to receive credit towards completing the data science track in the Microsoft Professional Program, you must obtain a verified certificate for a small fee for each of the ten courses you successfully complete in the curriculum.  The course schedule is presented in a suggested order, to guide you as you build your skills, but this order is only a suggestion.  If you prefer, you may take them in a different order.  You may also take them simultaneously or one at a time, so long as each course is completed within its specified session dates.

Look for it sometime next year.

Comments closed

Understanding The Cardinality Estimator

SQL Scotsman is working on a very interesting series on statistics and the different cardinality estimators.  So far, this is a three-part series.  Part one is an overview:

A few of those assumptions changed in the new  SQL Server 2014/2016 CE, namely:

  • Independence becomes Correlation: In absence of existing multi-column statistics, the legacy CE views the distribution of data contained across different columns as uncorrelated with one another. This assumption of independence often does not reflect the reality of a typical SQL Server database schema, where implied correlations do actually exist. The new CE uses an increased correlation assumption for multiple predicates and an exponential back off algorithm to derive cardinality estimates.

  • Simple Join Containment becomes Base Join Containment: Under the legacy CE, the assumption is that non-join predicates are somehow correlated which is called “Simple Containment”. For the new Cardinality Estimator, these non-join predicates are assumed to be independent (called “Base Containment”), and so this can translate into a reduced row estimate for the join. At a high level, the new CE derives the join selectivity from base-table histograms without scaling down using the associated filter predicates. Instead the new CE computes join selectivity using base-table histograms before applying the selectivity of non-join filters.

Part two looks at trace flag 9481:

When To Use Trace Flag 9481

Query Scope:  You’ve moved (migrated/upgraded) to SQL Server 2014 / 2016, your databases are at compatibility level 120 / 130 and using the new CE, your workload is performing well overall but there are a few regressions where a small number of queries actually perform worse.  Use Trace Flag 9481 on a per query basis as a temporary measure until you can tune / rewrite the query so it performs well without the hint.

Part three discusses database scoped configurations in SQL Server 2016:

The problem with lowering the database compatibility level is that you can’t leverage the new engine functionality available under the latest compatibility level.

This problem was solved in SQL Server 2016 with the introduction of Database Scoped Configurations which gives you the ability to make several database-level configuration changes for properties that were previously configured at the instance-level.  In particular, the LEGACY_CARDINALITY_ESTIMATION database scoped configuration allows you to set the cardinality estimation model independent of the database compatibility level. This option allows you to leverage all new functionality provided with compatibility level 130 but still use the legacy CE in the odd chance that the latest CE casuses severe query regressions across your workload.

The article on statistics is quite long for a blog post and a great read.  I’m looking forward to reading more.

Comments closed

Processing Azure Analysis Services

Bill Anton shows how to process an Azure Analysis Services tabular model:

This post contains a list of various methods that can be used to process (i.e. load data into) an Azure AS tabular model. As you will see – not much has changed from the regular on-premise version (which is a very good thing as it softens the learning curve).

Read on if you’re looking at putting an Analysis Services model into Azure.

Comments closed

External Tables To Hadoop

I have a post looking at creating external tables in Polybase to hit a Hadoop folder:

The DATA_SOURCE and DATA_FORMAT options are easy:  pick you external data source and external file format of choice.

The last major section deals with rejection.  We’re going from a semi-structured system to a structured system, and sometimes there are bad rows in our data, as there are no strict checks of structure before inserting records.  The Hadoop mindset is that there are two places in which you can perform data quality checks:  in the original client (pushing data into HDFS) and in any clients reading data from HDFS.  To make things simpler for us, the Polybase engine will outright reject any records which do not adhere to the quality standards you define when you create the table.  For example, let’s say that we have a Age column for each of our players, and that each age is an integer.  If the first row of our file has headers, then the first row will literally read “Age” and conversion to integer will fail.  Polybase rejects this row (removing it from the result set stream) and increments a rejection counter.  What happens next depends upon the reject options.

Creating an external table is pretty easy once you have the foundation prepared.

Comments closed

Growing New Speakers

Andy Yun hosted this month’s T-SQL Tuesday and it was a huge success:

Welcome to this month’s T-SQL Tuesday Round-Up! A few weeks ago, I sent out a call for bloggers and must say that I’m utterly blown away by the response. A whopping FORTY bloggers responded last week with contributions for Growing New Speakers!  Four – zero!  You people are all amazing!!!

There’s a lot to read here.  If you’ve ever thought about speaking, give it a try; there are 40 people trying to convince you this month.

Comments closed

SQL Server In Containers

Andrew Pruski shows how to install Docker on Windows Server 2016 and pull down a SQL Express container:

But what about connecting remotely? This isn’t going to be much use if we can’t remotely connect!

Actually connecting remotely is the same as connecting to a named instance. You just use the server’s IP address (not the containers private IP) and the non-default port that we specified when creating the container (remember to allow access to the port in the firewall).
Easy, eh?

Containers are great, though I do have trouble wrapping my head around containerized databases and have had struggles getting containerized Hadoop to work the way I want.

Comments closed

DBCC OPTIMIZER_WHATIF

Derik Hammer shows how to use DBCC OPTIMIZER_WHATIF to get an idea of how your query would run with different hardware:

DBCC OPTIMIZER_WHATIF can be used to pull down your resources or augment them. Often the differences in the execution plans have to do with parallelism and memory grants. This is an example of an execution plan running on an under powered development machine.

This is a good tool to help figure out what an execution plan probably would look like in production when your test environment is much smaller.

Comments closed

SQL Server R Service Users

John Pertell shows how to figure out which user account is running SQL Server R Services code:

You’re not running as yourself, even though that’s the account you signed into SSMS as.

You’re not running under the server account that SQL or SQL Launchpad run under.

You’re running as a new account created when you installed SQL R Service In Database for the purpose of running R code.

John also looks at a couple ways of showing which user is running this code and notes that this solves his file share issue.

Comments closed

Get-DbaTcpPort

Steve Jones looks at one Powershell function inside dbatools:

I like using PoSh for some tasks, especially when I don’t have an easy way to do something in SSMS or want to run a task across a variety of instances. In this case, as I glanced through the September updates, I found a good one.

Get-DbaTcpPort

I don’t love the mixed naming, and I’ll get used to it, but I do love the autocomplete in PoSh.

Steve has lots of screenshots walking you through this function.

Comments closed

Tar And Polybase

I look at what the deal is with Polybase and Tar files:

The select statement returned 3104 records, exactly 4 shy of the 3108 I would have expected (777 * 4 = 3108).  In each case, the missing row was the first, meaning when I search for LastName = ‘Turgeon’ (the first player in my data set), I get zero rows.  When I search for another second basemen in the set, I get back four rows, exactly as I would have expected.

What’s really interesting is the result I get back from Wireshark when I run a query without pushdown:  it does actually return the row for Casey Turgeon.

This isn’t an ideal scenario, but it did seem to be consistent in my limited testing.

Comments closed