Month: September 2018

If you’ve been anywhere near social media this week you may have seen that Microsoft has announced SQL Server 2019.

I love it when a new version of SQL is released. There’s always a whole new bunch of features (and improvements to existing ones) that I want to check out. What I’m not too keen on however is installing a preview version of SQL Server on my local machine. It’s not going to be there permanently and I don’t want the hassle of having to uninstall it.

This is where containers come into their own. We can run a copy of SQL Server without it touching our local machine.

Click through for the step-by-step.

Comments closed

SQL Server 2019 Containers Available

Published 2018-09-26 by Kevin Feasel

The SQL Server team has a getting started post on pulling down the latest CTP in a container, as well as some additional container features:

SQL Server 2019 is now available on Red Hat Enterprise Linux as a Red Hat Certified Container Images and Ubuntu-based container images enabling you to take advantage of the latest SQL Server engine innovations such as new SQL Graph features, and Data Discovery and Classification. We are also making it possible to adopt SQL Server in containers with existing scenarios such as Replication and Distributed Transaction which are now part of SQL Server 2019 on Linux.

This makes it easier to get started with SQL Server 2019 without potentially messing up your already-working systems.

Comments closed

Improvements In Table Variable Performance In SQL Server 2019

Published 2018-09-26 by Kevin Feasel

Matthew McGiffen tries out SQL Server 2019 to test a scenario where table variables were giving poor estimates in prior versions:

One of the most popular posts on my blog last year was where I pretty much suggested that people not use table variables:

Think twice before using table variables

This wasn’t new information when I wrote it, but bad performance due to the use of table variables remained such a common anti-pattern that I thought it was worth stressing again.

So, when I saw the above 2019 feature I thought I’d better investigate and update what I’m telling people.

TL;DR It looks like table variables are no longer a problem.

Read the whole thing. This has the potential of changing long-standing advice going back a decade regarding table variables.

Comments closed

PFS Corruption When Moving From SQL Server 2014

Published 2018-09-26 by Kevin Feasel

Paul Randal notes a bug in SQL Server 2014:

I’m seeing reports from a few people of DBCC CHECKDB reporting PFS corruption after an upgrade from SQL Server 2014 to SQL Server 2016 or later. The symptoms are that you run DBCC CHECKDB after the upgrade and get output similar to this:

Msg 8948, Level 16, State 6, Line 5

Database error: Page (3:3863) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.

Msg 8948, Level 16, State 6, Line 5

Database error: Page (3:3864) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.

CHECKDB found 2 allocation errors and 0 consistency errors not associated with any single object.

CHECKDB found 2 allocation errors and 0 consistency errors in database 'MyProdDB'.

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (MyProdDB).

I’ve discussed with the SQL Server team and this is a known bug in SQL Server 2014.

Read on for the fix and additional good advice.

Comments closed

Writing To Elasticsearch With Spark Streaming

Published 2018-09-25 by Kevin Feasel

Anuj Saxena has an example of writing data from a Spark Streaming pipeline out to Elasticsearch:

There’s been a lot of time we have been working on streaming data. Using Apache Spark for that can be much convenient. Spark provides two APIs for streaming data one is Spark Streaming which is a separate library provided by Spark. Another one is Structured Streaming which is built upon the Spark-SQL library. We will discuss the trade-offs and differences between these two libraries in another blog. But today we’ll focus on saving streaming data to Elasticseach using Spark Structured Streaming. Elasticsearch added support for Spark Structured Streaming 2.2.0 onwards in version 6.0.0 version of “Elasticsearch For Apache Hadoop” dependency. We will be using these versions or higher to build our sbt-scala project.

Click through for an example.

Comments closed

Wasting Money With Data Science

Published 2018-09-25 by Kevin Feasel

Giovanni Lanzani has a post with the controversial title above:

Some data is gathered, given to data scientists, and — after two weeks — the first demo takes place. The results are promising, but they need a bit more time.

Fine. After all, the data was messy: they had to clean it up and go back to the source a couple of times.

Two weeks pass and the new results are even nicer. With 70% accuracy, they can predict if a patient will go home after their visit to the emergency room.

This is much better than random (50%)! A full-fledged pilot starts.

They are faced with a couple of challenges to go from model to data product:

How to send the source data to the model is unclear;
Where the model should run;
The hospital operations need to change, as the intake happens with pen and paper;
They realize that without knowing to which department the patient will go, they won’t add any value;
To predict the department, the model need the diagnosis. But once the diagnosis gets typed in the computer, the patient has reached their destination: the model is useless!

I think it’s a fair point: it’s easy from the standpoint of internal researchers to look for things which they can do, but which don’t have much business value. The risk on the other side is that you’ll start diving into a high-potential-value problem and then realize that the data isn’t there to draw conclusions or that the relationships you expected simply aren’t there.

Comments closed

Be Careful Of P-Hacking

Published 2018-09-25 by Kevin Feasel

Vincent Granville discusses the problem of p-hacking:

I read an article this morning, about a top Cornell food researcher having 13 studies retracted, see here. It prompted me to write this blog. It is about data science charlatans and unethical researchers in the Academia, destroying the value of p-values again, using a well known trick called p-hacking, to get published in top journals and get grant money or tenure. The issue is widespread, not just in academic circles, and make people question the validity of scientific methods. It fuels the fake “theories” of those who have lost faith in science.

The trick consists of repeating an experiment sufficiently many times, until the conclusions fit with your agenda. Or by being cherry-picking about the data you use, or even discarding observations deemed to have a negative impact on conclusions. Sometimes, causation and correlations are mixed up on purpose, or misleading charts are displayed. Sometimes, the author lacks statistical acumen.

Usually, these experiments are not reproducible. Even top journals sometimes accept these articles, due to

Poor peer-review process
Incentives to publish sensational material

Wansink is a charlatan. But beyond p-hacking is Andrew Gelman and Eric Loken’s Garden of Forking Paths. Gelman’s blog, incidentally (example), is where I originally learned about Wansink’s shady behaviors. Gelman also warns us not to focus on the procedural, but instead on a deeper problem.

1 Comment

Databricks Delta Now Available On Azure

Published 2018-09-25 by Kevin Feasel

Cihan Biyikoglu and Singh Garewal announce the availability of Databricks Delta on Azure Databricks:

Using an innovative new table design, Delta supports both batch and streaming use cases with high query performance and strong data reliability while requiring a simpler data pipeline architecture:

Increased query performance – Able to deliver 10 to 100 times faster performance than Apache Spark(™) on Parquet through the use of key enablers such as compaction, flexible indexing, multi-dimensional clustering and data caching.

Improved data reliability – By employing ACID (“all or nothing”) transactions, schema validation / enforcement, exactly once semantics, snapshot isolation and support for UPSERTS and DELETES.

Reduced system complexity – Through the unification of batch and streaming in a common pipeline architecture – being able to operate on the same table also means a shorter time from data ingest to query result. Schema evolution provides the ability to infer schema from input data making it easier to deal with changing business needs.

The Azure version of Databricks is quickly reaching parity with the classic AWS-hosed version.

Comments closed

Other Ignite Announcements

Published 2018-09-25 by Kevin Feasel

Denny Cherry gives us a quick roundup of Ignite announcements:

On the Azure Data Platform side of the world, we have the announcement that Azure SQL DB now supports databases up to 100 TB in size using the Hyperscale feature of Azure SQL DB which you’ll see coming on October 1^st, 2018. Hyperscale is an excellent move for customers, as many customers were blocked by the fact that they couldn’t move the database to Azure SQL DB simple because of size; and this limit is going away in just a few short days.

Along with the legacy database platform, we have Managed Instance which was in Public Preview. The fact is that it is in preview is no-more; Managed Instance is being released in General Availability starting on October 1^st, 2018. Managed Instance will make migrations to Azure much more accessible for many clients that need support for a SQL Server instance because of features that aren’t available in Azure SQL DB. Managed Instance will bridge this gap for customers giving customers basically full SQL Server functionality within a PaaS service.

In the Azure SQL DB space, we see new features for optimization of query performance getting released to General Availability. These features include three new features called row mode: memory grant feedback, approximate query processing, and table variable deferred compilation. With minimal effort, these features can collectively optimize your memory usage and improve overall query performance.

They’re throwing a lot of stuff our way, including a less expensive version of Azure SQL Data Warehouse.

Comments closed

New Use Hint In SQL Server 2017 CU10

Published 2018-09-25 by Kevin Feasel

Pedro Lopes shows us a new use hint introduced in SQL Server 2017 CU10:

In this scenario, you only have this one query that apparently does better in SQL Server 2014 than 2017. That’s all “New CE” – there’s no CE70 vs CE 120+ at issue here. Using any known trace flag, the FORCE_LEGACY_CARDINALITY_ESTIMATION hint or the FORCE_DEFAULT_CARDINALITY_ESTIMATION hint doesn’t help. Rewriting the query is an option, but in the interim, I need a quick fix. How?

In SQL Server 2017 CU10, we have introduced a few new USE HINTs: the QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_n, where n is a supported database compatibility level. This forces the query optimizer behavior at a query level, as if the query was compiled with database compatibility level. You can refer to sys.dm_exec_valid_use_hints for a list of currently supported values for n.

So to be clear, the new hint is not forcing only a specific CE model, it’s forcing the equivalent of the specific database compatibility level’s query optimizer behavior, including any query optimizer fixes that are enabled by default in that database compatibility level.

Something to keep in mind, though ideally not something you’d want to use regularly.

Comments closed