Learning Spark Structured Streaming

Jules Damji has a nice compendium of links and additional resources for people wanting to learn more about Apache Spark’s Structured Streaming:

Structured Streaming In Apache Spark: A new high-level API for streaming

Databricks’ engineers and Apache Spark committers Matei Zaharia, Tathagata Das, Michael Armbrust and Reynold Xin expound on why streaming applications are difficult to write, and how Structured Streaming addresses all the underlying complexities.

There’s quite a bit of reading material on the other side.

Simpson’s Paradox Explained

Mehdi Daoudi, et al, have a nice explanation of Simpson’s Paradox:

E.H. Simpson first described the phenomenon of Simpson’s paradox in 1951. The actual name “Simpson’s paradox” was introduced by Colin R. Blyth in 1972. Blyth mentioned that:

G.W. Haggstrom pointed out that Simpson’s paradox is the simplest form of the false correlation paradox in which the domain of x is divided into short intervals, on each of which y is a linear function of x with large negative slope, but these short line segments get progressively higher to the right, so that over the whole domain of x, the variable y is practically a linear function of x with large positive slope.

The authors also provide a helpful example with operational metrics, showing how aggregating the data leads to an opposite (and invalid) conclusion.

Gradient Boosting In R

Anish Sing Walia walks us through a gradient boosting exercise using R:

An important thing to remember in boosting is that the base learner which is being boosted should not be a complex and complicated learner which has high variance for e.g a neural network with lots of nodes and high weight values.For such learners boosting will have inverse effects.

So I will explain Boosting with respect to decision trees in this tutorial because they can be regarded as weak learners most of the times.We will generate a gradient boosting model.

Click through for more details.  H/T R-Bloggers

Columnstore Predicate Pushdown Tipping Point

Lonny Niederstadt is hot on the trail, looking for evidence of a tipping point for COUNT(*) aggregates performing pushdown against a clustered columnstore index:

Below is what we want post-execution plans to look like when counting rows in a range – the thin arrow coming out of the Columnstore scan is a hint that predicate pushdown was successful.  I didn’t specify MAXDOP in a query hint, and Resource Governor isn’t supplying MAXDOP; MAXDOP 8 is coming from the Database Scoped Configuration.  The degree of parallelism turns out to be a significant factor in determining the tipping point.  The [key] column is a BigInt.  Maybe its surprising that I’m using 27213.9 as the upper bound.  But… check out the estimated number of rows 🙂 Again – this estimate is coming from the Legacy CE, specified in the database scoped configuration for my database.

Interesting findings, although it looks like the specific values are going to be more settings-dependent than the usual finding of this nature.

On Global Temp Tables

Denis Gobo riffs on global temp tables:

SQL Azure has added something called database scoped global temporary tables.

Azure SQL Database supports global temporary tables that are also stored in tempdb and scoped to the database level. This means that global temporary tables are shared for all users’ sessions within the same Azure SQL database. User sessions from other Azure SQL databases cannot access global temporary tables.

They way you add these is by using a double pound sign

Click through for more.  I will occasionally use a global temp table, mostly in conjunction with Central Management Server queries going over a set of databases on each instance, but I limit my usage to manual operations and nothing as part of a normal application or administrative process.

Columnstore Segment Alignment

Niko Neugebauer walks us through a new segment alignment detection function he has written:

There are 3 important factors that I use for the determination of the column that is really good for the Segment Elimination and hence the Segment Alignment:
– The support for the Segment Elimination (and then for the Predicate Pushdown). If the data type does not support Segment Elimination, than why would someone optimise for it ?
– The frequency with which the column is used in the predicates (not in the joins, because this is where generally the Segment Elimination/Predicate Pushdown does not function)
– The number of the distinct values within a table/partition (if we have more Segments than distinct values, it is not a very good sign generally: example – 10 million rows with 5 distinct values)

Read on for more details.

Waits With Outsized Importance

Brent Ozar has a few wait types whose appearance in your “uh-oh” list is disproportionate to the wait type’s relative percentage:

RESOURCE_SEMAPHORE_QUERY_COMPILE – this means a query came in, and SQL Server didn’t have an execution plan cached for it. In order to build an execution plan, SQL Server needs a little memory – not a lot, just a little – but that memory wasn’t available. SQL Server had to wait for memory to become available before it could even build an execution plan. For more details and a reproduction script, check out my Bad Idea Jeans: Dynamically Generating Ugly Queries post. In this scenario, cached query plans (and small ones) may be able to proceed just fine (depending on how much pressure the server is under), but the ugly ones will feel frozen.

Read on for a couple more examples of poison wait types.

In-Memory OLTP In SQL Server 2017

Ned Otter walks us through all of the changes to memory-optimized objects coming in SQL Server 2017:

  • Up to and including SQL 2016, the maximum number of nonclustered indexes on a memory-optimized table was eight, but that limitation has been removed for SQL 2017. I’ve tested this with almost 300 indexes, and it worked. With this many supported indexes, it’s no wonder they had to….

  • Enhance the index rebuild performance for nonclustered indexes during database recovery. I confirmed with Microsoft that the database does not have be in SQL 2017 compatibility mode (140) to benefit from the index rebuild enhancement. This type of rebuild happens not only for database restore and failover, but also for other “recovery events” – see my blog post here.

Read on for several more improvements, as well as a couple of things which Ned would like to see but aren’t there yet.

When CHECKDB Snapshots Run Out Of Disk Space

Andy Galbraith walks through an error message in DBCC CHECKDB when the snapshot runs out of disk space:

Looking in the SQL Error Log there were hundreds of these combinations in the minutes immediately preceding the job failure:

The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset 0x000048a123e000 in file ‘E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17‘. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Error: 17053, Severity: 16, State: 1.

E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17: Operating system error 665(The requested operation could not be completed due to a file system limitation) encountered.

Read on for more information, including a rough idea of how much space the snapshot requires as well as a few workarounds and hints.


August 2017
« Jul Sep »