Stream Computing Platform

Ravi Peri shows how to set up the Stream Computing Platform for .NET (SCP.Net) library and kick off a job:

SCP.Net generates a zip file consisting of the topology DLLs and dependency jars.

It uses Java (if found in the PATH) or .net to generate the zip. Unfortunately, zip files generated with .net are not compatible with Linux clusters.

If you’re interesting in working with a Storm topology while writing .NET code, check this out.

Loading Columnstore Data

I have a post on an issue I had with loading columnstore data:

In this design, I have data coming from the transactional system, undergoing some ETL processing, and going into a staging table on the warehouse.  From there, I perform the remainder of the ETL work and insert into a rowstore table.  This rowstore table has the same attribute names and data types as the columnstore table, but instead of having a clustered columnstore index, it has a standard B-tree index and can have additional non-clustered indexes.  From there, I expose the combination table using a view which simply unions the two sets of data so the application doesn’t have to see rowstore versus columnstore tables.

If you’re doing nightly insertions, results tend to be a lot better.  But if you need close-to-but-not-quite real-time data, there are still ways to solve the problem.

Thinking About The Data Lake

Ust Oldfield gives architectural hints on Azure Data Lake Store:

It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.

This is how you get data swamps (a term which I’m so happy is catching on).  Read the whole thing.

Creating Plan Guides

Brent Ozar has a couple examples of creating plan guides to work around bad queries using OPTIMIZE FOR UNKNOWN:

You can see the actual execution plan here, and hover your mouse over various parts of it to see the estimated vs actual rows.

The OPTIMIZE FOR UNKNOWN hint tells SQL Server to use the density vector rather than column statistics, so it only estimates that 1,865 rows will come back – when in actuality, 3.3mm rows come back. In performance tuning, that’s what we call a “bad thing,” since SQL Server ends up doing around 10mm page reads due to that key lookup. It would have been much more efficient to just do a clustered index scan.

Creating plan guides can be ugly business, but sometimes they’re the best solution.

Upgrade To 2016?

Kendra Little explains why upgrading to 2016 is a good idea:

I know most of y’all care about Standard Edition.

With SP1, you can use all sorts of (formerly) super-expensive features like data compression, partitioning, Columnstore, Change Data Capture, Polybase, and more in Standard Edition.

A few features have scalability caps in “lower” editions. There’s a memory limit for In-Memory OLTP (aka Hekaton), and Columnstore. Columnstore also has restrictions on parallelism.

I’ll also throw out my experience as an anecdote:  2016 has been more stable than 2014 in our environment.  I don’t regret going to 2014, but we’re better off on 2016.

Power BI Admin Portal

Melissa Coates looks at the Power BI tenant settings in the admin portal:

Keep in mind that these selections apply to all users across the entire tenant. At this time we can’t control them by groups or anything of that nature.

In addition to the above settings for controlling user experience, the Admin Portal is also the place for viewing usage metrics which are helpful for determining who runs what how often (it’s not everything we could possibly want to know, but it’s a good start). The other two options, manage users and audit logs, redirect you over to the Office 365 Admin Center.

Another week, another few dozen Power BI additions…

Delayed Durability

Chris Taylor gives a use case for Delayed Durability:

During development and initial testing on our own hardware, we had the migration at the time running at ~25minutes for around 600 packages (ie. tables) covering (what we termed) RawSource–>Source–>Staging which was well within the performance requirements for the stage that development was at and for what was initially set out. The rest of this blog post will hone in specifically on Source–>Staging only.

However, once we transferred the solution to the clients development environment things took a turn for the worse. In our environment we were running VMs with 8 cores, 16GB RAM and utlising SSDs. The client environment was running SQL Server 2016 Enterprise on VMWare vSphere 5.5, 8 vCPUs, 32GB RAM (for Integration, Development was half this) but the infrastructure team have done everything in their power to force all VMs onto the lower tier (ie. slow disks) of their 3-PAR SAN and throttle them in every way possible, just to make things more of a challenge. Even though the VM’s themselves were throttled we were confident that we wouldn’t see too much of a performance impact, especially as this was only a subset of the processing to be done so we needed it to be quick and it will only ever get longer and longer.

Chris walks through the hallmarks of when Delayed Durability might work, and the big one for me is the way data migration works:  full reloads.  The important thing is to have a durable source and a process to repeat data loads when things get missed; in this case, it’s a full reload, but in other cases it could be watchdog applications which compare data sets on each side.

Query Folding In Power Query

Reza Rad discusses the performance implications of query folding in M:

I can’t start talking about the issue without explaining what Query Folding is, so let’s start with that. Query Folding means translating Power Query (M) transformations into native query language of the data source (for example T-SQL). In other words; when you run Power Query script on top of a SQL Server database, query folding will translate the M script into T-SQL statements, and fetch the final results.

Click through for more details.  The advice here sounds pretty similar to what we get for optimizing Integration Services:  push as much of the heavy lifting onto well-optimized source queries as possible, particularly when it comes to filtering out rows.


November 2016
« Oct Dec »