Press "Enter" to skip to content

Month: June 2018

Building Observability Tools At Scale

Kevin Lew and Sangeeta Narayanan give us some lessons learned from building logging and monitoring solutions at Netflix:

We started our tooling efforts with providing visibility into device and server logs, so that our users can go to one tool instead of having to use separate data-specific tools or logging into servers. Providing visibility into logs is valuable because log messages include important contextual information, especially when errors occur.

However, at some point in our business growth, storing device and server logs didn’t scale because the increasing volume of log data caused our storage cost to balloon and query times to increase. Besides reducing our storage retention time period, we addressed scalability by implementing a real-time stream processing platform called Mantis. Instead of saving all logs to persistent storage, Mantis enables our users to stream logs into memory, and keep only those logs that match SQL-like query criteria. Users also have the choice to transform and save matching logs to persistent storage. A query that retrieves a sample of playback start events for the Apple iPad is shown in the following screenshot:

It’s an interesting post.

Comments closed

JSON Output And SSIS

Stacia Varga works around an oddity in the way SSIS reads JSON outputs:

What happened? The T-SQL produces the correct results in SQL Server Management Studio (SSMS). However, in SSIS, the same T-SQL statement in an OLE DB Source in a Data Flow Task produces two rows of data which adds a line feed into the flat file and renders the JSON unusable.

The problem is visible even before sending output to the flat file.

Click the link to see how Stacia solves this problem.

Comments closed

Permissions Error Executing R Scripts

Niels Berglund walks through a permissions error on a new installation of SQL Server 2017 CU 7 with Machine Learning Services:

Cool, all is “A-OK”! A couple of days go by, and I see that there is a Cumulative Update (CU) for SQL Server 2017 – CU7. I install it and does not think much about it. I mean: “what can go wrong, how hard can it be?”. A couple of days later and I am busy writing the follow-up post to sp_execute_external_script and SQL Compute Context – I when I try to execute sp_execute_external_script, and it falls over!

Niels has a couple false starts that he walks us through, but then lands on a solid answer.

Comments closed

Trickle Insertion With Clustered Columnstore Indexes

Sunil Agarwal provides a pattern for trickle loading clustered columnstore indexes:

A traditional scenario of loading data into CCI is a nightly load from one or more data files containing millions of rows. Recommended technique is to load the data with batchsize >= 102400 as explained https://blogs.msdn.microsoft.com/sqlserverstorageengine/2014/07/27/clustered-column-store-index-bulk-loading-the-data/. However, we are seeing many scenarios where data source is parallel data stream (i.e. trickle insert) to be loaded to CCI for analytics, a typical IOT scenario. CCI allows concurrent data streams into the same delta rowgroup. However, you will see higher page latch contention as you increase the concurrency. You may wonder why this is so? Well, each delta RG is organized internally as a clustered btree index and the dataload follows the pattern of monotonically increasing clustered index key causing latch contention on the last page.

Check out Sunil’s post.  I also have an older post covering my experiences with CCI trickle loads and three ETL patterns which can work.

Comments closed

DAX: Caching And IF Statement Optimization

Marco Russo shows us a way of improving performance on conditional statements:

Consider the following measure.

1
2
3
4
5
Margin :=
IF (
    [Sales Amount] > 0 && [Total Cost] > 0,
    [Sales Amount] - [Total Cost]
)

The basic idea is that the difference between Sales Amount and Total Cost should be evaluated only whether both measures are greater than zero. In such a condition, the DAX engine produces a query plan that evaluates each measure twice. This is visible in the storage engine requests generated for the following query.

Read on to see how Marco avoids this performance issue.

Comments closed

Analytics Platform System V7 Released

Microsoft has released a new version of their Analytics Platform System:

Microsoft is pleased to announce that the Analytics Platform System (APS) appliance update 7 (AU7) is now generally available. APS is Microsoft’s scale-out Massively Parallel Processing (MPP) system based on SQL Server for data warehouse specific workloads on-premises.

Customers will get significantly improved query performance and enhanced security features with this release. APS AU7 builds on appliance update 6 (APS 2016) release as a foundation. Upgrading to APS appliance update 6 is a prerequisite to upgrade to appliance update 7.

This is useful for the six customers which can afford the licensing for APS.

Comments closed

Getting Wait Info From Extended Events

Grant Fritchey shows how to get wait information for particular sessions from the system_health extended event:

On your servers, any of them that are SQL Server 2008 or newer, right now, unless you’ve performed actions to prevent this, you’re running the Extended Events system_health session. It’s just happening, currently, on all your servers. Nothing you need to do about it at all. I’ll be a lot of you never even knew it was there.

If you follow the link you can see all the various types of information being gathered by the Extended Event system_health session. I won’t detail all of it here. Let me just provide a little context around how the session works. First and foremost, similar to the error log, this session consists of four files, each 5mb in size, rolling over as they get filled. For systems with a very high degree of activity, that means the information here may only be hours old. However, for most of us, this provides days, if not weeks worth of information about the behavior of your system.

The system_health extended event misses a lot of stuff, but it’s quite useful when you don’t have a purpose-built monitoring solution in place.

Comments closed

rqdatatable — Wrangling Lots Of Data, Fast

John Mount explains the motivation behind rqdatatable and puts together a performance test:

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Teaching rquery and fully benchmarking it is a big task, so in this note we will limit ourselves to a single example and benchmark. Our intent is to use this example to promote rquery and rqdatatable, but frankly the biggest result of the benchmarking is how far out of the pack data.tableitself stands at small through large problem sizes. This is already known, but it is a much larger difference and at more scales than the typical non-data.table user may be aware of.

Click through for the benchmark and information on how to grab the package before it goes into CRAN.

Comments closed

Python And The Tidyverse

Leo at Locke Data looks at a couple Python packages which implement Tidyverse concepts:

The Dplython README provides some clear examples of how the package can be used. Below is an summary of the common functions:

  • select() – used to get specific columns of the data-frame.

  • sift() – used to filter out rows based on the value of a variable in that row.

  • sample_n() and sample_frac() – used to provide a random sample of rows from the data-frame.

  • arrange() – used to sort results.

  • mutate() – used to create new columns based on existing columns.

I think the Tidyverse is immediately accessible for data platform professionals, so it’s good to see these concepts making their way to Python as well as R.

Comments closed

Forwarded Records Without User Table Heaps

Erik Darling unravels a conundrum:

When people think about Heaps and the problems they can cause, they don’t often think of temp tables and table variables as Heaps.

Of course, without a clustered index, any table is a Heap.

This isn’t an argument for or against indexing temp tables, but while working with a client we came across something strange!

sp_BlitzFirst was reporting hundreds of thousands of Forwarded Records, but there were no Heaps in user databases.

When we dug in closer to what queries were doing, we found lots of places where temp tables had an insert/update pattern.

Click through for a demonstration and an explanation of why this can be trouble.

Comments closed