How The New York Times Uses Apache Kafka

Boerge Svingen gives us an architectural overview of how the New York Times uses Apache Kafka to link different services together:

These are all sources of what we call published content. This is content that has been written, edited, and that is considered ready for public consumption.

On the other side we have a wide range of services and applications that need access to this published content — there are search engines, personalization services, feed generators, as well as all the different front-end applications, like the website and the native apps. Whenever an asset is published, it should be made available to all these systems with very low latency — this is news, after all — and without data loss.

This article describes a new approach we developed to solving this problem, based on a log-based architecture powered by Apache KafkaTM. We call it the Publishing Pipeline. The focus of the article will be on back-end systems. Specifically, we will cover how Kafka is used for storing all the articles ever published by The New York Times, and how Kafka and the Streams API is used to feed published content in real-time to the various applications and systems that make it available to our readers.  The new architecture is summarized in the diagram below, and we will deep-dive into the architecture in the remainder of this article.

This is a nice write-up of a real-world use case for Kafka.

Nested Resampling In R

Max Kuhn describes how nested resampling works:

A common method for tuning models is grid search where a candidate set of tuning parameters is created. The full set of models for every combination of the tuning parameter grid and the resamples is created. Each time, the assessment data are used to measure performance and the average value is determined for each tuning parameter.

The potential problem is, once we pick the tuning parameter associated with the best performance, this value is usually quoted as the performance of the model. There is serious potential for optimization bias since we uses the same data to tune the model and quote performance. This can result in an optimistic estimate of performance.

Nested resampling does an additional layer of resampling that separates the tuning activities from the process used to estimate the efficacy of the model. An outer resampling scheme is used and, for every split in the outer resample, another full set of resampling splits are created on the original analysis set. For example, if 10-fold cross-validation is used on the outside and 5-fold cross-validation on the inside, a total of 500 models will be fit. The parameter tuning will be conducted 10 times and the best parameters are determined from the average of the 5 assessment sets.

Definitely worth the read.  H/T R-Bloggers

Sentiment Analysis In Power BI

Chris Webb has a new Power BI custom data connector:

I’m pleased to announce that I’ve published my first Power BI custom data connector on GitHub here:

Basically, it acts as a wrapper for the Microsoft Cognitive Services Text Analytics API and  makes it extremely easy to do language detection, sentiment analysis and to extract key phrases from text when you are loading data into Power BI.

Read the whole thing, as Chris has a great demo of it.

Integrating Active Directory: Local And Azure

Shannon Lowder sets up an on-prem Active Directory domain and links it to Azure Active Directory:

You’ll need to plan out your domain before you begin.  In my case, I already had my network configured to use 192.168.254.x. My Fiber router serves as my default gateway as well as my DHCP server and primary DNS server for my local network. My wireless access points, primary workstation, and printer are already set up for static IP addresses.  I have already set aside a subnet of addresses for static servers.  I also already own a domain name (  Having all this set up before trying to install my domain controller help by saving time.

Shannon glosses over the local AD part, but once that’s set up, shows how to tie it in with Azure Active Directory.

Will SQL Server On Linux Take Off?

Brent Ozar has his doubts about how popular SQL Server on Linux will be:

SQL Server 2017 runs on Linux, and the similarities between that and Windows Core are eerie:

  • Both present big stumbling blocks for traditional Windows DBAs
  • Both work mostly the same, but not exactly, as you can see in the SQL Server on Linux release notes
  • Both solved perceived problems for sysadmins
  • Neither solved a problem for database administrators

So why will you hear so much more about Linux support? Because this time around, it also solves a sales problem for Microsoft. Somebody, somewhere, has a spreadsheet showing that there will be a return on investment if they spend the development, marketing, and support resources necessary. (And I bet they’re right – if you compare this feature’s ROI against, say Hekaton or Polybase, surely Linux is going to produce a lot more new licenses sold.)

He does make some good points (though seriously, Polybase is awesome), but I think SQL Server on Linux is going to be quite a bit more popular for a couple of reasons.  First is core-based licensing in Windows Server:  that’s another big price increase that you get when upgrading to Server 2016, and at the margin, companies with a mixed OS setup will be more likely to move to Linux.  Second, Brent’s focus in the post is on current installations—that is, taking your Windows SQL Server instance and moving it to Linux.  As Koen Verbeeck mentions in the first comment, there’s a whole different market:  companies whose infrastructure is entirely Linux and are currently using MySQL, Oracle, or Postgres for their relational databases.  It’ll probably take a couple of years to get market penetration—especially because of the old guard Linux admin types who remember the Ballmer years with appropriate disdain—but this is a new market for Microsoft and they’ve already got a product which meets (or exceeds, depending upon your biases) the top competition.

How Functions Affect Data Retrieval Performance

Daniel Janik shows one of the many pain points around user-defined functions in SQL Server:

Note that Query 1 uses a function around the column and this causes a scan which increases IO and CPU utilization. This is because every value in the column for the whole table must have the hyphen removed to see if it is a match.

In Query 2, we see a seek. This is because the value is modified instead of the column.

Click through for a few examples.

Building An API To Read An API

Jesse Seymour shows how to build a WebAPI project to retrieve JSON data from another API:

In this file, our goal is to create a class library that connects to an API, authenticates, retrieves JSON formatted data, and deserializes to output for use in a SSIS package.  In this particular solution, I created a separate DLL for the class library which will require me to register it in the global assembly cache on the ETL server.  If your environment doesn’t allow for this, you can still use some of the code snippets here to work with JSON data.

Our order of operations will be to do the following tasks:  Create a web request, attach authentication headers to it, retrieve the serialized JSON data, and deserialize it into an object.  I use model-view-controller (MVC) architecture to organize my code, minus the views because I am not presenting the data to a user interface.

Read on for a depiction and all of the project code.  Building a separate WebAPI project to retrieve this data is usually a good move, as you gain a lot of flexibility:  you can run it on cheaper hardware, schedule data refreshes, send the data out to different locations, and so on.

Set A Fill Factor

Monica Rathbun wants you to set a better fill factor than the default:

Please, please, please Admins do not leave your default index fill factor at 0. This means you are telling SQL Server to fill the page 100% full when creating indexes. This also means you are forcing it to a new page when additional inserts are done. These are called PAGE SPLITS which can take time to perform and is a resource intensive operation. Having a high fill factor will cause more index fragmentation, decrease performance and increase IO.

If you find that this is how your system is configured, all is not lost. You can correct this by changing the default value so that new indexes will be created with proper factor and rebuilding your existing indexes with another fill factor value. I like to use 80 across the board for most, of course there is always the “it depends” scenario that arises but 80 is a pretty safe bet. One of those “it depends” would be on logging table that has the correct clustering key and never gets updates in between values (make sense?), I don’t want a fill factor of 80.  I’d want 0/100 to maximize page density as page splits wouldn’t occur if the clustered key is monotonically increasing.

Monica also has a couple scripts, one for changing the across-the-board default and one for changing a particular index.

Retrieving Disk Block Size With Powershell

Naveen Kumar shows how to use Powershell to find out the block size of a particular disk in Windows:

Do you need to worry about disk block size?
I would suggest you to read below articles for getting better understanding on this topic

Disk Partition Alignment Best Practices for SQL Server

Post discussion, the next question was how do we check the disk block size for a given server?
You can do it from command line using FSutil utility. But let’s do it with PowerShell.

Read on for the code.


September 2017
« Aug