Learn Machine Learning In Just 7 Years

Rwiddhi Chakraborty explains that machine learning isn’t a topic you pick up overnight:

Of course you could write a Hello World program in C++ in 24 hours, or a program to find the area of a circle in 24 hours, but that’s not the point. Do you grasp object oriented programming as a paradigm? Do you understand the use cases of namespaces and templates? Do you know your way around the famed STL? If you do, you certainly didn’t learn all this in a week, or even a month. It took you a considerable amount of time. And the more you learned, the more you realised that the abyss is deeper than it looks from the cliff.

I’ve found a similar situation in the current atmosphere surrounding Machine Learning, Deep Learning, and Artificial Intelligence as a whole. Feeding the hype, thousands of blogs, articles, and courses have popped up everywhere. Thousands of them have the same kind of headlines — “Machine Learning in 7 lines of code”, “Machine Learning in 10 days”, etc. This has, in turn led people on Quora to ask questions like “How do I learn Machine Learning in 30 days?”. The short answer is, “You can’t. No one can. And no expert (or even one comfortable with its ins and outs) did.”

This is a good antidote to the “I read a blog post and now I’m an expert” mentality which is particularly pernicious.

The Theory Behind ARIMA

Bidyut Ghosh explains how the ARIMA forecasting method works:

The earlier models of time series are based on the assumptions that the time series variable is stationary (at least in the weak sense).

But in practical, most of the time series variables will be non-stationary in nature and they are intergrated series.

This implies that you need to take either the first or second difference of the non-stationary time series to convert them into stationary.

Bidyut ends with a little bit of implementation in R, but I’d guess that’ll be the focus of part 2.

Slicing In R

Kevin Feasel



John Mount recommends learning about the array slicing system in R:

R has a very powerful array slicing ability that allows for some very slick data processing.

Suppose we have a data.frame “d“, and for every row where d$n_observations < 5 we wish to “NA-out” some other columns (mark them as not yet reliably available). Using slicing techniques this can be done quite quickly as follows.

d[d$n_observations < 5, qc(mean_cost, mean_revenue, mean_duration)] <- NA

Read on for more.  In general, I prefer the pipeline mechanics offered with the Tidyverse for readability.  But this is a good example of why you should know both styles.

Cloud Savings And TCO

Kevin Feasel



James Serra argues that moving to the cloud can be a net savings on cost:

I often tell clients that if you have your own on-premise data center, you are in the air conditioning business.  Wouldn’t you rather focus all your efforts on analyzing data?  You could also try to “save money” by doing your own accounting, but wouldn’t it make more sense to off-load that to an accounting company?  Why not also off-load the  costly, up-front investment of hardware, software, and other infrastructure, and the costs of maintaining, updating, and securing an on-premises system?

And when dealing with my favorite topic, data warehousing, a conventional on-premise data warehouse can cost millions of dollars in the following: licensing fees, hardware, and services; the time and expertise required to set up, manage, deploy, and tune the warehouse; and the costs to secure and back up the data.  All items that a cloud solution eliminates or greatly minimizes.

When estimating hardware costs for a data warehouse, consider the costs of servers, additional storage devices, firewalls, networking switches, data center space to house the hardware, a high-speed network (with redundancy) to access the data, and the power and redundant power supplies needed to keep the system up and running.  If your warehouse is mission critical then you need to also add the costs to configure a disaster recovery site, effectively doubling the cost.

I don’t think this story plays quite as well.  For small and mid-sized companies, yes, the cloud is often a net savings.  For companies whose products were designed to be cloud-first and take advantage of burstiness and spot markets, yes, you can drive cost savings that way.  But for most mid-to-large companies, I think the calculus shifts to where sometimes cloud options work better but often they don’t.  Need a few hundred SQL Server instances with microsecond-level latency running SQL Server Enterprise Edition 24/7?  That’s not going to be cheaper.

Something’s Missing: Head Operators In Extended Event-Based Execution Plans

Grant Fritchey notices something odd about execution plans grabbed from an Extended Events session:

Notice anything missing? Yeah, the first operator, the SELECT operator (technically, not really an operator, but they don’t have any name or designation in the official documentation, so I’m calling them operators). It’s not there. Why do I care?

Because it’s where all the information about the plan itself is stored. Stuff like, Cached Plan Size, Compile Time, Optimizer Statistics Usage, Reason for Early Termination, is all there, properties and details about the plan itself. Now, the weird thing is, if you look to the XML, as shown here, all that data is available:

Read on for Grant’s best guess as to the root cause of the problem.

When Table Join Order Oughtn’t Matter…But It Sometimes Does

Bert Wagner looks at join order in SQL Server:

SQL is a declarative language: you write code that specifies *what* data to get, not *how* to get it.

Basically, the SQL Server query optimizer takes your SQL query and decides on its own how it thinks it should get the data.

It does this by using precalculated statistics on your table sizes and data contents in order to be able to pick a “good enough” plan quickly.

I like this post.  It also lets me push one of my favorite old-time performance tuning books, SQL Tuning by Dan Tow.  95+ percent of the time, you don’t need to think about join order.  But when you do, you want to have a systematic method of figuring the ideal join order out.

SQL Server And STIGs

Mohammad Darab has a quick summary of the Department of Defense’s STIG overview for SQL Server 2016:

To make it easier for people in charge of “STIG’ing” their SQL Server 2016 environment, this blog is aimed to go over the newest MS SQL Server 2016 STIG Overview document (Version 1, Release 1) that was released on 09 March 2018. If you want to read through the whole document you can download it here. Otherwise, below is my summation of the relevant sections.

This overview document was developed by both Microsoft and DISA for the Department of Defense.

The entire overview document is 9 pages (including title page, etc.)

Click through for Mohammad’s summary.  Also check out Chris Bell’s sp_woxcompliant.

Auto Soft-NUMA And Scheduler Waits

Joe Obbish walks us through a scenario with automatic soft-NUMA leading to poor performance:

Consider a server with soft-NUMA nodes of 8 schedulers with MAXDOP 8. The first parallel query will be sent to numa node 0. The number of active workers matches the number of schedulers exactly so each active worker is assigned to a different scheduler in the NUMA node. The second parallel query will be sent to NUMA node 1. The third parallel query will be sent to NUMA node 2, and so on. Execution of serial queries or creation of sessions does not matter. That advances a counter that’s separate from the “global enumerator” used for parallel query scheduler placement. As far as I can tell the scheduler assigned to execution context 0 does not affect the scheduling of the parallel worker threads, although it can certainly affect parallel query performance.

The scenario described above doesn’t sound so bad. It can work well if the parallel queries take roughly about the same amount of time to complete and query MAXDOPmatches the number of schedulers per soft-NUMA node. Problems can emerge when at least one of those is not true. With the spread selection type it’s possible that the amount of work already assigned to schedulers has no effect on parallel query scheduler placement. Let that sink in. You could have 100 serial queries all assigned to schedulers in numa node 0 but SQL Server may send a parallel query to that NUMA node. It depends on the position of the “global enumerator” as opposed to current work on the server.

Joe offers up some alternatives if you find yourself dealing with this issue.  Definitely a must-read.


April 2018
« Mar May »