Housing Prices In Ames, Iowa: A Kaggle Competition

Kathryn Bryant and M. Aaron Owen share their Kaggle experiences.  First, Kathryn, et al:

The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and the results of our individual models alerted us to areas in data cleaning and feature engineering that needed improvement. We used root mean squared error (RMSE) of log Sale Price to evaluate model fit as this was the metric used by Kaggle to evaluate submitted models.

Data cleaning, EDA, feature engineering, and private train/test splitting (and one spline model!) were all done in R but  we used Python for individual model training and ensembling/stacking. Using R and Python in these ways worked well, but the decision to split work in this manner was driven more by timing than anything else.

Then, Aaron, et al, share their process and findings:

Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised.

Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located.

This is the major upside to Kaggle:  it gives you the ability to work in a controlled environment with real data sets, which include real data problems.  Yeah, the data’s much cleaner than you’d experience in production pretty much anywhere, but that lets you practice technique with a relatively low barrier to entry.  H/T R-Bloggers (Kathryn | Aaron)

Data Wrangling At Scale

Kevin Feasel

2017-11-21

R, Spark

John Mount has a short article showing off the cdata package:

Suppose we needed to un-pivot this data into a row oriented representation. Often big data transform steps can achieve a much higher degree of parallelization with “tall data”. With the cdata package this transform is easy and performant, as we show below.

Read the whole thing.

Docker And R

Mara Averick has some resources to help you get started with running R in a Docker container:

liftr 📦 by Nan Xiao

liftr aims to solve the problem of persistent reproducible reporting. To achieve this goal, it extends the R Markdown metadata format, and uses Docker to containerize and render R Markdown documents.

Click through for those resources as well as an addictive 8-bit animated GIF.

Binder: Hosting Jupyter Notebooks

Julia Evans points out a really interesting service:

Binder lets you easily host interactive Jupyter notebooks and let anyone on the internet use them interactively immediately! It uses JupyterHub under the hood.

If you want to try it out, you can do that right now:

  1. Go to https://mybinder.org/v2/gh/jvns/pandas-cookbook/master (which will launch the github.com/jvns/pandas-cookbook repository)
  2. Wait for it to build and click ‘launch’
  3. click ‘cookbook’, click a notebook, and play around! There’s an “A quick tour of the IPython Notebook” notebook which shows off some of the basic features.

It apparently uses Kubernetes + Docker under the hood which is interesting! It must be much much more expensive to run than the read-only services, but it’s such a useful and cool thing! I hope it continues to exist.

Definitely worth checking out.  I’m going to have to see the steps for getting an R runtime so I can post some of my own notebook repos.

Three Sessions And A Funeral

Solomon Rutzky explains what happens to sessions after they see the light at the end of the tunnel:

Sessions, in SQL Server, are born when a Connection is made from a client library to SQL Server. Temporary objects – Tables and/or Stored Procedures (yes, these are a thing) – may be created during a Session’s lifetime. The question is: for those temporary objects that are not explicitly dropped, what exactly happens to them? It is commonly known that they magically (ok fine, “automagically” — ok, ok, FINE, “automatically”) get dropped. But when do they get dropped? When the Session ends, right? And the Session ends when the Connection is closed, right? Well, that is certainly the common / conventional wisdom, at least. But is that understanding of the nature of Sessions and temporary objects correct?

It’s a more complicated topic than you might get from first appearances.

When Join Order Matters

Bert Wagner takes a look at one of the lesser appreciated tricks in performance tuning:

I had a great question submitted to me (thank you Brandman!) that I thought would make for a good blog post:

…I’ve been wondering if it really matters from a performance standpoint where I start my queries. For example, if I join from A-B-C, would I be better off starting at table B and then going to A & C?

The short answer: Yes. And no.

One of my favorite query tuning books is SQL Tuning by Dan Tow.  Parts of it are rather dated at this point—like pretty much anything involving a rule-based optimizer—but the gist still works well.  What it comes down to is finding the best single table from which to drive your query (based on table size, filters, etc.) and selecting the appropriate join order afterward.  It’s fairly time-consuming effort, but for the 0.5-1% of queries which really need it, it can be the difference between an awful plan and a good one.

Myths With Page Life Expectancy

Kevin Hill has a public service announcement:

I’ve been hearing throughout my entire DBA career that 300 seconds is a good counter for PLE (Page Life Expectancy) to be above.

Paul Randal calls this “utter nonsense“.  If anyone would have the right to say that, its Paul.

It is good for the number to be above 300, but that’s like saying it’s good to have your brand new car last a year and a half.

T-SQL Tuesday Roundup

Ewald Cress has what might have been the largest T-SQL Tuesday ever:

Firstly, I want to thank every person who took part. SIXTY TWO blog posts got generated, including a few first-time #tsql2sday contributors as well as first-time bloggers. I am fairly glowing to have been a part of it, and I hope the other contributors are too.

Secondly, from my own experience in writing a post, I know it feels terrible when you start worrying about who to pick. There are many people I could have included, but I hope I have made my appreciation for them clear elsewhere. Not that I want to speak on your behalf, but I’ll assume that the same applies to many other contributors.

That’s a lot of reading.

Importing SSMS Registered Servers Into SQL Operations Studio

Drew Furgiuele has a hankering for SQL Operations Studio and wants to invite a few servers to the party:

One barrier to entry is that the initial setup can be a little daunting, especially if you use a local connection groups or central management servers to keep track of registered connections in SQL Server Management Studio. You’d be in for a lot of manual clicking and typing of connections if you have a lot of saved connections. But there’s a better way: you can import all that saved information right into SQL Operations Studio, and it’s pretty painless, too. Buckle up, because this involves a little knowledge of how settings are saved in Operations Studio, and how we can quickly get saved connection information out of SSMS and into your new application. Spoiler alert, we’re going to use PowerShell.

I’d love to see CMS support in SQL Operations Studio.  In the meantime, this is a more or less reasonable alternative, depending upon how many servers you have and how frequently they change.

Clusterless Availability Groups For Scaling Out Reads

Sean Gallardy shows a good use case for Availability Groups in scaling out reads:

Read-Scale availability groups are ones where we don’t want the availability group for high-availability or disaster recovery, instead, we want to use it to create multiple copies of our databases that span across multiple servers allowing for the spreading of a large read-only workload. There are various scenarios where this might be extremely valuable and in previous versions of SQL Server it was possible, though there was a requirement of using Windows Server Failover Clustering (WSFC). Read-Scale availability groups do not require the WSFC component and does not give high-availability or disaster recovery, it only acts as a mechanism (availability groups) to facilitate the synchronization of the databases across multiple servers.

To reiterate, this is not used for high-availability or disaster recovery but instead to scale your databases across multiple servers for read workloads.

The remainder of the post shows how to set up an Availability Group without the corresponding Windows Server Failover Clustering components.

Categories

November 2017
MTWTFSS
« Oct Dec »
 12345
6789101112
13141516171819
20212223242526
27282930