Using drop = FALSE On Data Frames

Kevin Feasel



John Mount explains why you might want to add drop = FALSE to your data.frame operations:

We were merely trying to re-order the rows and the result was converted to a vector. This happened because the rules for [ , ] change if there is only one result column. This happens even if the there had been only one input column. Another example is: d[,] is also vector in this case.

The issue is: if we are writing re-usable code we are often programming before we know complete contents of a variable or argument. For a data.frame named “g” supplied as an argument: g[vec, ] can be a data.frame or a vector (or even possibly a list). However we do know if g is a data.frame then g[vec, , drop = FALSE] is also a data.frame(assuming vec is a vector of valid row indices or a logical vector, note: NA induces some special cases).

We care as vectors and data.frames have different semantics, so are not fully substitutable in later code.

Definitely read the comments on this one as well, as John extends his explanation and others chime in with very useful notes.

The Process Of Processing Data

I continue my series on launching a data science project:

This next category of data cleansing has to do with specific values.  I want to look at three particular sub-categories:  mislabeled data, mismatched data, and incorrect data.

Mislabeled data happens when the label is incorrect.  In a data science problem, the label is the thing that we are trying to explain or predict.  For example, in our data set, we want to predict SalaryUSD based on various inputs.  If somebody earns $50,000 per year but accidentally types 500000 instead of 50000, it can potentially affect our analysis.  If you can fix the label, this data becomes useful again, but if you cannot, it increases the error, which means we have a marginally lower capability for accurate prediction.

Mismatched data happens when we join together data from sources which should not have been joined together.  Let’s go back to the product title and UPC/MFC example.  As we fuss with the data to try to join together these two data sets, we might accidentally write a rule which joins a product + UPC to the wrong product + MFC.  We might be able to notice this with careful observation, but if we let it through, then we will once again thwart reality and introduce some additional error into our analysis.  We could also end up with the opposite problem, where we have missed connections and potentially drop useful data out of our sample.

Finally, I’m calling incorrect data where something other than the label is wrong.  For example, in the data professional salary survey, there’s a person who works 200 hours per week.  While I admire this person’s dedication and ability to create 1.25 extra days per week that the rest of us don’t experience, I think that person should have held out for more than just $95K/year.  I mean, if I had the ability to generate spare days, I’d want way more than that.

In this series, I’ve found myself writing a bit more than expected, so I’m breaking out theory from implementation.  This is the theory post, with implementation coming next week.

Image Recognition Using Viola-Jones

Ellen Talbot lays out some of the basics of image recognition:

Aggregate channel features (ACF) is a variation of channel features, which extracts features directly as pixel values in extended channels without computing rectangular sums at various locations and scales.

Common channels include the colour channels, such as grey-scale and RBG, but many other channels can be encoded, depending on the difficulty of your problem (e.g. gradient magnitude and gradient histograms).

ACF has advantages, such as a richer representation, accelerated detection speed and more accurate localisation of objects in the images when used in conjunction with a boosting method.

Click through for more, including a few resources around the Viola-Jones algorithm.

Configuring SQL Operations Studio

Ahmad Yaseen demonstrates how to configure SQL Operations Studio as well as writing queries with it:

To customize your connection, click the Advanced button that provides a large number of options that can help you to draw a specific type of connection. For example, you can specify the application workload type when connecting to the server by setting the Application Intent option. You can also override the default Connect Timeout setting, the SQL Server Current Language, the default Column Encryption Setting for all commands on the connection, the Encrypt option to use the SSL encryption for all data sent between the client and the server if there is an installed certificate, Persist Security Info to prevent returning the password as a part of the connection, and use the SSL encryption although there is no certificate in the server by enabling the Trust Server Certificate.

You can also use the Advanced options to specify the number of attempts to restore connection and the delay between attempts using Connect Retry Count and Connect Retry Interval. In addition, you will be able also to specify the maximum and the minimum number of connections allowed in the pool with the ability to force that the connection object is drawn from the appropriate pool, and the minimum amount of time for that connection to live in the pool using Load Balance Timeout. The Failover Partner option allows you to provide the name of the SQL Server instance that acts as a failover partner. You can control the size of the network packets used to communicate with the SQL Server instance using the Packet Size option.

It’s interesting to see just how much you can configure in the tool.

Full-Screen SSMS

Wayne Sheffield has another SSMS tip for us:

Do you ever find yourself working on a query and realize that you need just a bit more real estate in the SSMS window? Or perhaps you find that all the toolbars, menus, etc. are cluttering things up? To solve these issues, you can toggle the full screen mode in SSMS on. It will remove all that clutter and maximize the query window. Below, you can see a cluttered SSMS with two rows of buttons, and toolbars on both sides of it.

Click through to see how to enable full-screen mode.

Query Store And Multiple Plans Per Query

Kendra Little follows Betteridge’s Law:

Can I Force Multiple Plans for a Query in Query Store?


At least, not right now.

I started thinking about this when I noticed that the sys.sp_query_store_unforce_plan requires you to specify both a @query_id and a @plan_id.

If there’s only ever one plan that can be forced for a query, why would I need to specify the @plan_id?

I’ve got no insider knowledge on this, I just started thinking about it.

Read on for Kendra’s thoughts.  Maybe we will get something like multiple plans for a single query in the future, though figuring out which forced plan would relate to which combination of parameters would get complex pretty fast.

Calculating The End Of The Month

Bob Pusateri gives us a few techniques for calculating the last day of a particular month:

Months are funny. Unlike other parts of a date, they vary in length:

  • The last second of a minute is always 59.
  • The last minute of a hour is always 59.
  • The last hour of a day is always 23.

But the last day of a month? Well that depends on what month it is. And the year matters too because a leap year means February gets an extra day.

Click through for several techniques, including the knuckle technique for advanced practitioners.  But what if I need to calculate the end of a lunar month?

Migrating Database Files

Jeff Mlakar gives us three methods for migrating database files from one location to another:

The database will be unavailable during this operation so we need to notify our end users. Consider the ramifications if an application is using the database – we might want to stop application services or take some other custom action during the move.

Plan ahead before starting the job. Know what you are going to do before doing it. If you can test your method against a lab or development database that will help too.

Sound advice and technique.  Click through to see those three methods.

Switching Partitions And Table Structure

Andrew Pruski demonstrates a gotcha when switching partitions between tables:

When working with partitioning the SWITCH operation has to be my favourite. The ability to move a large amount of data from one table to another as a META DATA ONLY operation is absolutely fantastic.

What’s also cool is that we can switch data into a non-partitioned table. Makes life a bit easier not having to manage two sets of partitions!

However, there is a bit of a gotcha when doing this. Let’s run through a quick demo.

Read on for more.


March 2018
« Feb