Working With forcats

Kevin Feasel

2018-03-30

R

S. Richter-Walsh demonstrates what the forcats R package can do:

Synonymous factor levels

Sometimes a categorical variable may have two or more factor levels that refer to the same group. There may be subtle differences in syntax such as upper case leading letter versus lower case leading letter (GroupA vs. groupA), for example. In this situation, one can use forcats::fct_collapse() to collapse the synonymous levels into one. In our test data, let’s assume that Web and Online refer to the same sales channel and we want to combine both into a factor level called Online….

df$sales <- fct_collapse(df$sales, Online = c("Online", "Web"))

I don’t use forcats that often, but when I do, I definitely appreciate it being here.  H/T R-Bloggers

Tips For Processing Large Data Sets With Python

Kevin Feasel

2018-03-30

Python

Julien Heiduk has a few tips for people looking to process large data sets within Python:

In order to aggregate our data, we have to use chunksize. This option of read_csvallows you to load massive file as small chunks in Pandas. We decide to take 10% of the total length for the chunksize which corresponds to 40 Million rows.
Be careful it is not necessarily interesting to take a small value. The time between each iteration can be too long with a small chaunksize. In order to find the best trade-off “Memory usage – Time” you can try different chunksize and select the best which will consume the lesser memory and which will be the faster.

Click through for more tips.

Plotting In R Using ggplot2

The folks at Sharp Sight Labs have another nice demo of ggplot2:

You’ve heard me say it a thousand times: to master data science, you need to practice.

You need to “practice small” by practicing individual techniques and functions. But you also need to “practice big” by working on larger projects.

To get some practice, my recommendation is to find reasonably sized datasets online and plot them.

Wikipedia is a nearly-endless source of good datasets. The great thing about Wikipedia is that many of the datasets are small and well contained. They are also fairly clean, with just enough messiness to make them a bit of a challenge.

As a quick example, this week, we’ll plot some economic data.

The code is deceptively easy considering the scope of the problem.

More Fun With Row Goals

Joe Obbish has a detailed investigation into performance of a simple-enough query:

Something looks very wrong here. The loop join plan has a significantly lower cost than the hash join plan! In fact, the loop join plan has a total cost of 0.0167621 optimizer units. Why would disabling row goals for such a plan cause a decrease in total query cost?

I uploaded the estimated plans here for those who wish to examine them without going through the trouble of creating tables.

It’s a long but worthwhile read.

Null Reference Exception In SSMS 17.5

David Fowler notes a bug in SQL Server Management Studio 17.5 around availability group failover:

After a spot of head scratching and thinking that there was something wrong with my AG setup, it turns out that there’s a bug in SSMS.  I was running SSMS 17.5 although this may well also affect earlier versions.

Looking at the release notes for SSMS 17.6, one of the bug fixes that this version addresses is…

  • Fixed an issue when the primary is down and manually failover to secondary, a NullReferenceException will be thrown.

David notes that upgrading fixed his issue; read on for more.

Storage Spaces Direct Complications With SQL Server

Allan Hirt walks us through some of the sharp edges around using Storage Spaces Direct (S2D) beneath a Windows Failover Cluster Instance running SQL Server:

Arguably, the biggest thing about S2D is that the solutions currently have to be certified (see this bit of documentation from MS for more detail). This obviously doesn’t really affect, say, virtualized versions or ones up in the public cloud such as in Azure in a meaningful way, but it’s still technically a requirement much like logoed hardware for Windows Server supportability. Anyone want to point me to the logo stamped on your VMs? Didn’t think so. Now, from a pure FCI perspective none of this is an issue. The way a Windows Server failover cluster (WSFC) is currently designed, it is expecting that all nodes participating in the WSFC are also using/needed S2D. Why am I mentioning this? Disaster recovery.

It shouldn’t stop you from moving forward with S2D, but means you’ll probably have a bit more research on your hands.

Temp Table Caching And Reuse

Shane O’Neill ran into an error with his stored procedure call:

We store the results in a temporary table first.

Don’t worry, that’s not the end of the post. That’s not even the point of this post. It is, however, what I was doing when I came across a weird error.

Let me show you!

INSERT EXEC failed because the stored procedure altered the schema of the target table.

He wasn’t really trying to alter the schema of that temp table, but read on to see the context and solution.

Using CONCAT_WS

Dave Mason points out another nice addition to the T-SQL toolbelt in SQL Server 2017:

In the last post, I looked at a new T-SQL function for SQL Server 2017. Let’s continue down that path and look at CONCAT_WS(), which is also new for SQL Server 2017. Here’s the definition of the function from Microsoft Docs:

“Concatenates a variable number of arguments with a delimiter specified in the 1st argument. (CONCAT_WS indicates concatenate with separator.)”

Read on for an example using CONCAT_WS.  It’s one of those functions that I haven’t quite committed to memory, but every time I get reminded of it, I remember that I really need to remember it.

Understanding Confusion Matrices

Eli Bendersky explains what it is a confusion matrix tells us:

Now comes our first batch of definitions.

  • True positive (TP): Positive test result matches reality — the person is actually sick and tested positive.
  • False positive (FP): Positive test result doesn’t match reality — the test is positive but the person is not actually sick.
  • True negative (TN): Negative test result matches reality — the person is not sick and tested negative.
  • False negative (FN): Negative test result doesn’t match reality — the test is negative but the person is actually sick.

Folks get confused with these often, so here’s a useful heuristic: positive vs. negative reflects the test outcome; true vs. false reflects whether the test got it right or got it wrong.

It’s a nice read.  The next step, after understanding these, is figuring out in which circumstances we want to weigh some of these measures more than others.

A Frugal Stretch Database Alternative

Chris Bell shares a version of Stretch databases for people with budgets:

Stretch databases were going to provide “Cost-effective” availability for cold data, and unlike typical cold data storage,  our data would always be online and available to query. Applications would not need to be modified to work with the seamless design of the stretch database. Run a query, and the data was there being pulled from the cloud when needed. Streamlining on-premises data maintenance by reducing the local footprint of the data files as well as the size of backups! It was even going to be possible to keep data secure via encrypted connections to the cloud and in theory, make a migration to the cloud even easier.

It was destined to be a major win!

Then the price was mentioned.

Do you know anyone using stretch databases today?

Yeah, me neither.

It’s an interesting workaround with several moving parts.

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031