Press "Enter" to skip to content

Author: Kevin Feasel

tibbletime: Time-Aware Data Sets In R

At Business Science, they’ve announced a new R package:

We are excited to announce the release of tibbletime v0.0.2 on CRAN. Loads of new functionality have been added, including:

  • Generic period support: Perform time-based calculations by a number of supported periods using a new ~period formula~.
  • Creating series: Use create_series() to quickly create a tbl_time object initialized with a regular time series.
  • Rolling calculations: Turn any function into a rolling version of itself with rollify().
  • A number of smaller tweaks and helper functions to make life easier.

As we further develop tibbletime, it is becoming clearer that the package is a tool that should be used in addition to the rest of the tidyverseThe combination of the two makes time series analysis in the tidyverse much easier to do!

Check out their demos comparing New York and San Francisco weather.  It looks like it’ll be a useful package.  H/T R-bloggers

Comments closed

ANOVA

Mala Mahadevan explains what ANOVA is and why it’s interesting:

ANOVA – or analysis of variance, is a term given to a set of statistical models that are used to analyze differences among groups and if the differences are statistically significant to arrive at any conclusion. The models were developed by statistician and evolutionary biologist Ronald Fischer. To give a very simplistic definition – ANOVA is an extension of the two way T-Test to multiple cases.

ANOVA is an older test and a fairly simple process, but is quite useful to understand.

Comments closed

Updating A Collection Of Variables In Powershell

Klaas Vanbenberghe shows how to update multiple variables in the same way using Powershell:

We turn to two cmdlets: Get-Variable and Set-Variable. They may seem redundant as we get and set variables all the time without those. Maybe you have even never heard of these two cmdlets.
Well here they prove their usefulness:
we will set the variables using their names, and use the current value as a starting point.
An important rule to remember is this:

$ is a token to indicate we’re dealing with a variable, but it is not part of the variable name!

It’s worth reading the whole thing.

Comments closed

Neural Nets Optimizing For Imperfect

John Cook describes a paradox with neural nets:

Deep neural networks have enough parameters to overfit the data, but there are various strategies to keep this from happening. A common way to avoid overfitting is to deliberately do a mediocre job of fitting the model.

When it works well, the shortcomings of the optimization procedure yield a solution that differs from the optimal solution in a beneficial way. But the solution could fail to be useful in several ways. It might be too far from optimal, or deviate from the optimal solution in an unhelpful way, or the optimization method might accidentally do too good a job.

Conceptually, this feels a little weird but isn’t really much of a problem, as we have other analogues:  rational ignorance in economics (where we knowingly choose not to know something because the benefit is not worth the opportunity cost of learning), OPTIMIZE FOR UNKNOWN with SQL Server (where we knowingly do not use the passed-in parameter because we might get stuck in a lesser path), etc.  But the specific process here is interesting.

Comments closed

COALESCE, ISNULL, And Data Types

Shane O’Neill shows a subtle difference between ISNULL and COALESCE:

You may be asking yourself “What the…?!”.

I asked myself the same question but then I thought of a better idea and asked my Senior the question instead (he usually has better answers than I do), who proceeded to tell me that the final test would only work exactly at midnight!

….so I repeated my question to him.

It’s an interesting read, and not something you’d commonly think about.

Comments closed

Backup Compression And Encryption

Arun Sirpal shows the combined effects of backup encryption and backup compression in SQL Server 2017:

Do not forget about the certificate! Warning: The certificate used for encrypting the database encryption key has not been backed up. Imagine if you need to recover the backup and you can’t?  You will get the dreaded thumbprint error.

Msg 33111, Level 16, State 3, Line 25 Cannot find server certificate with thumbprint ‘0x78FAB5A2A5D593FD3C4E163C90B745F70AB51233’. Msg 3013, Level 16, State 1, Line 25

RESTORE DATABASE is terminating abnormally.

So make sure you respect this certificate (and the key) and back it up and re-create them on the target server for a successful restore.

In SQL Server 2016 and 2017, there’s no reason not to encrypt backups; the marginal cost is practically nil even if you’re low enough on disk space that you need to do backup compression.

Comments closed

Use Fixed Filegrowth Settings

Andy Galbraith notes that you should use fixed-increment filegrowth settings for log and data files:

As you probably already know, the key flaw to percentage-based FILEGROWTH is that over time the increment grows larger and larger, causing the actual growth itself to take longer and longer.  This is especially an issue with LOG files because they have to be zero-initialized before they can be used, causing excessive I/O and file contention while the growth is in progress.  Paul Randal (blog/@PaulRandal) describes why this is the case in this blog post.  (If you ever get a chance to see it Paul also does a fun demo in some of his classes and talks on why zero initialization is importan, using a hex editor to read the underlying contents of disk even after the old data is “deleted”)

Andy also has a script to change filegrowth to fixed-increment growth depending upon the size of the file, so check that out.

Comments closed

Imbalanced Data In R

Rathnadevi Manivannan explains how to deal with imbalanced data using R:

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalanced data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalanced dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset.

Rathnadevi uses fraudulent transactions for his sample, but medical diagnoses is also a good example:  suppose 1 person in 10,000 has a particular disease.  You’re 99.99% right if you just say nobody has the disease, but that’s a rather unhelpful model.

Comments closed

Getting Started With Apache Mesos

Mahesh Chand Kandpal shows how to install Apache Mesos:

Follow the following instructions to install required packages and other Mesos dependencies.

# Update the packages.
$ sudo apt-get update

# Install a few utility tools.
$ sudo apt-get install -y tar wget git

# Install the latest OpenJDK.
$ sudo apt-get install -y openjdk-8-jdk

# Install other Mesos dependencies.
$ sudo apt-get -y install build-essential python-dev python-six python-virtualenv libcurl4-nss-dev libsasl2-dev libsasl2-modules maven libapr1-dev libsvn-dev zlib1g-dev

Then, Anubhav Tarar shows how to install Spark on top of Mesos:

7.now got to $SPARK_HOME/CONF

inside your spark-env.sh add following parameters

export MESOS_NATIVE_JAVA_LIBRARY= /usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI=/path/to/spark-2.2.0-bin-hadoop2.7.tgz

8. start spark shell with mesos as master

./bin/spark-shell –master mesos://127.0.0.1:5050

Mesos is a rather interesting platform, and if you’re getting interested in Hadoop and Spark, it’s worth learning about this.

Comments closed