Tokenizing Text With R

Rachael Tatman shows how to tokenize a set of text as the first step in a natural language processing experiment:

In this tutorial you’ll learn how to:

  • Read text into R
  • Select only certain lines
  • Tokenize text using the tidytext package
  • Calculate token frequency (how often each token shows up in the dataset)
  • Write reusable functions to do all of the above and make your work reproducible

For this tutorial we’ll be using a corpus of transcribed speech from bilingual children speaking in English.  You can find more information on this dataset and download it here.

It’s a nice tutorial, especially because the data set is a bit of a mess.

One-Way ANOVA Testing With R

Bidyut Ghosh shows how to perform a one-way ANOVA test in R:

From the above results, it is observed that the F-statistic value is 17.94 and it is highly significant as the corresponding p-value is much less than the level of significance (1% or 0.01). Thus, it is wise to reject the null hypothesis of equal mean value of mileage run across all the tyre brands. In other words, the average mileage of the four tyre brands are not equal.
Now you have to find out the pair of brands which differ. For this you may use the Tukey’s HSD test.

ANOVA is a fairly simple test, but it can be quite useful to know.

The Value Of Tidyeval

Kevin Feasel



Bruno Rodrigues explains why he likes tidyeval:

Last year, this column, let’s call it spam, had values 1 for good and 0 for bad. This year the column is called Spam and the values are 1 and 2. When I found out that this was the source of the problem, I just had to change the arguments of my functions from

generate_spam_plot(dataset = data2016, column = spam, value = 1)
generate_spam_plot(dataset = data2016, column = spam, value = 0)


generate_spam_plot(dataset = data2017, column = Spam, value = 1)
generate_spam_plot(dataset = data2017, column = Spam, value = 2)

without needing to change anything else. This is why I use tidyeval; without it, writing a function such as genereta_spam_plot would not be easy. It would be possible, but not easy.

Read the whole thing.

R Services Packet Captures

Niels Berglund continues his R Services internals series:

In Figure 15, I set the filter to be tcp.srcport==50755, and then I applied the filter by clicking the arrow. To start using this:

  • Clear the Process Monitor display, and make sure you are capturing events.
  • Start WireShark capturing (Ctrl+E). If you get a question whether you want to save the captured packets, just click “Continue without Saving”.
  • Execute the code in Code Snippet 3.

The Process Monitor output looks almost the same as in Figure 9, whereas the WireShark output looks like so:

Niels also includes a recap to help people who haven’t been following along get up to speed.

Loading Webpage Data Into Excel

Kerry Tyler has an interesting use of Excel:

Fortunately, there are some tools built right into Excel that make this a whole lot easier than scrolling through the list in your browser. Armed with nothing more than the URL to the CU’s KB article and Excel 2016 (or a few older versions) quick work can be made of generating custom filters for this data.

Here are the steps:

The same Power Query bits are available in Power BI, so check it out.

Using Visual Studio Code As A Powershell IDE

Mike Robbins has a new video up:

If you follow me on Twitter, then I’m sure you’re aware that I’ve been using nothing but VS Code (Visual Studio Code) as a replacement for the PowerShell ISE (Integrated Scripting Environment) for the past couple of weeks and while I had tried it in the past, I didn’t previously think it was ready for prime time. That’s now changed with all of the updates and work that has gone into it. From what I’ve found, it works fairly well flawlessly so I’ve created a short and simple video to help others get VS Code installed and configured as a replacement for the PowerShell ISE.

Click through for some interesting comments, especially one from Mike Fal.

Finding Compressable Indexes

Tracy Boggiano has a script to help you figure out which indexes make sense to compress:

We can write procedure check periodically rather a table will benefit from compression or not.  There are a few tricks though:

  1. If you have sparse columns you cannot compress the table, we check for that in lines 70-74.
  2. Next if is a index type of 1 (a clustered index) it cannot contain LOB data types image, ntext, or text.  Which these days shouldn’t be a problem, they have been after all deprecated, so you aren’t using them.  We check for these in lines 76-84.
  3. Finally, we perform the same test for LOB data on index type of 2 (nonclustered indexes) in lines 85-95.

If everything checks out OK we call our second procedure to calculate the percentage of space we would space and if we would save more than 25% then we create script to compression the index.  Now we have a @Debug parameter in here so it can just show us what code it would execute to determine what it will do.

Click through for the code for the two procedures Tracy has created.

Errors With Invalid Backup Location

Adrian Buckman shows the types of errors you should expect when your default backup location is invalid:

Recently I was looking through the error log on one of my test machines and I spotted some unusual errors:

SQL ERROR: 3634 – The operating system returned the error ‘3(The system cannot find the path specified.)’ while attempting ‘DeleteFile’
SQL ERROR: 18272 – During restore restart, an I/O error occurred on checkpoint file (operating system error (null)). The statement is proceeding but cannot be restarted. Ensure that a valid storage location exists for the checkpoint file.

At first I assumed that I may have tried restoring a database to a location that did not exist but this was not the case, the actual issue was with SQL Server’s Default Backup Location.

Read on for the full explanation.

Creating A Date Dimension In Power BI

Reza Rad has released part 1 of a series on creating date dimensions in Power BI:

When it comes to create the date (or calendar) dimension in Power BI, there is always a question: Should I create the dimension with Power Query or DAX? This is a very good question to ask. It means that you know that there are multiple ways of creating it. What is the difference? the answer is that for many scenarios these are similar. So, it might not be different to use Power Query or DAX for it. However, there is a big difference.

Power Query can fetch data from live web APIs. This functionality gives you the power to fetch public holidays live from an API. You cannot do this with DAX! Apart from this big difference, majority of other requirements can be done with both, you can write calculations in both M or DAX to get calendar columns as well as fiscal columns. In many scenarios public holidays plays an important role in analyzing data. You would like to know how the sales was in holidays compared to other holidays and etc.

There are a few more parts forthcoming, so stay tuned.


August 2017
« Jul Sep »