Using stringr To Remove HTML

Kevin Feasel

2018-02-22

R

I have a quick post on removing HTML markup with stringr:

This is a quick post today on removing HTML tags using the stringr package in R.

My purpose here is in taking some raw data, which can include HTML markup, and preparing it for a vectorizer.  I don’t need the resulting output to look pretty; I just want to get rid of the HTML characters.

Click through for the script.  If you need to do something nice with the text afterward, my technique is probably too much sledgehammer for niceties, but it does the trick for pre-processing before vectorization.

Related Posts

Using DALEX To Explain Black-Box Models

Przemyslaw Biecek explains that there’s more than LIME for explaining black-box models: I’ve heard about a number of consulting companies, that decided to use simple linear model instead of a black box model with higher performance, because ,,client wants to understand factors that drive the prediction’’. And usually the discussion goes as following: ,,We have tried LIME […]

Read More

Comparing Keras In Python Versus R

Dmitry Kisler performs image classification using Keras in both Python and R: From the plots above, one can see that: the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is […]

Read More

Categories

February 2018
MTWTFSS
« Jan Mar »
 1234
567891011
12131415161718
19202122232425
262728