This is a quick post today on removing HTML tags using the stringr package in R.
My purpose here is in taking some raw data, which can include HTML markup, and preparing it for a vectorizer. I don’t need the resulting output to look pretty; I just want to get rid of the HTML characters.
Click through for the script. If you need to do something nice with the text afterward, my technique is probably too much sledgehammer for niceties, but it does the trick for pre-processing before vectorization.