Harris Amjad does some text cleanup:
Natural Language Processing (NLP) is currently all the rage in the current machine learning landscape. With technologies like ChatGPT, Gemini, Llama, and so many other state-of-the-art text generators getting popular with the mainstream public, many newcomers are pouring into the field of NLP. Unfortunately, before we delve into how these fancy chatbots work, we must understand how we are engineering and treating our data before we feed it to our model. In this tip, we will introduce and implement some basic text preprocessing and cleaning techniques with Python.
Click through for some common operations. Some of these are very important for certain tasks but likely unhelpful for others. That could include things like lower-casing all words or removing stopwords. There are also some operations like spell checking and jargon expansion (or replacement) that you will likely want to include in a real-life project with actual people entering the data, versus a tidy sample dataset.