Fundamentals of NLP: Preprocessing Text Using NLTK & SpaCy
Tokenization stemming and lemmatization are essential natural language processing (NLP) tasks. Tokenization involves breaking text into units (tokens) such as words or phrases facilitating analysis. Stemming reduces words to a common base form by removing prefixes or suffixes promoting simplicity in representation. In contrast lemmatization considers grammatical aspects to transform words into their base or dictionary form. You will begin this course by tokenizing text using the Natural Language Toolkit (NLTK) and SpaCy which involves splitting a large block of text into smaller units called tokens usually words or sentences. You will then remove stopwords common words such as "a" and "the" that add little meaning to text. Next you ll explore the WordNet lexical database which contAIns information about the semantic relationship between words. You ll use Synsets to view similar words and explore hypernyms hyponyms meronyms and holonyms. Finally you ll compare stemming and lemmatization using NLTK and SpaCy. You will explore both processes with NLTK and perform lemmatization using SpaCy.