Tools
Tools: From Bag of Words to Lexicons: A Simple Journey Through NLP Basics
2026-02-24
0 views
admin
Bag of Words: The First Baby Step ## Stop Words: Cleaning Up the Noise ## N-Grams: Adding Some Order ## Stemming: The Chainsaw Approach ## Lemmatization: The Surgical Approach ## Lexicon Based Approach: The Dictionary Method ## How They All Fit Together Let me walk you through how machines learned to understand text. Nothing fancy. Just concepts. Bag of Words was exactly what it sounds like. You take a sentence, throw all words into a bag, shake it up, and count how many times each word appears. "I love chai" and "chai love I" look identical to Bag of Words. Same words. Same counts. Order doesn't matter. The idea was simple. If a word appears more times, it probably matters more. If positive words show up frequently, the text might be positive. The problem? Context disappears. "Not good" and "good" use the same words. But they mean opposite things. Bag of Words couldn't tell the difference. It just counted. Early on, people noticed something obvious. Words like "the", "is", "at", "which" were showing up everywhere but adding zero value. Stop words are the filler words you remove before doing any actual work. The, a, an, and, but, or, for, so, to, from, with, by, at, in, on, of, about, above, below, under, over, through, during, without, within, inside, outside, upon, onto, into, towards, upon, via, plus, minus, including, excluding, etc. Remove them and suddenly your Bag of Words actually focuses on meaningful content instead of being drowned in noise. The stop words list keeps growing as people realize more words are useless. Different languages have different lists. Different problems need different filters. But the idea stays the same. Don't waste time on words that don't matter. N-grams tried to fix the order problem by looking at word sequences instead of individual words. Unigrams are single words. Each word stands alone. ["I", "love", "chai"] becomes three separate items after removing stop words. Bigrams look at word pairs. ["I love", "love chai"] captures some relationship between words. Now "not good" becomes one unit instead of two separate words, and "not" might not be a stop word anymore because it actually changes meaning. We represent these as 1 and 0. Either a word or phrase appears in the text, or it doesn't. Simple binary presence. This helped with context. But it also made the data bigger. Way bigger. Every possible word pair becomes a feature. Stop words sometimes get kept in bigrams because "not good" needs both words to make sense, even though "not" alone would be removed. Stemming took a different path. Instead of just counting words, it tried to chop them down to their roots using simple rules. Running → run
Cats → cat
Studied → studi
Studies → studi
Studying → studi The rules were plain regex patterns. If you see "ing", remove it. If you see "ed", remove it. Fast. Cheap. Violent. Stop words already gone, now we're hacking at the meaningful ones. The problem? "Studied" becomes "studi". That's not a real word. The meaning is still there, barely, but it looks ugly. Machines don't care about looks. But humans reading the output? They notice. And sometimes the meaning drifts just enough to matter. Lemmatization fixes the "studi" problem by using dictionaries and grammar rules instead of regex hacks. Studied → study
Studies → study
Studying → study
Better → good
Went → go
Is → be It understands that "studied" is past tense of "study". It knows "better" compares to "good". It has a dictionary and it uses it. Stop words already removed, now we're normalizing the real content. The tradeoff? Slower than stemming. Needs more memory. Needs a proper dictionary that understands grammar. But the output is actual words that make sense to humans and machines alike. At the metal level, it's just a really big lookup table with grammar rules attached. Not magic. Just data. Lexicon approaches skip the complex math entirely. You build a dictionary of words with pre-assigned sentiment scores. Love = +2
Good = +1
Bad = -1
Hate = -2
Not = flips the next word's score (special rule) You remove stop words first, then scan the remaining text, look up each word, and add up the scores. Positive total means positive sentiment. Negative total means negative sentiment. Some lexicons even handle negations by flipping scores when they see "not" or "never". Simple. Fast. No training required. It works surprisingly well for basic sentiment analysis. No GPUs needed. No deep learning. Just a list and some addition. Stop words removed, negations handled, scores added. The downside? Sarcasm still breaks it. "Oh great, another Monday" gets counted as positive because of "great", even though any human knows it's not. Context still matters more than dictionaries sometimes. Stop words clean the data. Bag of Words counts what's left. N-grams preserve some order. Stemming chops words to save space. Lemmatization chops them properly. Lexicons assign meaning without complex math. Modern NLP uses all of these in different combinations. Nothing gets thrown away. Every technique still has its use case. Sometimes you just need word counts with stop words removed. Sometimes you need n-grams to catch phrases. Sometimes you stem for speed. Sometimes you lemmatize for accuracy. Sometimes you grab a lexicon and call it a day. That's the thing about NLP. Fifty years of development and we're still using ideas from the 1960s. Because they work. Not perfectly. But enough to build things people actually use. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse
how-totutorialguidedev.toaideep learningnlp