NLP Foundations

I’m Siddhesh, a Microsoft Certified Trainer, cloud architect, and AI practitioner focused on helping developers and organizations adopt AI effectively. As a Pluralsight instructor and speaker, I design and deliver hands-on AI enablement programs covering Generative AI, Agentic AI, Azure AI, and modern cloud architectures.
With a strong foundation in Microsoft .NET and Azure, my work today centers on building real-world AI solutions, agentic workflows, and developer productivity using AI-assisted tools. I share practical insights through workshops, conference talks, online courses, blogs, newsletters, and YouTube—bridging the gap between AI concepts and production-ready implementations.
Why This Module Exists (Big Picture)
Before an AI Agent can act, it must:
Understand what the user said
Extract useful signals
Decide what to do next
NLP is the bridge between raw text and agent decisions
1️⃣ NLP Foundations – What & Why
Natural Language Processing converts human language into structured signals that machines can reason over.
Why it matters
Users never speak in structured JSON
Agents rely on interpretable signals (intent, entities, sentiment)
Connection to Module 3
➡️ NLP prepares the input layer for agents
➡️ Agents use NLP outputs to choose tools, actions, or responses
2️⃣ Text Cleaning – Why Noise Removal is Critical
Removing:
Special characters
Emojis
Extra spaces
Inconsistent casing
Why it matters
Noise reduces accuracy
Inconsistent text leads to wrong interpretations
Example
"Camera!!! is GREAT 😍" → "camera is great"
Connection to Agents
➡️ Clean text = reliable intent detection
➡️ Dirty text = agent confusion or wrong tool usage
Create folder nlp-demo
Create virtual environment
python -m venv labenv
./labenv/Scripts/Activate.ps1 [Windows]
pip install nltk spacy scikit-learn textblob regex
python -m nltk.downloader punkt punkt_tab stopwords wordnet averaged_perceptron_tagger_eng
python -m spacy download en_core_web_sm
Create python file nlp.py
text = """
Hello!!! I bought this phone for ₹25,000.
Battery-life is great :) but camera quality is poor!!!
Contact me at user123@email.com
"""
Demo 1: Basic Text Cleaning
import re
clean_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
clean_text = clean_text.lower()
print(clean_text)
Terminal>> python nlp.py
What This Shows
Removes special characters
Converts to lowercase
3️⃣ Regular Expressions (Regex) – Pattern Detection
Rule-based pattern matching for:
Emails
Phone numbers
IDs
Keywords
Real-world relevance
Extract order numbers
Detect support tickets
Identify PII
Connection to Agents
➡️ Regex acts as a pre-filter
➡️ Agent decides:
“I already know this is an email / ID — no need to ask LLM”
Extract Email using Regex
email = re.findall(r"\S+@\S+", text)
print(email)
4️⃣ Tokenization – Breaking Text into Meaningful Units
Splitting text into words or tokens.
"I need health insurance" →
["I", "need", "health", "insurance"]
Why it matters
Machines don’t understand sentences
They understand tokens
➡️ Tokens help agents:
Detect keywords
Map commands
Route tasks
Example:
“book flight” → travel agent
“file claim” → insurance agent
5️⃣ Stopwords – Removing Low-Value Words
Common words with little meaning:
- is, the, a, and, but
Why remove them?
Reduce noise
Improve signal clarity
Example
"I want to buy a policy" →
["want", "buy", "policy"]
Connection to Agents
➡️ Helps agents focus on action words
➡️ Improves intent classification accuracy
6️⃣ Lemmatization – Normalizing Meaning
Converting words to their base form.
buying → buy
policies → policy
Why it matters
Same meaning, different forms
Avoids duplication of logic
Connection to Agents
➡️ Agents match intent patterns
➡️ Lemmatization ensures:
“buy”, “buying”, “bought” → same action
Demo: Tokenization, Stopwords, Lemmatization
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Tokenization
tokens = word_tokenize(clean_text)
print(tokens)
#Remove Stopwords
stop_words = set(stopwords.words("english"))
filtered_tokens = [w for w in tokens if w not in stop_words]
print(filtered_tokens)
#Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized)
Token → word
Stopwords → noise
Lemma → base meaning
7️⃣ POS Tagging
Labeling words as:
Noun
Verb
Adjective
Why it matters
Understanding what the user wants vs what they describe
Example:
"Buy health insurance"
Buy → Verb (action)
insurance → Noun (object)
Connection to Agents
➡️ Helps agents identify:
Action (what to do)
Entity (what to act on)
from nltk import pos_tag
pos_tags = pos_tag(tokens)
print(pos_tags)
| Tag | Meaning | Example from output | Why it matters |
| NN | Noun (thing/object) | phone, batterylife, quality | Tells agent what is being talked about |
| VBD | Verb (past tense) | bought | Indicates an action already done |
| VBZ | Verb (present, singular) | is | Describes current state |
| JJ | Adjective | great, poor | Indicates opinion or sentiment |
| DT | Determiner | this | Points to a specific object |
| IN | Preposition | for, at | Shows relationships |
| CD | Cardinal number | 25000 | Used for amounts, pricing |
| CC | Conjunction | but | Shows contrast |
| PRP | Pronoun | me | Refers to a person |
8️⃣ Named Entity Recognition (NER)
Identifying real-world entities:
Names
Dates
Money
Locations
Products
Example
"I bought this phone for ₹25,000"
→ MONEY = 25000
Why it matters
Critical for automation
Enables personalization
Connection to Agents
➡️ Agents use entities as parameters
➡️ Example:
“Create insurance for ₹5L coverage”
Easiest NER: spaCy (Visual & Simple)
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
9️⃣ Vectorization – Converting Text to Numbers
Transforming text into numerical form so machines can compare meaning.
Why it matters
Computers cannot compare words
Numbers allow similarity measurement
Example
“battery life is good”
“battery lasts long”
➡️ High similarity score
Connection to Agents
➡️ Used in:
Semantic search
Memory retrieval
RAG pipelines
🔟 Text Similarity – Understanding Meaning, Not Keywords
Measuring how close two sentences are in meaning.
Why it matters
Users phrase the same intent differently.
Connection to Agents
➡️ Enables:
Intent matching
Past conversation recall
Tool selection
1️⃣1️⃣ Sentiment Analysis – Emotional Context
Detects:
Positive
Negative
Neutral tone
Why it matters
Same intent, different response needed.
Example:
“This plan is terrible” → support
“This plan is okay” → explanation
Connection to Agents
➡️ Agents adjust:
Tone
Escalation
Decision paths
TextBlob Demo
from textblob import TextBlob
review = "The battery life is amazing but the camera is bad"
blob = TextBlob(review)
print(blob.sentiment)
What blob.sentiment Means
Sentiment(polarity=-0.05, subjectivity=0.78)
TextBlob returns two values:
1️⃣ Polarity (Emotional Direction)
Range:
-1.0 → 0.0 → +1.0
Negative Neutral Positive
Your value
polarity = -0.05
Meaning
Slightly negative / near neutral
Mixed emotions cancel each other out
Why?
Sentence contains both:
Positive: “battery life is amazing”
Negative: “camera is bad”
➡️ Result is almost neutral but slightly negative.
📌 Key point
“When a sentence has mixed opinions, polarity moves closer to zero.”
2️⃣ Subjectivity (Opinion vs Fact)
Range:
0.0 → 1.0
Fact Opinion
Your value
subjectivity = 0.78
Meaning
Highly opinion-based
Contains personal judgement
Why?
Words like:
amazing
bad
➡️ These are subjective adjectives, not facts.
🧠 Simple Interpretation
“The user has a mixed opinion,
mostly expressing personal feelings,
with a slight negative tilt overall.”
Why This Matters for AI Agents (Module 3 Link)
Agents don’t just respond — they decide actions.
Example logic
Polarity < 0 → route to support
Subjectivity high → empathetic response
Mixed sentiment → clarification question
Example Agent Behavior
“I see you like the battery but are unhappy with the camera.
Would you like help comparing alternatives?”
How Module 2 Feeds into Module 3
| NLP Concept | Used in AI Agents For |
| Cleaning | Reliable input |
| Regex | Fast rule detection |
| Tokenization | Command extraction |
| Lemmatization | Intent normalization |
| POS | Action-object mapping |
| NER | Parameter extraction |
| Similarity | Intent matching |
| Sentiment | Decision routing |
Assignment
review = "The laptop performance is excellent but the price is too high"
Your task is to:
Clean the text
Tokenize
Remove stopwords
Find sentiment
Starter Code (Easiest Execution)
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob
clean = re.sub(r"[^a-zA-Z\s]", "", review.lower())
tokens = word_tokenize(clean)
filtered = [w for w in tokens if w not in stopwords.words("english")]
print("Tokens:", filtered)
print("Sentiment:", TextBlob(review).sentiment)




