NLP Foundations

Why This Module Exists (Big Picture)

Before an AI Agent can act, it must:

Understand what the user said
Extract useful signals
Decide what to do next

NLP is the bridge between raw text and agent decisions

1️⃣ NLP Foundations – What & Why

Natural Language Processing converts human language into structured signals that machines can reason over.

Why it matters

Users never speak in structured JSON
Agents rely on interpretable signals (intent, entities, sentiment)

Connection to Module 3

➡️ NLP prepares the input layer for agents
➡️ Agents use NLP outputs to choose tools, actions, or responses

2️⃣ Text Cleaning – Why Noise Removal is Critical

Removing:

Special characters
Emojis
Extra spaces
Inconsistent casing

Why it matters

Noise reduces accuracy
Inconsistent text leads to wrong interpretations

Example

"Camera!!! is GREAT 😍" → "camera is great"

Connection to Agents

➡️ Clean text = reliable intent detection
➡️ Dirty text = agent confusion or wrong tool usage

Create folder nlp-demo Create virtual environment
python -m venv labenv

./labenv/Scripts/Activate.ps1 [Windows]

pip install nltk spacy scikit-learn textblob regex

python -m nltk.downloader punkt punkt_tab stopwords wordnet averaged_perceptron_tagger_eng

python -m spacy download en_core_web_sm

Create python file nlp.py

text = """

Hello!!! I bought this phone for ₹25,000.

Battery-life is great :) but camera quality is poor!!!

Contact me at user123@email.com

"""

Demo 1: Basic Text Cleaning

import re

clean_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)

clean_text = clean_text.lower()

print(clean_text)

Terminal>> python nlp.py

What This Shows

Removes special characters
Converts to lowercase

3️⃣ Regular Expressions (Regex) – Pattern Detection

Rule-based pattern matching for:

Emails
Phone numbers
IDs
Keywords

Real-world relevance

Extract order numbers
Detect support tickets
Identify PII

Connection to Agents

➡️ Regex acts as a pre-filter
➡️ Agent decides:

“I already know this is an email / ID — no need to ask LLM”

Extract Email using Regex

email = re.findall(r"\S+@\S+", text)

print(email)

4️⃣ Tokenization – Breaking Text into Meaningful Units

Splitting text into words or tokens.

"I need health insurance" →

["I", "need", "health", "insurance"]

Why it matters

Machines don’t understand sentences
They understand tokens

➡️ Tokens help agents:

Detect keywords
Map commands
Route tasks

Example:

“book flight” → travel agent
“file claim” → insurance agent

5️⃣ Stopwords – Removing Low-Value Words

Common words with little meaning:

is, the, a, and, but

Why remove them?

Reduce noise
Improve signal clarity

Example

"I want to buy a policy" →

["want", "buy", "policy"]

Connection to Agents

➡️ Helps agents focus on action words
➡️ Improves intent classification accuracy

6️⃣ Lemmatization – Normalizing Meaning

Converting words to their base form.

buying → buy
policies → policy

Why it matters

Same meaning, different forms
Avoids duplication of logic

Connection to Agents

➡️ Agents match intent patterns
➡️ Lemmatization ensures:

“buy”, “buying”, “bought” → same action

Demo: Tokenization, Stopwords, Lemmatization

import nltk

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

# Tokenization

tokens = word_tokenize(clean_text)

print(tokens)

#Remove Stopwords

stop_words = set(stopwords.words("english"))

filtered_tokens = [w for w in tokens if w not in stop_words]

print(filtered_tokens)

#Lemmatization

lemmatizer = WordNetLemmatizer()

lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print(lemmatized)

Token → word
Stopwords → noise
Lemma → base meaning

7️⃣ POS Tagging

Labeling words as:

Noun
Verb
Adjective

Why it matters

Understanding what the user wants vs what they describe

Example:

"Buy health insurance"

Buy → Verb (action)

insurance → Noun (object)

Connection to Agents

➡️ Helps agents identify:

Action (what to do)
Entity (what to act on)

from nltk import pos_tag

pos_tags = pos_tag(tokens)

print(pos_tags)

Tag	Meaning	Example from output	Why it matters
NN	Noun (thing/object)	phone, batterylife, quality	Tells agent what is being talked about
VBD	Verb (past tense)	bought	Indicates an action already done
VBZ	Verb (present, singular)	is	Describes current state
JJ	Adjective	great, poor	Indicates opinion or sentiment
DT	Determiner	this	Points to a specific object
IN	Preposition	for, at	Shows relationships
CD	Cardinal number	25000	Used for amounts, pricing
CC	Conjunction	but	Shows contrast
PRP	Pronoun	me	Refers to a person

8️⃣ Named Entity Recognition (NER)

Identifying real-world entities:

Names
Dates
Money
Locations
Products

Example

"I bought this phone for ₹25,000"

→ MONEY = 25000

Why it matters

Critical for automation
Enables personalization

Connection to Agents

➡️ Agents use entities as parameters
➡️ Example:

“Create insurance for ₹5L coverage”

Easiest NER: spaCy (Visual & Simple)

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

for ent in doc.ents:

print(ent.text, ent.label_)

9️⃣ Vectorization – Converting Text to Numbers

Transforming text into numerical form so machines can compare meaning.

Why it matters

Computers cannot compare words
Numbers allow similarity measurement

Example

“battery life is good”
“battery lasts long”

➡️ High similarity score

Connection to Agents

➡️ Used in:

Semantic search
Memory retrieval
RAG pipelines

🔟 Text Similarity – Understanding Meaning, Not Keywords

Measuring how close two sentences are in meaning.

Why it matters

Users phrase the same intent differently.

Connection to Agents

➡️ Enables:

Intent matching
Past conversation recall
Tool selection

1️⃣1️⃣ Sentiment Analysis – Emotional Context

Detects:

Positive
Negative
Neutral tone

Why it matters

Same intent, different response needed.

Example:

“This plan is terrible” → support
“This plan is okay” → explanation

Connection to Agents

➡️ Agents adjust:

Tone
Escalation
Decision paths

TextBlob Demo

from textblob import TextBlob

review = "The battery life is amazing but the camera is bad"

blob = TextBlob(review)

print(blob.sentiment)

What blob.sentiment Means

Sentiment(polarity=-0.05, subjectivity=0.78)

TextBlob returns two values:

1️⃣ Polarity (Emotional Direction)

Range:

-1.0 → 0.0 → +1.0

Negative Neutral Positive

Your value

polarity = -0.05

Meaning

Slightly negative / near neutral
Mixed emotions cancel each other out

Why?

Sentence contains both:

Positive: “battery life is amazing”
Negative: “camera is bad”

➡️ Result is almost neutral but slightly negative.

📌 Key point

“When a sentence has mixed opinions, polarity moves closer to zero.”

2️⃣ Subjectivity (Opinion vs Fact)

Range:

0.0 → 1.0

Fact Opinion

Your value

subjectivity = 0.78

Meaning

Highly opinion-based
Contains personal judgement

Why?

Words like:

amazing
bad

➡️ These are subjective adjectives, not facts.

🧠 Simple Interpretation

“The user has a mixed opinion,
mostly expressing personal feelings,
with a slight negative tilt overall.”

Why This Matters for AI Agents (Module 3 Link)

Agents don’t just respond — they decide actions.

Example logic

Polarity < 0 → route to support
Subjectivity high → empathetic response
Mixed sentiment → clarification question

Example Agent Behavior

“I see you like the battery but are unhappy with the camera.
Would you like help comparing alternatives?”

How Module 2 Feeds into Module 3

NLP Concept	Used in AI Agents For
Cleaning	Reliable input
Regex	Fast rule detection
Tokenization	Command extraction
Lemmatization	Intent normalization
POS	Action-object mapping
NER	Parameter extraction
Similarity	Intent matching
Sentiment	Decision routing