Skip to main content

Command Palette

Search for a command to run...

NLP Foundations

Updated
7 min read
NLP Foundations
S

I’m Siddhesh, a Microsoft Certified Trainer, cloud architect, and AI practitioner focused on helping developers and organizations adopt AI effectively. As a Pluralsight instructor and speaker, I design and deliver hands-on AI enablement programs covering Generative AI, Agentic AI, Azure AI, and modern cloud architectures.

With a strong foundation in Microsoft .NET and Azure, my work today centers on building real-world AI solutions, agentic workflows, and developer productivity using AI-assisted tools. I share practical insights through workshops, conference talks, online courses, blogs, newsletters, and YouTube—bridging the gap between AI concepts and production-ready implementations.

Why This Module Exists (Big Picture)

Before an AI Agent can act, it must:

  1. Understand what the user said

  2. Extract useful signals

  3. Decide what to do next

NLP is the bridge between raw text and agent decisions


1️⃣ NLP Foundations – What & Why

Natural Language Processing converts human language into structured signals that machines can reason over.

Why it matters

  • Users never speak in structured JSON

  • Agents rely on interpretable signals (intent, entities, sentiment)

Connection to Module 3

➡️ NLP prepares the input layer for agents
➡️ Agents use NLP outputs to choose tools, actions, or responses


2️⃣ Text Cleaning – Why Noise Removal is Critical

Removing:

  • Special characters

  • Emojis

  • Extra spaces

  • Inconsistent casing

Why it matters

  • Noise reduces accuracy

  • Inconsistent text leads to wrong interpretations

Example

"Camera!!! is GREAT 😍" → "camera is great"

Connection to Agents

➡️ Clean text = reliable intent detection
➡️ Dirty text = agent confusion or wrong tool usage

Create folder nlp-demo Create virtual environment
python -m venv labenv

./labenv/Scripts/Activate.ps1 [Windows]

pip install nltk spacy scikit-learn textblob regex

python -m nltk.downloader punkt punkt_tab stopwords wordnet averaged_perceptron_tagger_eng

python -m spacy download en_core_web_sm

Create python file nlp.py

text = """

Hello!!! I bought this phone for ₹25,000.

Battery-life is great :) but camera quality is poor!!!

Contact me at user123@email.com

"""


Demo 1: Basic Text Cleaning

import re

clean_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)

clean_text = clean_text.lower()

print(clean_text)

Terminal>> python nlp.py

What This Shows

  • Removes special characters

  • Converts to lowercase


3️⃣ Regular Expressions (Regex) – Pattern Detection

Rule-based pattern matching for:

  • Emails

  • Phone numbers

  • IDs

  • Keywords

Real-world relevance

  • Extract order numbers

  • Detect support tickets

  • Identify PII

Connection to Agents

➡️ Regex acts as a pre-filter
➡️ Agent decides:

“I already know this is an email / ID — no need to ask LLM”

Extract Email using Regex

email = re.findall(r"\S+@\S+", text)

print(email)


4️⃣ Tokenization – Breaking Text into Meaningful Units

Splitting text into words or tokens.

"I need health insurance" →

["I", "need", "health", "insurance"]

Why it matters

  • Machines don’t understand sentences

  • They understand tokens

➡️ Tokens help agents:

  • Detect keywords

  • Map commands

  • Route tasks

Example:

  • “book flight” → travel agent

  • “file claim” → insurance agent


5️⃣ Stopwords – Removing Low-Value Words

Common words with little meaning:

  • is, the, a, and, but

Why remove them?

  • Reduce noise

  • Improve signal clarity

Example

"I want to buy a policy" →

["want", "buy", "policy"]

Connection to Agents

➡️ Helps agents focus on action words
➡️ Improves intent classification accuracy


6️⃣ Lemmatization – Normalizing Meaning

Converting words to their base form.

  • buying → buy

  • policies → policy

Why it matters

  • Same meaning, different forms

  • Avoids duplication of logic

Connection to Agents

➡️ Agents match intent patterns
➡️ Lemmatization ensures:

“buy”, “buying”, “bought” → same action

Demo: Tokenization, Stopwords, Lemmatization

import nltk

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

# Tokenization

tokens = word_tokenize(clean_text)

print(tokens)

#Remove Stopwords

stop_words = set(stopwords.words("english"))

filtered_tokens = [w for w in tokens if w not in stop_words]

print(filtered_tokens)

#Lemmatization

lemmatizer = WordNetLemmatizer()

lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print(lemmatized)

  • Token → word

  • Stopwords → noise

  • Lemma → base meaning


7️⃣ POS Tagging

Labeling words as:

  • Noun

  • Verb

  • Adjective

Why it matters

Understanding what the user wants vs what they describe

Example:

"Buy health insurance"

Buy → Verb (action)

insurance → Noun (object)

Connection to Agents

➡️ Helps agents identify:

  • Action (what to do)

  • Entity (what to act on)

from nltk import pos_tag

pos_tags = pos_tag(tokens)

print(pos_tags)

TagMeaningExample from outputWhy it matters
NNNoun (thing/object)phone, batterylife, qualityTells agent what is being talked about
VBDVerb (past tense)boughtIndicates an action already done
VBZVerb (present, singular)isDescribes current state
JJAdjectivegreat, poorIndicates opinion or sentiment
DTDeterminerthisPoints to a specific object
INPrepositionfor, atShows relationships
CDCardinal number25000Used for amounts, pricing
CCConjunctionbutShows contrast
PRPPronounmeRefers to a person

8️⃣ Named Entity Recognition (NER)

Identifying real-world entities:

  • Names

  • Dates

  • Money

  • Locations

  • Products

Example

"I bought this phone for ₹25,000"

→ MONEY = 25000

Why it matters

  • Critical for automation

  • Enables personalization

Connection to Agents

➡️ Agents use entities as parameters
➡️ Example:

“Create insurance for ₹5L coverage”

Easiest NER: spaCy (Visual & Simple)

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

for ent in doc.ents:

print(ent.text, ent.label_)


9️⃣ Vectorization – Converting Text to Numbers

Transforming text into numerical form so machines can compare meaning.

Why it matters

  • Computers cannot compare words

  • Numbers allow similarity measurement

Example

  • “battery life is good”

  • “battery lasts long”

➡️ High similarity score

Connection to Agents

➡️ Used in:

  • Semantic search

  • Memory retrieval

  • RAG pipelines


🔟 Text Similarity – Understanding Meaning, Not Keywords

Measuring how close two sentences are in meaning.

Why it matters

Users phrase the same intent differently.

Connection to Agents

➡️ Enables:

  • Intent matching

  • Past conversation recall

  • Tool selection


1️⃣1️⃣ Sentiment Analysis – Emotional Context

Detects:

  • Positive

  • Negative

  • Neutral tone

Why it matters

Same intent, different response needed.

Example:

  • “This plan is terrible” → support

  • “This plan is okay” → explanation

Connection to Agents

➡️ Agents adjust:

  • Tone

  • Escalation

  • Decision paths

TextBlob Demo

from textblob import TextBlob

review = "The battery life is amazing but the camera is bad"

blob = TextBlob(review)

print(blob.sentiment)

What blob.sentiment Means

Sentiment(polarity=-0.05, subjectivity=0.78)

TextBlob returns two values:

1️⃣ Polarity (Emotional Direction)

Range:

-1.0 → 0.0 → +1.0

Negative Neutral Positive

Your value

polarity = -0.05

Meaning

  • Slightly negative / near neutral

  • Mixed emotions cancel each other out

Why?

Sentence contains both:

  • Positive: “battery life is amazing”

  • Negative: “camera is bad”

➡️ Result is almost neutral but slightly negative.

📌 Key point

“When a sentence has mixed opinions, polarity moves closer to zero.”

2️⃣ Subjectivity (Opinion vs Fact)

Range:

0.0 → 1.0

Fact Opinion

Your value

subjectivity = 0.78

Meaning

  • Highly opinion-based

  • Contains personal judgement

Why?

Words like:

  • amazing

  • bad

➡️ These are subjective adjectives, not facts.


🧠 Simple Interpretation

“The user has a mixed opinion,
mostly expressing personal feelings,
with a slight negative tilt overall.”


Why This Matters for AI Agents (Module 3 Link)

Agents don’t just respond — they decide actions.

Example logic

  • Polarity < 0 → route to support

  • Subjectivity high → empathetic response

  • Mixed sentiment → clarification question

Example Agent Behavior

“I see you like the battery but are unhappy with the camera.
Would you like help comparing alternatives?”


How Module 2 Feeds into Module 3

NLP ConceptUsed in AI Agents For
CleaningReliable input
RegexFast rule detection
TokenizationCommand extraction
LemmatizationIntent normalization
POSAction-object mapping
NERParameter extraction
SimilarityIntent matching
SentimentDecision routing

Assignment

review = "The laptop performance is excellent but the price is too high"

Your task is to:

  1. Clean the text

  2. Tokenize

  3. Remove stopwords

  4. Find sentiment


Starter Code (Easiest Execution)

import re

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from textblob import TextBlob

clean = re.sub(r"[^a-zA-Z\s]", "", review.lower())

tokens = word_tokenize(clean)

filtered = [w for w in tokens if w not in stopwords.words("english")]

print("Tokens:", filtered)

print("Sentiment:", TextBlob(review).sentiment)