Chapter 1: Introduction to Large Language Models
The Era of Pre-trained Models
Learning Objectives
- Understand introduction to large language models fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Introduction to Large Language Models
What are Large Language Models?
Large Language Models (LLMs) are neural networks trained on massive amounts of text data to understand and generate human-like text. They represent a paradigm shift in NLP: instead of training models from scratch for each task, we pre-train on vast corpora and then fine-tune for specific applications.
Think of LLMs like a student who has read everything:
- Pre-training: Like reading millions of books, articles, and websites - learning language patterns, facts, reasoning
- Fine-tuning: Like taking a specialized course - adapting general knowledge to a specific task
- Result: A model that understands language deeply and can be adapted to many tasks
The Pre-training Revolution
Before LLMs (Pre-2018):
- Each task required a separate model
- Training from scratch for every application
- Limited by available labeled data
- Like learning to drive separately for each car model
With LLMs (2018+):
- One pre-trained model for many tasks
- Fine-tune or prompt for specific needs
- Leverage vast unlabeled text data
- Like learning to drive once, then adapting to different vehicles
📚 Evolution Timeline
- 2013 - Word2Vec: Word embeddings (300 dimensions per word)
- 2018 - BERT: Bidirectional encoder, 110M-340M parameters
- 2019 - GPT-2: Decoder-only, 1.5B parameters
- 2020 - GPT-3: 175B parameters, few-shot learning
- 2022 - ChatGPT: Instruction-tuned GPT-3.5
- 2023 - GPT-4: Multimodal, improved reasoning
Key Concepts
Scale and Emergence
The scaling hypothesis: As models get larger (more parameters), trained on more data, with more compute, they show predictable improvements and emergent capabilities not present in smaller models.
Emergent abilities include:
- Few-shot learning (performing tasks with just a few examples)
- Chain-of-thought reasoning (step-by-step problem solving)
- Instruction following (understanding and following complex instructions)
- Code generation and understanding
- Mathematical problem solving
Pre-training vs Fine-tuning
Pre-training: Learning general language patterns from massive unlabeled text. This is expensive (weeks/months, many GPUs) but done once.
Fine-tuning: Adapting the pre-trained model to specific tasks using labeled data. Much faster and cheaper, can be done for many tasks.
Prompt engineering: Using carefully crafted prompts to guide model behavior without any training. Fastest approach but less powerful than fine-tuning.
Architecture Types
Encoder-only (BERT): Bidirectional understanding, best for classification, QA, NER
Decoder-only (GPT): Autoregressive generation, best for text generation, completion
Encoder-decoder (T5): Both understanding and generation, best for translation, summarization
Mathematical Formulations
Language Modeling Objective
Where:
- \(x_i\): Token at position i
- \(P(x_i | x_1, \ldots, x_{i-1})\): Probability of token \(x_i\) given previous tokens
- Model learns to predict next token given context
Training Loss (Cross-Entropy)
Where:
- \(N\): Number of training examples
- \(y_i\): Target token
- \(x_{
- \(\theta\): Model parameters
Perplexity
Perplexity measures how well the model predicts a sequence. Lower perplexity means better prediction. It's the exponentiated average negative log-likelihood.
Detailed Examples
Example: How LLMs Generate Text
Input prompt: "The capital of France is"
Step 1: Tokenization
- Input → ["The", "capital", "of", "France", "is"]
- Each token converted to embedding vector
Step 2: Forward Pass
- Model processes sequence through transformer layers
- Creates contextualized representation for "is"
Step 3: Prediction
- Model outputs probability distribution over vocabulary
- P("Paris") = 0.85, P("London") = 0.05, P("Berlin") = 0.03, ...
Step 4: Sampling
- Sample "Paris" (highest probability)
- Output: "The capital of France is Paris"
Example: Few-Shot Learning
Task: Classify sentiment without training
Prompt with examples:
Review: "This movie was amazing!" Sentiment: Positive Review: "Terrible acting and plot." Sentiment: Negative Review: "It was okay, nothing special." Sentiment: Neutral
Model learns the pattern from examples and can classify new reviews without fine-tuning!
Implementation
Using Pre-trained LLM with HuggingFace
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# Load pre-trained model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
# Set pad token
tokenizer.pad_token = tokenizer.eos_token
# Input text
text = "The capital of France is"
# Tokenize
inputs = tokenizer(text, return_tensors="pt")
# Generate
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=20,
num_return_sequences=1,
temperature=0.7,
do_sample=True
)
# Decode
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
# Output: "The capital of France is Paris"
Few-Shot Prompting Example
def few_shot_classification(review, model, tokenizer):
"""
Classify sentiment using few-shot prompting
"""
prompt = f"""Review: "This movie was amazing!"
Sentiment: Positive
Review: "Terrible acting and plot."
Sentiment: Negative
Review: "{review}"
Sentiment:"""
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=inputs.input_ids.shape[1] + 10,
temperature=0.3,
do_sample=True
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
sentiment = result.split("Sentiment:")[-1].strip()
return sentiment
# Example usage
review = "It was okay, nothing special."
sentiment = few_shot_classification(review, model, tokenizer)
print(f"Sentiment: {sentiment}") # Output: "Neutral"
Real-World Applications
Major LLM Applications
Text Generation:
- Chatbots and conversational AI (ChatGPT, Claude)
- Content creation (articles, stories, marketing copy)
- Code generation (GitHub Copilot, Codex)
- Creative writing assistance
Understanding Tasks:
- Question answering systems
- Text classification and sentiment analysis
- Named entity recognition
- Document summarization
Specialized Applications:
- Translation services
- Educational tutoring systems
- Customer service automation
- Research assistance and information retrieval
Impact on Industry
LLMs are transforming:
- Software Development: Code completion, debugging, documentation
- Content Creation: Writing, editing, translation
- Education: Personalized tutoring, content generation
- Healthcare: Medical documentation, research assistance
- Business: Customer service, data analysis, report generation