Whatssssss up!
Welcome to guide book of Deep Learning.
Deep learning is deeply and easily summerized in this book. All the content is from 3rd parties since this book isn’t the documentation of any invention, rather it’s a documented form of already invented AI concepts.
Enjoy! :)
– Pahari
Bias and Variance
Bias and Variance two words are basic concept of ML. If you don’t know about them then malai baal.
Model Types
1. Overfit Model
When your model (say machine learning method/ algorithm or say curve) perform very well (imagine curve covering all the training points) in training dataset then there is high chance of model overfiting.
Note
Predicted value: Points of green curve
Actual value: Data points in red color
Let’s understand above figure.
There, (High) Variance is introduced. Variance can simply be understood as difference in fit of the model (how close the predicted value is from expected value between fits in 2 different test dataset) in different test dataset (or in outside ML context it is how much the predicted value is scattered).
If you have an model ( ML model/equation/ML algorithm … all similar), and you try to fit it in 2 different dataset. If there is high difference in error between there test error then it is called high variance. Here, test error varies greatly based on the selection of the training dataset.
2. Underfit Model
When your model (say machine learning method/ algorithm or say curve) perform little well only (imagine curve not covering must training points) then there is chance of model underfiting.
Let’s understand above figure.
There, (High) Bias is introduced. Bias can simply understood as how far the predicted value (curve) is from expected value(red dots).
If you have an model ( ML model/equation/ML algorithm … all similar), and you try to fit it in 2 different dataset. If there is high error in train error then it is called high bias. Here, train error is still high though different train dataset is taken.
3. Balanced Model
When your model (say machine learning method/ algorithm or say curve) perform well (imagine curve covering good number training points with less distance gap between curve and points) then there is chance of model being balanced. When it happens then in new test dataset there is possibility of less error (good prediction).
Low variance and low bias can result in creating a good model.
Remembering Tips 💡
a. Bias
How much the model fails to capture a true pattern in a training dataset. Resulting an underfit model (consistently wrong prediction in new dataset).
b. Variance
It is the amount by which the prediction would change if we fit the model to a different training data set (Bad prediction in new dataset).
Overfitting is shown. Variance is sensitivity to training data.
Note
Blue Dots: Training points
Green Dots: Testing points
References
-
Codebasics. Machine Learning Tutorial Python – 20: Bias vs Variance In Machine Learning. YouTube. Watch here
-
Josh Starmer (StatQuest). Machine Learning Fundamentals: Bias and Variance. YouTube. Watch here
Regression & Curve Fitting
Curve fitting refers to drawing a curve within a given dataset to capture a true pattern of the dataset. It is done using regression.
Types of Regression
It is about learning a general pattern and predict unseen data.
Note
Interpolation is not same as Regression. Interpolation is about drawing a curve passing exactly through all data points whereas Regression is about finding a best-fit trend in a given dataset.
1. Linear Regression Aka Least Squares
It is based on the line forming method. For a set of data in a graph, a line is drawn there and distance from that line (estimation line for output prediction of an unknown input) are sqaured and summed which is called as least squares.
And minimum least square is what we want to achive, i.e. Best Fit Model.
The final line (best fit model) minimizes the sums of squares (also known as least squares) between it and the real data.
In this situation, we optimize the value of slope(weight) and intercept(bias) for best fit.
2.
Chain Rule and Gradient Descent
Chain Rule
It is about “If something depends on something else, we calculate change step-by-step.”
Gradient Descent
It is about taking small steps downhill to reach the lowest error
Go to next for deeper understanding of both the concepts. ->
Chain Rule
Chain rule can be simply compared with the chain where say small objects are chained together.
Mathematically, one property relating to a secondary property and that another seconday property also relating with another third property implies new relation between first and the third one.
The below images can clear the concept of chain rule easily:
Things to know for the concpet of the chain rule in maths are:
a. Derivative
b. Slope
Finding best fit model
Step 1:
For each residual (x-axis is weight and y-axis is height) for a particular model(linear regression in this case), plot residuals and intercepts.
Step 2:
Plot residuals and squares of residual
Step 3:
Find the relation. Relation we got is
Weight -> Height -> Intercept (y-axis intercept/Height itself) -> Residuals -> Residual squares
Step 4:
Derivative of residual square with respect to intercept(Height) being zero means minimizing the squared residual (meaning less error)
References
- https://www.youtube.com/watch?v=wl1myxrtQHQ&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=55 | The Chain Rule, Clearly Explained!!! | StatQuest with Josh Starmer
Gradient Descent
Tip
Get familiar with the concept of Regression & Curve Fitting
It is about optimizing the model eg: Linear Regression (optimize the Intercept and Slope), Logistic Regression (optimize a squiggle), t-SNE (optimize the clusters) etc
Gradient means derivatives of the loss function with respect to the parameters whereas descent refers to descending the derivatives(slope) to near zero.
Loss Functions
Loss function are those type of functions that are used to measure how wrong the model’s prediction is (measure the difference between the actual value and the predicted value of a model)
Some type of loss functions are: Sum of Squared Residuals, Mean Absolute Error, Huber Loss, Mean Squared Log Error etc.
Working of Gradient Descent
Gradient Descent finds the optimal value (e.g. of value can be derivative of sum of square residuals curve plot with intercept on x-axis to get near zero) by taking big steps when it is far way and taking baby steps when near the optimal value.
Note
Using Least Sqaure method, we just find the slope of the curve (Sum of squared residuals vs Intercept) where it is zero. In contrast, Gradient Descent finds the minimum value by taking steps from a initial guess unit it reaches the best values. Gradient Descent is very useful when it is not possible to solve for where the derivative=0
Learning Rate
The is the rate that determines the step size of the parameter (which influence the finding of the optimal solution) like intercept.
Gradient Descent stops when the step size is very close to 0. And it happens when the slope is near to 0.
Formula:
Step Size = Slope * Learning Rate
Estimating Intercept and Slope using Gradient Descent
First, derivating the loss function (SSR) with respect to intercept and slope. Then, using learning rate increasing both two parameters Intercept and Slope. Once the step size starts getting near to 0 then it is suppose to getting near the optimal solution of both parameters.
Bird’s Eye View
Regularization
It is the technique in ML, where penalty is added to keep the model simple and avoid overfitting.
Types of Regularization
Ridge (L2) Regression
The main idea behind the ridge regression is to find a new line that doesn’t fit perfectly to the training data (done using bias).
In return, the variance is reduced and also avoid overfitting.
It basically not only minimize the sum of the squared residuals (like Least Sqaure) but also minimize (Lambda * Slope^2)
When sample size (training size) are relatively small, Ridge Regression (L2 regularization) can improve predictions made form new data (i.e. reduce variance) by making predicitons les sensitive to the training data by introducing penalty.
This is done by adding the ridge regression penalty((Lambda * Slope^2)) to the thing that must be minimized. i.e. th e sum of squared residuals (Least Sqaure) + (Lambda * Slope^2)
Lambda is determined using cross validation method. (test with the test data). Greater lambda results asymtotically zero.
Main effect is that it makes the prediction size less sensitive to the tiny training dataset.
Lasso (L1) Regression
The main idea behind the lasso regression similar to the ridge regression but here there is absolute value of slope in the penalty not square.
Similar to the Ridge regression, In return, the variance is reduced and also avoid overfitting. And main effect is that it makes the prediction size less sensitive to the tiny training dataset.
Lasso Regression can exclude useless varibles from equation due to the absolute value, it is little better than ridge regression at reducing teh variance in models that contains a lot of useless variables.
This is done by adding the lasso regression penalty((Lambda * |Slope|)) to the thing that must be minimized. i.e. th e sum of squared residuals (Least Sqaure) + (Lambda * |Slope|)
Word Embedding & Word2Vec
Long Short-Term Memory (LSTM)
Tip
Before diving into Long Short-Term Memory (LSTM), get familar with the concept of Recurrent Neural Network (RNN)
SequenceTransductionModels
Basics of Neural Network
Why is it called Neural Network? Because two fundamental components in the neural network nodes and connection are like brain neurons and synapses respectively.
Components of Neural Network
Fundamental components of NN are:
Nodes
It can be input node, output node and hidden nodes.
Hidden nodes have activation function. They are the curve from which y-axis value of the calculated x-axis from the layers (including bias and weight) is picked and plug into the dataset graph to fit the dataset.
Layers
Layers are like spider web, i.e. connections between nodes. It consist of bias and weights.
Bias is addition (+). In ML context, bias is how much the model fails to capture a true pattern in a training dataset. Resulting an underfit model (consistently wrong prediction in new dataset).
For deeper understanding, check the blog mentioned just below.
Weight is multiplication part. It adds the importance for a particular factor.
Bias is addition part. It is like a threshold adjuster. Or think of this as shifter/base score/adjustment.
Weights and Biases helps for fitting (capture the pattern of the dataset) a model (curve) in the dataset to predict for a new case result.
Tip
Get familar with the concept of Bias and Variance
Math Behind Neural Network
Here, y-axis is for how effective the Dosage is. X-axis is the level of dosage (low,medium,high).
Stepwise Maths
It’s a septwise mathematics behind predicting if 0.5 dosage (medium) is effective or not.**
Step 1
By putting input value 0.5 (Dosage) and doing all the calculation (weights is *-34.4 and bias is +2.14), the result is corresponded to x-axis coordinate of activation funciton.
Step 2
For x-axis value (-15.06)of activation function got from step 1, y-axis value of the activation function is used to plot points in the actual data set to form a curve.
Y-axis value of activation function is calculated using equation of activation function. Used activation function is softmax [ f(x)=log(1+e^-15.06) ].
Step 3
Doing same for yellow layer till that hidden node
Step 4
Now the y-axis value from two hidden nodes (blue and orange), they both are summed and then again that value is summed with some value [ (some small number * -1.30) + (0.71 * 2.28) + (-0.58)].
This results 1.03 which is close to 1. So, which means 0.5 dosage is effective.
Note
Bias & Activation function are inside the neuron (nodes) and Weights are on connection (lines)
Still the connection is the part of the neuron
Tip
Now, get familiar with the concept of Regression & Curve Fitting
Hand Written Notes
Pahari’s Notes :)
References:
- The Essential Main Ideas of Neural Networks | By StatQuest with Josh Starmer | https://www.youtube.com/watch?v=CqOfi41LfDw&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=74
- https://www.youtube.com/watch?v=i1G7PXZMnSc | The Perceptron Explained | Alice Heiman
- The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.) | https://www.youtube.com/watch?v=PaFPbb66DxQ&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=9 | By StatQuest with Josh Starmer
Backpropagation in Neural Network
Tip
Before diving into Backpropagation, get familar with the concept of Chain rule and Gradient Descent
The core concepts behind the backpropagation is using gradient descent to the find the optimal value of the bias(b3) using learning rate, chain rule, derivate and loss function.
You will understand it better if you have gone through Chain Rule & Gradient Descent properly.
The screenshot below can illustrate the basic concept behind backpropagtion.
Basic understanding of Backpropagation
Assume optimal value of b3 in that figure is unknown. What we do next is, we assume 0 as initial value of b3 and find the SSR(Sum of Squared Residual) for the curve obtain from that particular bias value.
Then plot that SSR(y-axis) and Bias(x-axis) in a graph. Then take derivative of SSR with respect to the bias to find optimal value of bias.
Using Gradient Descent, we calcaulte optimal value for bias, which is obtain when step size (here it is for calculaitng the new bias value) is near 0.
Recurrent Neural Network (RNN)
Tip
Before diving into Recurrent Neural Network (RNN), get familar with the concept of Neural Network Basics
Basics of Quantum Computing
In Quantum Computing, some of the basic terms that needs to be understood are:
- Qbits
- Superposition
- Entanglement
- Decoherence
- Interference
- Measurement
- Gates
Qbits
They are similar to the bits in classical computing, but is in quantum computing. It is the superposition of 1 and 0 but not literally 1 and 0 both (also represented as linear combination of 1 and 0 possibility).
It has amplitude and phase. But when measured, it’s value is colapsed at 1 or 0.
Qbits have possibility of having 1 or 0 based on the probabilty compution using ∣α∣^2 and ∣β∣^2
Superpostion
It is linear combination of basis state(1 and 0),output being 1 and 0 after measuring a qbit.
Here, the RHS is called as ket psi which tells the state of the Qbit. This is a wave-like state, not a classical value. And LHS is the linear combination of amplitudes corresponding to basis states to become the 0 or 1 bit after measurement.
Qbit can be in 1,0 or superpostion of 1 and 0. If there are 4 Qbits then 2^4 basic state which is 16 in this case, 0000,0001,0010,…,1111.
Before measurement, there above equation of state of Qbit is what represent the state of the Qbit. But, after measurement you get only one result (either 0000,0001,…., 1111). The state collapse to classical bits. It doesn;t mean giving all 16 outputs at the same time.
Measurement is like forcing quantum system to produce a classical outcome rather seeing the outcome as the wave.
Superpostion can’t be observed directly since any measurement destroys it which called decoherence.
Amplitude is α and β but not 1 and 0. They are complex numbers which represents wave strength and phase but not probabilities.
Note
amplitude (α and β)-> “how much wave exists” , They are also called probability weight probability (|α|^2 and |β|^2)-> “chance of outcome”
There above in note, we see absolute sign with square. That absolute is taking about the magnitude of that complex number (α and β) and squaring to get real value as probability.
Phase of state of Qbit is there in before measurement but after measurement, phase information is lost, rather it only reveals one classical outcome.
Manipulating amplitude means changing the value of α and β which is done using “quantum Gates” like hadamard, CNOT. Either increase some amplitudes, decrease others or change phase.
The idea of storing classical values is, that eg. 4 bits quantum system don’t store 16 classical values, but it stores 16 amplitudes.
Interference
Think it as waves, that can be added.
++ = biggerwave -> Constructive interference ++(-) = 0 (cancel the waves) -> Destructive interference
Wrong answers are cancel out while correct answers gets amplified.
If so than how to identify the wrong answer and write answer?
It is mentioned that algorithms are there to do so. Also, phase are not assigned rather they are controlled by quantum gate.
e.g.
- Hadamard gate → creates + and − superpositions
- Phase gate → adds phase shift e^iθ
- Oracle (in Grover’s algorithm) → flips phase of “marked” state
Example idea (Grover-style intuition)
Suppose correct answer = 1010
Oracle does:
∣1010⟩→−∣1010⟩
flips sign (phase = π)
In quantum computing, correct/wrong answers are not labeled. Instead, algorithm manupulate phases so that interfernce patterns amplify desired states (correct answer) and surppress other (wrong answer).
We only have a function which is reflected by the correct state. And this helps in amplifying the correct state.
Simple analogy
Imagine:
16 people in a dark room (states) and you cannot see them directly, but you have a rule:
“only correct person responds to a signal”
And you sen wave resulting correct person reflects differently (phase flip) and waves interfere which makes correct one becomes strongest echo
Entanglement
This is the ability of qubits to correlate their state with other Qbits. When an entangled Qbit is measured, its state become strongly correlated to other entangled Qbits which advantages on determining information immediately.
Decoherence
It is the process in which the quantum state collapse into a non-quantum state. This happens when quantum system is measured or makes a system behave as if measured (loses quantum behaviour(superposition + phase relations) when intereact with the environment). It then behaves like a classical object.
Decoherence affects quantum behavior, not the rule of measurement itself.
Quantam measurement problem
We can’t measure the full quantum state. The fundamental rule of nature is quantum state evolves smoothtly (wave-like) and measurement force a single outcome
Think of a spinning coin:
quantum state = coin spinning in air (mixture of heads & tails behavior)
measurement = catching it → only heads OR tails
Interpolation Techniques
Interpolation is an estimation. It’s just about estimating the unknown value that comes either inside or outside the interval.
1. Linear Interpolation
If the point are estimated such that between two end point there form a line then we called it Linear Interpoation
Remember there is another word extrapolation which means the estimation of the values beyond the range of the data.
2. Cubic Spline interpolation.
In short points, things to note in order to understand Cubic Spline Interpolation :
- Each interval have cubic polynomial function mapping.
- Cubic polynomial : ax³+bx²+cx+d
- For two sequential interval functions: Funcitons are equal for a interior points, it’s first derivative is also equal and it’s second derivative is also equal. [ Means Continuity]
- Extreme first and last points have second derivative zero.
Benefits of Cubic Spline and Bird Eye view
Derivation of Cubic Interpolation
Things to note from the above screenshot:
- 4 unknown coefficient need to be solved
- x0,……xn, for each point x-axis values
- n+1 points, then n polynomial function eg: above 3 point and 2 polynomial function for intervals So, 4 unknown coefficeint to be solved, we need n*4 equation ~ 4n
- Interior points: (n+1)-2 (need to exclude extreme points)= n-1 Each interior points have 2 polynomial equation and there are n polynomial from point 3 – So, 2*n equations from known points
To solve we use cubic polynomial in this format:
Si(x)=ai+bi(x−xi)+ci(x−xi)²+di(x−xi)³
Above points in below 2 screenshot which include same as above 4 points
Constraints:
Total equation that we can form are :
2n+(n-1)+(n-1)+2
= 4n-2+2
= 4n
Hence we can now plug all calculated known data 4n equations to get 4n equation for 4 unknown variables. And we solve it.
Application - Dengue Outbreak Interploation
Given:
Nepal’s District 2023 year dengue outbreak data for each month
i.e Jan - 1234 case Feb - 453 case ……………… Dec - 213 case
I want to interpolate in such way that I need to find out weekly interpolated data (i.e weekly estimation case from month data)
Solution:
Using Cubic Spline,
Here,
week_positions – created evenly spaced 52 points somthing look like [1.00, 1.21, 1.42, 1.63, …, 11.58, 11.79, 12.00]
bc_type = “natural” - second derivative of jan and dec is 0
cs – is the function for polynonimal (to estimate the case for given point which is x)
cs behaves like a mathematical function: cs(x)
Benefits of using cubic spline in this application: (LLM)
a. Smooth and Natural Curve
Produces a smooth trend without sharp corners between data points.
b. Realistic Disease Pattern Modeling
Captures gradual rise and fall of outbreaks better than straight-line interpolation.
c. Continuity of Growth Rate
Maintains continuous first and second derivatives (smooth change in infection rate).
d. Accurate Within Data Range
Provides reliable estimates inside the observed time period (Jan–Dec).
e. Converts Monthly to Weekly Easily
Transforms discrete monthly data into a continuous function for weekly estimation.
f. Avoids High-Degree Polynomial Oscillation
Prevents extreme fluctuations seen in single high-order polynomial interpolation.
Rough Notes:
Chebyshev Interpolation
Basic idea of Chebyshev Interpolation:
- Uses nodes with unequal distance to each other known to be chebyshev nodes.
- Node is decided from the unit circle with equally angled division to the semi circle
- Nodes are taken from cosine formula
- Chebyshev polynomial is used to form a approx function
- Approx function is linear combination of C.polynomial
- C.polynomial is bounded on [-1,1]
Take a look into the consise note below for basics.
How to interpolate using Chebyshev Polynomial?
Let’s do it step by step:
Given -> f(x) function where x on bound [-1,1]
To find -> An approximate polynomial function to approx original function f(x)
Step 1:
Finding chebyshev nodes using chebyshev polynomial
i.e
from Tn+1(x)=cos((n+1)arccosx) with roots when Tn+1(x)=0
Solving we get
Chebyshev nodes are calculated using the formula :
Step 2:
Find all yk value for all C.nodes (ie x value got from step 1)
Step 3:
(Method 1): Using Lagrange interpolation to guarantee that all the found nodes passes through the approx polynomial function Pn(xk)
(Method 2): Using Chebyshev series representation
Step 4:
Coefficients are computed using following formulas and condition:
( I have not understood the math so understand yourself, I will do later)
Step 5:
Interpolation polynomial ( approx function) will be
(I have not understood the math so understand yourself, I will do later)
But need to understand that after getting general form of the polynomial in step 2; it is futher calculated to get evaluating formula that is in below llm reference
Step 6 (optional):
Calculating error using following formula
In short:
- Choose degree n -> n means n+1 coefficients & n+1 nodes
- Compute Chebyshev nodes i.e xk
- find yk
- Find chebyshev coefficients cj and construct polynomial ie Pn(x)
- Approx function: f(x) ~ Pn(x) using explicit chebyshev series form formula which is used to calculate yk for any xk
In code form
Both in Recursive way. Take reference from the hand written note
a. Polynomial Form
b. Consine Form
Application of Chebyshev Interpolation in the Dengue Case Interpolation
Given:
Nepal’s District 2023 year dengue outbreak data for each month
i.e Jan - 1234 case Feb - 453 case ……………… Dec - 213 case
And things to note:
F**k this can’t be applicable dude. I myself is getting confuse.
And reason is below
Note
Chebyshev interpolation = method to construct a polynomial that approximates the original function well, especially at edges, without wild oscillations. The original function graph is “approximated” by this polynomial, not exactly reconstructed.
Not applicable for our senario dude.
Hold on guys, though Chebyshev node can’t be generated in our senario, using Chebyshev polynomials as a basis in our senerio can reduces round-off errors and instability in computation only which means using existing 12 months case data and using chebyshev polynomial instead of powers of x
What the F**k again. Dude my LLM is hallucinating yaar and forcing me too to hallucinate.
Ok, Lets do one thing, plot graph for normal polynomial built(power of x), linear interpolation, chebyshev interpolation and cublic spline too at once.
Before that basic things need we, actually I am confused.
a. Linear Interpolation
As discussed above, straigth line joining two points is linear interpolation.
b. Polynomial Interpolation
A single polynomial function that passes a set of points (to each points). Lagrange Polynomial is one of the example of it. It introduce Runge oscillation
So, in our case with fixed monthy cases means equally spaced nodes, so no chebyshev nodes can be created so not applicable, even if we find chebyshev polynomial and plot, it is more like polynomial interpolation. That’s why there plot are overlapping with clear runge osscilation at both edge.
3. Rational Spline
It is foundational concept in the computer graphics. The standard parametric splines use standard polynomial to define their shape and doesn’t prefectly represent the conic sections like circle, ellipses, parabolas and hyperbola.
Simply, rational function is the ratio of two polynomial.
Before diving into the rational spline, let’s have a look into the Bezier curves and B-spline.
Bezier Curves
It is used in the computer graphics. The pen tool in the figma uses it.
For n+1 control points, Bezier curve is defined as
This curve depends upon the control points. One point change can change whole structure of the curve.
In points:
- Global control with single control point change
- Curve passes through first and last points
- Entire curve lies within convex hull of control points
- degree is n, and control points are n+1
B-spline
B-spline is also called basis spline which is combination of the piecewise polynomial (all with same degree )segments joined smoothly. It is controlled by control points and a knot vector.
Attention Is All You Need
“One of the most popular and revolutionary research papers out there in the market. Today’s AI is almost 70% because of this paper,”
I assume that, because of what I have hearded about it.
If that is so, then let’s crack this paper and go deeper to understand the core concept that this paper has introduced to the world.
Warning
This paper is absolute rabbit hole for beginners.
And yes, I am going deep into the hole.
Cracking “Abstract”
Since an abstract tells the concise summary of what the paper is trying to achieve, or say, jist of the paper, there is nothing but to understand some concepts that are introduced in this section first.
And they are:
I. Sequence Transduction Models
Sequence transduction means converting one sequence into another sequence. For example, the seq2seq model is a type of sequence transduction model.
Tip
Before diving deep into seq2seq, get familar with the concept of Long Short-Term Memory and Word Embedding & Word2Vec
seq2seq model consists of two components:
a. Encoder:
b. Decoder:
Go Deeper into Sequence Transduction Models
II. Recurrent Neural Network
III. Convolution Neural Network
IV. Attention Mechanism
V. Transformer
What this paper achieved?
Getting Start
Below are the course contents that will be noted in this particular parent section “Deep Learning Course Content”
Course Content
Unit 1: Foundations & Applied Math (8 Hours)
- Introduction and History: Motivation for Deep Learning; Historical trends; Success stories.
- Linear Algebra & Probability: Tensors, Eigendecomposition, Information Theory, and Numerical Optimization.
- Bayesian Decision Theory: Making optimal decisions under uncertainty, inference vs. decision, and loss functions for classification/regression.
- Machine Learning Basics: Capacity, Overfitting/Underfitting, Hyperparameters, and the Bias-Variance tradeoff.
Unit 2: Deep Networks & Training Optimization (12 Hours)
- Deep Feedforward Networks: Multilayer Perceptrons (MLP); Gradient-Based Learning; Backpropagation and the Chain Rule.
- Modern Regularization: L1/L2 penalties, Dropout, Early Stopping, and Dataset Augmentation.
- Optimization & Normalization: SGD, Momentum, Adam Optimizer; Batch Normalization and Layer Normalization.
Unit 3: Convolutional Networks & Computer Vision (10 Hours)
- The Convolution Operation: Motivation, Pooling, and the Neuroscientific basis for CNNs.
- Modern Vision Architectures: Residual Networks (ResNets), Inception, and Deep CNN variants.
- Advanced Vision Tasks: Object Detection (YOLO/SSD), Semantic Segmentation, and the U-Net architecture.
Unit 4: Sequence Modeling & The Attention Revolution (10 Hours)
- Recurrent Neural Networks: RNNs, the Vanishing Gradient problem, and Gated Units (LSTM and GRU).
- The Attention Mechanism: Self-Attention, Multi-head Attention, and the “Attention is All You Need” paradigm.
- The Transformer Blueprint: Encoder-Decoder architecture, Positional Encoding, and scaling to Large Language Models (LLMs).
Unit 5: Frontiers: Generative & Graph Models (8 Hours)
- Autoencoders & Latent Spaces: Undercomplete autoencoders and Representation Learning.
- Generative AI: Variational Autoencoders (VAEs) and Diffusion Models.
- Graph Neural Networks: Message Passing, Node Embeddings, and Graph Convolutional Networks (GCNs).