Whatssssss up!

Welcome to guide book of Deep Learning.

Deep learning is deeply and easily summerized in this book. All the content is from 3rd parties since this book isn’t the documentation of any invention, rather it’s a documented form of already invented AI concepts.

Enjoy! :)

– Pahari

Bias and Variance

Bias and Variance two words are basic concept of ML. If you don’t know about them then malai baal.

Model Types

1. Overfit Model

When your model (say machine learning method/ algorithm or say curve) perform very well (imagine curve covering all the training points) in training dataset then there is high chance of model overfiting.

Note

Predicted value: Points of green curve
Actual value: Data points in red color

Let’s understand above figure.

There, (High) Variance is introduced. Variance can simply be understood as difference in fit of the model (how close the predicted value is from expected value between fits in 2 different test dataset) in different test dataset (or in outside ML context it is how much the predicted value is scattered).

If you have an model ( ML model/equation/ML algorithm … all similar), and you try to fit it in 2 different dataset. If there is high difference in error between there test error then it is called high variance. Here, test error varies greatly based on the selection of the training dataset.

2. Underfit Model

When your model (say machine learning method/ algorithm or say curve) perform little well only (imagine curve not covering must training points) then there is chance of model underfiting.

Let’s understand above figure.

There, (High) Bias is introduced. Bias can simply understood as how far the predicted value (curve) is from expected value(red dots).

If you have an model ( ML model/equation/ML algorithm … all similar), and you try to fit it in 2 different dataset. If there is high error in train error then it is called high bias. Here, train error is still high though different train dataset is taken.

3. Balanced Model

When your model (say machine learning method/ algorithm or say curve) perform well (imagine curve covering good number training points with less distance gap between curve and points) then there is chance of model being balanced. When it happens then in new test dataset there is possibility of less error (good prediction).

Low variance and low bias can result in creating a good model.

Remembering Tips 💡

a. Bias

How much the model fails to capture a true pattern in a training dataset. Resulting an underfit model (consistently wrong prediction in new dataset).

b. Variance

It is the amount by which the prediction would change if we fit the model to a different training data set (Bad prediction in new dataset).

Overfitting is shown. Variance is sensitivity to training data.

Note

Blue Dots: Training points
Green Dots: Testing points

References

Codebasics. Machine Learning Tutorial Python – 20: Bias vs Variance In Machine Learning. YouTube. Watch here
Josh Starmer (StatQuest). Machine Learning Fundamentals: Bias and Variance. YouTube. Watch here

Regression & Curve Fitting

Curve fitting refers to drawing a curve within a given dataset to capture a true pattern of the dataset. It is done using regression.

Types of Regression

It is about learning a general pattern and predict unseen data.

Note

Interpolation is not same as Regression. Interpolation is about drawing a curve passing exactly through all data points whereas Regression is about finding a best-fit trend in a given dataset.

1. Linear Regression Aka Least Squares

It is based on the line forming method. For a set of data in a graph, a line is drawn there and distance from that line (estimation line for output prediction of an unknown input) are sqaured and summed which is called as least squares.

And minimum least square is what we want to achive, i.e. Best Fit Model.

The final line (best fit model) minimizes the sums of squares (also known as least squares) between it and the real data.

In this situation, we optimize the value of slope(weight) and intercept(bias) for best fit.

2.

Chain Rule and Gradient Descent

Chain Rule
It is about “If something depends on something else, we calculate change step-by-step.”

Gradient Descent
It is about taking small steps downhill to reach the lowest error

Go to next for deeper understanding of both the concepts. ->

Chain Rule

Chain rule can be simply compared with the chain where say small objects are chained together.

Mathematically, one property relating to a secondary property and that another seconday property also relating with another third property implies new relation between first and the third one.

The below images can clear the concept of chain rule easily:

Things to know for the concpet of the chain rule in maths are:
a. Derivative
b. Slope

Finding best fit model

Step 1:
For each residual (x-axis is weight and y-axis is height) for a particular model(linear regression in this case), plot residuals and intercepts.

Step 2:
Plot residuals and squares of residual

Step 3:
Find the relation. Relation we got is
Weight -> Height -> Intercept (y-axis intercept/Height itself) -> Residuals -> Residual squares

Step 4:
Derivative of residual square with respect to intercept(Height) being zero means minimizing the squared residual (meaning less error)

References

https://www.youtube.com/watch?v=wl1myxrtQHQ&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=55 | The Chain Rule, Clearly Explained!!! | StatQuest with Josh Starmer

Gradient Descent

Tip

Get familiar with the concept of Regression & Curve Fitting

It is about optimizing the model eg: Linear Regression (optimize the Intercept and Slope), Logistic Regression (optimize a squiggle), t-SNE (optimize the clusters) etc

Gradient means derivatives of the loss function with respect to the parameters whereas descent refers to descending the derivatives(slope) to near zero.

Loss Functions

Loss function are those type of functions that are used to measure how wrong the model’s prediction is (measure the difference between the actual value and the predicted value of a model)

Some type of loss functions are: Sum of Squared Residuals, Mean Absolute Error, Huber Loss, Mean Squared Log Error etc.

Working of Gradient Descent

Gradient Descent finds the optimal value (e.g. of value can be derivative of sum of square residuals curve plot with intercept on x-axis to get near zero) by taking big steps when it is far way and taking baby steps when near the optimal value.

Note

Using Least Sqaure method, we just find the slope of the curve (Sum of squared residuals vs Intercept) where it is zero. In contrast, Gradient Descent finds the minimum value by taking steps from a initial guess unit it reaches the best values. Gradient Descent is very useful when it is not possible to solve for where the derivative=0

Learning Rate

The is the rate that determines the step size of the parameter (which influence the finding of the optimal solution) like intercept.

Gradient Descent stops when the step size is very close to 0. And it happens when the slope is near to 0.

Formula:

Step Size = Slope * Learning Rate

Estimating Intercept and Slope using Gradient Descent

First, derivating the loss function (SSR) with respect to intercept and slope. Then, using learning rate increasing both two parameters Intercept and Slope. Once the step size starts getting near to 0 then it is suppose to getting near the optimal solution of both parameters.

Bird’s Eye View

Regularization

It is the technique in ML, where penalty is added to keep the model simple and avoid overfitting.

Types of Regularization

Ridge (L2) Regression

The main idea behind the ridge regression is to find a new line that doesn’t fit perfectly to the training data (done using bias).

In return, the variance is reduced and also avoid overfitting.

It basically not only minimize the sum of the squared residuals (like Least Sqaure) but also minimize (Lambda * Slope^2)

When sample size (training size) are relatively small, Ridge Regression (L2 regularization) can improve predictions made form new data (i.e. reduce variance) by making predicitons les sensitive to the training data by introducing penalty.

This is done by adding the ridge regression penalty((Lambda * Slope^2)) to the thing that must be minimized. i.e. th e sum of squared residuals (Least Sqaure) + (Lambda * Slope^2)

Lambda is determined using cross validation method. (test with the test data). Greater lambda results asymtotically zero.

Main effect is that it makes the prediction size less sensitive to the tiny training dataset.

Lasso (L1) Regression

The main idea behind the lasso regression similar to the ridge regression but here there is absolute value of slope in the penalty not square.

Similar to the Ridge regression, In return, the variance is reduced and also avoid overfitting. And main effect is that it makes the prediction size less sensitive to the tiny training dataset.

Lasso Regression can exclude useless varibles from equation due to the absolute value, it is little better than ridge regression at reducing teh variance in models that contains a lot of useless variables.

This is done by adding the lasso regression penalty((Lambda * |Slope|)) to the thing that must be minimized. i.e. th e sum of squared residuals (Least Sqaure) + (Lambda * |Slope|)

Word Embedding & Word2Vec

Long Short-Term Memory (LSTM)

Tip

Before diving into Long Short-Term Memory (LSTM), get familar with the concept of Recurrent Neural Network (RNN)

SequenceTransductionModels

Basics of Neural Network

Why is it called Neural Network? Because two fundamental components in the neural network nodes and connection are like brain neurons and synapses respectively.

Components of Neural Network

Fundamental components of NN are:

Nodes

It can be input node, output node and hidden nodes.

Hidden nodes have activation function. They are the curve from which y-axis value of the calculated x-axis from the layers (including bias and weight) is picked and plug into the dataset graph to fit the dataset.

Layers

Layers are like spider web, i.e. connections between nodes. It consist of bias and weights.

Bias is addition (+). In ML context, bias is how much the model fails to capture a true pattern in a training dataset. Resulting an underfit model (consistently wrong prediction in new dataset).

For deeper understanding, check the blog mentioned just below.

Weight is multiplication part. It adds the importance for a particular factor.

Bias is addition part. It is like a threshold adjuster. Or think of this as shifter/base score/adjustment.

Weights and Biases helps for fitting (capture the pattern of the dataset) a model (curve) in the dataset to predict for a new case result.

Tip

Get familar with the concept of Bias and Variance

Math Behind Neural Network

Here, y-axis is for how effective the Dosage is. X-axis is the level of dosage (low,medium,high).

Stepwise Maths

It’s a septwise mathematics behind predicting if 0.5 dosage (medium) is effective or not.**

Step 1

By putting input value 0.5 (Dosage) and doing all the calculation (weights is *-34.4 and bias is +2.14), the result is corresponded to x-axis coordinate of activation funciton.

Step 2

For x-axis value (-15.06)of activation function got from step 1, y-axis value of the activation function is used to plot points in the actual data set to form a curve.

Y-axis value of activation function is calculated using equation of activation function. Used activation function is softmax [ f(x)=log(1+e^-15.06) ].

Step 3

Doing same for yellow layer till that hidden node

Step 4

Now the y-axis value from two hidden nodes (blue and orange), they both are summed and then again that value is summed with some value [ (some small number * -1.30) + (0.71 * 2.28) + (-0.58)].

This results 1.03 which is close to 1. So, which means 0.5 dosage is effective.

Note

Bias & Activation function are inside the neuron (nodes) and Weights are on connection (lines)
Still the connection is the part of the neuron

Tip

Now, get familiar with the concept of Regression & Curve Fitting

Hand Written Notes

Pahari’s Notes :)

References:

The Essential Main Ideas of Neural Networks | By StatQuest with Josh Starmer | https://www.youtube.com/watch?v=CqOfi41LfDw&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=74
https://www.youtube.com/watch?v=i1G7PXZMnSc | The Perceptron Explained | Alice Heiman
The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.) | https://www.youtube.com/watch?v=PaFPbb66DxQ&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=9 | By StatQuest with Josh Starmer

Backpropagation in Neural Network

Tip

Before diving into Backpropagation, get familar with the concept of Chain rule and Gradient Descent

The core concepts behind the backpropagation is using gradient descent to the find the optimal value of the bias(b3) using learning rate, chain rule, derivate and loss function.

You will understand it better if you have gone through Chain Rule & Gradient Descent properly.

The screenshot below can illustrate the basic concept behind backpropagtion.

Basic understanding of Backpropagation

Assume optimal value of b3 in that figure is unknown. What we do next is, we assume 0 as initial value of b3 and find the SSR(Sum of Squared Residual) for the curve obtain from that particular bias value.

Then plot that SSR(y-axis) and Bias(x-axis) in a graph. Then take derivative of SSR with respect to the bias to find optimal value of bias.

Using Gradient Descent, we calcaulte optimal value for bias, which is obtain when step size (here it is for calculaitng the new bias value) is near 0.

Recurrent Neural Network (RNN)

Tip

Before diving into Recurrent Neural Network (RNN), get familar with the concept of Neural Network Basics

Basics of Quantum Computing

In Quantum Computing, some of the basic terms that needs to be understood are:

Qbits
Superposition
Entanglement
Decoherence
Interference
Measurement
Gates

Qbits

They are similar to the bits in classical computing, but is in quantum computing. It is the superposition of 1 and 0 but not literally 1 and 0 both (also represented as linear combination of 1 and 0 possibility).

It has amplitude and phase. But when measured, it’s value is colapsed at 1 or 0.

Qbits have possibility of having 1 or 0 based on the probabilty compution using ∣α∣^2 and ∣β∣^2

Superpostion

It is linear combination of basis state(1 and 0),output being 1 and 0 after measuring a qbit.

Here, the RHS is called as ket psi which tells the state of the Qbit. This is a wave-like state, not a classical value. And LHS is the linear combination of amplitudes corresponding to basis states to become the 0 or 1 bit after measurement.

Qbit can be in 1,0 or superpostion of 1 and 0. If there are 4 Qbits then 2^4 basic state which is 16 in this case, 0000,0001,0010,…,1111.

Before measurement, there above equation of state of Qbit is what represent the state of the Qbit. But, after measurement you get only one result (either 0000,0001,…., 1111). The state collapse to classical bits. It doesn;t mean giving all 16 outputs at the same time.

Measurement is like forcing quantum system to produce a classical outcome rather seeing the outcome as the wave.

Superpostion can’t be observed directly since any measurement destroys it which called decoherence.

Amplitude is α and β but not 1 and 0. They are complex numbers which represents wave strength and phase but not probabilities.

Note

amplitude (α and β)-> “how much wave exists” , They are also called probability weight probability (|α|^2 and |β|^2)-> “chance of outcome”

There above in note, we see absolute sign with square. That absolute is taking about the magnitude of that complex number (α and β) and squaring to get real value as probability.

Phase of state of Qbit is there in before measurement but after measurement, phase information is lost, rather it only reveals one classical outcome.

Manipulating amplitude means changing the value of α and β which is done using “quantum Gates” like hadamard, CNOT. Either increase some amplitudes, decrease others or change phase.

The idea of storing classical values is, that eg. 4 bits quantum system don’t store 16 classical values, but it stores 16 amplitudes.

Interference

Think it as waves, that can be added.

++ = biggerwave -> Constructive interference ++(-) = 0 (cancel the waves) -> Destructive interference

Wrong answers are cancel out while correct answers gets amplified.

If so than how to identify the wrong answer and write answer?

It is mentioned that algorithms are there to do so. Also, phase are not assigned rather they are controlled by quantum gate.
e.g.

Hadamard gate → creates + and − superpositions
Phase gate → adds phase shift e^iθ
Oracle (in Grover’s algorithm) → flips phase of “marked” state

Example idea (Grover-style intuition)

Suppose correct answer = 1010

Oracle does:
∣1010⟩→−∣1010⟩
flips sign (phase = π)

In quantum computing, correct/wrong answers are not labeled. Instead, algorithm manupulate phases so that interfernce patterns amplify desired states (correct answer) and surppress other (wrong answer).

We only have a function which is reflected by the correct state. And this helps in amplifying the correct state.

Simple analogy

Imagine:

16 people in a dark room (states) and you cannot see them directly, but you have a rule:

“only correct person responds to a signal”

And you sen wave resulting correct person reflects differently (phase flip) and waves interfere which makes correct one becomes strongest echo

Entanglement

This is the ability of qubits to correlate their state with other Qbits. When an entangled Qbit is measured, its state become strongly correlated to other entangled Qbits which advantages on determining information immediately.

Decoherence

It is the process in which the quantum state collapse into a non-quantum state. This happens when quantum system is measured or makes a system behave as if measured (loses quantum behaviour(superposition + phase relations) when intereact with the environment). It then behaves like a classical object.

Decoherence affects quantum behavior, not the rule of measurement itself.

Quantam measurement problem

We can’t measure the full quantum state. The fundamental rule of nature is quantum state evolves smoothtly (wave-like) and measurement force a single outcome

Think of a spinning coin:

quantum state = coin spinning in air (mixture of heads & tails behavior)
measurement = catching it → only heads OR tails

Interpolation Techniques

Interpolation is an estimation. It’s just about estimating the unknown value that comes either inside or outside the interval.

1. Linear Interpolation

If the point are estimated such that between two end point there form a line then we called it Linear Interpoation

Remember there is another word extrapolation which means the estimation of the values beyond the range of the data.

2. Cubic Spline interpolation.

In short points, things to note in order to understand Cubic Spline Interpolation :

Each interval have cubic polynomial function mapping.
Cubic polynomial : ax³+bx²+cx+d
For two sequential interval functions: Funcitons are equal for a interior points, it’s first derivative is also equal and it’s second derivative is also equal. [ Means Continuity]
Extreme first and last points have second derivative zero.

Benefits of Cubic Spline and Bird Eye view

Derivation of Cubic Interpolation

Things to note from the above screenshot:

4 unknown coefficient need to be solved
x0,……xn, for each point x-axis values
n+1 points, then n polynomial function eg: above 3 point and 2 polynomial function for intervals So, 4 unknown coefficeint to be solved, we need n*4 equation ~ 4n
Interior points: (n+1)-2 (need to exclude extreme points)= n-1 Each interior points have 2 polynomial equation and there are n polynomial from point 3 – So, 2*n equations from known points

To solve we use cubic polynomial in this format:

Si(x)=ai+bi(x−xi)+ci(x−xi)²+di(x−xi)³

Above points in below 2 screenshot which include same as above 4 points

Constraints:

Total equation that we can form are :

2n+(n-1)+(n-1)+2
= 4n-2+2
= 4n

Hence we can now plug all calculated known data 4n equations to get 4n equation for 4 unknown variables. And we solve it.

Application - Dengue Outbreak Interploation

Given:

Nepal’s District 2023 year dengue outbreak data for each month

i.e Jan - 1234 case Feb - 453 case ……………… Dec - 213 case

I want to interpolate in such way that I need to find out weekly interpolated data (i.e weekly estimation case from month data)

Solution:

Using Cubic Spline,

Here,

week_positions – created evenly spaced 52 points somthing look like [1.00, 1.21, 1.42, 1.63, …, 11.58, 11.79, 12.00]
bc_type = “natural” - second derivative of jan and dec is 0
cs – is the function for polynonimal (to estimate the case for given point which is x)

cs behaves like a mathematical function: cs(x)

Benefits of using cubic spline in this application: (LLM)

a. Smooth and Natural Curve
Produces a smooth trend without sharp corners between data points.

b. Realistic Disease Pattern Modeling
Captures gradual rise and fall of outbreaks better than straight-line interpolation.

c. Continuity of Growth Rate
Maintains continuous first and second derivatives (smooth change in infection rate).

d. Accurate Within Data Range
Provides reliable estimates inside the observed time period (Jan–Dec).

e. Converts Monthly to Weekly Easily
Transforms discrete monthly data into a continuous function for weekly estimation.

f. Avoids High-Degree Polynomial Oscillation
Prevents extreme fluctuations seen in single high-order polynomial interpolation.

Rough Notes:

Chebyshev Interpolation

Basic idea of Chebyshev Interpolation:

Uses nodes with unequal distance to each other known to be chebyshev nodes.
Node is decided from the unit circle with equally angled division to the semi circle
Nodes are taken from cosine formula
Chebyshev polynomial is used to form a approx function

Approx function is linear combination of C.polynomial

C.polynomial is bounded on [-1,1]

Take a look into the consise note below for basics.

How to interpolate using Chebyshev Polynomial?

Let’s do it step by step:

Given -> f(x) function where x on bound [-1,1]
To find -> An approximate polynomial function to approx original function f(x)

Step 1:

Finding chebyshev nodes using chebyshev polynomial

i.e
from Tn+1(x)=cos((n+1)arccosx) with roots when Tn+1(x)=0

Solving we get

Chebyshev nodes are calculated using the formula :

Step 2:

Find all yk value for all C.nodes (ie x value got from step 1)

Step 3:

(Method 1): Using Lagrange interpolation to guarantee that all the found nodes passes through the approx polynomial function Pn(xk)

(Method 2): Using Chebyshev series representation

Step 4:

Coefficients are computed using following formulas and condition:

( I have not understood the math so understand yourself, I will do later)

Step 5:

Interpolation polynomial ( approx function) will be

(I have not understood the math so understand yourself, I will do later)

But need to understand that after getting general form of the polynomial in step 2; it is futher calculated to get evaluating formula that is in below llm reference

Step 6 (optional):

Calculating error using following formula

In short:

Choose degree n -> n means n+1 coefficients & n+1 nodes
Compute Chebyshev nodes i.e xk
find yk
Find chebyshev coefficients cj and construct polynomial ie Pn(x)
Approx function: f(x) ~ Pn(x) using explicit chebyshev series form formula which is used to calculate yk for any xk

In code form

Both in Recursive way. Take reference from the hand written note

a. Polynomial Form

b. Consine Form

Application of Chebyshev Interpolation in the Dengue Case Interpolation

Given:

Nepal’s District 2023 year dengue outbreak data for each month

i.e Jan - 1234 case Feb - 453 case ……………… Dec - 213 case

And things to note:

F**k this can’t be applicable dude. I myself is getting confuse.

And reason is below

Note

Chebyshev interpolation = method to construct a polynomial that approximates the original function well, especially at edges, without wild oscillations. The original function graph is “approximated” by this polynomial, not exactly reconstructed.

Not applicable for our senario dude.

Hold on guys, though Chebyshev node can’t be generated in our senario, using Chebyshev polynomials as a basis in our senerio can reduces round-off errors and instability in computation only which means using existing 12 months case data and using chebyshev polynomial instead of powers of x

What the F**k again. Dude my LLM is hallucinating yaar and forcing me too to hallucinate.

Ok, Lets do one thing, plot graph for normal polynomial built(power of x), linear interpolation, chebyshev interpolation and cublic spline too at once.

Before that basic things need we, actually I am confused.

a. Linear Interpolation

As discussed above, straigth line joining two points is linear interpolation.

b. Polynomial Interpolation

A single polynomial function that passes a set of points (to each points). Lagrange Polynomial is one of the example of it. It introduce Runge oscillation

So, in our case with fixed monthy cases means equally spaced nodes, so no chebyshev nodes can be created so not applicable, even if we find chebyshev polynomial and plot, it is more like polynomial interpolation. That’s why there plot are overlapping with clear runge osscilation at both edge.

3. Rational Spline

It is foundational concept in the computer graphics. The standard parametric splines use standard polynomial to define their shape and doesn’t prefectly represent the conic sections like circle, ellipses, parabolas and hyperbola.

Simply, rational function is the ratio of two polynomial.

Before diving into the rational spline, let’s have a look into the Bezier curves and B-spline.

Bezier Curves

It is used in the computer graphics. The pen tool in the figma uses it.

For n+1 control points, Bezier curve is defined as

This curve depends upon the control points. One point change can change whole structure of the curve.

In points:

Global control with single control point change
Curve passes through first and last points
Entire curve lies within convex hull of control points
degree is n, and control points are n+1

B-spline

B-spline is also called basis spline which is combination of the piecewise polynomial (all with same degree )segments joined smoothly. It is controlled by control points and a knot vector.

Attention Is All You Need

“One of the most popular and revolutionary research papers out there in the market. Today’s AI is almost 70% because of this paper,”

I assume that, because of what I have hearded about it.

If that is so, then let’s crack this paper and go deeper to understand the core concept that this paper has introduced to the world.

Warning

This paper is absolute rabbit hole for beginners.

And yes, I am going deep into the hole.

Cracking “Abstract”

Since an abstract tells the concise summary of what the paper is trying to achieve, or say, jist of the paper, there is nothing but to understand some concepts that are introduced in this section first.

And they are:

I. Sequence Transduction Models

Sequence transduction means converting one sequence into another sequence. For example, the seq2seq model is a type of sequence transduction model.

Tip

Before diving deep into seq2seq, get familar with the concept of Long Short-Term Memory and Word Embedding & Word2Vec

seq2seq model consists of two components:

Getting Start

Below are the course contents that will be noted in this particular parent section “Deep Learning Course Content”

Course Content

Unit 1: Foundations & Applied Math (8 Hours)

Introduction and History: Motivation for Deep Learning; Historical trends; Success stories.
Linear Algebra & Probability: Tensors, Eigendecomposition, Information Theory, and Numerical Optimization.
Bayesian Decision Theory: Making optimal decisions under uncertainty, inference vs. decision, and loss functions for classification/regression.
Machine Learning Basics: Capacity, Overfitting/Underfitting, Hyperparameters, and the Bias-Variance tradeoff.

Unit 2: Deep Networks & Training Optimization (12 Hours)

Deep Feedforward Networks: Multilayer Perceptrons (MLP); Gradient-Based Learning; Backpropagation and the Chain Rule.
Modern Regularization: L1/L2 penalties, Dropout, Early Stopping, and Dataset Augmentation.
Optimization & Normalization: SGD, Momentum, Adam Optimizer; Batch Normalization and Layer Normalization.

Unit 3: Convolutional Networks & Computer Vision (10 Hours)

The Convolution Operation: Motivation, Pooling, and the Neuroscientific basis for CNNs.
Modern Vision Architectures: Residual Networks (ResNets), Inception, and Deep CNN variants.
Advanced Vision Tasks: Object Detection (YOLO/SSD), Semantic Segmentation, and the U-Net architecture.

Unit 4: Sequence Modeling & The Attention Revolution (10 Hours)

Recurrent Neural Networks: RNNs, the Vanishing Gradient problem, and Gated Units (LSTM and GRU).
The Attention Mechanism: Self-Attention, Multi-head Attention, and the “Attention is All You Need” paradigm.
The Transformer Blueprint: Encoder-Decoder architecture, Positional Encoding, and scaling to Large Language Models (LLMs).

Unit 5: Frontiers: Generative & Graph Models (8 Hours)

Autoencoders & Latent Spaces: Undercomplete autoencoders and Representation Learning.
Generative AI: Variational Autoencoders (VAEs) and Diffusion Models.
Graph Neural Networks: Message Passing, Node Embeddings, and Graph Convolutional Networks (GCNs).

Notation

Unit 1: Foundations & Applied Math

Keyboard shortcuts

Learning Deep