Review 2#

This notes is completed with assistance of ChatGPT

ToC#

  • Kernel Trick

  • common kernels

  • prove valid kernel

  • Discriminative Neural Networks Models

  • Perceptron

  • Feedforward Neural Networks

  • Convolutional Neural Networks

  • Recurrent Neural Networks

  • Attention

SVM vs NN#

Question

What is the difference between Feedforward Neural Networks and SVM considering their approaches in dealing with non-linearly separable data?

Question

What are the advantage and disadvantage of manual feature engineering and automatic data transformation in Neural Network Models?

Answer

The advantages and disadvantages of manual feature engineering and automatic data transformation in neural network models:

Aspect

Manual Feature Engineering

Automatic Data Transformation in Neural Network Models

Advantages

Domain Knowledge Application

✅ Uses domain expertise to create meaningful features.

❌ Lacks domain-specific insights.

Model Interpretability

✅ Leads to transparent and interpretable models.

❌ Often results in “black box” models.

Efficiency

✅ Simpler models might suffice.

❌ Requires complex models and more computational resources.

Adaptability

❌ Might not generalize across tasks.

✅ Adaptable to various tasks without manual intervention.

End-to-end Learning

❌ Typically not end-to-end.

✅ Allows optimization from raw data to output.

Reduced Human Bias

❌ Risk of introducing biases based on intuitions.

✅ Reduces human biases in feature creation.

Disadvantages

Time Consumption

✅ Requires significant time and effort.

❌ Faster adaptation to new data/tasks.

Risk of Overfitting

✅ Possible if features don’t generalize well.

❌ Can overfit if not regularized or if data is limited.

Scalability

❌ Needs repeated efforts for new tasks/data.

✅ Scales better to different tasks without manual changes.

Note: The “✅” and “❌” symbols indicate whether the mentioned aspect is an advantage or disadvantage, respectively, for each method.

Partial Derivative of Loss Function#

Question

Given the following neural network for binary classification, if we use binary cross entropy as loss function, compute the partial derivative of loss function L with respect to vij (written as a formula that only contains xi, uj, wj, z and y, where y is the correct label of data. Assume we use sigmoid activation function everywhere, which are shown as f in the figure below.)

nn1

Derivative of Sigmoid Function#

Given:

\[f(x) = \frac{1}{1 + e^{-x}} \]

To find \( f'(x) \), differentiate with respect to \( x \):

\[f'(x) = \frac{-1 \cdot (-e^{-x})}{(1 + e^{-x})^2} \]
\[= \frac{e^{-x}}{(1 + e^{-x})^2} \]

Notice:

\[f(x) = \frac{1}{1 + e^{-x}} \]

So:

\[e^{-x} = \frac{1}{f(x)} - 1 \]

Replacing \( e^{-x} \) in the derivative expression:

\[f'(x) = f(x)(1 - f(x)) \]

So, the derivative of the sigmoid function is:

\[f'(x) = f(x)(1 - f(x)) \]

Tip

Use the quotient rule of differentiation: If you have a function in the form \( \frac{u(x)}{v(x)} \), then its derivative is:

\[ \frac{u'v - uv'}{v^2} \]

CNN#

Pooling#

Calculations#

(a)Size of Output Map

\[ O = \lceil \frac{W - K + 2P}{S} + 1 \rceil \]

Where:

  • \( O \) is the output size.

  • \( W \) is the width (or height) of the input.

  • \( K \) is the kernel (or filter) size.

  • \( P \) is the padding.

  • \( S \) is the stride.

(b) Number of parameters in each layer:

  • Parameters = \( K × K × C_{in} × C_{out} \)

    • \( K \) is the kernel size.

    • \( C_{in} \) is the number of input channels.

    • \( C_{out} \) is the number of output channels (or filters). eg. input: 64 * 64 * 64 * 3 -> 3 is the number of input channels;

    output channels = # of filters.

  • Pooling: No learnable parameters in max-pooling layer.

  • Fully Connected:

    • n * m (no bias)

    • (n + 1) * m (with bias)

(c) To keep the same size of the receptive field of conv2 with fewer parameters:

  1. Use 1x1 Convolution (a.k.a Network in Network): Applying 1x1 convolutions can help reduce the depth (number of channels) before applying the 9x9 convolution. For example, using a 1x1 convolution to reduce the channels to 32, and then applying a 9x9 convolution.

  2. Use Dilated Convolution: Instead of the regular convolution, dilated (or atrous) convolution can be used with a smaller kernel size but with increased dilation rate to achieve the same receptive field.

  3. Factorized Convolution: Decompose the 9x9 convolution into two separate convolutions: one 9x1 convolution followed by a 1x9 convolution.

Note: Each of the above methods reduces the number of parameters while maintaining the same receptive field, but the exact effects on performance and the number of parameters saved will vary.