- If W is greater than 1, y_hat would be too big.
- If W is smaller than 1, y_hat would be too small.
This also results in unstable/difficult training.
- Setting W smaller than 1 brings about activations/gradients decreasing exponentially
- Gradient Descent will take tiny little steps making the learning process too long.
There is no complete solution to this problem but a partial solution that is related to initialization of the weights