Wow, I’m learning about neural networks and your post helped me really come at it from a historical perspective on how we got here. Thank you so much for this!
I struggled with two important parts of this post:
- Why did we introduce multiple parallel convolutions and then concatenate them? I understood the part about reducing feature depth with bottlenecks, but not parallelism.
- What happens due to the input bypass mechanism starting ResNet? I didn’t understand how f(x) + x translates into a better network.
Just sharing my thoughts coming at this as a beginner – it would be nice if you could elaborate on these two points in your post.
Thank you so much for putting this together!