What's wrong with convnets
Summary
A talk given by Dr. Geoffrey Hinton listing all the shortcomings in the current convolutional networks and then alluding to his work on Capsule nets. Recording can be found here
- our neural nets have very few structure (unlike our brain!)
- neurons, layers, the whole net - that's it
- we are also lacking entities in these nets
- one way to define entities is through a group of neurons
- aka capsule
- aka mini-column
- one entity per capsule
- one way to define entities is through a group of neurons
- capsule? - a group of neurons that represents:
- probability of presence of a multi-dimensional entity it has been designed to search for
- and that entity's instantiation params
- these could be pose of the object - location, orientation, velocity, deformation, etc
- these capsules are then connected to form multiple layers
- coincidence filtering
- a capsule receives a batch of multi-dim vec's from capsules beneath it
- looks for tightly coupled clusters in that batch
- if found this outputs
- a high probability that an entity of "its" type exists in the batch
- the center of gravity of the cluster
- because, at higher dims coincidences very rarely happen by chance
- he believes in convolutions, but not pooling. He provides 4 arguments:
- Point1: doesn't model the psychology of shape perception
- humans use rectangular coordinate frames to perceive shapes.
- probably even have hierarchy of such frames used for final perception
- This plays a vital role in our perception!
- convnets have no notion of this
- Hinton demonstrated this using a 2-sliced tetrahedron experiment!
- another demonstration was the use of mental rotation
- eg: tilted 'R' letter and deciding correct handedness
- relation between object and viewer represented by bunch of active neurons
- Point2: doesn't solve the right problem
- convnets aim for invariance
- it's ok, as it is guided by label being invariant to viewpoint
- however its better to aim for equivariance
- changes in viewpoint lead to corresponding changes in neural activities
- place-coded equivariance - PCE
- different capsule represents this object while its translating
- eg: convnets without pooling (wrt translated images)
- rate-coded equivariance - RCE
- for very slight translations, the same capsule represents it
- but, its instantiation params change
- lower level PCE is translated into higher level RCE
- at lower capsules
- most of it is PCE
- only small changes cause RCE
- at higher capsules
- most of it is RCE
- only very large changes cause PCE
- at lower capsules
- convnets aim for invariance
- Point3: fails to take advantage of linear manifold (eg: computer graphics)
- eg: comp-graphics
- it already represents objects in the rectangular coordinate frame
- that manifold of object representation is globally linear.
- convnets don't exploit this property!
- they collect and train on data of objects under different "properties"
- thus need a lot of data to train such models
- approach of figuring out the manifold through objects' pose, location, translation, deformation, etc is much better compared to convnets
- this means we'll also need very less data to train our models
- basically, we should design our nets to perform inverse graphics!
- literally the inverse of the process what graphics pipelines do
- this then exploits the underlying linear manifold
- obviously, applies only to computer vision problems
- this approach of coincidence filtering is more similar to hough transform
- eg: comp-graphics
- Point4: a primitive way to do routing
- convnets handle this through pooling by picking the most active neuron
- certainly a primitive way of doing routing!
- much better approach
- route the info in images to the neurons that can best make sense of it!
- route info dynamically based on agreement provided by upper capsules
- upper capsules will request more input from the lower ones which vote for its cluster. They will request less, otherwise.
- convnets handle this through pooling by picking the most active neuron
- proof of concept (on mnist dataset)
- pixel intensities to primary capsules
- i/p -> conv layer patch -> logistic layer -> capsules
- this is done for each patch in the i/p image
- all patches, however, share the same weights (similar to conv layer)
- 2nd layer
- poses of each capsule from each patch vote for poses of each o/p class
- these layers are linear (see the globally linear manifold argument)
- to take translation into account, the first 2 pose params will get added by x,y coordinates of the patch
- no. of transformation matrices = #types x #classes
- how to detect agreements?
- use a mixture of gaussians and uniform distributions
- use EM to estimate mean/var of these gaussians
- typically converges in few iterations, it seems
- find a score that is the difference of log-prob of all samples under condition from mixture and condition from only uniform
- apply softmax on these to do final prediction
- our brain doesn't do such clustering to find agreements!
- pixel intensities to primary capsules
- his prediction is that if we can use unsupervised learning to come up with
primary capsules, then we will much less data
- aka "derendering stage"
- this has to be highly non-linear
- one idea is to use the autoencoder approach
- decoder tries to reconstruct the image based on each of the capsules
- encoder then tries to learn how to map pixel intensities to capsules!
- the output these primary capsules is concatenated into a single vector
- then 'N' number of factor analyzers will be applied on these
- we'll get a mixture of factor analyzers