Summary

A talk given by Dr. Geoffrey Hinton listing all the shortcomings in the current convolutional networks and then alluding to his work on Capsule nets. Recording can be found here

  • our neural nets have very few structure (unlike our brain!)
    • neurons, layers, the whole net - that's it
  • we are also lacking entities in these nets
    • one way to define entities is through a group of neurons
      • aka capsule
      • aka mini-column
      • one entity per capsule
  • capsule? - a group of neurons that represents:
    • probability of presence of a multi-dimensional entity it has been designed to search for
    • and that entity's instantiation params
      • these could be pose of the object - location, orientation, velocity, deformation, etc
  • these capsules are then connected to form multiple layers
  • coincidence filtering
    • a capsule receives a batch of multi-dim vec's from capsules beneath it
    • looks for tightly coupled clusters in that batch
    • if found this outputs
      • a high probability that an entity of "its" type exists in the batch
      • the center of gravity of the cluster
    • because, at higher dims coincidences very rarely happen by chance
  • he believes in convolutions, but not pooling. He provides 4 arguments:
  • Point1: doesn't model the psychology of shape perception
    • humans use rectangular coordinate frames to perceive shapes.
    • probably even have hierarchy of such frames used for final perception
    • This plays a vital role in our perception!
    • convnets have no notion of this
    • Hinton demonstrated this using a 2-sliced tetrahedron experiment!
    • another demonstration was the use of mental rotation
      • eg: tilted 'R' letter and deciding correct handedness
    • relation between object and viewer represented by bunch of active neurons
  • Point2: doesn't solve the right problem
    • convnets aim for invariance
      • it's ok, as it is guided by label being invariant to viewpoint
      • however its better to aim for equivariance
      • changes in viewpoint lead to corresponding changes in neural activities
    • place-coded equivariance - PCE
      • different capsule represents this object while its translating
      • eg: convnets without pooling (wrt translated images)
    • rate-coded equivariance - RCE
      • for very slight translations, the same capsule represents it
      • but, its instantiation params change
    • lower level PCE is translated into higher level RCE
      • at lower capsules
        • most of it is PCE
        • only small changes cause RCE
      • at higher capsules
        • most of it is RCE
        • only very large changes cause PCE
  • Point3: fails to take advantage of linear manifold (eg: computer graphics)
    • eg: comp-graphics
      • it already represents objects in the rectangular coordinate frame
      • that manifold of object representation is globally linear.
      • convnets don't exploit this property!
      • they collect and train on data of objects under different "properties"
      • thus need a lot of data to train such models
    • approach of figuring out the manifold through objects' pose, location, translation, deformation, etc is much better compared to convnets
    • this means we'll also need very less data to train our models
    • basically, we should design our nets to perform inverse graphics!
      • literally the inverse of the process what graphics pipelines do
      • this then exploits the underlying linear manifold
      • obviously, applies only to computer vision problems
    • this approach of coincidence filtering is more similar to hough transform
  • Point4: a primitive way to do routing
    • convnets handle this through pooling by picking the most active neuron
      • certainly a primitive way of doing routing!
    • much better approach
      • route the info in images to the neurons that can best make sense of it!
      • route info dynamically based on agreement provided by upper capsules
      • upper capsules will request more input from the lower ones which vote for its cluster. They will request less, otherwise.
  • proof of concept (on mnist dataset)
    • pixel intensities to primary capsules
      • i/p -> conv layer patch -> logistic layer -> capsules
      • this is done for each patch in the i/p image
      • all patches, however, share the same weights (similar to conv layer)
    • 2nd layer
      • poses of each capsule from each patch vote for poses of each o/p class
      • these layers are linear (see the globally linear manifold argument)
      • to take translation into account, the first 2 pose params will get added by x,y coordinates of the patch
      • no. of transformation matrices = #types x #classes
    • how to detect agreements?
      • use a mixture of gaussians and uniform distributions
      • use EM to estimate mean/var of these gaussians
        • typically converges in few iterations, it seems
      • find a score that is the difference of log-prob of all samples under condition from mixture and condition from only uniform
      • apply softmax on these to do final prediction
    • our brain doesn't do such clustering to find agreements!
  • his prediction is that if we can use unsupervised learning to come up with primary capsules, then we will much less data
    • aka "derendering stage"
    • this has to be highly non-linear
    • one idea is to use the autoencoder approach
      • decoder tries to reconstruct the image based on each of the capsules
      • encoder then tries to learn how to map pixel intensities to capsules!
  • the output these primary capsules is concatenated into a single vector
    • then 'N' number of factor analyzers will be applied on these
    • we'll get a mixture of factor analyzers