Filip Piekniewski, Patryk Laurent, Csaba Petre, Micah Richert, Dimitry Fisher, Todd Hylton
(Submitted on 22 Jul 2016 (v1), last revised 1 Aug 2016 (this version, v2))
Understanding visual reality involves acquiring common-sense knowledge about countless regularities in the visual world, e.g., how illumination alters the appearance of objects in a scene, and how motion changes their apparent spatial relationship. These regularities are hard to label for training supervised machine learning algorithms; consequently, algorithms need to learn these regularities from the real world in an unsupervised way. We present a novel network meta-architecture that can learn world dynamics from raw, continuous video. The components of this network can be implemented using any algorithm that possesses certain key characteristics. The highly-parallelized architecture is scalable, with localized connectivity, processing, and learning. We demonstrate an implementation of this architecture where the components are built from multi-layer perceptrons. We use this implementation to create a system capable of stable and robust visual tracking of objects as seen by a moving camera. Results show performance on par with or exceeding state-of-the-art tracking algorithms. The tracker can be trained in either fully supervised or unsupervised-then-briefly-supervised regimes. Success of the briefly-supervised regime suggests that the unsupervised portion of the model extracts useful information about visual reality. The results suggest a new class of AI algorithms that can learn from and act within the real world.