Visual Feature Learning

Dissertation by Justus H. Piater, University of Massachusetts Amherst, February 2001

Doctoral Committee:

Abstract:

Humans learn robust and efficient strategies for visual tasks through interaction with their environment. In contrast, most current computer vision systems have no such learning capabilities. Motivated by insights from psychology and neurobiology, I combine machine learning and computer vision techniques to develop algorithms for visual learning in open-ended tasks. Learning is incremental and makes only weak assumptions about the task environment.

I begin by introducing an infinite feature space that contains combinations of local edge and texture signatures not unlike those represented in the human visual cortex. Such features can express distinctions over a wide range of specificity or generality. The learning objective is to select a small number of highly useful features from this space in a task-driven manner. Features are learned by general-to-specific random sampling. This is illustrated on two different tasks, for which I give very similar learning algorithms based on the same principles and the same feature space.

The first system incrementally learns to discriminate visual scenes. Whenever it fails to recognize a scene, new features are sought that improve discrimination. Highly distinctive features are incorporated into dynamically updated Bayesian network classifiers. Even after all recognition errors have been eliminated, the system can continue to learn better features, resembling mechanisms underlying human visual expertise. This tends to improve classification accuracy on independent test images, while reducing the number of features used for recognition.

In the second task, the visual system learns to anticipate useful hand configurations for a haptically-guided dextrous robotic grasping system, much like humans do when they pre-shape their hand during a reach. Visual features are learned that correlate reliably with the orientation of the hand. A finger configuration is recommended based on the expected grasp quality achieved by each configuration.

The results demonstrate how a largely uncommitted visual system can adapt and specialize to solve particular visual tasks. Such visual learning systems have great potential in application scenarios that are hard to model in advance, e.g. autonomous robots operating in natural environments. Moreover, this dissertation contributes to our understanding of human visual learning by providing a computational model of task-driven development of feature detectors.