Summary of Presentation by Prof. James Hoffman on High-Level Vision

Prepared by Tim Simpson (additions and comments by Frawley)

Vision can be understood at various levels: low-level vision, like the computation of lines, surfaces, edges, etc.; high-level vision, like the computation of recognizable objects. Hoffman's presentation is concerned with the latter -- in particular, basic-level objects. These are things, following Rosch's work, that occur in the middle of hierarchies of categories of objects: e.g., in the hierarchy of animals, `dog' is a basic-level category, somewhere between the broad category `animal' and the narrower category `poodle.'
Vision is such a large part of brain activity that 33 areas in the brain have been connected to it. Moreover, visual processing accounts for 50% of cortical activity. Clearly vision is an essential function of the brain's hardware.

The primary function of visual processing is object recognition. This is made difficult by a number of factors: multiple appearances of the object, differences in lighting conditions and viewing angle, occlusion, and so on. The question in cognitive science is: how can the brain find a method so that an object can be recognized under almost all conditions?

Note: this question has other versions for other domains: how can a brain find a method so that a word, smell, sound, person, number, melody, etc. can be recognized under almost all conditions? Answers: store all instances and retrieve for a match? store a schematic and retrieve that but fill it in for a match? and so on.

One theory is that the brain stores an image, or template, of every object it sees. It then attempts to match objects in the visual field to these templates. This theory fails in a couple of places. One is the possibility of memory limitations. As there are more possible neural connections in a single human brain than there are hydrogen atoms in the universe, this may or may not be a problem. A larger failing of this theory is that it does not account for the recognition of a known object in a unique setting.

A second theory is that the brain decomposes objects into invariant pieces and stores these along with their relationship to each other.

That is, vision is a feature-detection complex
Visual neurons are organized in a hierarchy. An individual optical neuron only monitors a very small area of the viewing field. Within this area it is sensitive to contrast. If the whole field is uniformly light or dark, it will not respond. Further, cells have been found to be feature-sensitive. For example, some cells only respond to horizontal lines and responsive to motion in a single direction. The higher up the hierarchy of cells, the larger the area being monitored. The next level breaks the objects into parts (what Biederman calls geons) that are viewpoint invariant. This system allows the brain to recover a 3D image from the 2D image on the retina and also create a long term memory model of an object.
It has been theorized that there are thirty of these geons that can be assembled into 30,000 basic level objects.

Note: this is different from face recognition, which, though visual, is not "object visual." For one thing, faces are apparently recognized by parts and clusters and have a preferred order of presentation (upright). For another, face and object processing can dissociated: i.e., disrupted differentially by brain trauma. There are face recognition disorders that preserve object recognition and vice versa. See the paper by Farah in the Osherson volume on vision.
At the highest level are thought to exist clusters of cells that respond to particular objects. For example, there are clusters of cells that respond to only a vertical hand, as found in experiments on primate brains.

One of the most influential current theories of object recognition is the geon theory (Biederman). According to this theory, the visual system constructs ideal schematic objects out of 2D features and uses these objects as the invariant core for recognizing the variety of objects presented to the visual system (e.g., a generalized cylinder might underlie the recognition of body parts, lamps, etc.) It has been theorized that thirty of these geons that can be assembled into 30,000 basic level objects.

Question: is this part of underdetermined visual knowledge? Is this part of the minimal visual core that can be scaled up into full objects, just as there is underdetermined and minimal linguistic, musical, mathematical, and face knowledge?
One of the problems with the geon theory is that, so far, it is not known how the brain divides objects into geons. Another problem goes back to the heart of the visual processing problem. It is the question of whether object recognition actually is viewpoint independent, as geons suggest.

One clear example of where viewpoint does heavily influence recognition times is face recognition. Humans are better at recognizing faces that are upright rather than inverted. This implies that there is another system at work in face recognition. It also suggests another theory that has its basis in templates: that every view of an object is stored in memory. When a unique situation is experienced, the brain tries to find the closest image already stored. This implies that the more experience someone has with an object, the more easily they recognize it in a new setting.

Currently, template/prototype and feature-complex theories are the major competitors for explanation.