Mike's computer vision presentation

This project is an attempt to relate some techniques in machine vision to those of cognitive science.


The creation of machine vision systems allows rigorous testing of whether some strategy hypothesized for a biological system really works at solving an important problem in vision. We want to get from an intensity array (a scanned/digitized photograph or a CCD image) to a representation that will support highly flexible visual cognition and behavior.

Some simple facts about the human eye we want to keep in mind

Take a look at pictures of the human visual system, the structure of the human eye, and the retina.

Rods and cone cells of retina, when exposed to light, stimulate chemical reactions which in turn stimulate a nervous impulse which can then be sent to the brain/conscious mind.
Rods are better for lower intensities (sense only black and white) while cones sense higher intensities (can sense color).
The Visual Cortex is where actual image processing, involving things such as edge detection, texture analysis, motion analysis, and image enhancement, takes place.
The resolution of the human eye:
Human image processing is highly parallel.
(Unlike the algorithms we have developed in this class thus far!).

Two vision levels in the vision process.

We want to concentrate on the Low-Level vision process. This is because it relates more to what we can do in computer vision.
It turns out that early visual computation (the low-level vision process) is highly parallel and local. (The retina and primary visual cortex seem wired up to perform these computations.
Most people assume a great deal of info can be extracted from bottom up processing (possibly guided by top-down expectations about what is present in an image, BUT this concept is controversial).

Why do we ``think'' a gaussian ``filter'' is useful (believe we have a-prior, or "top-down", knowledge about the image).


Depth perception

Stereopsis (binocular disparity): An image within a certain distance will appear at different places within the retina of each eye and give us a sensation of depth perception. As well, the resolution of any reflecting surface (like a telescope) is proportional to the diameter of the eye (in this case) divided by the wavelength.
Movements toward the eye will make objects appear to grow larger and objects moving aways will appear to shrink. We implemented growing and shrinking operations in class for character recognition, but they could just as well emulate the approach and receeding movements of objects in our field of view. Check these examples out. First is the original, second is after a shrinking operation, third is after a growing operation.

The physics of image formation constrains the structure of images so that bottom up processes can be informative. e.g. Few objects in our visual world undergo frequent smooth inflation or deflation like a balloon. This implies that optical flow gives reliable information about depth and can be built into a visual system. (expanding and shrinking operations again)
Binocular disparity gives us reliable information about the distances to surfaces, as long as we can resolve objects at that distance. Once again Resolution is proportional to the Diameter of the collecting surface (be it eye or telescope) divided by the wavelength of the incident radiation (for humans, the visible spectrum). There has to be a disparity between the positions of corresponding features in the two retinal images. This depends upon not only where the objects are located but also where the eyes are focused.


Intensity changes - Edge Detection

These are a basic source of information for low-level vision. Once again both rapid and gradual gradients must be taken into account.

The earliest visual processes locate & represent the intensity changes in the image using local and parallel computations which reminds us of the windows we have used in edge detection and gaussian filtering.

We can use a 2nd order difference operator (gradient of the gradient) to detect edges or gradual peaks and troughs in the gradient. These "zero-crossings" show up as a change in sign.

We can use a 1-D operator to get information about intensity changes in one direction. Here is a very useful graphic.
One can have 2 or more 1-D operators which measure intensity change at two or more orientations.
One can use the laplacian which is sensitive to zero crossings at all orientations. The laplacian applied to a gaussian will yield a mexican hat profile. For a given window size more zero-crossings are found & assigned more accurate spatial locations.

The parallel processing mentioned earlier implies, in this case, that the window operations on each pixel are performed simultaneously. It seems that networks of neurons appear to work in this way.

The eye versus computer edge detection

Some boundaries that cannot be captured via zero-crossing or other intensity based edge detection schemes.
Texture edges: two sides of a boundary differ in texture rather than in average intensity.

Motion

There is considerable evidence for the existance of specialized neurophysiological circuitry for motion processing.
If you stare at a waterfall for a period of time and then look to the surrounding scenary you will perceive the illusion of upward scenary.
Try staring at a spinwheel for a period of time and then stopping the spinwheel abruptly. The patterns of the spinwheel will appear to be moving in a direction opposite to the one in which it was spinning.

This implies that there are direction sensitive cells in the visual cortex.


Computational analysis of visual motion

First we need to measure the 2-D motion in the image.
Next we need to interpret this 2-D motion for a 3-D view. This can be partially done thru expanding and shrinking methods.
Now we should compute the 2-D vector(velocity) field V(x,y,t) from the changing image I(x,y,t).
Then we make initial local measurements on zero-crossings detected via Mexican-hat-type spatial filters. The zero-crossings are correlated with the physical features of the world. This implies that motion measurements are also correlated.[remember: zero-crossings are where intensity changes are a maximum in the image.]
BUT, we have an aperture problem since each cell is looking thru a small window. Given this it cannot get the whole velocity field from local measurements. In fact it is well known that the human eye has channels of different sizes, this plot demonstrates that. The data points are simply the contrast sensitivity at different spatial frequencies. The arrow points to where the subject was adapted to that frequency. The heavy line is the normal limit of the contrast sensitivity function for a human. The eye can be oversensitized to certain spatial frequencies. This proves the eye has channels of different sizes.
We must invoke the smoothness constraint: surfaces of objects are smooth relative to their distance from the observer. Smooth surfaces in motion lead us to smooth velocity fields in image. This can lead to motion illusions such as the barber pole and diagonal motion from a combination of horizontal and vertical motion.


Example: Barber Pole

The stripes appear to move downward. (a)
Each point is in fact moving horizontally.(b)
The smoothest velocity field turn out to be vertical.(d)
This implies that the motion computation involved is primitive and isolated from other info.

Example: diagonal motion

Horzontal moving stripes overlaid with vertical moving stripes. This will give a plaid
pattern moving diagonal.

Primal Sketch

People interested in integrating results from psychology, AI, and neurophysiology consider the primal sketch to be one of the most interesting proposals concerning the earliest visual processes.
It gives an account of the way the physical properties of surfaces and reflected light determine the information in images that can be extracted quickly using low-level processes.
It contains a detailed theory of the very earliest visual processes which compute what is called the raw primal sketch.
The Raw primal sketch is a first description of the zero- crossings detected by the operators, or channels of different sizes. e.g. A gradual intensity change may not be detected by the smallest channel, but it will show up in two or more larger channels. There is physiological evidence for channels of different size.
A theory of grouping processes that operate on the raw primal sketch to produce the full primal sketch.

A few notes on High-Level vision processes.

These complete the job of delivering a coherent interpretation of the image. It's assumed that low and intermediate level processes make a useful segmented representation of the 2 and 3-D structure of the image.
Determine what objects are present and their interrelations.
Do high level processes assist the operation of lower level processes thru the top-down flow of hypotheses about what is present in the image?? (many computer vision systems make extensive use of this kind of model,or hypothesis-driven, top-down processing.

References

Here are a few references I have used in producing this document.

  • "A guided tour of computer vision";(1993); Vishvjit S. Nalwa; ISBN:1-201-54853-4
  • "Machine Vision"; Ramesh Jain, Rangachar Kasturi, Brian G. Schunck;(1995); ISBN: 0-07-032018-7
  • "Visual Perception"; Tom N. Cornsweet;(1970)
  • "Cognitive Science: An Introduction";(1995); Neil A. Stillings, Steven E. Weisler, Christopher H. Chase, Mark H. Feinstein, Jay L. Garfield, and Edwina L. Rissland; ISBN:0-262-19353-1; Chapter 12.
  • "Foundations of Cognitive Science"; (1989); Ed: Michael I. Posner; ISBN 0-262-16112-5; Chapter 15.