Artificial Vision with Digital Retinas

CMOS Imaging

For about 30 years, CCD has been the prevailing technology in image capture. Independently, CMOS has grown as the leading general purpose solid-state technology, accounting for 90% of all chips manufactured today, from powerful microprocessors to RAM and ROM memory chips. CCD and CMOS technologies are both based on silicon, a semiconductor that is naturally photosensitive in the visible light spectrum. At the beginning of the nineties, the MOS transistor became small enough to fit discretely within a 10µm large pixel, thus turning CMOS into a potential alternative imaging technology. Thanks to the development of novel circuit techniques during the nineties, in particular to deal with noise issues, the quality of CMOS imagers has improved to nearly match that of CCDs.
Unlike CCDs however, CMOS imagers feature an architecture that is similar to that of dRAMs, with random access to pixels. One key advantage of CMOS over CCD imagers is that the complicated driving of large swing clocks is no longer necessary, thus saving much power and avoiding specific external drivers. In the pixel of CMOS imagers, MOS transistors have been used first as read/reset pass gates, then as amplifying devices to improve noise performances. But they have also proven very useful beside the imaging array to implement other functions on the same chip. Actually, CMOS technology allows the implementation of smart imaging systems on a single chip, by combining all functions needed from photon capture to the output of digital bits. Today, as the MOS transistor is in the deep submicron range, CMOS imagers with 5µm pixel pitch are invading the low-end imaging market, e.g. webcams, and are even found in professional digital cameras. More information is available from the web sites of companies that have pioneered the field, in particular VVL and Photobit.

From CMOS imaging to CMOS vision

Today, images are mainly captured to be presented to human observers, remotely or later in time. In this perspective, CMOS imaging will certainly have a large societal impact. But the need for digital images can only be as large as the human abilities to exploit/absorb them (not to speak about storing and communicating them). Human beings might actually be saturated soon. In that sense, the true imaging revolution is yet to come, and it is that of real time artificial vision. There is a need for compact (on a chip) and low power vision systems, able to understand what they are looking at. And the corresponding market is much larger than that of image production. Typical applications are related to automatic surveillance, identification, human assistance, autonomous robotics.
A CMOS circuit is precisely a place where light sensing (photodiode) and intelligence (MOS transistor) can meet. A practical way to build a CMOS vision system on a chip could be to combine a digital CMOS imager (including ADC) with some microprocessor or DSP cores. But is it the best architecture ? Our academic role is to deal with this issue in the most fundamental way.

CMOS vision with CMOS retinas

We notice that one of the main difficulties faced today by computer architects lies in the fracture between memory and computing. Looking at nature, a remarkable feature is that animal vision always intimately combine sensing and processing. These clues (and others) make us believe that the above architecture is not so good. They rather advocate for "smart pixels" able to process - to some extent - pixel data on-site. Arrays of such smart pixels deserve to be called "artificial retinas", owing to the similarity with their biological counterparts, with respect to the sensing/processing intimacy. However, whereas biological retinas are dedicated to specific visual tasks, CMOS artificial retinas can be made versatile by using simple yet universal digital computing resources in the pixel, iteratively controlled through external programming. For the past ten years, we have been investigating programmable digital artificial retinas, and their use in vision systems. Henceforth, programmable digital artificial retinas will simply be called "digital retinas".

Digital CMOS retinas

So a digital retina is an imaging array with each pixel containing an analog-to-digital converter and a tiny digital processor. Yet, analog circuitry is very useful in digital retinas, not only for pixel-level ADC, but also for the compact and low power implementation of the tiny digital processor, and even possibly for some purely analog processing that would be too impractical digitally. A digital retina is essentially a periodic array, where the periodic cell is the pixel itself, or possibly a small cluster of pixels sharing common hardware resources.
Without any high frequency, long distance, high capacitance, power-consuming data transfer, a digital retina is meant (a) to capture images if placed in the focal plane of a lens, (b) to convert them into digital format, (c) to store some of them, (d) to perform various elementary computations on the corresponding arrays of pixel data and (e) to aggregate these huge data sets under compact - possibly scalar - forms, called image descriptors, to be used externally. All these tasks are performed on external request, that is through external programming.
So the tiny digital processor has to be universal enough to allow the execution of the most general computation class. Under drastic area constraints - it's in the pixel - it must definitely be a "highly reduced instruction set" processor. This tiny digital processor combines memory, computing and communication resources. Communications are either (a) local, among neighbor pixels (NEWS network), or (b) global, for controlling the retina array and extracting compact image descriptors, or (c) regional, to allow the efficient manipulation of objects in middle level vision.
The basic control mode of digital retinas is SIMD (Single Instruction Multiple Data), with the same instruction performed at the same time in each pixel. This type of massive parallelism has been extensively investigated in the seventies and the eighties, to be then abandoned in favor of more flexible types, easing the use of off-the-shelf components. However using SIMD in a digital retina is a completely different story. In particular, there is no need for a high bandwidth off-chip communication link. Besides, more flexible control modes can be easily implemented, in particular sub-resolution SIMD modes, windowing mode, single pixel access mode, etc.
Image descriptors extracted from a digital retina are typically scalar measures obtained through global summation over all pixels, or lists of co-ordinates of pixels of interest. Whole images are not supposed to be output from a digital retina, except for test purposes.
How large and complex can be a digital retina ? Using an outdated 0.8µm CMOS technology, we have been able to successfully design and operate a 128x128 version, with 5 binary registers per pixel. Using a not-so-far-ahead technology that could pack a billion transistors on a single die, a 512x512 retina could be built with a local pixel memory of several hundreds of bits.

Vision systems including a digital retina

In brief, a digital retina is an array processor with integrated optical input and external control. As so, it is a specialized computing unit in the same way as a floating point unit in a microprocessor. To keep with the analogy between biological and artificial vision, let us call "cortex" the set of resources that must be associated to an artificial retina to turn it into a vision system able to observe and understand scenes up to action-oriented decision making. Functionally speaking, there is a master-slave relationship between cortex and retina : the cortex controls the digital retina in order to make it produce image descriptors useful for decision making. Computationally speaking, the cortex+retina association is a hybrid parallel system where the digital retina supports low to middle level vision - as long as data feature a bidimensional structure - while the cortex is in charge of middle to high level vision, the part of vision which is closer to artificial intelligence than signal processing, and where data structures are no longer images but rather scalar values, lists, vectors, graphs, etc. The cortex is typically implemented using a standard microprocessor with enhanced I/O.
The point of using a digital retina in a vision system is that the volume of data manipulated is much larger at low than high level. To a lesser extent, this is also true about the computational load. Then a vision system incorporating a digital retina is expected to largely inherit from the retina high performances in terms of volume, weight, speed and power consumption. From the architect viewpoint, the cortex+retina association is fairly insensitive to the memory/computing fracture mentioned earlier, at the retina level as well as at the system level.

Retinal algorithms

Thanks to the versatility of its components, the cortex+retina system is able to support a vast class of vision tasks. However, algorithms have to fit the architectural specificities of the system :
- Computations must be partitioned between cortex and digital retina such as to make the most of their respective computational characteristics, including those of the specific operators settled in the retina to produce image descriptors.
- A digital retina is a massively parallel array processor, subject to the SIMD control mode or to its extensions. SIMD implies fully parallel algorithms while the sub-resolution extensions favor scale-space frameworks.
- A digital retina does not use any external data memory, so with the CMOS technologies presently available, only a small amount of bits can be stored in the pixel to be used by the tiny digital processor. This puts much pressure on the algorithm designer to make the procedures concise.
- Asynchronous circuits have a key role to play for the energy-efficient support of middle level vision in digital retinas, in particular for regional communication purposes. This implies mixed synchronous/asynchronous algorithms, a new track of research.
- The availability of a random bit generator at the pixel level (from electronic noise) opens the door to low power stochastic algorithms, such as those based on Markov random fields.
Finally the retinal context imposes harsh programming constraints, but they are also great opportunities to force the researcher's mind into new and fruitful dimensions of algorithmic research. In the retinal crucible, new concepts emerge which are often valuable on any standard computer. However, they are targeted at the retinal context, where their real-time execution allows the accurate assessment of their contribution to the visual process. In the past years, our retina-inspired algorithmic research has focused on digital halftoning, structural pattern recognition, skeletonization, mathematical morphology, Markov-based motion detection, to mention only the most successful.

Author : T. Bernard

Last update : December17th, 2001