How does the image recognition work

Image understanding

... is one of the most complex computer science areas. Image processing and image understanding are closely related. Image understanding is more likely to be assigned to artificial intelligence. Understanding images is a process that outputs a series of descriptions. Of course, these depend extremely on the picture and the associated question.

Image understanding is the reconstruction and interpretation of a scene using images so that at least one of the following operational services can be achieved:

  • Output of a verbal description of the scene
  • Answering verbal queries to the system
  • navigate robots in the scene
  • systematic gripping and manipulation of objects in the scene
  • Output of warning signals in dangerous situations

The system model according to Marr


picturedigital raster image with radiometric properties of each pixel
primary sketchfirst impression - sensibly reduce the amount of data to the essentials
2 1/2 D sketchGeometric and photometric properties of the surface, such as depth information, contours or normal vectors, partial shape and geometry construction
3D modelIntegration of several 2 1/2 D sketches, decomposition into generalized cylinders, statements about hidden parts, description of scenes

Levels of representation

worldphysical objects with attributes, object configuration, movement of objects
scene3D section of the world and a specific point in time
picture2D image of a scene
Image descriptionBottum-up, 2D image elements
Scene descriptionInterpretation of the scene elements
Description of the worldTop - down with prior knowledge

... is the basis for feature extraction and image segmentation. Errors in the image should be corrected or the image processed or improved.

There are Point operations, local operations, and global operations. In the case of local operations, only a partial area is considered, which is usually chosen as a square. In contrast, in global operations, an output pixel is dependent on all input pixels, i.e. the entire original image.

Point operations

Stretching the gray scaleMultiplying a contrast and adding a brightness constant (+/-), thus inversion)
Linear expansion of the gray scaleThe gray scale is modified via contrast and brightness in such a way that the entire spectrum is used.
Binarization / threshold value generationSeparate certain objects from the background
Color transformationthe three color separations R, G, B are modified
Background subtractiona second image, where only the background was recorded, is subtracted)
MasqueradingExtraction of semantically significant image parts through a bit mask)
Geometric transformationwhen comparing images of different sizes, spatial coordinate transformation or similar.

local operations

Linear folding

A large number of filters (high and low pass filters) can be implemented with the convolution. An LxM matrix is ​​used to calculate the current image point depending on the surrounding points.

The linear convolution calculates the values ​​of the output image from a rectangular (the matrix H) environment of the input image. The properties of the convolution operation are determined by the convolution kernel (the matrix H)

In image processing, rows and columns of the matrix are always smaller than those of the input image. Points outside of the input image are filled according to various methods (area yes necessary for matrix operation)

Input image grayscale

Mean value operator (low-pass filter)

3x3 matrix filled with 1/9 for smoothing the gray values ​​.. reinforces low spatial frequencies and suppresses high ones. Thus, small gray value peaks disappear. Evenly bright parts of the image remain unchanged, but edges are blurred. This operator can be used to reduce the noise. However, other relevant information is lost in the process.

Output image through low-pass filter

Laplace operator (high pass filter)

The Laplace lifts Gray value differences and serves to highlight the edges). It suppresses low spatial frequencies and emphasizes high frequencies. This means that evenly bright parts of the image disappear. Edges are thus emphasized.

Output image through high-pass filter

Raising edges in a preferred direction

.. with a convolution matrix each for N, S, NE and SE directions

Sob operator

Difference formation for the next but one line in order to avoid small disturbances.

... small disturbances in neighboring rows / columns are not included in the result. The Sobel operator is also used to highlight edges.


The gradient is another measure of the non-uniformity in the local gray value distribution. The gradient operator is a formula that includes a convolution matrix (e.g. a Soebel operator). The gradient operator is thus a direction-independent operator, but it is not linear.

Precedence operations

Ranking operations are based on the fact that the N pixels in a given local environment are sorted into a pixel as a function of their gray value.

Minimal surgery

Removes peaks of high gray values ​​and blemishes, but enlarges spots of lower gray values)

Maximum operation

Removes small spots of low gray values, but enlarges peaks of high gray values

Median operation

Removes point-like structures without blurring, there are no new gray values, thin lines can disappear)

global operations

In order to solve a problem it is sometimes necessary to transform a mathematical object into another object and solve a related problem on that newly transformed object. This is used in particular in image processing, as there are different approaches that make calculations many times easier (discrete 2D Fourier transformation).

Mathematical morphology (study of form)

Example image for erosion and dilation

The input image is the binarized original image (threshold value generation)

Each raster image can be represented as a subset of Z ^ n. Binary images are described as subsets of Z ^ 2, consisting of the spatial coordinates of the image points. In contrast, gray value or time-invariant binary images are formed in Z ^ 3 from location coordinates and the time or gray value. Many complex image processing algorithms can be traced back to the following elementary operations.

Dilation (X + B)

The dilation adds a structural element B to an image X. Therefore it is also an image intensifying or image thickening operation. It is possible to combine groups of particles, to fill holes or to close cracks. A very simple type of edge detection is (X + B) \ X.

Image after multiple use of the dilatation (note the increase in pixels)

Erosion (X-B)

The erosion is the counterpart of the dilation and removes the matching pixels. The erosion thus reduces the image data.

In other words, the erosion finds pixels that are surrounded by elements with a special pattern. With de erosion, narrow areas and small objects, the size of which is smaller than that of structural element B, are completely eliminated. The erosion is also used to find edges.

Image after multiple applications of erosion (note the pixel reduction)

Multiple applied closing (note missing guitar neck)

The opening ((X-B) + B) and the closing ((X + B) -B) are derived from these two operations. The opening results in the elimination of subsets of image X that are small in relation to structural element B. That is, narrow connections or single set elements are deleted. On the other hand, small incisions and gaps are closed during the closing.


  • All or Nothing - Transformation (XxB)
  • Emaciation or scaling (X / (XxB))
  • thickening

These are iterating operations in which the same operation is used with several structure elements one after the other. B


Segmentation means dividing the image into meaningful image parts. The goals of segmentation include

  • Finding segmentation objects (objects with homogeneous properties)
  • Finding segment boundaries (differently light gray value areas so distinguishable)

Types of segmentation

  • point-oriented methods (foreground is separated by simple threshold value operations)
  • edge-oriented processes
  • region-oriented procedures
  • rule-based procedures (here the segment shape is known in advance)

Point-oriented segmentation

The simplest form of segmentation is that Threshold method. A pixel of the output image is set to 1 if the pixel of the input image exceeds a certain value. One or more extreme values ​​can often be seen in the histogram. The most sensible threshold values ​​lie between the extreme values, since the local minima represent the places which separate the foreground and background from one another. However, since many extreme values ​​(multimodal histograms) are visible in most images, overlaps naturally occur. Therefore, the generation of the threshold value is only sufficient for an acceptable segmentation in the rarest of cases.

Mathematical basics of image segmentation

Neighborhood relations

  • 4 - neighborhood (north, south, west, east)
  • Diagonal neighborhood (north-south, south-west, north-west, south-east)
  • 8 - neighborhood (4 - neighborhood and diagonal neighborhood), i.e. all neighboring points)
  • Neighborhood in a hexagonal grid (circles and lines can be better represented here)

Two points are connected if there is one path between these there. A non-empty subset of a neighborhood structure is called contiguous (area, region)if any two points of the subset are connected.

A set of areas of the neighborhood structure that are discrete from one another is called Disassemblywhen the union of all areas gives the whole picture again. The equivalence classes generated by the connectedness relation are called Components. The components are the maximally connected subsets of the image.

Determination of components

An algorithm is reminiscent of depth-first search. The components are found one after the other. Another is the line coincidence method.

Area-based segmentation

With the Growing region Procedure, the components are determined. Several starting points (seed cells) i, image can be defined from which the segmentation is to take place. If two areas should touch in the course of the algorithm, the homogeneity criterion is checked in order to possibly combine the areas into one.

Another approach is that Split and Merge Algorithm. With this algorithm, the image is recursively broken down into sub-areas in the split phase until each sub-area meets the homogeneity criterion. In the merge phase, those sub-areas that meet the same homogeneity criterion and are adjacent are combined into one area.

Edge-oriented segmentation

The aim of this type of segmentation is to find marginal edges (marginal paths or contours) of areas. This includes the following tasks:

  • Edge detection (you get a lot of suspicious points)
  • Edge thinning / skeletonization (the amount is reduced to a one-dimensional amount)
  • Edge tracking (to extend or close found edges)

A neighborhood graph shows in the form of a graph how the segments are in which hierarchy in the image. The surrounding graph, on the other hand, shows wonderfully which areas are included and how.

Model-dependent segmentation

  • Matching (comparing the picture with a template)
  • Hough transformation (finds straight lines by specifying vectors)

For the automatic recognition of images, methods are necessary that can extract the features of objects. These features are used for the characteristic description of objects and the essential support in object recognition. The prerequisite is a segmented image where the features can be calculated for individual segments. With the help of the characteristics obtained, one wants to be able to assign the segments to certain objects. To achieve this, two almost incompatible criteria are necessary:

  • they must be calculable, i.e. based on concrete images of objects
  • they must be independent of the peculiarities of a specific image, i.e. invariant with regard to displacement, rotation, expansion, perspective, lighting or obscuring

Geometric features

  • Area, circumference, length, width
  • Compactness, roundness (circumference to the power of 2 by 4Pi area)
  • Object circumscribing or inscribed n-corner
  • smallest circumscribing axially parallel rectangle (ferret box)
  • circumscribing rectangle of minimal area (minimum bounding rectangle - MBR)
  • Aspect Ratio (AR) - ratio height - width of a ferret box or MBR as a logaritm
  • Degree of filling, indicates in percentage how much area in the rectangle is filled with the object
  • convex hull as a rectangle
  • octagonal shell allows steps from rectangular angles (is thus eight-sided)
  • convex core


... are used to find invariant features. Moments of the first order are the features that are not translation-invariant, i.e. that are dependent on the position of the object in the image.

First order moments

  • Number of lines of the object
  • Number of columns of the object (are required to calculate the object's center of gravity)
  • Centroid coordinates (Rows or columns by area)

Centered moment (p, q) - th order are invariant to translation. The normalized centered moment (p, q) - th order is invariant to similarity transformation.

Run length coding

A Run is a triple (s, z, n), where S is the column, Z is the row and n is the length of a contiguous area of ​​horizontally adjoining object points. Under one Run length coding one understands a lot of such runs. With the help of the run length coding, the moments can be calculated very easily.

Euler number

The Euler number E is defined by E = B - L. B is the number of objects and L is the number of holes (background surrounded by an object). The Euler number is determined by moving several 2x2 masks over the image and using the respective hit frequencies to calculate the Euler number with the help of a formula.

Relational structures

... allow a shape description on the basis of local descriptive elements and their relationship to each other and are shown as graphs.

The aim is to find the objects found by segmentation, Meanings assign. Individual objects, groups or relationships between objects can be the basis. The basic principle is simple. Features are extracted from the calculated objects or areas, which are classified through training and known knowledge and given a specific meaning.

Numerical classification

An object becomes through a Feature vector described. In the case of a numerical classification, each vector is assigned to a specific class K.

Linear classification

Classification by a decision-making function

Distance classifiers

Classification by "geometric distance" of the object to the classesMinimum distance - Classifier as well Nearest - neighbor - classifier

statistical classification

Here the probability calculation with conditional probabilities is used. A classifier from this group is the Maximum likelihood classifier.

Context-dependent classification

An object is not classified here on its own, but in combination or in context with other objects. Thus a classification of structures (graphs with objects or edges as nodes and neighborhood relations) takes place here. There are two procedures:

  • Relaxation: Since the variety of possible graphs is large, an attempt is made by iteration to classify the objects so that the given relations are fulfilled.
  • Graph matching: There are a small number of known graphs. A graph to be classified is simply assigned to the closest one.

Discrete relaxation

It is made up of a segmentation of a Neighborhood structure and one Area neighborhood graphs went out. An iterative process tries to assign the meanings B in a meaningful way, taking the edges into account.

Continuous relaxation

Here, each object is assigned a probability that is intended to specify the affiliation to the corresponding classes. It is then iterated and these probabilities are adjusted in each pass. The neighboring objects are considered and checked whether the classification is compatible with the current object or not. If the neighboring objects are compatible, the corresponding probabilities of the class assignments are increased. If they are incompatible, the probabilities are reduced. When a certain condition is met (termination condition), the iteration is ended. the Shape reconstruction three-dimensional objects. The basis is visual input data from visual sensors (one or more cameras) that depict a static or dynamic scene.

  • static scenes (Light changes possible, but no object movements during image recording)
  • dynamic scenes (Object movements possible during recording)
  • static image acquisition (Cameras are fixed and immobile during recording)
  • dynamic scenes (Cameras can be moved freely)

Generation of a 2 1/2 D sketch

This sketch is an important intermediate step on the way to reconstructing the shape. It describes incomplete spatial information on the visible surfaces of an object. There are different approaches for creating the sketch. Some are closely oriented towards human perception. Most of these principles can be viewed in isolation, the term has become Shape from X (X stands for the respective method).

Possibilities of 2 1/2 D - representation

  • Depth images (the distance to the camera is coded in each pixel)
  • Surface patches (the orientation of the surface is coded for each pixel)
  • Discontinuities in the distance (e.g. as space curves)
  • Approximation (of continuous surfaces e.g. by B-splines)
  • Combinations of the forms of representation mentioned

Formal specification

The vector to the point to be recorded is the to measuring scene value. It is recorded via the three-dimensional environment of the sensor. This represents a mapping of a vector (x, y, z) into the real numbers. Scene values ​​can be gray or color values ​​for cameras or distances for lasers, among other things.

In simplified terms, an optical image A is nothing more than a projection of the scene S onto an image E, that is to say E = A (S).


Scene objects are now to be represented again as three-dimensional bodies on the basis of the images. So it is necessary to examine the inverse mapping of A. A clear assignment is, however, generally not possible. Because of this, the problem is almost always only limited solvable

  • only interesting and relevant information from the scene is considered
  • only a narrower selection of depth points of objects (depth analysis)

Three obvious limits of shape reconstruction

  • Is only possible for visible object surfaces.
  • Since the objects are not always recorded from all directions, only a 2 1/2 D sketch is usually possible.
  • Due to the image rasterization and the distance between the camera and the object, the spatial resolution of the object surface is limited.


Any three-dimensional objects can be recognized and localized on the basis of their contours. The prerequisite for this is that it is sufficient geometric models of the objects available. The Walz algorithm tries to interpret two-dimensional line drawings in three dimensions.

The Line drawing is a undirected graphwhich from node and Sections consists. During the 3d interpretation, the nodes and sections are assigned different meanings. (Marked by marks on the respective picture elements)

Image elements with meaning cannot be assigned indiscriminately. The existing relations (physical facts) between the meanings must be taken into account. The task now is to find consistent markings for a 2D line drawing or at least to greatly limit the initial set of markings by removing the inconsistent markings.


  • no shadows or break lines
  • all corner points are intersections of exactly three colliding surfaces of an object. The upper corner points of pyramids are not permitted.
  • General observation point, i.e. that with little movement of the eye (or camera) no intersection points change their type

Edge marks

There are four options:

  • "-"concave Edge with two visible faces
  • "+"convex Edge with two visible faces
  • "->" convex edge with a visible face in the direction of the arrow (here right)
  • "<-" convex edge with a visible face in the direction of the arrow (here left)

Node marks

Due to the incoming edges, nodes can also be provided with marks. However, only a small part (18) of the combinatorial set of the theoretically possible number is possible.

Since exactly three surfaces collide at each node, a node divides the 3D space into eight octants. If you now imagine one, two ... or seven octants filled in and the viewer in the free octant, then you get all possible node types. The case that only 2, 4 or 6 octants are filled in cannot occur.

There are four types of nodes:
  • "<"L.
  • "Y"fork
  • "->"arrow
  • "T"T (only possible with partial coverage)

However, the markers of the edges that converge in the respective nodes have not yet been taken into account. Physically, only 18 of the purely combinatorial possible 208 are possible.

Walz algorithm

Aim: Find one for the line drawing Assignment of brands and nodes so that each knot is one permissible marking in the sense of the 18 possesses and the brands of the compound Edge pieces agree.

Another assumption should be that every object in the room is only suspended, i.e. the background boundary lines are only arrow marks.

Notes on marking

  • There is only one type of arrow with arrow marks on the left and right edges. For such an arrow, the middle edge must be provided with a "+".
  • There is only one type of fork with some "+". For this fork, all edges must be marked with "+".

The knot marking algorithm Waltz

  • Make one List L with all nodes (intersections)
  • Remove until the list is empty ...
  • the first item in the list. Name the element "current node"
  • if the current node has not yet been visited, create one for it Set of knot markscontaining all possible labels for the node type. A Quantity change has thus taken place.
  • If any node label from the set of the current node with all Node in the set of any neighboring node is incompatible, eliminate that incompatible mark from the set of the current node. A Quantity change has thus taken place.
  • If a Quantity change took place, put every neighboring node that has a token set and is not in the list L at the beginning of the List L.

Without the assumption that objects hang freely in space, the interpretations would no longer be unambiguous. By introducing shadow such ambiguities can be eliminated. To do this, however, you need further edge markings for the shadows and break lines. The number of permissible markings then increases by leaps and bounds!

Source: Script by Dr. Johannes Steinmüller at the Technical University of Chemnitz

I wrote the program to be able to try out some filters. All of the above images are screenshots of the program. It is used to try out. A small text file with help is enclosed.

Image processing tool