Recent advances in machine learning, in particular deep learning, have revolutionized not only all kinds of image understanding problems in computer vision, but also the approach to general pattern detection problems for various signal processing tasks. Deep learning methods [1] can be applied to data that originates from almost any type of sensor, including image data from arbitrary modalities and most time-dependent data. Roughly speaking, in machine learning a very general computational model with a large number of free parameters is fitted to a specific problem during a training phase. The parameters are iteratively adjusted such that the computation performed by the model has minimal deviation from a desired result. In the case of supervised learning, the desired computation is specified by a finite set of input-output pairs, the training data. The machine learning model attempts to interpolate or extrapolate between the training data points using some concept of smoothness such that reasonable output for data not in the training set can be predicted. This is generally referred to as the ability of the model to generalize, which can be evaluated by a second set of input-output pairs, the test data. The particular success of supervised learning approaches in computer vision primarily stems from the tremendous advances in the achievable accuracy for classification tasks. Provided that the computational model has sufficient capacity and the training data set is large enough, particularly hierarchical neural networks can approximate a very wide range of functions and can be trained efficiently using backpropagation [2].
However, the availability of training data is the main problem of deep learning methods. For the task of general image understanding in computer vision, several standardized databases with millions of labelled images exist [3,4,5]. The databases have been created by joint efforts of the computer vision research community and constitute a considerable investment in machine learning research. For the application of deep learning methods to more specific problems, either scientific or industrial, labelled training data from in-vivo sources (see textbox for definition) does not exist in general. In-situ creation of the data, can be a problem for numerous reasons. (1) If the data acquisition involves expensive measurement equipment or sample preparation, the cost of generating sufficient quantities of training data can be prohibitive. (2) In many application, there are ethical questions involved in acquiring training data, for example radiation exposure of patients or data from traffic accidents with human casualties. (3) Particularly for the case of semantic segmentation (per-pixel classification), labelling the training set can constitute a tremendous effort. (4) In many scientific applications, a phenomenon that was predicted from theory, but not yet observed, should be detected. In such cases, in-vivo and in-vitro training data is unavailable for principle reasons.
An additional concern with in-vivo training data relates to the clustering of data around common phenomena. In most scenarios, certain situations occur more frequently than others. In a production environment, most data will show undamaged parts while actual defects are rare. Even defective parts typically have a heterogeneous distribution where some defect types are common and others highly uncommon. Often, situations that are relevant to be detected occur rarely.
Definition: in-vivo, in-vitro, and in-silico data |
In-vivo data is captured from real-life situations that were not primarily created or modified for the purpose of capturing the data. Examples are video streams from autonomous vehicles driving through a city, black-box data of accidents, and images of product defects from production. |
In-vitro data is captured using physical sensors under lab conditions. Examples are footage of crash-tests, images of products that were intentionally damaged in the lab to capture the data, or images of surface materials taken in the lab under controlled lighting conditions. |
In-silico data is generated without the use of physical sensors by software simulations. Examples are renderings of traffic scenes from a driving simulator, rendered images of defect products, or virtual crash-tests performed by simulations using the finite element method. |
Consequently, in-vivo training data sets typically consist of large quantities of relatively uninteresting situations with rare instances of exceptional, but highly relevant situations. If used to train a machine learning system, this situation immediately translates to a class imbalance problem. In principle, this problem can be mitigated to some degree by manually filtering or selecting training data, and by some computational compensation for class imbalance. Nevertheless, the rate of occurrence of rare phenomena might be very low.
A vivid example of an exceptional situation is the child running in front of an autonomous driving vehicle. Cars in Germany drove 7.3·1011 km in 2016 [6] and created 4195 severe accidents with children [7]. Consider subdividing all driven distance to chunks of 5 m length to obtain individual training data samples. One can estimate that approximately three out of 1011 such chunks contain images of children prior to a severe accident. For obvious reasons, in-vitro generation of the data is not possible. Resolving the class imbalance problem by capturing enough data and normalizing class balance by manual sorting is also not a valid option. Even if one could afford the sheer amount of work, the ethical implications of the approach is that one would need to wait for these 4195 severe accidents to happen to record the data required to avoid them rather than using in-silico data generation and preventing that the accidents must happen.
The core contribution of this position paper is the introduction of a concept called “Digital Reality” that solves these issues.
Figure 1) displays this generic blueprint of how machine learning models can be trained and validated using such synthetic training data. The approach applies to all data driven methods, particularly data-driven supervised or unsupervised learning with deep neural networks, and deep reinforcement learning [8].
The process starts by (1) creating partial models of reality by modeling, capturing, or learning individual aspects such as geometry, behavior, or physical properties including materials or lighting. (2) stating models are composed of parametric scenarios would invert the part-whole relation by manual configuration, data fitting, or machine learning. (3) Setting all parameters of such a scenario to fixed values creates a concrete instance of the scenario, corresponding to a simulation-ready 3D scene. (4) The scene is then rendered by a forward-simulation of the imaging process. (5) The resulting synthetic images are used to train a machine learning system.
The remainder of this paper is organized as follows: We first present a more detailed description of the individual steps of the Digital Reality concept. In section 2, we describe how partial models are obtained, and in section 3 we elaborate how training data can be generated from parametric models using sensor simulations. In section 4, we discuss how considering the parametric models from a sampling perspective can provide useful insights into data generation. In section 5, we present several Use Cases from different application areas to illustrate the Digital Reality concept on concrete examples, and in order to give some evidence that the concept is feasible.
Parametric models of the real world
The first step of the Digital Reality concept is the creation of partial models of the real world. Each of these partial models covers one specific aspect of reality in the context of a narrow field of application.
For example, in the context of defect detection in a production environment, partial models can cover the shape properties of the products, shape and characteristics of the defects, material properties of the product surfaces, lighting setup at the production site, or properties and physics of the imaging system. In the context of autonomous driving vehicles, a much longer list of partial models is perceivable. The list includes, among others, geometry and material properties of individual elements of a traffic scene such as roads, buildings, traffic signs, plants, vehicles, or pedestrians [9], layout of traffic scenes, behavior models of vehicles, traffic lights, and pedestrians, models of lighting conditions and weather, as well as models of the sensory systems available to the car including optical cameras, lidar, and the physics of the imaging process of each modality. The partial model is called parametric because each model is controlled by a set of input parameters and describes a part of a scene in a simulation environment as a function of these parameters.
Clearly, creating partial models of reality closely relates to science. However, there are differences between models created for the purpose of Digital Reality and general scientific models. Science aims to understand one aspect of reality, in the most general way possible. Therefore, models are only accepted if they are described in a form that is interpretable by humans. The value of a model is highly dependent on the range of its applicability. This means, a model is considered valuable if it can be applied to a wide variety of situations and explains one aspect of reality. Consequently, capturing data about a phenomenon without generating an abstract insight and interpretation is considered incomplete science.
In the context of Digital Reality, neither understanding nor generality of partial models are primary concerns. Instead, for the immediate purpose of training a machine learning system, a generative model with a narrow applicability to the problem is sufficient. The model does not necessarily need to be formulated in a way that is particular prone to human interpretation. Rather, any parametric model that is capable to generate the desired output is sufficient. Obviously, the partial models can still be created manually. The manual approach is ideal when obtaining of a deeper understanding of aspects of reality is of interest for reasons beyond machine learning. In other cases, capturing or learning a phenomenon is often more effective.
If a model has only little manually created structure but a large number of parameters that are automatically fitted against data, we refer to the process of creating the model as capturing. Typical examples for captured models are object geometries, surface properties of materials, emission properties of light sources, animation snippets, and many more. In the following, we give some examples of recent progress in capturing various types of models. In the special case that the architecture used for the model capturing is a neural network, we refer to the process as learning the model.
Capturing of appearance models
The most commonly captured type of partial model is the geometry of objects. Surface geometry is traditionally captured by 3D laser scanners. The scanners generate an unstructured set of points on the surface of an object. A model is then fitted to these points to establish connectivity and create a mesh. A viable alternative to laser scanners that is increasingly used in the computer game and movie industry is photogrammetry [10]. Apart from the obvious point that a digital camera is sufficient to perform a scan, photogrammetry has the advantage of capturing surface color and texture along with the shape. However, the captured textures include lighting information that must be removed in a non-trivial post-processing step called delighting [11]. Most 3D scanning approaches are limited to strictly diffuse objects. A fully automatic solution to handle glass and mirror surfaces has just been presented recently [12].
Geometry alone is insufficient for photorealistic rendering, as the appearance of objects strongly depends on their optical properties. The most basic model of the optical properties of material surfaces for rendering images is the Bidirectional Reflection Distribution Function (BRDF). Capturing BRDF has been a topic in computer graphics research for a long time, and various measurement devices and algorithms of different levels of complexity have been developed for this purpose [13]. In practice, most renderers use lower dimensional parametric models that are fitted against measured BRDF data. Small details in the surface geometry and in the BRDF are stored as a set of texture images for diffuse color, position (also called displacement), surface normal direction, reflectivity, roughness, and so on.
An interesting observation for many entertainment applications is that the characteristic features in color variation and the small geometric features captured in a surface normal are more important for the human perception of materials than the precise modelling of reflectance characteristics. A common workflow for capturing materials therefore consists of generating a high resolution elevation model of the surface using photogrammetry. Material textures are then generated from this model and the remaining free parameters of the material model are set manually to fit the model appearance.
Once partial aspects of the real world are modelled, the partial models can be composed to parametric scenarios in a simulation environment (Fig. 2). The scenario can be configured via a parameter space that consists of all parameters of the partial scenarios and potentially additional, scenario-specific parameters. If all parameters are set to concrete values, the generative model can produce a concrete instance of a scenario that corresponds to a simulation-ready scene.
Behavior model generation
When moving from static images to video, parametric models must include time dependent aspects, including behavior models. Particularly the behavior of digital human models (DHM) is of key importance because of their high-level of variability and the importance of correctly detecting humans. The control of DHMs can be separated into at least two major aspects. On the one hand, a controller which drives the basic mechanics of the artifact that represents the body of a human (not necessarily making use of physics for this purpose) and, on the other hand, an intelligent agent that drives the high-level behavior of the DHM.
In principle, human motion synthesis can be addressed with varying approaches and levels of detail, depending on the requirements of the specific domains. A recent overview of motion synthesis approaches is given in [14]. Current motion generation approaches for full body animation can be classified as either analytical or data-driven. Analytical motion synthesis aims to generate realistic motions based on intrinsic mathematical or physics models [15]. In particular, inverse-kinematics-based approaches [16] are often utilized to manipulate motions of articulated avatars. In contrast to the analytical approaches, data driven or example-based motion synthesis approaches rely strongly on reference and example datasets, which are predominantly recorded by means of motion capturing. These approaches can be further subdivided into different categories: motion blending, which interpolates example clips [17, 18], motion graphs, which concatenate discrete segments or poses [19], and machine learning, in particular deep learning-based approaches, which approximate a function or statistical model [20, 21]. Recently, machine learning approaches using deep neural networks have shown promising results with comparatively little manual effort in the preprocessing [22, 23].
One level of abstraction above animation is the high-level control of a DHM. Work on the modelling of intelligent behavior goes back to the fifties of the past century. At that time researchers concentrated on developing general problem solvers, which worked in any given environment if it could be described in a formal manner. However, the success of such systems was rather limited because of the computational complexity of the presented problems. In the current state of the art on DHMs, behavior trees or belief, desire, intention (BDI) reasoning are mostly used in complex applications. In how far it is possible to combine these approaches with the planning from first principle approach of a general problem solver is a research question. In fact, whether and how it is possible to learn basic behavior, possibly using deep learning techniques, and combine it with symbolic reasoning approaches remains an open problem.
For autonomous driving, the behavior of pedestrians and bicyclists is the most difficult part to model, as in-vivo data is only partially available and cannot be obtained in many situations due to ethics and effort (Fig. 3). Current synthetic driving simulators either do not include pedestrians at all [24] or only display default game engine animations with predefined trajectories [25,26,27]. A review on models of pedestrian behavior in urban areas is presented in [28]. The work has a focus on route choice and crossing behavior. Most importantly for a Digital Reality, the authors propose a multi-level behavior model, very similar to the definition of multi-agent systems.
Shallow models in two dimensions
So far, we have considered the case that partial models are built close to the real world. In this case, models exist in a three-dimensional world space, and object properties are modelled from a relatively deep physical understanding of the measurement process. This allows the generation of in-silico images using low-level physical simulation of the measurement process, such as physics-based rendering or radar simulation. Such an approach is conceptually very clean and has clear advantages in terms of generality.
However, the capturing of all required models can constitute a tremendous effort, and the low-level sensor simulations can have very high computational cost. Though, in many situations, in-silico data can be generated in sufficient quality from more shallow models. Hereby, a typically two-dimensional model is generated purely from in-vitro or in-vivo data without the need to integrate a deeper physical understanding of the real world. Such image-based models are typically expressed in image processing terms such as intensities, frequencies and pixel-distances.
An example of such a shallow model is modeling of cracks in microchips, which can be used to train an optical inspection system. The model consists of a background texture with a crack model painted over the background, which is generated from a texture atlas using an exemplar-based inpainting approach for texture synthesis. The crack model itself consists of a polygon line of random width that extends in a primary direction, but deviates from that direction at random steps in random angles. The intensity profile of the crack is modeled by superimposing several semi-transparent lines with identical corner positions but different transparencies and line widths. All random parameters of the model are drawn from statistical distributions that were generated by manually measuring a set of 80 in-vitro images of cracks. The overall parametric model is depicted in Fig. 4.
Scientific model generation
Typical length scales relevant to production or traffic environments are the millimeter to meter scale. For many scientific applications, we are interested in much smaller or much larger length scales and sensory systems suitable for these scales. Paradoxically, our quantitative understanding of both matter and the imaging thereof is much more precise on both microscopic scale and on astronomical scales compared to the every-day environment that we live in. Using, for example, force field simulations as consistency check, we can model microstructure at the atomic level much more reliably than we can model objects at the every-day scale, such as cars, buildings, furniture, or humans. It is much easier to achieve a quantitatively correct simulation of an electron microscopy image compared to a quantitatively correct rendering of, for example, a human face at the visual optical spectrum. As Richard Feynman put it [29], it is possible to know everything about matter, all one would have to do is to look at it and see where the atoms are. With recent advances in imaging and analytical characterization techniques, it is nowadays possible to generate a good description of the atomic structure of materials.
For example, with the advent of aberration corrected transmission electron microscopy (TEM) [30] and increasingly sensitive detectors, it is possible to create a two-dimensional projection of the atomic structure of a thin object. This can be extended by in-situ TEM characterization to image the structural dynamics at the atomic level in response to external stimuli such as heat, electrical currents, strain, or specific gas environments [31,32,33]. The main challenge is to create a three-dimensional model from the atomic scale projections.
However, using convolutional neural networks, significant advances have been achieved, for example by identifying and tracking atomic positions in a metal nanoparticle exposed to a defined gas environment to follow the structural response of the nanoparticle [34]. Also in atom probe tomography (APT) [35, 36], tremendous improvements have been achieved, which enable determination of the three-dimensional coordinates of around 50% of the atoms in nanoscale needle-shaped samples. With this progress, state-of-the-art analytical techniques are getting closer to fulfill Feynman’s vision.