To complete the Bachelor’s program in Germany one has to hand in a final thesis. My topic of choice was “Photorealistic Rendering of Training Data for Object Detection and Pose Estimation with a Physics Engine”. The final version (long read!) can be found in the Papers section, this is merely a rough outline of the topic and results.
Excourse: Pose Estimation
To understand the relevance and contribution of my thesis, it is necessary to be familiar with pose estimation first. Pose estimation is a problem in computer vision that infers 3D poses (translation & rotation) from 2D data such as RGB images and optionally depth information.
This task is often solved using convolutional neural networks (CNNs), which require vast amounts of labeled data for training. Labeled data are, in this case, images with information about at least the 3D poses of the visible objects. Acquiring training data proofs difficult, as not many real datasets are available and generating synthetic datasets is not solved and an active field of research.
The goal of my thesis was, to develop a generation pipeline that combines the authenticity of real data with generation speed and amount when using synthetic approaches. This was achieved by rendering virtual objects and blending them onto existing photographs.
The virtual objects are scans of real objects, which were used to create the LineMOD dataset. The real images stem from the 3RScan dataset, a collection of 400+ reconstructed 3D scenes using RGB-D images. This enables us to include scene context when rendering the objects and simulating their interactions with the scene physically accurate.
The generator first spawns objects in a randomly chosen scene and simulates physical interactions using NVIDIA PhysX. After the objects reach steady state, the simulation is stopped and the rendering process starts. Blurry images are filtered out and the remaining images are used for generating new, synthetic ones. Rendering is done using Blender & Appleseed, as it was a requirement to only use open source software. This is because Toyota, which use the generator for robotic vision networks, had no funds for software licenses for student projects available.
Since Blender does not have a C++ interface and cannot be embedded, I instead use embedded Python and Blender as a Python module to create render commands. This is done using a flexible, custom JSON input structure to keep the Python and C++ parts mostly seperated.
It should also be noted, that at the time Appleseed was only available on Windows platforms, but not Linux. I was able to create a Docker container that compiles a Blender compatible Appleseed Linux build, which is why Linux is now officially supported again!
The rendering process is split into several steps which will be shortly outlined. First, the depth of both objects and scene at a given camera pose are rendered and used to create an object visibility mask. This determines whether to proceed further, as the pipeline terminates if no or not enough objects are visible.
If an image passes this test, a label image is rendered next (object IDs as shades of gray). Then, ambient occlusion of the objects is rendered, where the scene is indirectly included (scene mesh does not recieve camera rays). Next, the objects are rendered in full PBR fashion, while the scene mesh again only recieves indirect diffuse / specular rays. The difference is quite noticeable and can be observed in the image below.
Finally, the generated ambient occlusion & PBR renders are combined with the occlusion mask and blended with the real images, which is stored on disk together with annotations and the corresponding combined depth map. The entire pipeline runs on the CPU and takes ~15s per image on my system (i7 6700k @4.4GHz).
The outputs were in many instances comparable to fully artificial rendering, did however fall behind in some key aspects. Lighting, the lack of shadows and clipping issues are yet to be solved and will be the focus of another student who is supposed to continue my work. Overall the results are very promising and were extremely well recieved.
We also trained Mask R-CNN on a dataset of 10k images and evaluated the resulting network. While training was certainly not long enough, as the network often did not detect anything at all, the instances were it detect objects were spot on. Examples for the outputs (masks & bounding boxes) can be seen below.
The project was overall a success and promising enough to spawn an additional thesis to build on my work. While there were many obstacles and more work intensive (+2000h) than a thesis should be (~400h), it was a interesting and fun project. If you wish to build the project or take a look at implementation details, the project is available on GitHub. Download links for the thesis and a precompiled version for Windows are provided below.