This week I made my first forays into running generative networks for synthesizing 3D models for video games. My goal was to take an existing model and try training it with a dataset of 3D game assets to see what sorts of outputs it would produce.
For my first attempt here I used the model of Soltani et al. from their paper “Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes with Deep Generative Networks”. This is one of the papers I highlighted in my academic literature review, and I chose this one to start with over the others because I was curious to see how its unique point cloud output would work as I had not worked with this data type before. I obtained the group’s project code from GitHub at (Link).
The authors of the paper used data from the ShapeNet Core dataset, which includes 3D models of standard objects such as cars, chairs, guitars, etc. As my project is interested in video game models I obtained assets from models-resource.com to use as training data. As the network is designed for learning a particular category of objects, I decided to test it with 100 different Pokémon models from the 3DS games, a good type of game model with lots of samples in the same “category”.
The models I downloaded came in several formats as well as including extra data such as textures that I discarded. I imported a COLLADA .dae of each model into Blender, and joined separate any separate meshes into one single mesh before exporting it as an STL file. I repeated this (tedious) process for each of the 100 Pokémon meshes.
Next, I used a program included with the GitHub repo called
renderDepth to create depth maps of each mesh, the way that this network takes in inputs. This was somewhat awkward because while all other operations for running the network were designed to take place in Unix, this program was a .EXE for some reason so I had to transfer files to Windows, run the program, and transfer the results back. However the results were very cool-looking depth maps for each of the models, with one image for each of twenty different camera viewpoints.
Next came the main step of actually training the network. The repo included several Lua scripts that processed input arguments and then made the appropriate calls to PyTorch to perform training, validation, testing, and the different experiments included. In order to figure out how to call the training program with the arguments I needed, I had to wade through the scant documentation on the GitHub page, in a few READMEs, as well as the scripts themselves. The model included three different network variations: AllVPNet, which is the standard model that takes 20 viewpoints, DropoutNet, which also takes 20 viewpoints but randomly discards 15 to 18 of them before the encoder stage, and SingleVPNet, where only one image is taken randomly from the 20 viewpoints. From the paper I gathered that AllVPNet had the highest performance of the three across several metrics, and thus decided to work with that model.
With the network set, I ran training for 80 epochs, which took around 15 minutes on my GTX 1060. As I tried to run training again later, I ran into some errors about running out of memory, which seems possible given my average graphics card and only 6GB of VRAM. However in those cases when I restarted my computer I was able to complete training.
When the included script finished training it also performed several sample experiments that gave me samples from the learned manifold of shapes. The results, however, were many egg-shaped blobs of slightly different shapes and sizes. This could mean that the model only learned the notion of some vague “torso” for each of the Pokémon models, or given that the Pokémon models’ torsos were all centered this could just be a product of general averaging of the shapes as well. The results were somewhat disappointing, but I also wasn’t particularly surprised since I trained with only 100 samples and chose Pokémon with widely different body shapes and features. For example in the images above the Teddiursa is bipedal, has round ears and round geometry, while the Arcanine is a quadruped with more detailed and spiky geometry from its fur.
To make sure that these results were not from incorrect setup of the model, I ran the same training and manifold-sampling procedure on 50 models of headphones from the ShapeNet Core dataset. These results were by no means perfect either but did show recognizable learning by the model of the general shape of headphones, which indicated my result came from either poor choice of data, or inherent deficiencies of the model itself.
With my Pokémon blobs, I ran an additional included program to visualize the separate viewpoints into one 3D point cloud for each model. This took several attempts to get it to work but eventually I was able to get out PLY files which I imported into Blender. Here I was able to confirm that the point clouds had cavities in their centers, which I took as a good sign that the network had learned some sort of body cavity concept rather than just averaging points at the center. I also attempted to transform these point clouds into real meshes using MeshLab, but having not used this software before the meshes had fairly jagged geometry and I was unable to get a satisfactory result this week.
Moving forward, I am more aware of the challenges of my problem space, but based upon the still somewhat successful results I got this week given the circumstances, I am still hopeful about the future trajectory of the project. I can continue to work with this multi-depth-map model on an improved dataset, or I could try one of the voxel models I researched. However as the voxel output has such low resolution I am doubtful of that approaches’ usefulness for my application (which is a reason why I chose the point cloud model this week). Ultimately I need to also read through the paper(s) deeper to understand how the models are actually built and, which a bit of self-study likely, start to theorize how I may be able to improve the networks for my task.