This week I began to investigate alternative options to the multi-view depth map model of Soltani et al. However, I still continued some limited experimentation with the model, which did allow me to gain some more insight into what might have been going wrong last week. This side effort was also motivated by the fact that alternative models I found either do not seem promising enough, or do not provide adequate enough documentation for me to parse through them.
Experimentation on multi-view depth network
I began by posting an issue on the GitHub page for the Soltani et al. project describing the issues I faced last week when I attempted to train the model on swords but received a bunch of identical diamond shapes. I received a response back fairly quickly, but the author merely suggested I try training either with more data or more epochs.
Meanwhile, I wanted to confirm that the issue was still not due to too wide of geometry to sample from, so I created twin toy datasets of many (~20) copies of one of the Final Fantasy sword models I had, as well as another one composed of many copies of a Larvitar model from the Pokemon dataset. After training on the standard 80 epochs, I received the same diamond shapes for the sword model, but was interested to find that the Larvitar model actually displayed reasonable similarities to the models’ geometry, as visible below.
From these results, and my conclusion from last week that the issue was somewhere in the data, I took a closer look at the input sword depth maps. Looking at them closer, I formed a hypothesis that out of all of the models, the only parts that ever appear white (i.e. close to the camera and strong signal) in the sword depth maps are the tips of the swords in certain of the views; everything else appears gray. Looking at the Larvitar results, I noticed a similar pattern that the reproduced features seemed to mostly be only the strongest ones (colored white) in the corresponding depth maps for each view. Looking at the sword depth maps again, I then theorized that the diamond-like shape is due to that being the general shape of the tip of the sword, apparently the only part providing a strong enough activation to be learned by the network.
From this theory, I might conclude that the Soltani model will only work well for data that are roughly spherical in nature so that the sphere-based depth-map rendering system can capture each part equally. The sword dataset clearly does not fit this criteria, but the Pokemon dataset theoretically does. Though as my original goal was to try other models, I moved on try that instead of trying to reconfigure the Pokemon dataset. However I did at least make a follow-up post on GitHub with the paper author communicating this theory and asking if he had any tips on getting around it.
Investigating an alternative generative network
As before, I did not think that voxel models would be the best choice, since they have such low resolution and thus cannot capture as great of detail. The only remaining choice of the studies I highlighted before then was the AtlasNet project, which uses a VAE to take point cloud or image inputs and encode them into meshes using a parametric surface atlas approach. I found the project’s GitHub and was able to run a pre-trained model on a plane example to get this output mesh, which is of pretty good quality:
This made me excited to try the model, but after some work I could not find any documentation on how to add my own data, nor perform random sampling instead of straight encoding. This is in contrast to the Soltani et al. project which very helpfully provided much more detailed documentation on this. For now, I posted an issue on the AtlasNet GitHub asking for pointers on using my own data. Along the way I also discovered a repo for AtlasNetv2, a newer and more powerful version, which I would want to use if I ended up going with this network.
Interview and alternative approach
Finally, on Monday I was able to get an interview with an industry technical artist who proposed the idea of instead of using generative networks to directly generate 3D content, use 2D networks to create inputs for a photogrammetry-like technique. This seems promising based upon the more lifelike state-of-the art for 2D networks vs. 3D ones, but even in our discussion we already identified issues of synchronizing output across multiple networks.
Thinking about this prompted me to go back and try to understand the Soltani et al. model architecture a bit more, since from what I can tell their model more or less does this. That is, rather than actually learn 3D representations, their model is simply mapping the 20 2D depth maps to the same number of outputs via a VAE.
From here, I’m going to see if any of the issues I posted about get replied to, and in the mean time try and understand more about the Soltani et al. architecture. Given that it seems to align with this photogrammetry-like approach (in fact that is what is seems to be doing), I think it warrants another look. To that end I think I might right out to the authors themselves so that I can ask them more detailed questions (maybe in an interview-like call!).