MIT Works on 2D to 3D Conversion

MIT researchers, together with a team from the Qatar Computing Research Institute (QCRI), have developed a system that automatically converts 2D video of football games to 3D. The footage can then be played back on any 3D device, including VR headsets.

While many displays today can show 3D, content is lacking, said MIT’s Wojciech Matusik. Sporting content is especially difficult to translate into 3D, as it must be done in real-time.

Previous systems for 3D conversion have produced visual artefacts. Matusik says that this can be avoided with MIT’s system, as the conversion pipeline is not a general purpose one: it was developed for a specific sport, and leverages gaming technology.

Video games today store detailed 3D maps of their virtual environments. When the player initiates a move, the game adjusts the map accordingly and instantly generates a 2D projection of 3D scene. This corresponds to a particular viewing angle.

The video game process was essentially run in reverse by the MIT/QCRI teams. They set the game FIFA 13 to play on a loop. Microsoft’s video game analysis tool, PIX, was used to continuously store screenshots of the action. A corresponding 3D map was also extracted for each screenshot.

Using an algorithm, the teams gauged the difference between the two images. They reduced the screenshots to keep only the ones that had best captured the range of possible viewing angles and player configurations that the game presented (although this was still in the tends of thousands). Each screenshot and associated 3D map was stored in a database.

The system developed utilises this database, searching for the 10 screenshots that best correspond to every frame of 2D video of a real football game. The images are then decomposed, as the system seeks the best match between smaller regions of the video feed and the screenshots. When those matches have been found, the depth information from the screenshots is imposed onto the corresponding sections of the video feed. Finally, the system stitches the pieces back together.

MIT says that the result is ‘a very convincing 3D effect’, without artefacts. The system currently takes about 0.33 seconds to process each frame of video. However, successive frames could be processed in parallel, so that this delay needs to occur only once.

“One of the main insights of the paper is that domain-specific methods are able to yield bigger improvements than more general approaches”, said Hanspeter Pfister, a professor at Harvard University. “This is an important lesson that will have ramifications for other domains”.

A video of the system in action can be found at