Model Details
While most methods using multi-view stereo to extract features of a scene, we predict depth from these features and further create two depth-guided sampling strategies to customize on two tasks.
Utilizing multi-view inputs to synthesize novel-view images, Neural Radiance Fields (NeRF) have emerged as a popular research topic in 3D vision. In this work, we introduce a Generalizable Semantic Neural Radiance Field (GSNeRF), which uniquely takes image semantics into the synthesis process so that both novel view images and the associated semantic maps can be produced for unseen scenes. Our GSNeRF is composed of two stages: Semantic Geo-Reasoning and Depth-Guided Visual Rendering. The former is able to observe multi-view image inputs to extract semantic and geometry features from a scene. Guided by the resulting image geometry information, the latter performs both image and semantic rendering with improved performances. Our experiments not only confirm that GSNeRF performs favorably against prior works on both novel-view image and semantic segmentation synthesis but the effectiveness of our sampling strategy for visual rendering is further verified.
Overview of GSNeRF. Given multi-view images of a scene, the Semantic Geo-Reasoner $šŗ_š$ predicts the depth map of each image, which is aggregated to estimate the target view depth map. With DT as key geometric guidance, we design Depth-Guided Visual Rendering to render target view image and segmentation map respectively.
While most methods using multi-view stereo to extract features of a scene, we predict depth from these features and further create two depth-guided sampling strategies to customize on two tasks.
Table 1: Quantitative results on ScanNet & Replica. Note that methods in the first four rows take GT depth as inputs or training supervision, while the methods in the last six rows do not observe GT depth during training/testing.
Table 2: Results of finetuning on unseen scenes of ScanNet.
Figure 1: Qualitative evaluation. We compare the visual quality of the rendered novel view images (the first three columns) and semantic segmentation maps (the last three columns) with S-Ray.
@inproceedings{Chou2024gsnerf,
author = {ZiāTing Chou* and ShengāYu Huang* and IāJieh Liu and YuāChiang Frank Wang},
title = {GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding},
booktitle = CVPR,
year = {2024},
arxiv = {2403.03608},
}