GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding

CVPR 2024

1National Taiwan University, 2NVIDIA

Video

Abstract

Utilizing multi-view inputs to synthesize novel-view images, Neural Radiance Fields (NeRF) have emerged as a popular research topic in 3D vision. In this work, we introduce a Generalizable Semantic Neural Radiance Field (GSNeRF), which uniquely takes image semantics into the synthesis process so that both novel view images and the associated semantic maps can be produced for unseen scenes. Our GSNeRF is composed of two stages: Semantic Geo-Reasoning and Depth-Guided Visual Rendering. The former is able to observe multi-view image inputs to extract semantic and geometry features from a scene. Guided by the resulting image geometry information, the latter performs both image and semantic rendering with improved performances. Our experiments not only confirm that GSNeRF performs favorably against prior works on both novel-view image and semantic segmentation synthesis but the effectiveness of our sampling strategy for visual rendering is further verified.

Framework

Overview of GSNeRF. Given multi-view images of a scene, the Semantic Geo-Reasoner $šŗ_šœƒ$ predicts the depth map of each image, which is aggregated to estimate the target view depth map. With DT as key geometric guidance, we design Depth-Guided Visual Rendering to render target view image and segmentation map respectively.

Model Details

While most methods using multi-view stereo to extract features of a scene, we predict depth from these features and further create two depth-guided sampling strategies to customize on two tasks.

Reasoner Image Depth Guided Rendering Image

Experiments

Table 1: Quantitative results on ScanNet & Replica. Note that methods in the first four rows take GT depth as inputs or training supervision, while the methods in the last six rows do not observe GT depth during training/testing.

Table 2: Results of finetuning on unseen scenes of ScanNet.

Figure 1: Qualitative evaluation. We compare the visual quality of the rendered novel view images (the first three columns) and semantic segmentation maps (the last three columns) with S-Ray.

BibTeX

If you find this useful for your research, please consider citing:
@inproceedings{Chou2024gsnerf,
      author    = {Ziā€‘Ting Chou* and Shengā€‘Yu Huang* and Iā€‘Jieh Liu and Yuā€‘Chiang Frank Wang},
      title     = {GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding},
      booktitle = CVPR,
      year      = {2024},
      arxiv     = {2403.03608},
    }