[Paper Review] DUSt3R: Geometric 3D Visioin Made Easy+ MASt3R: Grounding Image Matching in 3D

이전의 3D vision 기술들

SfM (Structure from Motion)

Reconstructing sparse 3D maps while jointly determining camera parameters from a set of images
pixel correspondences obtained from keypoint matching between multiple images to determine geometric relationships, followed by bundle adjustment ti optimize 3D coordinates and camera parameter jointly.

But, The sequential structure of the SfM pipeline persists, making it vulnerable to noise and errors in each individual component.

Multi-view stereo reconstruction (MVS)

In the wild requires to first estimate the camera parameters
Task of densely reconstructing visible surfaces, which is achieved via triangulation between multiple viewpoints.
All camera parameters are supposed to be provided as inputs

But, Inaccutacy of pre-estimated camera parameters can be detrimental for these algorithms to work properly.

Direct RGB-to-3D

Aiming at directly predicting 3D geometry from a single RGB image have been proposed.

But, Intrinsically limited by the quality of depth estimate, which argubly is ill-posed for monocular setting

Pointmaps

Using a collection of pointmaps as shape representation is quite counter-intuitive for MVS, but its usage is widespread for Visual Localization tasks, either in scene-dependent optimization approaches or scene-agnostic inference methods.

Dense and Unconstrained Stereo 3D Reconstruction (DUSt3R)

Dense and Unconstrained Stereo 3D Reconstruction(DUSt3R) operates without prior information about camera calibration nor viewpoint poses (camera parameter)

Unconstrained Imae Collections 이 DUSt3R로 전달되면, Pointmap(2D-3D mapping) 을 생성하고, 각 이미지 픽셀에 대해 3D 공간 상의 좌표를 dense 형태로 예측한다. 이러한 pointmap은 다양한 vision 작업에 사용된다.

Architecture of the network

1. Two views of a scene(I1 I2)는 shared weights을 사용하는 ViT encoder를 사용해 Siamese manner 로 encoding

2. 두 encoder는 각각 pointmap F1 F2 생성

3. F1 F2는 Transformer Decoder에 전달

4. 각 decoder는 별도의 head를 통해 Pointmap (각 pixel에 대해 3D공간 상의 포인트 좌표를 예측), Confidence (각 포인트의 신뢰도를 나타내는 값)을 출력

5. I1의 카메라는 원점을 기준으로 고정된 좌표계, I2의 카메라는 상대적 위치가 알려지지 않은 상태에서 시작하며, 두 이미지에서 예측된 pointmap 과 confidence를 이용해 공통 좌표계에서 두 카메라의 위치 및 3D 구조를 정렬함.

왼쪽에서 부터 RGB, Depth map, Confidence map, Reconstruction .

기본적으로 두 개의 camera를 입력받지만, postprocessing 을 통해 N개의 camera pose 를 동시에 optimize 할 수 있다.

Grounding Image Matching in 3D with MASt3R

MASt3R은 DUSt3R에서 Transformer Decoder에 head를 한 개 더 추가했다. (image matching 을 위해)

*Head : 입력 시퀀스의 서로 다른 부분에 attention을 주기 때문에 모델이 입력 토큰 간의 더 복잡한 관계를 다룰 수 있어 더 많은 정보로 표한 할 수 있다.

MASt3R은 새로운 Head (Local Features Head) 를 추가하여 각 픽셀의 Local Feature F^1,1 과 F^2,1 을 추출한다.

Local Feature는 pixel 단위의 descriptor를 생성하며, image matching 에 사용된다.
이를 통해 MASt3R는 3D pointmap 기반 matching뿐만 아니라, Feature Descriptor 기반 매칭도 가능해졌다.

*Feature Descriptor : 특정 Feature 를 정량적으로 표현하는 벡터 (서로 다른 각도, 조명, 스케일 등에도 불변성을 가질 수 있도록 설계된다)

+ Zero-shot / One-shot / Few-shot learning : ML에서 데이터 부족 문제를 해결하기 위한 학습 방법론

Zero-shot learning(ZSL) : 학습 과정에서 특정 테스트에 대한 label data 없이, 모델이 새로운 task/class에서 성능을 발휘

모델이 이적에 학습한 지식을 활용하여 본 적 없는 task/class에 대해 추론
pretraining 단계에서 방대한 양의 데이터를 사용하여 일반적인 지식을 학습

One-shot learning(OSL) : 새로운 task를 해결하기 위해 class 당 1개의 sample만으로 학습하거나 추론하는 방식

모델은 단 하나의 예제만 보고도 해당 class에 대해 일반화할 수 있어야 한다.
pretraining 후, 새로운 class의 데이터가 최소화된 환경에서도 일반화 능력 발휘
한 사람의 얼굴 사진만 보고, 다른 사진에서도 이 사람을 정확히 인식
Siamese Network : 두 입력 간의 유사성을 측정해 새 데이터를 기존 class 와 비교

Few-shot learning(FSL) : class당 몇 개의 샘플(few) 만을 학습 데이터로 사용하여 새로운 task를 해결하는 방식

새로운 언어에 대한 몇 개의 번역 문장만 보고, 그 언어에서 작동 가능
기존 task에서의 pretraining을 활용하여, 소수의 데이터를 보고도 새로운 task에 적응

'공부 > Deep Learning' 카테고리의 다른 글

[Paper Review] You Only Look Once: Unified, Real-Time Object Detection (0)	2025.01.09
[Paper Review] RoMa : Robust Dense Feature Matching (0)	2025.01.08
[Paper Review] Deep Convolutional Neural Models for Picture-Quality Prediction (0)	2024.05.15
[Paper Review] Quality-aware Pre-trained Models for Blind Image Quality Assessment (0)	2024.04.02
[Paper Review] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (0)	2024.03.17