[Paper Review] RoMa : Robust Dense Feature Matching

공부/Deep Learning

[Paper Review] RoMa : Robust Dense Feature Matching

유스베리이 2025. 1. 8. 15:58

Image Matching

Feature Matching is an important computer computer vision task that involves estimating correspndences between two images of a 3D scene, and dense methods estimate all such correspondences

Dense Method (coarse-to-fine approach): Dense feature matching methods aim to find all matching pixel-pairs between the images
coarse feature는 3D supervision을 통해 학습 -> 3D dataset은 매우 비싸고 양이 제한, overfitting 위험

따라서 RoMa는 overfitting을 방지하기 위해 frozen pretrained features using Masked-Image-Modeling(MIM)/DINOv2사용

Propose such a model, leaveraging frozen pretained feature from the fonddation model DINOv2

Combine them with specialized Convnet fine features, creating a precisely localizable feature pyramid.

RoMa는 이미지 간 매칭 및 변형 추정을 통해 장면 재구성과 Localization을 수행한다.
- RoMa 는 MegaDepth, WxBS 등 벤치마크에서 최고성능 (SotA) 를 달
Frozen DINOv2 encoder for coarse features, while using a proposed specialized ConvNet encoder for the fine features

- Dense Method (coarse-to-fine approach): Dense feature matching methods aim to find all matching pixel-pairs between the images
- Coarse feature는 3D supervision을 통해 학습 -> 3D dataset은 매우 비싸고 양이 제한, overfitting 위험
- 따라서 RoMa는 frozen pretrained features를 위해 Masked-Image-Modeling(MIM) / DINOv2 사용
- Frozen DINOv2 encoder for coarse features.
- DINOv2 features are significantly more robust to changes in viewpoint than both ResNet and VGG19(ConvNet)
- VGG19 are worse than the ResNet feature for coarse matching but, better suited for fine localized features
- 따라서 RoMa는 fine features에는 VGG19(ConvNet) 를 사용한다.
기존 ConvNet 대신 Transformer match decoder를 도입하여 anchor probability를 정밀하게 예측하고 multi modal 분포를 모델링한다.
- ConvNet coarse match decoder overfit to the training resolution and tend to be overfit to the training resolution.
- Transformer matcher decoder consists of 5 ViT blocks, with 8 heads, hidden size D 1024, MLP size 4096
- 따라서 transformer decoder without using position encodings (More robust)
Improved loss formulation through regression-by-classification with subseqent robust regression
- Coarse 단계의 multimodal 분포 처리
  - Regression-by-Classification (분류 기반 회귀) : 분포를 격자로 나누고 각 격자의 anchor를 예측하여 매칭
- Refinement 단계의 unimodal(단일모드) 분포 처리
  - Robust Regression Loss : Charbonnier 손실 기반 (작은 오차에 대해 L2와 유사하게 작동, 큰 오차에서는 부드럽게 감소하여 노이즈에 민감하지 않음
- 전체 손실 함수

Model Freezing

: frozen pretrained feature (데이터의 특징 변수는 동결하고), fullty connected layer 변수만 업데이트 (Fine-tuning)

ConvNet ( Convolutional Neural Network )

1. Input layer

Feature Map : (높이, 넓이, 채널) 의 크기를 갖는 3차원의 크기를 가짐

2. Convolutional layer

Kernel
- Kernel(Filter)라는 정사각 행렬을 적용하여 합성곱 연산 수행
- Stride 라는 지정된 간격에 따라 순차적으로 이동
Padding
- 합성곱 연산을 수행할 경우, kernel과 stride의 작용으로 원본의 크기가 줄어듦. feature map의 크기가 작아지는 것을 방지하기 위해 Padding이랑 기법을 사용함.
- 단순히 원본 이미지에 0이라는 padding값을 채워넣어 이미지를 확장한 후 합성곱 연산을 적용
ReLu Activation Function( 활성화 함수)
Pooling layer