基于视觉 Transformer 的密集预测

Vision Transformer for Dense Prediction

V

René Ranftl, Alexey Bochkovskiy, Vladlen Koltun
Intel Labs

Abstract

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at this https URL.

1. Motivation

import torch

2. Method

Mountain landscape
Figure 1. Left: Architecture overview. The input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a ResNet-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple transformer stages. We reassemble tokens from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction. Center: Overview of the Reassembles operation. Tokens are assembled into feature maps with 1s the spatial resolution of the input image. Right: Fusion blocks combine features using residual convolutional units [25] and upsample the feature maps.

3. Experiments

During training, we use 36 RGB-D img (6, 21, 9 img from 2001, 2006 and 2014 datasets, respec- tively) from Middlebury dataset. To evaluate the perfor- mance of our PMBANet, we test on 6 standard depth maps (Art, Books, Moebius, Dolls, Laundry, Reindeer) from Middlebury 2005, 4 standard depth maps (Tsukuba, Venus, Teddy, Cones) from Middlebury 2003.

We augment the training dataset by 180-rotation and randomly extracted 10000+ depth patches of a fixed size of 16 × 16 from LR depth maps. The corresponding HR depth patches are the squared size of 32, 64, 128, and 256 according to 2, 4, 8, and 16 up-scaling factors respectively. Similar to other works, the metric of Mean Absolute Difference (MAD), Root Mean Square Error (RMSE), and percentage of error pixels (PE) is used to measure the difference between the predicted depth map and the corresponding ground truth. Lower MAD and PE values, bettern performance.

Quantitative depth SR results (in MAD) on Middlebury 2005 dataset.

 

 

Art

 

 

Books

 

 

Dolls

 

 

Laundry

 

 

Mobius

 

 

Reindeer

 

×4

×8

×16

×4

×8

×16

×4

×8

×16

×4

×8

×16

×4

×8

×16

×4

×8

×16

CLMF

0.76/8.12

1.44/17.28

2.87/33.25

0.28/3.27

0.51/7.25

1.02/16.09

0.34/4.40

0.60/8.76

1.01/18.32

0.50/5.50

0.80/12.67

1.67/25.40

0.29/4.13

0.51/8.42

0.97/17.27

0.51/4.65

0.84/9.96

1.55/18.34

JGF

0.47/3.25

0.78/7.39

1.54/14.31

0.24/2.14

0.43/5.41

0.81/12.05

0.33/3.23

0.59/7.29

1.06/15.87

0.36/2.60

0.64/4.54

1.20/8.69

0.25/3.36

0.46/6.45

0.80/12.33

0.38/2.27

0.64/5.17

1.09/11.84

EDGE

0.65/6.82

1.03/13.49

2.11/25.90

0.30/3.35

0.56/8.50

1.03/19.32

0.31/2.90

0.56/6.84

1.05/17.97

0.32/2.82

0.54/5.46

1.14/13.57

0.29/3.72

0.51/7.36

1.10/14.05

0.37/2.67

0.63/6.22

1.28/16.80

TGV

0.65/5.14

1.17/10.51

2.30/21.37

0.27/2.48

0.42/4.65

0.82/11.20

0.33/4.45

0.70/11.12

2.20/45.54

0.55/6.99

1.22/16.32

3.37/53.61

0.29/3.68

0.49/6.84

0.90/14.09

0.49/4.67

1.03/11.22

3.05/43.48

KSVD

0.64/3.46

0.81/5.18

1.47/8.39

0.23/2.13

0.52/3.97

0.76/8.76

0.34/4.53

0.56/6.18

0.82/12.98

0.35/2.19

0.52/3.89

1.08/8.79

0.28/2.08

0.48/4.86

0.81/8.97

0.47/2.19

0.57/5.76

0.99/12.67

CDLLC

0.53/2.86

0.76/4.59

1.41/7.53

0.19/1.34

0.46/3.67

0.75/8.12

0.31/4.61

0.53/5.94

0.79/12.64

0.30/2.08

0.48/3.77

0.96/8.25

0.27/1.98

0.46/4.59

0.79/7.89

0.43/2.09

0.55/5.39

0.98/11.49

PB

0.79/3.12

0.93/6.18

1.98/12.34

0.16/1.39

0.43/3.34

0.79/8.12

0.53/3.99

0.83/6.22

0.99/12.86

1.13/2.68

1.89/5.62

2.87/11.76

0.17/1.95

0.47/4.12

0.82/8.32

0.56/6.04

0.97/12.17

1.89/21.35

EG

0.48/2.48

0.71/3.31

1.35/5.88

0.15/1.23

0.36/3.09

0.70/7.58

0.27/2.72

0.49/5.59

0.74/12.06

0.28/1.62

0.45/2.86

0.92/7.87

0.23/1.88

0.42/4.29

0.75/7.63

0.36/1.97

0.51/4.31

0.95/9.27

SRCNN

0.63/7.61

1.21/14.54

2.34/23.65

0.25/2.88

0.52/7.98

0.97/15.24

0.29/3.93

0.58/8.34

1.03/16.13

0.40/6.25

0.87/13.63

1.74/24.84

0.25/3.63

0.43/7.28

0.87/14.53

0.35/3.84

0.75/7.98

1.47/14.78

DSP

0.73/7.83

1.56/15.21

3.03/31.32

0.28/3.19

0.61/8.52

1.31/16.73

0.32/4.