[논문] DETR: End to End Object Detection with Transformers (ECCV 2020)

study

논문 출처 :https://arxiv.org/abs/2005.12872

git :https://github.com/facebookresearch/detr

참고:https://wikidocs.net/145910

DETR(DEtection with TRansformer) 은 2020년 facebook 팀이 개발한 모델이다.

네트워크 구조

DETR은 CNN backbone + Transformer + FFN 로 구성되어 있다. 아주 간단한 코드로 pytorch로 구현이 간단하게 가능하다.

CNN back bone

image를 CNN backbone에 통과 시키면 feature map을 뽑아낸다.

이 feature map이 transformer에 들어가도록 처리하는 과정이 필요하다.

ResNet50을 사용했기 때문에 featuremap은 C x H x W (C=2048, H = H_0/32, W = W_0/32) 인데, 1x1 Conv를 적용하여

d x H x W 형태로 바꿔준다. (C>d)

transformer에 들어가기 위해선 2차원이어야 하므로 C x HW로 구조를 바꾼다.

class Backbone(BackboneBase):
    """ResNet backbone with frozen BatchNorm."""
    def __init__(self, name: str,
                 train_backbone: bool,
                 return_interm_layers: bool,
                 dilation: bool):
        backbone = getattr(torchvision.models, name)(
            replace_stride_with_dilation=[False, False, dilation],
            pretrained=is_main_process(), norm_layer=FrozenBatchNorm2d)
        num_channels = 512 if name in ('resnet18', 'resnet34') else 2048
        super().__init__(backbone, train_backbone, num_channels, return_interm_layers)

def build_backbone(args):
    position_embedding = build_position_encoding(args)
    train_backbone = args.lr_backbone > 0
    return_interm_layers = args.masks
    backbone = Backbone(args.backbone, train_backbone, return_interm_layers, args.dilation)
    model = Joiner(backbone, position_embedding)
    model.num_channels = backbone.num_channels
    return model

Transformer

-Encoder

왼쪽 파란색 박스를 보면 d x HW의 feature matrix에 Positional Enconding 정보를 더해 Multi-Head Self-Attention에 통과시킨다.

class TransformerEncoderLayer(nn.Module):

    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

        self.activation = _get_activation_fn(activation)
        self.normalize_before = normalize_before

def build_transformer(args):
    return Transformer(
        d_model=args.hidden_dim,
        dropout=args.dropout,
        nhead=args.nheads,
        dim_feedforward=args.dim_feedforward,
        num_encoder_layers=args.enc_layers,
        num_decoder_layers=args.dec_layers,
        normalize_before=args.pre_norm,
        return_intermediate_dec=True,
    )

-Decoder

N개의 bounding box에 대해 N개의 object query를 생성한다. 초기 object query 는 0이다. 이 N개의 object query를 입력받아 multi-head self-attention을 거쳐 가공된 N개의 unit을 출력한다.

다시 N개의 unit들이 Query로, 그리고 encoder의 출력 unit들이 Key와 Value로 작동하여 encoder-decoder multi-head attention 을 수행한다.

class TransformerDecoderLayer(nn.Module):

    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

        self.activation = _get_activation_fn(activation)
        self.normalize_before = normalize_before

N개의 유닛은 FNN을 거쳐 class와 bounding box를 출력한다. bipartite matching을 통해 bounding box가 겹치지 않게 한다.

Loss function & Training

σ: Ground truth의 object set의 순열
σ_hat: L_match를 최소로 하는 예측 bounding box set의 순열
y: Ground truth의 object set // y_hat: 예측한 N object set
c: class label // p(c): 해당 class에 속할 확률
b: bounding box의 위치와 크기 (x, y, w, h)

L_match

ground truth의 bounding box와 예측 bounding box가 정보가 잘 matching 되었을 때 낮은 값을 가지도록 합니다.

1-1. p(c): 해당 class로 예측한 확률. 앞에 (-)가 붙어있으므로 해당 확률이 높을수록 L_match가 작아짐
1-2. L_box: ground truth의 bounding box와 예측 bounding box 사이의 loss. 두 box가 비슷할 수록 L_match가 작아짐.
bounding box의 크기가 클수록 L1 loss가 커지기 때문에 아래와 같이 bounding box간 IOU loss를 더하여 이를 보정해줌

σ_hat

L_match를 최소로 하는 예측 bounding box 순서 σ_hat 을 찾습니다.

Hungarian loss

σ_hat를 찾았으면 bipartite matching이 완료된 것이므로 이제 Loss를 구할 수 있습니다. Loss는 아래와 같이 Hungarian loss를 계산합니다. (일반적인 object detection 모델에서 사용하는 loss와 유사합니다만 Hungarian loss가 궁금하다면 여기를 참고하세요.) 이 Loss를 최소화하는 방식으로 학습이 진행됩니다.

'study' 카테고리의 다른 글

[study] YOLO 학습속도 개선 (0)	2023.02.10
detr_demo.ipynb 코드 해부하기 (0)	2023.02.09
[코드] yolov5 구현 (0)	2023.02.09
[코드] Imagenet 분류 코드 (0)	2023.02.09
[study] EfficientNet 모델 구조 (0)	2023.02.09

ABOUT ME

자 이제 시작이야 내 꿈을~ 자 이제 시작이야 내 꿈을~

네트워크 구조

CNN back bone

Transformer

Loss function & Training

L_match

σ_hat

Hungarian loss

'study' 카테고리의 다른 글

티스토리툴바

ABOUT ME

네트워크 구조

CNN back bone

Transformer

Loss function & Training

L_match

σ_hat

Hungarian loss

'study' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바