[2021] AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

<aside> 💫 ctrl+alt+t를 누르면 한 번에 여닫기를 할 수 있습니다. 그리고 화이트 모드로 보는 것을 권장드립니다.

CNN, Transformer를 전반적으로 짚어본 후, ViT에 대해서 자세하게 볼 예정입니다.

</aside>

<aside> 📢

<Contents>

</aside>

0. 요약

전반적으로 이런 내용을 다뤄요

1. CNN

1-1. CNN Introduction

1-2. CNN Application

1-3. Max-pooling & Stride

1-4. Design CNN Architecture

2. Transformer

2-1. Introduction

2-2. Multi-head Attention

2-3. Encoder block

2-4. Decoder with Masking

2-5. Positional Encoding

2-6. Learning rate warm-up and linear decay

2-7. Appendix: Beyond the paper

3. 논문 리뷰

<aside> 💡 앞의 1~2까지 내용을 보시고 아래 내용을 보시면 아래 논문을 이해하기 쉬울 것 같습니다.

</aside>

3-1. Introduction

3-2. Related Work

3-3. Method

3-4. Experiments

3-5. Conclusion

4. Code

기존에 제공해주신 링크의 경우, jax로 작성되어 있어 익숙하지 않아 PyTorch 기반으로 코드를 탐색했습니다.
코드를 자세하게 분석하고 작성하기에 시간이 부족하여, 모델 아키텍처와 코드를 매칭 시켜 이해한대로 편집하였습니다.
코드 링크
Full Architecture

Patches Embeddings
Transformer Encoder
Head
ViT = Patches Embeddings + Transformer Encoder + Head

FineTuning 예시 코드

5. 추후 방향성

최근 txt→img, txt+img → img 관련 생성 모델들을 이해하기 위해서 거쳐야하는 모델 중 하나인 CLIP이라는 모델이 있습니다.
텍스트와 이미지 데이터의 임베딩 간의 연결을 어떻게 할 것인가
CLIP을 사용해 본 적은 있지만, 깊게 이해하기 위해 ViT이후 추가적으로 논문을 탐색하고자 합니다.

6. 참고 링크

원문
이외 참고한 링크는 북마크로 내용 사이에 첨부되어 있습니다.

7. ViT 이전에 참고할 만한 논문

“Attention is all you need”논문만 읽어본 적이 있다면, ViT는 쉽게 이해할 수 있어서 추가적인 논문을 남기지 않았습니다.

8. ViT이후 관련 논문 for CLIP

멀티모달 러닝
- 논문: "Multimodal Deep Learning" (링크)
- 참고 자료: "A Survey on Multimodal Learning" (링크)
Contrastive Learning
- 논문: "SimCLR: A Simple Framework for Contrastive Learning of Visual Representations" (링크)
- 참고 자료: "Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere" (링크)
Zero-Shot Learning
- 논문: "Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly" (링크)
- 참고 자료: "Zero-Shot Learning in Modern NLP" (링크)
Text Embedding
- 논문: "Efficient Estimation of Word Representations in Vector Space" (Word2Vec) (링크)
- 참고 자료: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (링크), T5
Image Retrieval
- 논문: "Learning to Rank for Content-Based Image Retrieval" (링크)
- 참고 자료: "Large-Scale Image Retrieval with Attentive Deep Local Features" (링크)

9. 리뷰 후기

<aside> 💡 NLP만 주로 다루다가 이미지 생성 모델로 넘어가면서, 순수 Vision 분야의 논문을 자세하게 읽어본 것은 이번이 처음이었습니다. Transformer를 그대로 유지하려고 노력한 논문이기에, 기존에 Transformer를 이해하고 있어 읽기 편했고 실험 관련 내용을 자세하게 작성하지 못해서, 추후 마무리해서 완성하도록 하겠습니다.

</aside>