# MMML-reading-list

**Repository Path**: hehehelloworld/mmml-reading-list

## Basic Information

- **Project Name**: MMML-reading-list
- **Description**: A list of Multi-Modal Machine Learning (MMML) papers that I read every day. 
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2023-05-19
- **Last Updated**: 2023-05-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# MMML reading list


## Object Detection

* [You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/pdf/1506.02640.pdf)

* [Fast R-CNN](https://arxiv.org/pdf/1504.08083.pdf)

* [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/pdf/1506.01497.pdf)

* [Focal Loss for Dense Object Detection](https://arxiv.org/pdf/1708.02002.pdf)

* [Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection](https://arxiv.org/pdf/1912.02424.pdf) [[code]](https://github.com/sfzhang15/ATSS)

* [Dynamic Head: Unifying Object Detection Heads with Attentions](https://arxiv.org/pdf/2106.08322v1.pdf)

* [End-to-End Object Detection with Transformers](https://arxiv.org/pdf/2005.12872.pdf) [[code]](https://github.com/facebookresearch/detr)

* [Rethinking Transformer-based Set Prediction for Object Detection](https://arxiv.org/pdf/2011.10881.pdf)

* [Fast Convergence of DETR with Spatially Modulated Co-Attention](https://arxiv.org/pdf/2108.02404.pdf) [[code]](https://github.com/gaopengcuhk/SMCA-DETR)

* [UP-DETR: Unsupervised Pre-training for Object Detection with Transformers](https://arxiv.org/abs/2011.09094) [[code]](https://github.com/dddzg/up-detr)

* [End-to-End Semi-Supervised Object Detection with Soft Teacher](https://arxiv.org/pdf/2106.09018v3.pdf) [[code]](https://github.com/microsoft/SoftTeacher)


## Vision-Language Pretraining (VLP)

* [SIMVLM: Simple Visual Language Model Pre-training with Weak Supervision](https://arxiv.org/pdf/2108.10904.pdf)
  * What: 一个主要专注于语言回答类问题（e.g., VQA, image captioning）的VLP模型。在VQA上可以比VinVL高4个点。
  * Why: 想弥补MLM在生成任务上的天然缺陷。
  * How: 简单粗暴，把图片和文字一部分caption放进encoder，然后训练一个auto-regressive的语言模型。


* [Align before Fuse: Vision and Language Representation Learning with Momentum Distillation](https://arxiv.org/pdf/2107.07651.pdf) [[code]](https://github.com/salesforce/ALBEF)
  * What: 一个新的VLP模型，在大型noisy dataset上学习效率更高。
  * Why: 之前的VLP模型text 和image没有进行过向量层面的alignment学习，导致效率偏低。同时因为大型数据集相对noisy，直接做MLM一类的任务会不利于模型泛化。
  * How: 先把图片和文字各自过encoder(VIT/BERT)，在进行cross-modal interaction前，先去把text 和image embedding做一个contrastive alignment(MoCo)。之后尝试最小化原模型和momentum模型之间的KL divergence。效果显示在noisy dataset上学习效果更好。


## Contrastive Learning

* [Momentum Contrast for Unsupervised Visual Representation Learning](https://arxiv.org/pdf/1911.05722.pdf) [[code]](https://github.com/facebookresearch/moco)

* [Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere](https://arxiv.org/pdf/2005.10242.pdf) [[code]](https://github.com/SsnL/align_uniform)

* [Multimodal Few-Shot Learning with Frozen Language Models](https://arxiv.org/pdf/2106.13884.pdf)


## Embodied Vision Language Planning
* (survey)[Core Challenges in Embodied Vision-Language Planning](https://arxiv.org/pdf/2106.13948.pdf)

* [The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation](https://arxiv.org/pdf/2010.12639.pdf)

* [Sim-to-Real Transfer for Vision-and-Language Navigation](https://arxiv.org/pdf/2011.03807.pdf) [[code]](https://github.com/batra-mlp-lab/vln-sim2real)


## Scene Graph Generation

* (survey)[Scene Graphs: A Survey of Generations and Applications](https://arxiv.org/pdf/2104.01111.pdf)
  * What:
  * Why:
  * How:


## Prompt Tuning

* (survey)[Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing](https://arxiv.org/pdf/2107.13586.pdf)

* [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf)

* [PPT: Pre-trained Prompt Tuning for Few-shot Learning](https://arxiv.org/pdf/2109.04332.pdf)


## Datasets

* [The Open Images Dataset V4-Unified image classiﬁcation, object detection, and visual relationship detection at scale](https://arxiv.org/pdf/1811.00982.pdf)

* [Visual Genome-Connecting Language and Vision Using Crowdsourced Dense Image Annotations](https://visualgenome.org/static/paper/Visual_Genome.pdf)

* [Microsoft COCO: Common Objects in Context](https://arxiv.org/pdf/1405.0312.pdf)

* [LVIS: A Dataset for Large Vocabulary Instance Segmentation](https://arxiv.org/pdf/1908.03195.pdf)