# MMML-reading-list **Repository Path**: hehehelloworld/mmml-reading-list ## Basic Information - **Project Name**: MMML-reading-list - **Description**: A list of Multi-Modal Machine Learning (MMML) papers that I read every day. - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2023-05-19 - **Last Updated**: 2023-05-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # MMML reading list ## Object Detection * [You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/pdf/1506.02640.pdf) * [Fast R-CNN](https://arxiv.org/pdf/1504.08083.pdf) * [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/pdf/1506.01497.pdf) * [Focal Loss for Dense Object Detection](https://arxiv.org/pdf/1708.02002.pdf) * [Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection](https://arxiv.org/pdf/1912.02424.pdf) [[code]](https://github.com/sfzhang15/ATSS) * [Dynamic Head: Unifying Object Detection Heads with Attentions](https://arxiv.org/pdf/2106.08322v1.pdf) * [End-to-End Object Detection with Transformers](https://arxiv.org/pdf/2005.12872.pdf) [[code]](https://github.com/facebookresearch/detr) * [Rethinking Transformer-based Set Prediction for Object Detection](https://arxiv.org/pdf/2011.10881.pdf) * [Fast Convergence of DETR with Spatially Modulated Co-Attention](https://arxiv.org/pdf/2108.02404.pdf) [[code]](https://github.com/gaopengcuhk/SMCA-DETR) * [UP-DETR: Unsupervised Pre-training for Object Detection with Transformers](https://arxiv.org/abs/2011.09094) [[code]](https://github.com/dddzg/up-detr) * [End-to-End Semi-Supervised Object Detection with Soft Teacher](https://arxiv.org/pdf/2106.09018v3.pdf) [[code]](https://github.com/microsoft/SoftTeacher) ## Vision-Language Pretraining (VLP) * [SIMVLM: Simple Visual Language Model Pre-training with Weak Supervision](https://arxiv.org/pdf/2108.10904.pdf) * What: 一个主要专注于语言回答类问题(e.g., VQA, image captioning)的VLP模型。在VQA上可以比VinVL高4个点。 * Why: 想弥补MLM在生成任务上的天然缺陷。 * How: 简单粗暴,把图片和文字一部分caption放进encoder,然后训练一个auto-regressive的语言模型。 * [Align before Fuse: Vision and Language Representation Learning with Momentum Distillation](https://arxiv.org/pdf/2107.07651.pdf) [[code]](https://github.com/salesforce/ALBEF) * What: 一个新的VLP模型,在大型noisy dataset上学习效率更高。 * Why: 之前的VLP模型text 和image没有进行过向量层面的alignment学习,导致效率偏低。同时因为大型数据集相对noisy,直接做MLM一类的任务会不利于模型泛化。 * How: 先把图片和文字各自过encoder(VIT/BERT),在进行cross-modal interaction前,先去把text 和image embedding做一个contrastive alignment(MoCo)。之后尝试最小化原模型和momentum模型之间的KL divergence。效果显示在noisy dataset上学习效果更好。 ## Contrastive Learning * [Momentum Contrast for Unsupervised Visual Representation Learning](https://arxiv.org/pdf/1911.05722.pdf) [[code]](https://github.com/facebookresearch/moco) * [Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere](https://arxiv.org/pdf/2005.10242.pdf) [[code]](https://github.com/SsnL/align_uniform) * [Multimodal Few-Shot Learning with Frozen Language Models](https://arxiv.org/pdf/2106.13884.pdf) ## Embodied Vision Language Planning * (survey)[Core Challenges in Embodied Vision-Language Planning](https://arxiv.org/pdf/2106.13948.pdf) * [The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation](https://arxiv.org/pdf/2010.12639.pdf) * [Sim-to-Real Transfer for Vision-and-Language Navigation](https://arxiv.org/pdf/2011.03807.pdf) [[code]](https://github.com/batra-mlp-lab/vln-sim2real) ## Scene Graph Generation * (survey)[Scene Graphs: A Survey of Generations and Applications](https://arxiv.org/pdf/2104.01111.pdf) * What: * Why: * How: ## Prompt Tuning * (survey)[Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing](https://arxiv.org/pdf/2107.13586.pdf) * [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf) * [PPT: Pre-trained Prompt Tuning for Few-shot Learning](https://arxiv.org/pdf/2109.04332.pdf) ## Datasets * [The Open Images Dataset V4-Unified image classification, object detection, and visual relationship detection at scale](https://arxiv.org/pdf/1811.00982.pdf) * [Visual Genome-Connecting Language and Vision Using Crowdsourced Dense Image Annotations](https://visualgenome.org/static/paper/Visual_Genome.pdf) * [Microsoft COCO: Common Objects in Context](https://arxiv.org/pdf/1405.0312.pdf) * [LVIS: A Dataset for Large Vocabulary Instance Segmentation](https://arxiv.org/pdf/1908.03195.pdf)