# Arxiv-Daily **Repository Path**: bit212/Arxiv-Daily ## Basic Information - **Project Name**: Arxiv-Daily - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-05-14 - **Last Updated**: 2021-05-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Arxiv-Daily My daily arxiv reading notes. [2021 March](202103.md) ## CV (Daily) #### 20210429 ##### Vision Transformer * [Twins: Revisiting the Design of Spatial Attention in Vision Transformers](https://arxiv.org/pdf/2104.13840.pdf) Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, TwinsPCPVT and Twins-SVT. 对标Swin Transformer,取得相当的性能 [code](https://github.com/Meituan-AutoML/Twins) (Zhi Tian, Chunhua Shen) * [ConTNet: Why not use convolution and transformer at the same time?](https://arxiv.org/pdf/2104.13497.pdf) In this work, we innovatively propose ConTNet (ConvolutionTransformer Network), combining transformer with ConvNet architectures to provide large receptive fields. 创新性很普通只能通过实验堆一堆“优点” [code](https://github.com/yan-hao-tian/ConTNet) * [HOTR: End-to-End Human-Object Interaction Detection with Transformers](https://arxiv.org/pdf/2104.13682.pdf) TASK: Human-Object Interaction (HOI) detection is a task of identifying “a set of interactions” in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. PROBLEM: Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. METHOD: In this paper, we present a novel framework, referred by HOTR, which directly predicts a set of hhuman, object, interactioni triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. 用DETR做HOI,故事线还是end-to-end,简化pipeline,消除后处理。 * [Point Cloud Learning with Transformer](https://arxiv.org/pdf/2104.13636.pdf) In this paper, we introduce a novel framework, called Multi-level Multi-scale Point Transformer (MLMSPT) that works directly on the irregular point clouds for representation learning. Specifically, a point pyramid transformer is investigated to model features with diverse resolutions or scales we defined, followed by a multi-level transformer module to aggregate contextual information from different levels of each scale and enhance their interactions * [Medical Transformer: Universal Brain Encoder for 3D MRI Analysis](https://arxiv.org/pdf/2104.13633.pdf) * [Inpainting Transformer for Anomaly Detection](https://arxiv.org/pdf/2104.13897.pdf) ##### Others * [Zero-Shot Detection via Vision and Language Knowledge Distillation](https://arxiv.org/pdf/2104.13921.pdf) MOTIVATION: Zero-shot image classification has made promising progress by training the aligned image and text encoders. The goal of this work is to advance zero-shot object detection, which aims to detect novel objects without bounding box nor mask annotations. We propose ViLD, a training method via Vision and Language knowledge Distillation. We distill the knowledge from a pre-trained zero-shot image classification model (e.g., CLIP [33]) into a two-stage detector (e.g., Mask R-CNN [17]). RESULT: We benchmark the performance on LVIS dataset [15] by holding out all rare categories as novel categories. ViLD obtains 16.1 mask APr with a Mask R-CNN (ResNet-50 FPN) for zero-shot detection, outperforming the supervised counterpart by 3.8. The model can directly transfer to other datasets, achieving 72.2 AP50, 36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively. (Tsung-Yi Lin) * [Shot Contrastive Self-Supervised Learning for Scene Boundary Detection](https://arxiv.org/pdf/2104.13537.pdf) TASK: We presented a self-supervised learning approach to learn a shot representation for long-form videos using unlabeled video data. MOTIVATION: Our approach is based on the key observation that nearby shots in movies and TV episodes tend to have the same set of actors enacting a cohesive story-arch, and are therefore in expectation more similar to each other than a set of randomly selected shots. METHOD: We used this observation to consider nearby similar shots as augmented versions of each other and demonstrated that when used in a contrastive learning setting, this augmentation scheme can encode the scene-structure more effectively than existing augmentation schemes that are primarily geared towards images and short videos * [Efficient Pre-trained Features and Recurrent Pseudo-Labeling in Unsupervised Domain Adaptation](https://arxiv.org/pdf/2104.13486.pdf) In this paper, we show how to efficiently opt for the best pre-trained features from seventeen well-known ImageNet models in unsupervised DA problems. In addition, we propose a recurrent pseudo-labeling model using the best pre-trained features (termed PRPL) to improve classification performance. * [Exploring Relational Context for Multi-Task Dense Prediction](https://arxiv.org/pdf/2104.13874.pdf) TASK: We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads. Our goal is to find the most efficient way to refine each task prediction by capturing cross-task contexts dependent on tasks’ relations. METHOD: Empirical findings confirm that different source-target task pairs benefit from different context types. To automate the selection process, we propose an Adaptive Task-Relational Context (ATRC) module, which samples the pool of all available contexts for each task pair using neural architecture search and outputs the optimal configuration for deployment. (Luc Van Gool) * [Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation](https://arxiv.org/pdf/2104.13613.pdf) 用自监督的深度估计增强域适应语义分割(相互增强)[code](https://github.com/qinenergy/corda) (Dengxin Dai, Luc Van Gool) * [Semi-Supervised Semantic Segmentation with Pixel-Level Contrastive Learning from a Class-wise Memory Bank](https://arxiv.org/pdf/2104.13415.pdf) 像素级对比学习做半监督语义分割 This module enforces the segmentation network to yield similar pixel-level feature representations for same-class samples across the whole dataset. To achieve this, we maintain a memory bank continuously updated with feature vectors from labeled data. These features are selected based on their quality and relevance for the contrastive learning. * [FrameExit: Conditional Early Exiting for Efficient Video Recognition](https://arxiv.org/pdf/2104.13400.pdf) While existing works focus on selecting a subset of salient frames to reduce the computation costs, we propose to use a simple sampling strategy combined with conditional early exiting to enable efficient recognition. * [AdvHaze: Adversarial Haze Attack](https://arxiv.org/pdf/2104.13673.pdf) MOTIVATION: However, previous attack methods have mainly focused on applying some lp normbounded noise perturbations. In this paper, we instead introduce a novel adversarial attack method based on haze, which is a common phenomenon in real-world scenery. Our method can synthesize potentially adversarial haze into an image based on the atmospheric scattering model with high realisticity and mislead classifiers to predict an incorrect class. SIGNIFICANCE: We hope this work can boost the development of non-noisebased adversarial attacks and help evaluate and improve the robustness of DNNs. * [Contrastive Spatial Reasoning on Multi-View Line Drawings](https://arxiv.org/pdf/2104.13433.pdf) Spatial reasoning on multi-view line drawings by stateof-the-art supervised deep networks is recently shown with puzzling low performances on the SPARE3D dataset. To study the reason behind the low performance and to further our understandings of these tasks, we design controlled experiments on both input data and network designs. Guided by the hindsight from these experiment results, we propose a simple contrastive learning approach along with other network modifications to improve the baseline performance. * [LambdaUNet: 2.5D Stroke Lesion Segmentation of Diffusion-weighted MR Images](https://arxiv.org/pdf/2104.13917.pdf) 把lambdaNet用在医学图像处理上 * [MOD: Benchmark for Military Object Detection](https://arxiv.org/pdf/2104.13763.pdf) #### 20210428 * [Multimodal Contrastive Training for Visual Representation Learning](https://arxiv.org/pdf/2104.12836.pdf) METHOD: Different from VirTex [10], our method not only learns the cross-modal correlation between images and captions, but also exploits intrinsic data properties in a self-supervised manner within each modality. RESULT: For example, the visual representations pre-trained on COCO by our method achieve stateof-the-art top-1 validation accuracy of 55.3% on ImageNet classification, under the common transfer protocol. * [Explaining in Style: Training a GAN to explain a classifier in StyleSpace](https://arxiv.org/pdf/2104.13369.pdf) 基于StyelGAN做分类器的可解释性。Image classification models can depend on multiple different semantic attributes of the image. An explanation of the decision of the classifier needs to both discover and visualize these properties. Here we present StylEx, a method for doing this, by training a generative model to specifically explain multiple attributes that underlie classifier decisions. (Phillip Isola) * [BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment](https://arxiv.org/pdf/2104.13371.pdf) NTIRE2021三项冠军。 BACKGROUND: A recurrent structure is a popular framework choice for the task of video super-resolution. The state-of-theart method BasicVSR adopts bidirectional propagation with feature alignment to effectively exploit information from the entire input video. METHOD: In this study, we redesign BasicVSR by proposing second-order grid propagation and flowguided deformable alignment. We show that by empowering the recurrent framework with the enhanced propagation and alignment, one can exploit spatiotemporal information across misaligned video frames more effectively. RESULTS: In addition to video superresolution, BasicVSR++ generalizes well to other video restoration tasks such as compressed video enhancement. In NTIRE 2021, BasicVSR++ obtains three champions and one runner-up in the Video Super-Resolution and Compressed Video Enhancement Challenges. (Chen Change Loy) * [Unsupervised 3D Shape Completion through GAN Inversion](https://arxiv.org/pdf/2104.13366.pdf) (Chen Change Loy) * [Self-distillation with Batch Knowledge Ensembling Improves ImageNet Classification](https://arxiv.org/pdf/2104.13298.pdf) BACKGROUND: The recent studies of knowledge distillation have discovered that ensembling the “dark knowledge” from multiple teachers or students contributes to creating better soft targets for training, but at the cost of significantly more computations and/or parameters. METHOD: In this work, we present BAtch Knowledge Ensembling (BAKE) to produce refined soft targets for anchor images by propagating and ensembling the knowledge of the other samples in the same mini-batch. Specifically, for each sample of interest, the propagation of knowledge is weighted in accordance with the inter-sample affinities, which are estimated on-the-fly with the current network. (依赖更大的batchsize?) RESULT: Extensive experiments demonstrate that the lightweight yet effective BAKE consistently boosts the classification performance of various architectures on multiple datasets, e.g., a significant +1.2% gain of ResNet-50 on ImageNet with only +3.7% computational overhead and zero additional parameters. (Hongsheng Li) * [Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?](https://arxiv.org/pdf/2104.13343.pdf) SIGNIFICANCE: Our results show that the winning lottery tickets of FCNs display the key features of CNNs. The ability of such automatic network-simplifying procedure to recover the key features “hand-crafted” in the design of CNNs suggests interesting applications to other datasets and tasks, in order to discover new and efficient architectural inductive biases. Funny Perspective (Physician) * [Adapting ImageNet-scale models to complex distribution shifts with self-learning](https://arxiv.org/pdf/2104.12928.pdf) 在imagenet级别的数据集上探究域适应中的self-training。EMPIRICAL: While self-learning methods are an important component in many recent domain adaptation techniques, they are not yet comprehensively evaluated on ImageNetscale datasets common in robustness research. In extensive experiments on ResNet and EfficientNet models, we find that three components are crucial for increasing performance with self-learning: (i) using short update times between the teacher and the student network, (ii) fine-tuning only few affine parameters distributed across the network, and (iii) leveraging methods from robust classification to counteract the effect of label noise. DATASET: We therefore re-purpose the dataset from the Visual Domain Adaptation Challenge 2019 and use a subset of it as a new robustness benchmark (ImageNet-D) which proves to be a more challenging dataset for all current state-of-the-art models (58.2% error) to guide future research efforts at the intersection of robustness and domain adaptation on ImageNet scale. * [Graphical Modeling for Multi-Source Domain Adaptation](https://arxiv.org/pdf/2104.13057.pdf) MOTIVATION: .In Multi-Source Domain Adaptation, it is essential to utilize the labeled source data and the unlabeled target data to approach the conditional distribution of semantic label on target domain, which requires the joint modeling across different domains and also an effective domain combination scheme. The graphical structure among different domains is useful to tackle these challenges, in which the interdependency among various instances/categories can be effectively modeled. METHOD: In this work, we propose two types of graphical models, i.e. Conditional Random Field for MSDA (CRF-MSDA) and Markov Random Field for MSDA (MRF-MSDA), for cross-domain joint modeling and learnable domain combination. (Bingbing Ni) * [Unsupervised Multi-Source Domain Adaptation for Person Re-Identification](https://arxiv.org/pdf/2104.12961.pdf) TASK: To make full use of the valuable labeled data, we introduce the multi-source concept into UDA person re-ID field, where multiple source datasets are used during training. METHOD: In this paper, we try to address this problem from two perspectives, i.e. domain-specific view and domain-fusion view. RESULT: The proposed method outperforms state-of-the-art UDA person re-ID methods by a large margin, and even achieves comparable performance to the supervised approaches without any post-processing techniques. * [Width transfer: on the (in)variance of width optimization](https://arxiv.org/pdf/2104.13255.pdf) Optimizing the channel counts for different layers of a CNN has shown great promise in improving the efficiency of CNNs at test-time. In this work, we propose width transfer, a technique that harnesses the assumptions that the optimized widths (or channel counts) are regular across sizes and depths. * [Dual Transformer for Point Cloud Analysis](https://arxiv.org/pdf/2104.13044.pdf) * [Every Annotation Counts: Multi-label Deep Supervision for Medical Image Segmentation](https://arxiv.org/pdf/2104.13243.pdf) 医学图像分割 Pixel-wise segmentation is one of the most data and annotation hungry tasks in our field. Providing representative and accurate annotations is often mission-critical especially for challenging medical applications. METHOD: Our approach is based on a new formulation of deep supervision and student-teacher model and allows for easy integration of different supervision signals. In contrast to previous work, we show that care has to be taken how deep supervision is integrated in lower layers and we present multi-label deep supervision as the most important secret ingredient for success. RESULT: we are able to cut the requirement for expensive labels by 94.22% – narrowing the gap to the best fully supervised baseline to only 5% mean IoU. Our approach is validated by extensive experiments on retinal fluid segmentation and we provide an in-depth analysis of the anticipated effect each annotation type can have in boosting segmentation performance. (CVPR21) * [Underwater Image Enhancement via Medium Transmission-Guided Multi-Color Space Embedding](https://arxiv.org/pdf/2104.13015.pdf) (Wenqi Ren, TIP) #### 20210427 ##### Vision Transformer * [Improve Vision Transformers Training by Suppressing Over-smoothing](https://arxiv.org/pdf/2104.12753.pdf) ANALYSIS: This work investigate how to stabilize the training of vision transformers without special structure modification. We observe that the instability of transformer training on vision tasks can be attributed to a over-smoothing problem, that the self-attention layers tend to map the different patches from the input image into a similar latent representation, hence yielding the loss of information and degeneration of performance, especially when the number of layers is large. TECHNICALLY: We then propose a number of techniques to alleviate this problem, including introducing additional loss functions to encourage diversity, prevent loss of information, and discriminate different patches by additional patch classification loss for Cutmix. RESULT: We show that our proposed techniques stabilizes the training and allow us to train wider and deeper vision transformers, achieving 85.0% top-1 accuracy on ImageNet validation set without introducing extra teachers or additional convolution layers. [code](https://github.com/ChengyueGongR/PatchVisionTransformer) * [MDETR - Modulated Detection for End-to-End Multi-Modal Understanding](https://arxiv.org/pdf/2104.12763.pdf) PROBLEM: Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. 现有跨模态预训练将detector当作黑盒,而没有有效利用. METHOD: In this paper we propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. RESULT: 除了在下游跨模态任务上取得SOTA,We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting. We show that our pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances. [code](https://github.com/ashkamath/mdetr) (Yann LeCun, Nicolas Carion) * [Visformer: The Vision-friendly Transformer](https://arxiv.org/pdf/2104.12533.pdf) [code](https://github.com/danczs/Visformer) This paper offers an empirical study by performing step-by-step operations to gradually transit a Transformer-based model to a convolution-based model. The results we obtain during the transition process deliver useful messages for improving visual recognition. Based on these observations, we propose a new architecture named Visformer. RESULT: With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy, and the advantage becomes more significant when the model complexity is lower or the training set is smaller. [code](https://github.com/danczs/Visformer) (Qi Tian) * [Playing Lottery Tickets with Vision and Languag](https://arxiv.org/pdf/2104.11832.pdf) Models such as LXMERT, ViLBERT and UNITER have significantly lifted the state of the art over a wide range of V+L tasks. However, the large number of parameters in such models hinders their application in practice. EMPIRICAL: In this work, we perform the first empirical study to assess whether such trainable subnetworks also exist in pre-trained V+L models. We use UNITER, one of the best-performing V+L models, as the testbed, and consolidate 7 representative V+L tasks for experiments. FINDINGS: Through comprehensive analysis, we summarize our main findings as follows. (i) It is difficult to find subnetworks (i.e., the tickets) that strictly match the performance of the full UNITER model. However, it is encouraging to confirm that we can find “relaxed” winning tickets at 50%- 70% sparsity that maintain 99% of the full accuracy. (ii) Subnetworks found by task-specific pruning transfer reasonably well to the other tasks, while those found on the pre-training tasks at 60%/70% sparsity transfer universally, matching 98%/96% of the full accuracy on average over all the tasks. (iii) Adversarial training can be further used to enhance the performance of the found lottery tickets. (Jingjing Liu) * [M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers](https://arxiv.org/pdf/2104.11896.pdf) We present a novel architecture for 3D object detection, M3DETR, which combines different point cloud representations (raw, voxels, bird-eye view) with different feature scales based on multi-scale feature pyramids. SIGNIFICANCE: M3DETR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers. RESULT: Our method achieves state-of-the-art performance on the KITTI 3D object detection dataset and Waymo Open Dataset. Results show that M3DETR improves the baseline significantly by 1.48% mAP for all classes on Waymo Open Dataset. In particular, our approach ranks 1 st on the well-known KITTI 3D Detection Benchmark for both car and cyclist classes, and ranks 1 st on Waymo Open Dataset with single frame point cloud inpu * [Diverse Image Inpainting with Bidirectional and Autoregressive Transformers](https://arxiv.org/pdf/2104.12335.pdf) (Shijian Lu) * [GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization](https://arxiv.org/pdf/2104.12465.pdf) TASK: When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. METHOD: In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. RESULT: experimental results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method. [code](https://github.com/Jhhuangkay/GPT2MVS-Generative-Pre-trained-Transformer-2-for-Multi-modal-Video-Summarization) * [Visual Saliency Transformer](https://arxiv.org/pdf/2104.12099.pdf) (Ling Shao) * [RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory](https://arxiv.org/pdf/2104.11934.pdf) TASK: Visual relationship recognition (VRR) PROBLEM: Several recent studies showed that the long-tail problem in VRR is even more critical than that in object recognition due to the compositional complexity and structure. METHOD: To overcome this limitation, we propose a novel transformerbased framework, dubbed as RelTransformer, which performs relationship prediction using rich semantic features from multiple image levels ##### Others * [How Well Self-Supervised Pre-Training Performs with Streaming Data?](https://arxiv.org/pdf/2104.12081.pdf) 考虑现实场景中的streaming data. CONCEPT: The common self-supervised pre-training practice requires collecting massive unlabeled data together and then trains a representation model, dubbed joint training. However, in real-world scenarios where data are collected in a streaming fashion, the joint training scheme is usually storage-heavy and time-consuming. A more efficient alternative is to train a model continually with streaming data, dubbed sequential training. PURPOSE: Nevertheless, it is unclear how well sequential self-supervised pre-training performs with streaming data. In this paper, we conduct thorough experiments to investigate self-supervised pre-training with streaming data. Specifically, we evaluate the transfer performance of sequential self-supervised pre-training with four different data sequences on three different downstream tasks and make comparisons with joint self-supervised pretraining. FINDINGS: Surprisingly, we find sequential self-supervised learning exhibits almost the same performance as the joint training when the distribution shifts within streaming data are mild. Even for data sequences with large distribution shifts, sequential self-supervised training with simple techniques, e.g., parameter regularization or data replay, still performs comparably to joint training. CONCLUSTION: Based on our findings, we recommend using sequential self-supervised training as a more efficient yet performance-competitive representation learning practice for real-world applications. (Jiashi Feng) * [Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data](https://arxiv.org/pdf/2104.12673.pdf) This paper studies the problem of novel category discovery on single- and multi-modal data with labels from different but relevant categories. We present a generic, end-to-end framework to jointly learn a reliable representation and assign clusters to unlabelled data. To avoid over-fitting the learnt embedding to labelled data, we take inspiration from self-supervised representation learning by noise-contrastive estimation and extend it to jointly handle labelled and unlabelled data. RESULT: We thoroughly evaluate our framework on large-scale multi-modal video benchmarks Kinetics-400 and VGG-Sound, and image benchmarks CIFAR10, CIFAR100 and ImageNet, obtaining state-of-the-art results. * [Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos](https://arxiv.org/pdf/2104.12671.pdf) this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. 三种模态上的自监督预训练,不需要数据成对! (Shih-Fu Chang) * [Mutual Contrastive Learning for Visual Representation Learning](https://arxiv.org/pdf/2104.12565.pdf) We present a collaborative learning method called Mutual Contrastive Learning (MCL) for general visual representation learning. The core idea of MCL is to perform mutual interaction and transfer of contrastive distributions among a cohort of models. RESULT: Experimental results on supervised and self-supervised image classification, transfer learning and few-shot learning show that MCL can lead to consistent performance gains, demonstrating that MCL can guide the network to generate better feature representation learning. * [2.5D Visual Relationship Detection](https://arxiv.org/pdf/2104.12727.pdf) PROBLEM: Visual 2.5D perception involves understanding the semantics and geometry of a scene through reasoning about object relationships with respect to the viewer in an environment. However, existing works in visual recognition primarily focus on the semantics. NEW TASK: To bridge this gap, we study 2.5D visual relationship detection (2.5VRD), in which the goal is to jointly detect objects and predict their relative depth and occlusion relationships. Unlike general VRD, 2.5VRD is egocentric, using the camera’s viewpoint as a common reference for all 2.5D relationships. Unlike depth estimation, 2.5VRD is object-centric and not only focuses on depth. DATASET: To enable progress on this task, we create a new dataset consisting of 220k human-annotated 2.5D relationships among 512K objects from 11K images. Our results show that existing models largely rely on semantic cues and simple heuristics to solve 2.5VRD, motivating further research on models for 2.5D perception. (Ming-Hsuan Yang, Boqing Gong) * [Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets](https://arxiv.org/pdf/2104.12690.pdf) 新的数据集标注方法for collecting multi-class classification labels for a large collection of images,结合自监督和Machine Labeler (CVPR'21, Oral) * [Carrying out CNN Channel Pruning in a White Box](https://arxiv.org/pdf/2104.11883.pdf) (Rongrong Ji) * [CompOFA: Compound Once-For-All Networks for Faster Multi-Platform Deployment](https://arxiv.org/pdf/2104.12642.pdf) (ICLR'21) * [Practical Wide-Angle Portraits Correction with Deep Structured Models](https://arxiv.org/pdf/2104.12464.pdf) (CVPR'21, Haoqiang Fan) * [Delving into Data: Effectively Substitute Training for Black-box Attack](https://arxiv.org/pdf/2104.12378.pdf) (CVPR'21) * [StegaPos: Preventing Crops and Splices with Imperceptible Positional Encodings](https://arxiv.org/pdf/2104.12290.pdf) Funny * [Piggyback GAN: Efficient Lifelong Learning for Image Conditioned Generation](https://arxiv.org/pdf/2104.11939.pdf) * [Clean Images are Hard to Reblur: A New Clue for Deblurring](https://arxiv.org/pdf/2104.12665.pdf) * [Rich Semantics Improve Few-shot Learning](https://arxiv.org/pdf/2104.12709.pdf) #### 20210426 * [VidTr: Video Transformer Without Convolutions](https://arxiv.org/pdf/2104.11746.pdf) We introduce Video Transformer (VidTr) with separableattention for video classification. Comparing with commonly used 3D networks, VidTr is able to aggregate spatiotemporal information via stacked attentions and provide better performance with higher efficiency. * [Learning to Cluster Faces via Transformer](https://arxiv.org/pdf/2104.11502.pdf) PROBLEM: The main challenge is that it is difficult to cluster images from the same identity with different face poses, occlusions, and image quality. METHOD: In this paper, we repurpose the well-known Transformer and introduce a Face Transformer for supervised face clustering. In Face Transformer, we decompose the face clustering into two steps: relation encoding and linkage predicting. * [Skeletor: Skeletal Transformers for Robust Body-Pose Estimation](https://arxiv.org/pdf/2104.11712.pdf) However, rather than tracking body parts and trying to temporally smooth them, we propose a novel transformer based network that can learn a distribution over both pose and motion in an unsupervised fashion. * [Deep Lucas-Kanade Homography for Multimodal Image Alignment](https://arxiv.org/pdf/2104.11693.pdf) TASK: Estimating homography to align image pairs captured by different sensors or image pairs with large appearance changes is an important and general challenge for many computer vision applications. METHOD: In contrast to others, we propose a generic solution to pixel-wise align multimodal image pairs by extending the traditional Lucas-Kanade algorithm with networks. funny * [A Closer Look at Self-training for Zero-Label Semantic Segmentation](https://arxiv.org/pdf/2104.11692.pdf) PROBLEM: Prior zerolabel semantic segmentation works approach this task by learning visual-semantic embeddings or generative models. However, they are prone to overfitting on the seen classes because there is no training signal for them. METHOD: We assume that pixels of unseen classes could be present in the training images but without being annotated. Our idea is to capture the latent information on unseen classes by supervising the model with self-produced pseudo-labels for unlabeled pixels. We propose a consistency regularizer to filter out noisy pseudolabels by taking the intersections of the pseudo-labels generated from different augmentations of the same image. Our framework generates pseudo-labels and then retrain the model with human-annotated and pseudo-labelled data. * [Skip-Convolutions for Efficient Video Processing](https://arxiv.org/pdf/2104.11487.pdf) We propose Skip-Convolutions to leverage the large amount of redundancies in video streams and save computations. Each video is represented as a series of changes across frames and network activations, denoted as residuals. We reformulate standard convolution to be efficiently computed on residual frames: each layer is coupled with a binary gate deciding whether a residual is important to the model prediction, e.g. foreground regions, or it can be safely skipped, e.g. background regions. RESULT: By replacing all convolutions with Skip-Convolutions in two state-ofthe-art architectures, namely EfficientDet and HRNet, we reduce their computational cost consistently by a factor of 3 ∼ 4× for two different tasks, without any accuracy drop. * [H2O: A Benchmark for Visual Human-human Object Handover Analysis](https://arxiv.org/pdf/2104.11466.pdf) Object handover is a common human collaboration behavior that attracts attention from researchers in Robotics and Cognitive Science. Though visual perception plays an important role in the object handover task, the whole handover process has been specifically explored. In this work, we propose a novel rich-annotated dataset, H2O, for visual analysis of human-human object handovers. The H2O, which contains 18K video clips involving 15 people who hand over 30 objects to each other, is a multi-purpose benchmark. Funnt task. * [Motion Representations for Articulated Animation](https://arxiv.org/pdf/2104.11280.pdf) We propose novel motion representations for animating articulated objects consisting of distinct parts. In a completely unsupervised manner, our method identifies object parts, tracks them in a driving video, and infers their motions by considering their principal axes. #### 20210423 ##### vision transformer * [VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text](https://arxiv.org/pdf/2104.11178.pdf) our Video-AudioText Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. (Shih-Fu Chang, Boqing Gong) * [Multiscale Vision Transformers](https://arxiv.org/pdf/2104.11227.pdf) METHOD: We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. RESULT: We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10× more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. 很直接的想法,但很有效 (Haoqi Fan) * [ImageNet-21K Pretraining for the Masses](https://arxiv.org/pdf/2104.10972.pdf) This paper aims to close this gap, and make high-quality efficient pretraining on ImageNet-21K available for everyone. Via a dedicated preprocessing stage, utilizing WordNet hierarchies, and a novel training scheme called semantic softmax, we show that various models, including small mobile-oriented models, significantly benefit from ImageNet-21K pretraining on numerous datasets and tasks. We also show that we outperform previous ImageNet-21K pretraining schemes for prominent new models like ViT. [code](https://github.com/Alibaba-MIIL/ImageNet21K) TODO * [So-ViT: Mind Visual Tokens for Vision Transformer](https://arxiv.org/pdf/2104.10935.pdf) PROBLEM: However, the high performance of the original ViT heavily depends on pretraining using ultra large-scale datasets, and it significantly underperforms on ImageNet1K if trained from scratch. METHOD: (1) This paper makes the efforts toward addressing this problem, by carefully considering the role of visual tokens. First, for classification head, existing ViT only exploits class token while entirely neglecting rich semantic information inherent in high-level visual tokens. Therefore, we propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification. (2) Second, the original ViT employs the naive embedding of fixed-size image patches, lacking the ability to model translation equivariance and locality. To alleviate this problem, we develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding. * [Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet](https://arxiv.org/pdf/2104.10858.pdf) 本文旨在为vision transformer提供准确率和模型复杂度的trade-off,图1提供了一个比较全面的比较。PROBLEM: While recent vision transformers have demonstrated promising results in ImageNet classification, their performance still lags behind powerful convolutional neural networks (CNNs) with approximately the same model size. METHOD: In this work, instead of describing a novel transformer architecture, we explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques. We show that by slightly tuning the structure of vision transformers and introducing token labeling—a new training objective, our models are able to achieve better results than the CNN counterparts and other transformer-based classification models with similar amount of training parameters and computations. [code](https://github.com/zihangJiang/TokenLabeling) (Jiashi Feng) * [KeypointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control](https://arxiv.org/pdf/2104.11224.pdf) ##### CVPR21 * [Hierarchical Motion Understanding via Motion Programs](https://arxiv.org/pdf/2104.11216.pdf) (Jiajun Wu) * [Distilling Audio-Visual Knowledge by Compositional Contrastive Learning](https://arxiv.org/pdf/2104.10955.pdf) 跨模态对比学习。 In this work, we propose to transfer knowledge across heterogeneous modalities, even though these data modalities may not be semantically correlated. Rather than directly aligning the representations of different modalities, we compose audio, image, and video representations across modalities to uncover richer multi-modal knowledge. We establish a new, comprehensive multi-modal distillation benchmark on three video datasets: UCF101, ActivityNet, and VGGSound. Moreover, we demonstrate that our model significantly outperforms a variety of existing knowledge distillation methods in transferring audio-visual knowledge to improve video representation learning. [code](https://github.com/yanbeic/CCL) TODO * [Heterogeneous Grid Convolution for Adaptive, Efficient, and Controllable Computation](https://arxiv.org/pdf/2104.11176.pdf) This paper proposes a novel heterogeneous grid convolution that builds a graph-based image representation by exploiting heterogeneity in the image content, enabling adaptive, efficient, and controllable computations in a convolutional architecture. We have evaluated the proposed approach on four image understanding tasks, semantic segmentation, object localization, road extraction, and salient object detection. 思想类似Efficient Segmentation: Learning Downsampling Near Semantic Boundaries(ICCV'19),但是结合graph讲故事 * [Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation](https://arxiv.org/pdf/2104.11116.pdf) In this paper, we propose a clean yet effective framework to generate posecontrollable talking faces. (Ziwei Liu) * [DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation](https://arxiv.org/pdf/2104.10834.pdf) SETTING: It employs an adversarial training with a labeled daytime dataset and an unlabeled dataset that contains coarsely aligned day-night image pairs. METHOD: Specifically, for the unlabeled day-night image pairs, we use the pixel-level predictions of static object categories on a daytime image as a pseudo supervision to segment its counterpart nighttime image. We further design a re-weighting strategy to handle the inaccuracy caused by misalignment between day-night image pairs and wrong predictions of daytime images, as well as boost the prediction accuracy of small objects. RESULT: Extensive experiments on Dark Zurich and Nighttime Driving datasets show that our method achieves state-of-the-art performance for nighttime semantic segmentation. * [ManipulaTHOR: A Framework for Visual Object Manipulation](https://arxiv.org/pdf/2104.11213.pdf) (Oral) ##### Others * [Pri3D: Can 3D Priors Help 2D Representation Learning?](https://arxiv.org/pdf/2104.11225.pdf) (Saining Xie) * [Domain Adaptation for Semantic Segmentation via Patch-Wise Contrastive Learning](https://arxiv.org/pdf/2104.11056.pdf) Unlike many earlier methods that rely on adversarial learning for feature alignment, we leverage contrastive learning to bridge the domain gap by aligning the features of structurally similar label patches across domains. As a result, the networks are easier to train and deliver better performance. particularly with a small number of target domain annotations. It can also be naturally extended to weakly-supervised domain adaptation, where only a minor drop in accuracy can save up to 75% of annotation cost. * [Lighting the Darkness in the Deep Learning Era](https://arxiv.org/pdf/2104.10729.pdf) 低光照图像增强综述,并提供新的数据集和benchmark用于future research. (Ming-Ming Cheng, Chen Change Loy) * [On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation](https://arxiv.org/pdf/2104.11222.pdf) We investigate the sensitivity of the Frechet Inception ´ Distance (FID) score to inconsistent and often incorrect implementations across different image processing libraries. FID score is widely used to evaluate generative models, but each FID implementation uses a different low-level image processing process. OBSERVATION: We observe that numerous subtle choices need to be made for FID calculation and a lack of consistencies in these choices can lead to vastly different FID scores. In particular, we show that the following choices are significant: (1) selecting what image resizing library to use, (2) choosing what interpolation kernel to use, (3) what encoding to use when representing images. CONTRIBUTION: We additionally outline numerous common pitfalls that should be avoided and provide recommendations for computing the FID score accurately. We provide an easy-to-use optimized implementation of our proposed recommendations in the accompanying code. (Richard Zhang, Junyan Zhu) * [FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection](https://arxiv.org/pdf/2104.10956.pdf) However, it is non-trivial to make a general adapted 2D detector work in this 3D task. In this technical report, we study this problem with a practice built on fully convolutional single-stage detector and propose a general framework FCOS3D. RESULT: Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020. (Dahua Lin) * [Fully Convolutional Line Parsing](https://arxiv.org/pdf/2104.11207.pdf) We present a one-stage Fully Convolutional Line Parsing network (F-Clip) that detects line segments from images. F-Clip detects line segments in an end-to-end fashion by predicting them with each line’s center position, length, and angle. [code](https://github.com/Delay-Xili/F-Clip) (Yi Ma) #### 20210422 * [MetricOpt: Learning to Optimize Black-Box Evaluation Metrics](https://arxiv.org/pdf/2104.10631.pdf) We study the problem of directly optimizing arbitrary non-differentiable task evaluation metrics such as misclassification rate and recall. Our method, named MetricOpt, operates in a black-box setting where the computational details of the target metric are unknown. We achieve this by learning a differentiable value function, which maps compact task-specific model parameters to metric observations. Result: MetricOpt achieves state-of-the-art performance on a variety of metrics for (image) classification, image retrieval and object detection. Solid benefits are found over competing methods, which often involve complex loss design or adaptation. MetricOpt also generalizes well to new tasks and model architectures. (CVPR21, Oral) * [Temporal Modulation Network for Controllable Space-Time Video Super-Resolution](https://arxiv.org/pdf/2104.10642.pdf) Space-time video super-resolution (STVSR) aims to increase the spatial and temporal resolutions of lowresolution and low-frame-rate videos (Ming-Ming Cheng) * [Visualizing Adapted Knowledge in Domain Transfer](https://arxiv.org/pdf/2104.10602.pdf) To understand the adaptation process, we portray their knowledge difference with image translation. Specifically, we feed a translated image and its original version to the two models respectively, formulating two branches. Through updating the translated image, we force similar outputs from the two branches. When such requirements are met, differences between the two images can compensate for and hence represent the knowledge difference between models.(why?) Funny idea * [Balanced Knowledge Distillation for Long-tailed Learning](https://arxiv.org/pdf/2104.10510.pdf) * [Camouflaged Object Segmentation with Distraction Mining](https://arxiv.org/pdf/2104.10475.pdf) One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances. we develop a bio-inspired framework, termed Positioning and Focus Network (PFNet), which mimics the process of predation in nature. (Deng-Ping Fan) * [Fourier Contour Embedding for Arbitrary-Shaped Text Detection](https://arxiv.org/pdf/2104.10442.pdf) To tackle these problems, we model text instances in the Fourier domain and propose one novel Fourier Contour Embedding (FCE) method to represent arbitrary shaped text contours as compact signatures. * [PP-YOLOv2: A Practical Object Detector](https://arxiv.org/pdf/2104.10419.pdf) [code](https://github.com/PaddlePaddle/PaddleDetection) * [Comprehensive Multi-Modal Interactions for Referring Image Segmentation](https://arxiv.org/pdf/2104.10412.pdf) We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the given natural language description. * [Towards Corruption-Agnostic Robust Domain Adaptation](https://arxiv.org/pdf/2104.10376.pdf) In this paper, we investigate a new task, Corruptionagnostic Robust Domain Adaptation (CRDA): to be accurate on original data and robust against unavailablefor-training corruptions on target domains. TODO * [Guided Interactive Video Object Segmentation Using Reliability-Based Attention Maps](https://arxiv.org/pdf/2104.10386.pdf) [code](https://github.com/yuk6heo/GIS-RAmap) We propose a novel guided interactive segmentation (GIS) algorithm for video objects to improve the segmentation accuracy and reduce the interaction time. 将类似active learning的概念引入交互式视频分割中,利用可靠性向用户推荐要标注的帧 (CVPR21 Oral) * [SRWarp: Generalized Image Super-Resolution under Arbitrary Transformation](https://arxiv.org/pdf/2104.10386.pdf) Recent approaches extend the scope to real-valued upsampling factors, even with varying aspect ratios to handle the limitation. In this paper, we propose the SRWarp framework to further generalize the SR tasks toward an arbitrary image transformation. Compared with previous methods, we do not constrain the SR model on a regular grid but allow numerous possible deformations for flexible and diverse image editing. (CVPR21) Funny task. * [Invertible Denoising Network: A Light Solution for Real Noise Removal](https://arxiv.org/pdf/2104.10546.pdf) Invertible networks have various benefits for image denoising since they are lightweight, information-lossless, and memory-saving during back-propagation. InvDN transforms the noisy input into a low-resolution clean image and a latent representation containing noise. To discard noise and restore the clean image, InvDN replaces the noisy latent representation with another one sampled from a prior distribution during reversion. * [Auto-FedAvg: Learnable Federated Averaging for Multi-Institutional Medical Image Segmentation](https://arxiv.org/pdf/2104.10195.pdf) Federated learning (FL) enables collaborative model training while preserving each participant’s privacy, which is particularly beneficial to the medical field. (Alan Yuille) #### 20210421 * [VideoGPT: Video Generation using VQ-VAE and Transformers](https://arxiv.org/pdf/2104.10157.pdf) We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQVAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPTlike architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Funny: views each frame as a word in GPT. [code](https://github.com/wilson1yan/VideoGPT) * [Understanding Synonymous Referring Expressions via Contrastive Features](https://arxiv.org/pdf/2104.10156.pdf) Task: Referring expression comprehension aims to localize objects identified by natural language descriptions. Motivation: One nature is that each object can be described by synonymous sentences with paraphrases, and such varieties in languages have critical impact on learning a comprehension model. While prior work usually treats each sentence and attends it to an object separately, we focus on learning a referring expression comprehension model that considers the property in synonymous sentences. Method: To this end, we develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels, where features extracted from synonymous sentences to describe the same object should be closer to each other after mapping to the visual domain. Funny Story (Yi-Hsuan Tsai, Ming-Hsuan Yang) [code](https://github.com/wenz116/RefContrast) * [Variational Relational Point Completion Network](https://arxiv.org/pdf/2104.10154.pdf) (Ziwei Liu) * [Transformer Transforms Salient Object Detection and Camouflaged Object Detection](https://arxiv.org/pdf/2104.10127.pdf) In this paper, we conduct research on applying the transformer networks for salient object detection (SOD). Specifically, we adopt the dense transformer backbone for fully supervised RGB image based SOD, RGB-D image pair based SOD, and weakly supervised SOD via scribble supervision. As an extension, we also apply our fully supervised model to the task of camouflaged object detection (COD) for camouflaged object segmentation. (Deng-Ping Fan) * [Contrastive Learning for Sports Video: Unsupervised Player Classification](https://arxiv.org/pdf/2104.10068.pdf) Task: We address the problem of unsupervised classification of players in a team sport according to their team affiliation, when jersey colours and design are not known a priori. Method: We adopt a contrastive learning approach in which an embedding network learns to maximize the distance between representations of players on different teams relative to players on the same team, in a purely unsupervised fashion, without any labelled data. Funny application. * [Style-Aware Normalized Loss for Improving Arbitrary Style Transfer](https://arxiv.org/pdf/2104.10064.pdf) Neural Style Transfer (NST) has quickly evolved from single-style to infinite-style models, also known as Arbitrary Style Transfer (AST). Problem: more than 50% of the time, AST stylized images are not acceptable to human users, typically due to under- or over-stylization. Insight: Our studies show that the IST issue is related to the conventional AST style loss, and reveal that the root cause is the equal weightage of training samples irrespective of the properties of their corresponding style images, which biases the model towards certain styles. Method: Through investigation of the theoretical bounds of the AST style loss, we propose a new loss that largely overcomes IST. Funny story. Long-tailed style distribution? * [VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization](https://arxiv.org/pdf/2104.10036.pdf) * [T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval](https://arxiv.org/pdf/2104.10054.pdf) Task: Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions. Problem: The key to this problem is to measure textvideo similarities in a joint embedding space. However, most existing methods only consider the global cross-modal similarity and overlook the local details. Method: In this paper, we design an efficient global-local alignment method. The multi-modal video sequences and text features are adaptively aggregated with a set of shared semantic centers. The local crossmodal similarities are computed between the video feature and text feature within the same center. (Yi Yang) * [Posterior Sampling for Image Restoration using Explicit Patch Priors](https://arxiv.org/pdf/2104.09895.pdf) In this paper, we show how to combine explicit priors on patches of natural images in order to sample from the posterior probability of a full image given a degraded image. Unlike previous approaches that computed a single restoration using MAP or MMSE, our method makes explicit the uncertainty in the restored images and guarantees that all patches in the restored images will be typical given the patch prior * [Lighting, Reflectance and Geometry Estimation from 360◦ Panoramic Stereo](https://arxiv.org/pdf/2104.09886.pdf) Our model takes advantage of the 360◦ input to observe the entire scene with geometric detail, then jointly estimates the scene’s properties with physical constraints. * [SelfReg: Self-supervised Contrastive Regularization for Domain Generalization](https://arxiv.org/pdf/2104.09841.pdf) In recent studies, contrastive learning-based domain generalization approaches have been proposed and achieved high performance. Problem: However, the performance of contrastive learning fundamentally depends on quality and quantity of negative data pairs. (问题不够明确) To address this issue, we propose a new regularization method for domain generalization based on contrastive learning, self-supervised contrastive regularization (SelfReg). The proposed approach use only positive data pairs, thus it resolves various problems caused by negative pair sampling.(BYOL?) Moreover, we propose a class-specific domain perturbation layer (CDPL), which makes it possible to effectively apply mixup augmentation even when only positive data pairs are used. * [Detector-Free Weakly Supervised Grounding by Separation](https://arxiv.org/pdf/2104.09829.pdf) Task: Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. * [CTNet: Context-based Tandem Network for Semantic Segmentation](https://arxiv.org/pdf/2104.09805.pdf) This work proposes a novel Context-based Tandem Network (CTNet) by interactively exploring the spatial contextual information and the channel contextual information, which can discover the semantic context for semantic segmentation. (Jinhui Tang) * [SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud](https://arxiv.org/pdf/2104.09804.pdf) [code](https://github.com/Vegeta2020/SE-SSD) * [Does enhanced shape bias improve neural network robustness to common corruptions](https://arxiv.org/pdf/2104.09789.pdf) It has been shown that augmenting the training data with different image styles decreases this texture bias in favor of increased shape bias while at the same time improving robustness to common corruptions, such as noise and blur. Commonly, this is interpreted as shape bias increasing corruption robustness. However, this relationship is only hypothesized. We perform a systematic study of different ways of composing inputs based on natural images, explicit edge information, and stylization. While stylization is essential for achieving high corruption robustness, we do not find a clear correlation between shape bias and robustness. We conclude that the data augmentation caused by style-variation accounts for the improved corruption robustness and increased shape bias is only a byproduct. (ICLR21) * [Learning Semantic-Aware Dynamics for Video Prediction](https://arxiv.org/pdf/2104.09762.pdf) We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions and capturing the evolution of semantically consistent regions in the video. (CVPR21) * [Imaginative Walks: Generative Random Walk Deviation Loss for Improved Unseen Learning Representation](https://arxiv.org/pdf/2104.09757.pdf) We propose a novel loss for generative models, dubbed as GRaWD (Generative Random Walk Deviation), to improve learning representations of unexplored visual spaces. TODO * [Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information](https://arxiv.org/pdf/2104.09580.pdf) Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions. #### 20210420 * [Does language help generalization in vision models?](https://arxiv.org/pdf/2104.08313.pdf) 对CLIP中的观点提出质疑. PROBLEM: One might assume that these abilities are derived, at least in part, from a “semantic grounding” of the visual feature space, learning meaningful structure by mirroring the space of linguistic representations. FIND1: Contrary to this intuition, we show that a visual model (BiT-M) trained on a very large supervised image dataset (ImageNet-21k) can be as efficient for generalization (few-shot learning, unsupervised clustering) as its multimodal counterpart (CLIP). FIND2: When compared to other standard visual or language models, the latent representations of BiT-M were found to be just as “linguistic” as those of CLIP. CONCLUSION: Overall, these findings suggest that the main factor driving improvements of generalization in current models is the size of the training dataset, not (solely) the multimodal grounding property. * [TransVG: End-to-End Visual Grounding with Transformers](https://arxiv.org/pdf/2104.08541.pdf) TASK: In this paper, we present a neat yet effective transformerbased framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. METHOD: we propose to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules (e.g., modular attention network, dynamic graph, and multi-modal tree) can be replaced by a simple stack of transformer encoder layers with higher performance. Moreover, we re-formulate the visual grounding as a direct coordinates regression problem and avoid making predictions out of a set of candidates (i.e., region proposals or anchor boxes) (Wengang Zhou, Houqiang Li) * [Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training](https://arxiv.org/pdf/2104.09411.pdf) (Houqiang Li) TODO * [Visual Transformer Pruning](https://arxiv.org/pdf/2104.08500.pdf) The pipeline for visual transformer pruning is as follows: 1) training with sparsity regularization; 2) pruning channels; 3) finetuning. (Yunhe Wang) CVPR21 * [Temporal Query Networks for Fine-grained Video Understanding](https://arxiv.org/pdf/2104.09496.pdf) Our objective in this work is fine-grained classification of actions in untrimmed videos, where the actions may be temporally extended or may span only a few frames of the video. We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set. (Andrew Zisserman) * [One More Check: Making “Fake Background” Be Tracked Again](https://arxiv.org/pdf/2104.09441.pdf) Once a target bounding box is mistakenly classified as background by the detector, the temporal consistency of its corresponding tracklet will be no longer maintained, as shown in Fig. 1. In this paper, we set out to restore the misclassified bounding boxes, i.e., fake background, by proposing a re-check network. Good Tiltle, nice story. [code](https://github.com/JudasDie/SOTS) * [Cross-Domain Adaptive Clustering for Semi-Supervised Domain Adaptation](https://arxiv.org/pdf/2104.09415.pdf) TASK: In semi-supervised domain adaptation, a few labeled samples per class in the target domain guide features of the remaining target samples to aggregate around them. PROBLEM: However, the trained model cannot produce a highly discriminative feature representation for the target domain because the training data is dominated by labeled samples from the source domain. This could lead to disconnection between the labeled and unlabeled target samples as well as misalignment between unlabeled target samples and the source domain. (问题描述值得学习) * [Contrastive Learning for Compact Single Image Dehazing](https://arxiv.org/pdf/2104.09367.pdf) 用对比学习做图像去雾。问题:(1)existing deep learning based dehazing methods only adopt clear images as positive samples to guide the training of dehazing network while negative information is unexploited (2) most of them focus on strengthening the dehazing network with an increase of depth and width, leading to a significant requirement of computation and memory. 方法: (1)we propose a novel contrastive regularization (CR) built upon contrastive learning to exploit both the information of hazy images and clear images as negative and positive samples, respectively (2) we develop a compact dehazing network based on autoencoder-like (AE) framework. It involves an adaptive mixup operation and a dynamic feature enhancement module [code](https://github.com/GlassyWu/AECR-Net) * [Multi-person Implicit Reconstruction from a Single Image](https://arxiv.org/pdf/2104.09283.pdf) * [Multi-Modal Fusion Transformer for End-to-End Autonomous Driving](https://arxiv.org/pdf/2104.09224.pdf) Funny * [Surrogate Gradient Field for Latent Space Manipulation](https://arxiv.org/pdf/2104.09065.pdf) * [Distilling Knowledge via Knowledge Review](https://arxiv.org/pdf/2104.09044.pdf) (Jiaya Jia) * [Self-Supervised Pillar Motion Learning for Autonomous Driving](https://arxiv.org/pdf/2104.08683.pdf) Autonomous driving can benefit from motion behavior comprehension when interacting with diverse traffic participants in highly dynamic environments. To this end, we propose a learning framework that leverages free supervisory signals from point clouds and paired camera images to estimate motion purely via self-supervision * [RefineMask: Towards High-Quality Instance Segmentation with Fine-Grained Features](https://arxiv.org/pdf/2104.08569.pdf) PROBLEM: However, the segmented masks are still very coarse due to the downsampling operations in both the feature pyramid and the instance-wise pooling process, especially for large objects. METHOD: In this work, we propose a new method called RefineMask for high-quality instance segmentation of objects and scenes, which incorporates fine-grained features during the instance-wise segmenting process in a multi-stage manner. RESULT: Without bells and whistles, RefineMask yields significant gains of 2.6, 3.4, 3.8 AP over Mask R-CNN on COCO, LVIS, and Cityscapes benchmarks respectively at a small amount of additional computational cost. * [Few-Shot Model Adaptation for Customized Facial Landmark Detection, Segmentation, Stylization and Shadow Removal](https://arxiv.org/pdf/2104.09457.pdf) * [Learning To Count Everything](https://arxiv.org/pdf/2104.08391.pdf) Existing works on visual counting primarily focus on one specific category at a time, such as people, animals, and cells. In this paper, we are interested in counting everything, that is to count objects from any category given only a few annotated instances from that category. TODO * [Attention in Attention Network for Image Super-Resolution](https://arxiv.org/pdf/2104.09497.pdf) In this work, we attempt to quantify and visualize the static attention mechanisms and show that not all attention modules are equally beneficial. We then propose attention in attention network (A2N) for highly accurate image SR. Specifically, our A2N consists of a non-attention branch and a coupling attention branch. #### 20210419 CVPR21: * [Fusing the Old with the New: Learning Relative Camera Pose with Geometry-Guided Uncertainty](https://arxiv.org/pdf/2104.08278.pdf) * [Divide-and-Conquer for Lane-Aware Diverse Trajectory Prediction](https://arxiv.org/pdf/2104.08277.pdf) Our work addresses two key challenges in trajectory prediction, learning multimodal outputs, and better predictions by imposing constraints using driving knowledge. * [Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos](https://arxiv.org/pdf/2104.07905.pdf) We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Funny Others: * [Deep Stable Learning for Out-Of-Distribution Generalization](https://arxiv.org/pdf/2104.07876.pdf) 问题设置:探究更开放的域适应问题 Conventional methods assume either the known heterogeneity of training data (e.g. domain labels) or the approximately equal capacities of different domains. In this paper, we consider a more challenging case where neither of the above assumptions holds. 采用类似因果的解决方案 We propose to address this problem by removing the dependencies between features via learning weights for training samples, which helps deep models get rid of spurious correlations and, in turn, concentrate more on the true connection between discriminative features and labels. * [“BNN - BN = ?”: Training Binary Neural Networks without Batch Normalization](https://arxiv.org/pdf/2104.08215.pdf) 问题:However, the BN layer is costly to calculate and is typically implemented with non-binary parameters, leaving a hurdle for the efficient implementation of BNN training. It also introduces undesirable dependence between samples within each batch. 工作:Inspired by the latest advance on Batch Normalization Free (BN-Free) training [7], we extend their framework to training BNNs, and for the first time demonstrate that BNs can be completed removed from BNN training and inference regimes. (Zhangyang Wang) (CVPRW) * [Dual Contrastive Learning for Unsupervised Image-to-Image Translation](https://arxiv.org/pdf/2104.07689.pdf) 背景:Contrastive learning for Unpaired image-to-image Translation (CUT) yields state-of-the-art results in modeling unsupervised image-toimage translation by maximizing mutual information between input and output patches using only one encoder for both domains. 贡献:In this paper, we propose a novel method based on contrastive learning and a dual learning setting (exploiting two encoders) to infer an efficient mapping between unpaired data. Additionally, while CUT suffers from mode collapse, a variant of our method efficiently addresses this issue. * [Contrastive Learning with Stronger Augmentations](https://arxiv.org/pdf/2104.07713.pdf) 现有对比学习问题:However, those carefully designed transformations limited us to further explore the novel patterns exposed by other transformations. Meanwhile, as found in our experiments, the strong augmentations distorted the images’ structures, resulting in difficult retrieval. 方法:Thus, we propose a general framework called Contrastive Learning with Stronger Augmentations (CLSA) to complement current contrastive learning approaches. Here, the distribution divergence between the weakly and strongly augmented images over the representation bank is adopted to supervise the retrieval of strongly augmented queries from a pool of instances. (Guojun Qi) * [Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment](https://arxiv.org/pdf/2104.07719.pdf) We propose a meta-learning based few-shot object detection method by transferring meta-knowledge learned from data-abundant base classes to data-scarce novel classes. To improve proposal generation for few-shot novel classes, we propose to learn a lightweight matching network to measure the similarity between each spatial position in the query image feature map and spatially-pooled class features, instead of the traditional object/nonobject classifier, thus generating category-specific proposals and improving proposal recall for novel classes. (Shih-Fu Chang) * [Pareto Self-Supervised Training for Few-Shot Learning](https://arxiv.org/pdf/2104.07841.pdf) 探究few-shot learning和自监督学习的结合。 问题:Previous works benefit from sharing inductive bias between the main task (FSL) and auxiliary tasks (SSL), where the shared parameters of tasks are optimized by minimizing a linear combination of task losses. However, it is challenging to select a proper weight to balance tasks and reduce task conflict. 方法:To handle the problem as a whole, we propose a novel approach named as Pareto self-supervised training (PSST) for FSL. PSST explicitly decomposes the few-shot auxiliary problem into multiple constrained multi-objective subproblems with different trade-off preferences, and here a preference region in which the main task achieves the best performance is identified. Then, an effective preferred Pareto exploration is proposed to find a set of optimal solutions in such a preference region. * [Weakly Supervised Object Localization and Detection: A Survey](https://arxiv.org/pdf/2104.07918.pdf) (Ming-Hsuan Yang) * [Self-supervised Video Retrieval Transformer Network](https://arxiv.org/pdf/2104.07993.pdf)任务及其应用Content-based video retrieval aims to find videos from a large video database that are similar to or even nearduplicate of a given query video. It plays an important role in many video related applications, including copyright protection, recommendation, filtering and etc.. 方法: We propose a novel video retrieval system, termed SVRTN, It first applies self-supervised training to effectively learn video representation from unlabeled data to avoid the expensive cost of manual annotation. Then, it exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity. It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners. * [Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos](https://arxiv.org/pdf/2104.08241.pdf) The key factor for video person reidentification is to effectively exploit both spatial and temporal clues from video sequences. In this work, we propose a novel Spatial-Temporal Correlation and Topology Learning framework (CTL) to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. (Jiawei Liu, Zheng-Jun Zha, Kecheng Zheng) #### 20210405 CVPR21: * [Group Collaborative Learning for Co-Salient Object Detection](https://arxiv.org/pdf/2104.01108.pdf) [code](https://github.com/fanq15/GCoNet) (Deng-Ping Fan, Ling Shao) * [MOST: A Multi-Oriented Scene Text Detector with Localization Refinement](https://arxiv.org/pdf/2104.01070.pdf) (Xiang Bai) * [Visual Semantic Role Labeling for Video Understanding](https://arxiv.org/pdf/2104.00990.pdf) * [UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles](https://arxiv.org/pdf/2104.00946.pdf) * [Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning](https://arxiv.org/pdf/2104.00924.pdf) * [Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation](https://arxiv.org/pdf/2104.00905.pdf) * [Network Quantization with Element-wise Gradient Scaling](https://arxiv.org/pdf/2104.00903.pdf) * [HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection](https://arxiv.org/pdf/2104.00902.pdf) * [Adaptive Class Suppression Loss for Long-Tail Object Detection](https://arxiv.org/pdf/2104.00885.pdf) * [S2R-DepthNet: Learning a Generalizable Depth-specific Structural Representation](https://arxiv.org/pdf/2104.00877.pdf) (Xuejin Chen, Wenjun Zeng) * [Self-supervised Video Representation Learning by Context and Motion Decoupling](https://arxiv.org/pdf/2104.00862.pdf) * [Fully Understanding Generic Objects: Modeling, Segmentation, and Reconstruction](https://arxiv.org/pdf/2104.00858.pdf) * [Towards High Fidelity Face Relighting with Realistic Shadows](https://arxiv.org/pdf/2104.00825.pdf) * [Curriculum Graph Co-Teaching for Multi-Target Domain Adaptation](https://arxiv.org/pdf/2104.00808.pdf) * [FESTA: Flow Estimation via Spatial-Temporal Attention for Scene Point Clouds](https://arxiv.org/pdf/2104.00798.pdf) Vision Transformer: * [LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference](https://arxiv.org/pdf/2104.01136.pdf) 从speed-acc tradeoff的角度讲故CNN与ViT结合,提出attention bias, a new way to integrate positional information in vision transformers.:We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. For example, at 80% ImageNet top-1 accuracy, LeViT is 3.3 times faster than EfficientNet on the CPU. * [Language-based Video Editing via Multi-Modal Multi-Level Transformer](https://arxiv.org/pdf/2104.01122.pdf) * [AAformer: Auto-Aligned Transformer for Person Re-Identification](https://arxiv.org/pdf/2104.00921.pdf) * [TubeR: Tube-Transformer for Action Detection](https://arxiv.org/pdf/2104.00969.pdf) * [TFill: Image Completion via a Transformer-Based Architecture](https://arxiv.org/pdf/2104.00845.pdf) [code](https://github.com/lyndonzheng/TFill) (Jianfei Cai) * [VisQA: X-raying Vision and Language Reasoning in Transformers](https://arxiv.org/pdf/2104.00926.pdf) Others: * [Scene Graphs: A Survey of Generations and Applications](https://arxiv.org/pdf/2104.01111.pdf) * #### 20210402 TOP: * [Group-Free 3D Object Detection via Transformers](https://arxiv.org/pdf/2104.00678.pdf) In this paper, we present a simple yet effective method for directly detecting 3D objects from the 3D point cloud. Instead of grouping local points to each object candidate, our method computes the feature of an object from all the points in the point cloud with the help of an attention mechanism in the Transformers, where the contribution of each point is automatically learned in the network training. [code](https://github.com/zeliu98/Group-Free-3D) (Ze Liu, Yue Cao, Han Hu) * [EfficientNetV2: Smaller Models and Faster Training](https://arxiv.org/pdf/2104.00298.pdf) 考虑训练的Efficiency (1) To develop this family of models, we use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller. (2) we propose an improved method of progressive learning, which adaptively adjusts regularization (e.g., dropout and data augmentation) along with image size. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. [code](https://github.com/google/automl/efficientnetv2) (Mingxing Tan, Quoc V. Le) * [UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training](https://arxiv.org/pdf/2104.00332.pdf) To generalize this success to non-English languages, we introduce UC2 , the first machine translation-augmented framework for cross-lingual cross-modal representation learning. (1 ) augment existing English-only datasets with other languages via machine translation (MT) (2) shared visual context (i.e., using image as pivot) (3) To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. * [Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval](https://arxiv.org/pdf/2104.00650.pdf) Our objective in this work is video-text retrieval – in particular a joint embedding that enables efficient text-to-video retrieval. We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is an adaptation and extension of the recent ViT and Timesformer architectures, and consists of attention in both space and time. It is trained with a curriculum learning schedule that begins by treating images as ‘frozen’ snapshots of video, and then gradually learns to attend to increasing temporal context when trained on video datasets. (Andrew Zisserman) * [Jigsaw Clustering for Unsupervised Visual Representation Learning](https://arxiv.org/pdf/2104.00323.pdf) 有趣的pretext task设计。 We propose a new jigsaw clustering pretext task in this paper, which only needs to forward each training batch itself, and reduces the training cost. Our method makes use of information from both intra- and inter-images, and outperforms previous single-batch based ones by a large margin. It is even comparable to the contrastive learning methods when only half of training batches are used. Our method indicates that multiple batches during training are not necessary, and opens the door for future research of single-batch unsupervised methods. [code](https://github.com/Jia-Research-Lab/JigsawClustering) (Jiaya Jia, CVPR21) * [Unsupervised Sound Localization via Iterative Contrastive Learning](https://arxiv.org/pdf/2104.00315.pdf) Sound localization aims to find the source of the audio signal in the visual scene. In this work, we propose an iterative contrastive learning framework that requires no data annotations. At each iteration, the proposed method takes the 1) localization results in images predicted in the previous iteration, and 2) semantic relationships inferred from the audio signals as the pseudolabels. Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio. (如何保证基于伪标签的迭代是变好,而非变差?) (Ming-Hsuan Yang) * [In&Out : Diverse Image Outpainting via GAN Inversion](https://arxiv.org/pdf/2104.00675.pdf) GAN inversion逐渐成为GAN研究的主流方向,本文借GAN inversion做Image outpainting. Image outpainting seeks for a semantically consistent extension of the input image beyond its available content. In this work, we formulate the problem from the perspective of inverting generative adversarial networks. Our generator renders micro-patches conditioned on their joint latent code as well as their individual positions in the image. [code](https://github.com/yccyenchicheng/InOut) (Ming-Hsuan Yang) CVPR21: * [Online Multiple Object Tracking with Cross-Task Synergy](https://arxiv.org/pdf/2104.00380.pdf) [code](https://github.com/songguocode/TADAM) (Dacheng Tao) * [Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty Estimation for Facial Expression Recognition](https://arxiv.org/pdf/2104.00232.pdf) (Tao Mei) * [Divergence Optimization for Noisy Universal Domain Adaptation](https://arxiv.org/pdf/2104.00246.pdf) * [FAPIS: A Few-shot Anchor-free Part-based Instance Segmenter](https://arxiv.org/pdf/2104.00073.pdf) * [Self-supervised Motion Learning from Static Images](https://arxiv.org/pdf/2104.00240.pdf) * [Learning to Track Instances without Video Annotations](https://arxiv.org/pdf/2104.00287.pdf) Tracking segmentation masks of multiple instances has been intensively studied, but still faces two fundamental challenges: 1) the requirement of large-scale, frame-wise annotation, and 2) the complexity of two-stage approaches. 本文利用自监督实现单阶段 with only a labeled image dataset and unlabeled video sequences * [Improving Calibration for Long-Tailed Recognition](https://arxiv.org/pdf/2104.00466.pdf) [code](https://github.com/Jia-Research-Lab/MiSLAS) (Jiaya Jia) * [Towards Evaluating and Training Verifiably Robust Neural Networks](https://arxiv.org/pdf/2104.00447.pdf) (Dahua Lin) * [One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking](https://arxiv.org/pdf/2104.00597.pdf) (Jianlong Fu) * [Unsupervised Degradation Representation Learning for Blind Super-Resolution](https://arxiv.org/pdf/2104.00416.pdf) Funmy 构建不同程度降质的图像做对比学习 In this paper, we propose an unsupervised degradation representation learning scheme for blind SR without explicit degradation estimation. Specifically, we learn abstract representations to distinguish various degradations in the representation space rather than explicit estimation in the pixel space. [code](https://github.com/LongguangWang/DASR) * [Bipartite Graph Network with Adaptive Message Passing for Unbiased Scene Graph Generation](https://arxiv.org/pdf/2104.00308.pdf) long-tailed class distribution and large intra-class variation. To address these issues, we introduce a novel confidence-aware bipartite graph neural network with adaptive message propagation mechanism for unbiased scene graph generation. In addition, we propose an efficient bi-level data resampling strategy to alleviate the imbalanced data distribution problem in training our graph network. * [A Realistic Evaluation of Semi-Supervised Learning for Fine-Grained Classification](https://arxiv.org/pdf/2104.00679.pdf) * [RGB-D Local Implicit Function for Depth Completion of Transparent Objects](https://arxiv.org/pdf/2104.00622.pdf) * [SimPoE: Simulated Character Control for 3D Human Pose Estimation](https://arxiv.org/pdf/2104.00683.pdf) * [NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video](https://arxiv.org/pdf/2104.00681.pdf) * [PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Material Editing and Relighting](https://arxiv.org/pdf/2104.00674.pdf) * [LED2 -Net: Monocular 360◦ Layout Estimation via Differentiable Depth Rendering](https://arxiv.org/pdf/2104.00568.pdf) Towards reconstructing the room layout in 3D, we formulate the task of 360◦ layout estimation as a problem of predicting depth on the horizon line of a panorama. * [Reconstructing 3D Human Pose by Watching Humans in the Mirror](https://arxiv.org/pdf/2104.00340.pdf) In this paper, we introduce the new task of reconstructing 3D human pose from a single image in which we can see the person and the person’s image through a mirror. [code](https://github.com/zju3dv/Mirrored-Human) * [Wide-Depth-Range 6D Object Pose Estimation in Space](https://arxiv.org/pdf/2104.00337.pdf) 有趣的应用 [code](https://github.com/cvlab-epfl/wide-depth-range-pose) * [Fostering Generalization in Single-view 3D Reconstruction by Learning a Hierarchy of Local and Global Shape Priors](https://arxiv.org/pdf/2104.00476.pdf) * [Deep Two-View Structure-from-Motion Revisited](https://arxiv.org/pdf/2104.00556.pdf) Vision Transformer: * [Group-Free 3D Object Detection via Transformers](https://arxiv.org/pdf/2104.00678.pdf) In this paper, we present a simple yet effective method for directly detecting 3D objects from the 3D point cloud. Instead of grouping local points to each object candidate, our method computes the feature of an object from all the points in the point cloud with the help of an attention mechanism in the Transformers, where the contribution of each point is automatically learned in the network training. [code](https://github.com/zeliu98/Group-Free-3D) (Ze Liu, Yue Cao, Han Hu) * [Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval](https://arxiv.org/pdf/2104.00650.pdf) * [Spatial-Temporal Graph Transformer for Multiple Object Tracking](https://arxiv.org/pdf/2104.00194.pdf) Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named Spatial-Temporal Graph Transformer (STGT), which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. * [Latent Variable Nested Set Transformers & AutoBots](https://arxiv.org/pdf/2104.00563.pdf) We validate the Nested Set Transformer for autonomous driving settings which we refer to as (“AutoBot”), where we model the trajectory of an ego-agent based on the sequential observations of key attributes of multiple agents in a scene. * [LoFTR: Detector-Free Local Feature Matching with Transformers](https://arxiv.org/pdf/2104.00680.pdf) (CVPR21) * [Mesh Graphormer](https://arxiv.org/pdf/2104.00272.pdf) Others: * [The surprising impact of mask-head architecture on novel class segmentation](https://arxiv.org/pdf/2104.00613.pdf) We address the partially supervised instance segmentation problem in which one can train on (significantly cheaper) bounding boxes for all categories but use masks only for a subset of categories. [code](https://google.github.io/deepmac/) * [In&Out : Diverse Image Outpainting via GAN Inversion](https://arxiv.org/pdf/2104.00675.pdf) GAN inversion逐渐成为GAN研究的主流方向,本文借GAN inversion做Image outpainting. Image outpainting seeks for a semantically consistent extension of the input image beyond its available content. In this work, we formulate the problem from the perspective of inverting generative adversarial networks. Our generator renders micro-patches conditioned on their joint latent code as well as their individual positions in the image. [code](https://github.com/yccyenchicheng/InOut) (Ming-Hsuan Yang) * [Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study](https://arxiv.org/pdf/2104.00676.pdf) (ICLR21) * [CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning](https://arxiv.org/pdf/2104.00285.pdf) * [Composable Augmentation Encoding for Video Representation Learning](https://arxiv.org/pdf/2104.00616.pdf) To overcome this limitation, we propose an ‘augmentation aware’ contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations (such as the values of the time shifts used to create data views) as composable augmentation encodings (CATE) to our model when projecting the video representations for contrastive learning. * [Text to Image Generation with Semantic-Spatial Aware GAN](https://arxiv.org/pdf/2104.00567.pdf) * [Linear Semantics in Generative Adversarial Networks](https://arxiv.org/pdf/2104.00487.pdf) * [Unsupervised Foreground-Background Segmentation with Equivariant Layered GANs](https://arxiv.org/pdf/2104.00483.pdf) * [Improved Image Generation via Sparse Modeling](https://arxiv.org/pdf/2104.00464.pdf) * [Exploiting Relationship for Complex-scene Image Generation](https://arxiv.org/pdf/2104.00356.pdf) (Tao Mei) * [MeanShift++: Extremely Fast Mode-Seeking With Applications to Segmentation and Object Tracking](https://arxiv.org/pdf/2104.00303.pdf) * [SCALoss: Side and Corner Aligned Loss for Bounding Box Regression](https://arxiv.org/pdf/2104.00462.pdf) IoU-based loss has the gradient vanish problem in the case of low overlapping bounding boxes, and the model could easily ignore these simple cases. In this paper, we propose Side Overlap (SO) loss by maximizing the side overlap of two bounding boxes, which puts more penalty for low overlapping bounding box cases. * [Anchor Pruning for Object Detection](https://arxiv.org/pdf/2104.00432.pdf) This paper proposes anchor pruning for object detection in one-stage anchor-based detectors. In this work, we show that many anchors in the object detection head can be removed without any loss in accuracy. With additional retraining, anchor pruning can even lead to improved accuracy. 没引DETR和Sparse RCNN. (Deng Cai) * [Modular Adaptation for Cross-Domain Few-Shot Learning](https://arxiv.org/pdf/2104.00619.pdf) * [A Survey on Natural Language Video Localization](https://arxiv.org/pdf/2104.00234.pdf) #### 20210401 TOP: * [Going deeper with Image Transformers](https://arxiv.org/pdf/2103.17239.pdf) However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.3% top-1 accuracy on Imagenet when training with no external data (Facebook, DeiT团队) * [StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery](https://arxiv.org/pdf/2103.17249.pdf) However, discovering semantically meaningful latent manipulations typically involves painstaking human examination of the many degrees of freedom, or an annotated collection of images for each desired manipulation. In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort. [code](https://github.com/orpatashnik/StyleCLIP) * [PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering](https://arxiv.org/pdf/2103.17070.pdf) CVPR21: * [Scale-aware Automatic Augmentation for Object Detection](https://arxiv.org/pdf/2103.17220.pdf) [code](https://github.com/Jia-Research-Lab/SA-AutoAug) (Jiaya Jia) * [Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark](https://arxiv.org/pdf/2103.16746.pdf) Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. In this work, we propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset, strong and diverse baseline methods. We also introduce two new challenges into TNL2K for the object tracking task, i.e., adversarial samples and modality switch. (Feng Wu) * [SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification](https://arxiv.org/pdf/2103.16725.pdf) * [Denoise and Contrast for Category Agnostic Shape Completion](https://arxiv.org/pdf/2103.16671.pdf) * [DAP: Detection-Aware Pre-training with Weak Supervision](https://arxiv.org/pdf/2103.16651.pdf) we transform a classification dataset into a detection dataset through a weakly supervised object localization method based on Class Activation Maps to directly pre-train a detector, making the pre-trained model location-aware and capable of predicting bounding boxes. * [Unsupervised Disentanglement of Linear-Encoded Facial Semantics](https://arxiv.org/pdf/2103.16605.pdf) * [ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows](https://arxiv.org/pdf/2103.16877.pdf) (Jiebo Luo) * [Online Learning of a Probabilistic and Adaptive Scene Representation](https://arxiv.org/pdf/2103.16832.pdf) (Hongbin Zha) * [Convolutional Hough Matching Networks](https://arxiv.org/pdf/2103.16831.pdf) * [Rectification-based Knowledge Retention for Continual Learning](https://arxiv.org/pdf/2103.16597.pdf) * [Learning Scalable l∞-constrained Near-lossless Image Compression via Joint Lossy Image and Residual Compression](https://arxiv.org/pdf/2103.17015.pdf) * [Mask-ToF: Learning Microlens Masks for Flying Pixel Correction in Time-of-Flight Imaging](https://arxiv.org/pdf/2103.16693.pdf) * [Neural Response Interpretation through the Lens of Critical Pathways](https://arxiv.org/pdf/2103.16886.pdf) (VGG) * [Prototypical Cross-domain Self-supervised Learning for Few-shot Unsupervised Domain Adaptation](https://arxiv.org/pdf/2103.16765.pdf) * [Dense Relation Distillation with Context-aware Aggregation for Few-Shot Object Detection](https://arxiv.org/pdf/2103.17115.pdf) * [ReMix: Towards Image-to-Image Translation with Limited Data](https://arxiv.org/pdf/2103.16835.pdf) * [DER: Dynamically Expandable Representation for Class Incremental Learning](https://arxiv.org/pdf/2103.16788.pdf) * [GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection](https://arxiv.org/pdf/2103.17202.pdf) * [A Closer Look at Fourier Spectrum Discrepancies for CNN-generated Images Detection](https://arxiv.org/pdf/2103.17195.pdf) * [Semi-supervised Synthesis of High-Resolution Editable Textures for 3D Humans](https://arxiv.org/pdf/2103.17266.pdf) * [VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization](https://arxiv.org/pdf/2103.16874.pdf) While an increasing number of studies have been conducted, the resolution of synthesized images is still limited to low (e.g., 256×192), which acts as the critical limitation against satisfying online consumers. To address the challenges, we propose a novel virtual try-on method called VITON-HD that successfully synthesizes 1024×768 virtual try-on images. * [Learning Camera Localization via Dense Scene Matching](https://arxiv.org/pdf/2103.16792.pdf) * [Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding](https://arxiv.org/pdf/2103.16848.pdf) * [Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors](https://arxiv.org/pdf/2103.17265.pdf) We introduce (HPS) Human POSEitioning System, a method to recover the full 3D pose of a human registered with a 3D scan of the surrounding environment using wearable sensors. * [Learning by Aligning Videos in Time](https://arxiv.org/pdf/2103.17260.pdf) * [Dogfight: Detecting Drones from Drones Videos](https://arxiv.org/pdf/2103.17242.pdf) * [Rainbow Memory: Continual Learning with a Memory of Diverse Samples](https://arxiv.org/pdf/2103.17230.pdf) * [Layout-Guided Novel View Synthesis from a Single Indoor Panorama](https://arxiv.org/pdf/2103.17022.pdf) Vision Transformer: * [Going deeper with Image Transformers](https://arxiv.org/pdf/2103.17239.pdf) However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.3% top-1 accuracy on Imagenet when training with no external data (Facebook, DeiT团队) * [Learning Spatio-Temporal Transformer for Visual Tracking](https://arxiv.org/pdf/2103.17154.pdf) The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. The whole method is endto-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. [code](https://github.com/researchmm/Stark) (Jianlong Fu, Huchuan Lu) * [Robust Facial Expression Recognition with Convolutional Visual Transformers](https://arxiv.org/pdf/2103.16854.pdf) Different from previous pure CNNs based methods, we argue that it is feasible and practical to translate facial images into sequences of visual words and perform expression recognition from a global perspective. (Shutao Li) * [DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention](https://arxiv.org/pdf/2103.17084.pdf) 基于Deformable DETR的域适应目标检测 ## NLP (Weekly)