We will be open sourcing a tool called FARSI (Facebook AR system investigator), a design space exploration framework. FARSI enables an agile and automated search of optimal hardware allocation and software-to-hardware mapping solutions.
A Fast sketching based solver for large scale ridge regression
Code accompanying Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos (CVPR 2021)
Bilingual lexicons map words in one language to their translations in another, and are typically induced by learning linear projections to align monolingual word embedding spaces. In this paper, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment. Directly applying a pipeline that uses recent algorithms for both subproblems significantly improves induced lexicon quality and further gains are possible by learning to filter the resulting lex-ical entries, with both unsupervised and semi-supervised schemes. Our final approach out-performs the state of the art on the BUCC 2020shared task by 14 F1 points averaged over 12 language pairs, while also providing a more interpretable approach that allows for rich reasoning of word meaning in context.
This repo contains the data (question/answer pairs and their associated passages from Wikipedia) collected and used in the following paperDivyansh Kaushik, Douwe Kiela, Zachary C. Lipton, Wen-tau Yih. "On the Efficacy of Adversarial Data Collection for Question Answering Results from a Large-Scale Randomized Study." In ACL-2021.
Code for DVD A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers
Code to reproduce experiments in "Antipodes of Label Differential Privacy PATE and ALIBI"
Recurring batch data pipelines are a staple of the modern enterprisescale data warehouse. As a data warehouse scales to support more products and services, a growing number of interdependent pipelines running at various cadences can give rise to periodic resource bottlenecks for a cluster. This resource contention results in pipelines starting at unpredictable times each day and consequently variable landing times for the data artifacts they produce. The variability gets compounded by the dependency structure of the workload, and the resulting unpredictability can disrupt the project workstreams which consume this data. We present Clockwork, a delay-based global scheduling framework for data pipelines which improves landing time stability by spreading out tasks throughout the day. Whereas most scheduling algorithms optimize for makespan or average job completion times, Clockwork’s execution plan optimizes for stability in task completion times while also targeting predifined pipeline. Online experiments comparing this novel scheduling algorithm and a previously proposed greedy procrastinating heurstic show tasks complete almost an hour earlier on average, while exhibiting lower landing time variance and producing significantly less competition for resources in a target cluster.
Official repository for the paper "Instance-Conditioned GAN" by Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michał Drożdżal, Adriana Romero-Soriano.
IMGUR5K handwriting set. It is a handwritten in-the-wild dataset, which contains challenging real world handwritten samples from different writers.The dataset is shared as a set of image urls with annotations. This code downloads the images and verifies the hash to the image to avoid data contamination.
Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations
Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus.