# stable-retraining-conversational-agents

**Repository Path**: mirrors_google/stable-retraining-conversational-agents

## Basic Information

- **Project Name**: stable-retraining-conversational-agents
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-07-24
- **Last Updated**: 2025-09-06

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Random and Systematic Noisy Data for Stable Re-training of Conversational Agents

This repository contains the synthetic data for the paper:

*   Reducing Model Churn: Stable Re-training of Conversational Agents

The data is derived from the following datasets:

*   [Semantic Parsing for Task Oriented Dialog using Hierarchical
    Representations](https://research.fb.com/publications/semantic-parsing-for-task-oriented-dialog-using-hierarchical-representations/)
*   [Low-Resource Domain Adaptation for Compositional Task-Oriented Semantic Parsing](https://fb.me/TOPv2Dataset)    
*   [MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark](https://fb.me/mtop_dataset)
*   [Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces](https://github.com/sonos/nlu-benchmark)

## Data

The generated synthetic data is present in the following directories (corresponding respectively to the above links):

    top_data/
    
    topv2_data/
    
    mtop_data/
    
    snips_data/
    
Each dataset directory has two subdirectories:

    swap_top/
    
    distant_top/

These subdirectories contain the random and systematic noisy datasets from Section 6.

Each dataset will have a train.tsv file with columns for "query" and "label."

The dev and test sets are the same as the original papers and are thus not included here.

The distant_top directories have an additional file named distant_labeled_train.tsv, which corresponds to the heldout 10\% of the training data that was labeled by a model.