# ml-code-switched-speech-translation

**Repository Path**: mirrors_apple/ml-code-switched-speech-translation

## Basic Information

- **Project Name**: ml-code-switched-speech-translation
- **Description**: This repository contains the code and instructions needed to reproduce the dataset splits for out paper "Speech Translation for Code-Switched Speech".
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-04-10
- **Last Updated**: 2026-02-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Overview
This repository contains the code and instructions needed to reproduce the dataset splits for ["Speech Translation for Code-Switched Speech"](LINK_TODO).

You can create both datasets with the `bash create_datasets.sh` command, following the instructions in the [Instructions Section](#Instructions). The `fisher` and `miami` directories contain the scripts needed to for each dataset used by `bash create_datasets.sh`. 

A mapping between the original data and the new code-switched and monolingual splits used in the paper can be found in `mapping_files`. Note that running `bash create_datasets.sh` will create these mappings.

## Instructions
0. Install the prerequisite libraries for linux/macOS.  This includes `ffmpeg`, `sox`, `wget`, and `python` (e.g. `apt-get install sox`).
1. Run `pip install -r requirement.txt` to setup the python enviroment
2. Collect the data needed for the Fisher corpus ([LDC2010T04](https://catalog.ldc.upenn.edu/LDC2010T04) and [LDC2010S01](https://catalog.ldc.upenn.edu/LDC2010S01)) and export them: `export LDC2010S01={path_to_LDC2010S01}` and `export LDC2010T04={path_to_LDC2010T04}/fisher_spa_tr`.
3. Run `bash create_datasets.sh` to generate both Miami and Fisher datasets. 


## Example

Example utterance:
- (Audio clip)
- Transcript (code-switched): *y ti bueno tiene dos papás **which can be a little can be a little challenging**.*
- Translation (English only): *and she has two fathers which can be a little, can be a little challenging.*

The data files are composed of three parts:
1. The transcript for the dataset split (in `{dataset_name}.translation`)
2. The translation for the dataset split (in `{dataset_name}.translation`)
3. The audio for the dataset split (in `{dataset_name}.yaml` and `{dataset_name}/clips/*.wav` or `{dataset_name}/clips.zip`)

## Citation
If you found this repository helpful in your research, please consider citing
```
Orion Weller, Matthias Sperber, Telmo Pessoa Pires, Hendra Setiawan, Christian Gollan, Dominic Telaar, Matthias Paulik: End-to-End Speech Translation for Code Switched Speech (Findings of the Association for Computational Linguistics: ACL 2022)
```