# vertex-parameter-server-training-demo **Repository Path**: mirrors_GoogleCloudPlatform/vertex-parameter-server-training-demo ## Basic Information - **Project Name**: vertex-parameter-server-training-demo - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-07-16 - **Last Updated**: 2026-03-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Vertex AI Custom Training with Tensorflow ParameterServerStrategy demo This project demostrates how to asynchronous model training on Cloud Vertex AI with TensorFlow ParameterServerStrategy. ## Folders - `scripts`: Scripts to build and push Docker images, and run model training - `trainer`: The model training application ## Prerequisites - Python 3 - Cloud SDK (i.e. `gcloud`) ## Usage ### Configure GCP project Make sure you have a Google Cloud project and have authenticated your credentials. ``` gcloud config set project gcloud auth login ``` You will need to create at least one bucket to store the data and the models. However, having separate buckets for these types of data would be preferred. For example: ``` gs://my-gcp-project-models gs://my-gcp-project-datasets ``` ### Prepare the dataset The dataset you will use is `horses_or_humans` from the TensorFlow Datasets (TFDS) catalog. If you prefer, you can use another dataset from the same catalog that used for image classification. Run the following script to upload the dataset to your GCS bucket: ``` python scripts/prepare_dataset.py \ horses_or_numans \ gs://my-gcp-project-datasets ``` Check that the dataset was uploaded: ``` gsutil ls gs://my-gcp-project-datasets/horses_or_humans ``` The above path may contain a directory specifying the dataset version, e.g. `3.0.0`. Under that directory should be the data files in TFRecord format, e.g. `horses_or_humans-train.tfrecord*`. ### Training the model locally Before running model training on Vertex AI, it may be helpful to check model training on your local machine. Define the `MODELS_BUCKET` and `DATASETS_BUCKET` environmental variables to specify the GCS buckets for your models and datasets respectively. For example: ``` export MODELS_BUCKET='gs://my-gcp-project-models' export DATASETS_BUCKET='gs://my-gcp-project-datasets' ``` The following script will run an in-process cluster with TensorFlow ParameterServerStrategy. The different task servers (i.e. chief, worker, ps) will be run as separate processes on the local machine. ```bash bash scripts/local_train.sh ``` The training logic can be found in `trainer/task.py`. ### Train the model in Vertex AI #### Build the trainer image You would need to prepare the model training application as a Docker image and upload it into Artifact Registry. Make sure that the Artifact Registry API is enabled. Run the following command to create a repository for Docker images. ```bash gcloud artifacts repositories create vertex-pss-demo \ --repository-format=docker \ --location=us-central1 \ --description="Container image repository." ``` Run the following script to build and push the image to Artifact Registry. ```bash bash scripts/build.sh ``` If you want to change the repository name and/or the location (region), make sure to also change the AR_REPOSITORY AND REGION variables in the script respectively. Note that by default, the project ID and default region will be inferred from your environment. #### Configure training parameters Make a copy of `pss_config.yaml.template` and save it as `pss_config.yaml`> Make sure to update the following text: - PROJECT: Your GCP project ID - MODEL_BUCKET: Name of the GCS bucket for models - DATASET_BUCKET: Name of the GCS bucket for datasets If you changed the region where your Docker repository was created, make sure to change that in the config file as well (default is `us-central1`). Verify that the following fields are correct based on your environment. - imageUri: Docker image tag of your trainer image. - --model_dir: App specific-flag indicating the GCS path where model checkpoints will be stored. - --train_pattern: App specific-flag indicating the GCS path where the training dataset will be read from. - --val_pattern: App specific-flag indicating the GCS path where the validation dataset will be read from. #### Run training on Vertex After setting up your `pss_config.yaml` file, run the following script to execute model training on Vertex AI: ```bash bash scripts/vertex_train.sh ``` ## Contributing See [`CONTRIBUTING.md`](CONTRIBUTING.md) for details. ## License Apache 2.0; see [`LICENSE`](LICENSE) for details.