# cog-vllm **Repository Path**: javacares/cog-vllm ## Basic Information - **Project Name**: cog-vllm - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: arctic - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-07 - **Last Updated**: 2025-07-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Cog-vLLM: Run vLLM on Replicate [Cog](https://github.com/replicate/cog) is an open-source tool that lets you package machine learning models in a standard, production-ready container. vLLM is a fast and easy-to-use library for LLM inference and serving. You can deploy your packaged model to your own infrastructure, or to [Replicate]. ## Highlights * 🚀 **Run vLLM in the cloud with an API**. Deploy any [vLLM-supported language model] at scale on Replicate. * 🏭 **Support multiple concurrent requests**. Continuous batching works out of the box. * 🐢 **Open Source, all the way down**. Look inside, take it apart, make it do exactly what you need. ## Quickstart Go to [replicate.com/replicate/vllm](https://replicate.com/replicate/vllm) and create a new vLLM model from a [supported Hugging Face repo][vLLM-supported language model], such as [google/gemma-2b](https://huggingface.co/google/gemma-2b) > [!IMPORTANT] > Gated models require a [Hugging Face API token](https://huggingface.co/settings/tokens), > which you can set in the `hf_token` field of the model creation form. Create a new vLLM model on Replicate

Replicate downloads the model files, packages them into a `.tar` archive, and pushes a new version of your model that's ready to use. Trained vLLM model on Replicate

From here, you can either use your model as-is, or customize it and push up your changes. ## Local Development If you're on a machine or VM with a GPU, you can try out changes before pushing them to Replicate. Start by [installing or upgrading Cog](https://cog.run/#install). You'll need Cog [v0.10.0-alpha11](https://github.com/replicate/cog/releases/tag/v0.10.0-alpha11): ```console $ sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/download/v0.10.0-alpha11/cog_$(uname -s)_$(uname -m)" $ sudo chmod +x /usr/local/bin/cog ``` Then clone this repository: ```console $ git clone https://github.com/replicate/cog-vllm $ cd cog-vllm ``` Go to the [Replicate dashboard](https://replicate.com/trainings) and navigate to the training for your vLLM model. From that page, copy the weights URL from the Download weights button. Copy weights URL from Replicate training

Copy weights URL from Replicate training

Set the `COG_WEIGHTS` environment variable with that copied value: ```console $ export COG_WEIGHTS="..." ``` Now, make your first prediction against the model locally: ```console $ cog predict -e "COG_WEIGHTS=$COG_WEIGHTS" \ -i prompt="Hello!" ``` The first time you run this command, Cog downloads the model weights and save them to the `models` subdirectory. To make multiple predictions, start up the HTTP server and send it `POST /predictions` requests. ```console # Start the HTTP server $ cog run -p 5000 -e "COG_WEIGHTS=$COG_WEIGHTS" python -m cog.server.http # In a different terminal session, send requests to the server $ curl http://localhost:5000/predictions -X POST \ -H 'Content-Type: application/json' \ -d '{"input": {"prompt": "Hello!"}}' ``` When you're finished working, you can push your changes to Replicate. Grab your token from [replicate.com/account](https://replicate.com/account) and set it as an environment variable: ```shell export REPLICATE_API_TOKEN= ``` ```console $ echo $REPLICATE_API_TOKEN | cog login --token-stdin $ cog push r8.im// --> ... --> Pushing image 'r8.im/...' ``` After you push your model, you can try running it on Replicate. Install the [Replicate Python SDK][replicate-python]: ```console $ pip install replicate ``` Create a prediction and stream its output: ```python import replicate model = replicate.models.get("/") prediction = replicate.predictions.create( version=model.latest_version, input={ "prompt": "Hello" }, stream=True ) for event in prediction.stream(): print(str(event), end="") ``` [Replicate]: https://replicate.com [vLLM-supported language model]: https://docs.vllm.ai/en/latest/models/supported_models.html [replicate-python]: https://github.com/replicate/replicate-python