From 6056c4e094e326dc802e04f9f3bc231786cec0bb Mon Sep 17 00:00:00 2001 From: "mingjiang.li" Date: Wed, 21 May 2025 16:49:15 +0800 Subject: [PATCH] add ixrt 25.06 models to reamde model list --- README.md | 25 +++++++++++-------- README_en.md | 25 +++++++++++-------- .../resnext101_32x8d/ixrt/README.md | 8 ++++-- .../resnext101_64x4d/ixrt/README.md | 7 ++++-- .../shufflenetv2_x0_5/ixrt/README.md | 2 +- .../shufflenetv2_x1_0/ixrt/README.md | 2 +- .../shufflenetv2_x1_5/ixrt/README.md | 2 +- .../shufflenetv2_x2_0/ixrt/README.md | 2 +- .../object_detection/yolov10/ixrt/README.md | 2 +- .../object_detection/yolov11/ixrt/README.md | 2 +- .../cv/object_detection/yolov9/ixrt/README.md | 2 +- .../vision_language_model/aria/vllm/README.md | 5 ++-- .../chameleon_7b/vllm/README.md | 2 +- .../fuyu_8b/vllm/README.md | 6 +++-- .../h2vol/vllm/README.md | 10 +++++--- .../idefics3/vllm/README.md | 9 ++++--- .../intern_vl/vllm/README.md | 7 ++++-- .../{mllama => llama-3.2}/vllm/README.md | 9 ++++--- .../{mllama => llama-3.2}/vllm/ci/prepare.sh | 0 .../vllm/offline_inference_vision_language.py | 0 .../llava/vllm/README.md | 9 +++++-- .../llava_next_video_7b/vllm/README.md | 7 ++++-- .../minicpm_v/vllm/README.md | 8 +++--- .../pixtral/vllm/README.md | 2 +- 24 files changed, 98 insertions(+), 55 deletions(-) rename models/multimodal/vision_language_model/{mllama => llama-3.2}/vllm/README.md (73%) rename models/multimodal/vision_language_model/{mllama => llama-3.2}/vllm/ci/prepare.sh (100%) rename models/multimodal/vision_language_model/{mllama => llama-3.2}/vllm/offline_inference_vision_language.py (100%) diff --git a/README.md b/README.md index 2dc67b0e..88b38054 100644 --- a/README.md +++ b/README.md @@ -123,14 +123,14 @@ DeepSparkInference将按季度进行版本更新,后续会逐步丰富模型 | ResNetV1D50 | FP16 | [✅](models/cv/classification/resnetv1d50/igie) | [✅](models/cv/classification/resnetv1d50/ixrt) | 4.2.0 | | | INT8 | | [✅](models/cv/classification/resnetv1d50/ixrt) | 4.2.0 | | ResNeXt50_32x4d | FP16 | [✅](models/cv/classification/resnext50_32x4d/igie) | [✅](models/cv/classification/resnext50_32x4d/ixrt) | 4.2.0 | -| ResNeXt101_64x4d | FP16 | [✅](models/cv/classification/resnext101_64x4d/igie) | | 4.2.0 | -| ResNeXt101_32x8d | FP16 | [✅](models/cv/classification/resnext101_32x8d/igie) | | 4.2.0 | +| ResNeXt101_64x4d | FP16 | [✅](models/cv/classification/resnext101_64x4d/igie) | [✅](models/cv/classification/resnext101_64x4d/ixrt) | 4.2.0 | +| ResNeXt101_32x8d | FP16 | [✅](models/cv/classification/resnext101_32x8d/igie) | [✅](models/cv/classification/resnext101_32x8d/ixrt) | 4.2.0 | | SEResNet50 | FP16 | [✅](models/cv/classification/se_resnet50/igie) | | 4.2.0 | | ShuffleNetV1 | FP16 | | [✅](models/cv/classification/shufflenet_v1/ixrt) | 4.2.0 | -| ShuffleNetV2_x0_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x0_5/igie) | | 4.2.0 | -| ShuffleNetV2_x1_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_0/igie) | | 4.2.0 | -| ShuffleNetV2_x1_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_5/igie) | | 4.2.0 | -| ShuffleNetV2_x2_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x2_0/igie) | | 4.2.0 | +| ShuffleNetV2_x0_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x0_5/igie) | [✅](models/cv/classification/shufflenetv2_x0_5/ixrt) | 4.2.0 | +| ShuffleNetV2_x1_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_0/igie) | [✅](models/cv/classification/shufflenetv2_x1_0/ixrt) | 4.2.0 | +| ShuffleNetV2_x1_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_5/igie) | [✅](models/cv/classification/shufflenetv2_x1_5/ixrt) | 4.2.0 | +| ShuffleNetV2_x2_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x2_0/igie) | [✅](models/cv/classification/shufflenetv2_x2_0/ixrt) | 4.2.0 | | SqueezeNet 1.0 | FP16 | [✅](models/cv/classification/squeezenet_v1_0/igie) | [✅](models/cv/classification/squeezenet_v1_0/ixrt) | 4.2.0 | | | INT8 | | [✅](models/cv/classification/squeezenet_v1_0/ixrt) | 4.2.0 | | SqueezeNet 1.1 | FP16 | [✅](models/cv/classification/squeezenet_v1_1/igie) | [✅](models/cv/classification/squeezenet_v1_1/ixrt) | 4.2.0 | @@ -181,9 +181,9 @@ DeepSparkInference将按季度进行版本更新,后续会逐步丰富模型 | | INT8 | [✅](models/cv/object_detection/yolov7/igie) | [✅](models/cv/object_detection/yolov7/ixrt) | 4.2.0 | | YOLOv8 | FP16 | [✅](models/cv/object_detection/yolov8/igie) | [✅](models/cv/object_detection/yolov8/ixrt) | 4.2.0 | | | INT8 | [✅](models/cv/object_detection/yolov8/igie) | [✅](models/cv/object_detection/yolov8/ixrt) | 4.2.0 | -| YOLOv9 | FP16 | [✅](models/cv/object_detection/yolov9/igie) | | 4.2.0 | -| YOLOv10 | FP16 | [✅](models/cv/object_detection/yolov10/igie) | | 4.2.0 | -| YOLOv11 | FP16 | [✅](models/cv/object_detection/yolov11/igie) | | 4.2.0 | +| YOLOv9 | FP16 | [✅](models/cv/object_detection/yolov9/igie) | [✅](models/cv/object_detection/yolov9/ixrt) | 4.2.0 | +| YOLOv10 | FP16 | [✅](models/cv/object_detection/yolov10/igie) | [✅](models/cv/object_detection/yolov10/ixrt) | 4.2.0 | +| YOLOv11 | FP16 | [✅](models/cv/object_detection/yolov11/igie) | [✅](models/cv/object_detection/yolov11/ixrt) | 4.2.0 | | YOLOv12 | FP16 | [✅](models/cv/object_detection/yolov12/igie) | | 4.2.0 | | YOLOX | FP16 | [✅](models/cv/object_detection/yolox/igie) | [✅](models/cv/object_detection/yolox/ixrt) | 4.2.0 | | | INT8 | [✅](models/cv/object_detection/yolox/igie) | [✅](models/cv/object_detection/yolox/ixrt) | 4.2.0 | @@ -236,13 +236,18 @@ DeepSparkInference将按季度进行版本更新,后续会逐步丰富模型 | Model | vLLM | IxFormer | IXUCA SDK | |---------------------|-----------------------------------------------------------------------|------------------------------------------------------------|-----------| +| Aria | [✅](models/multimodal/vision_language_model/aria/vllm) | | 4.2.0 | | Chameleon-7B | [✅](models/multimodal/vision_language_model/chameleon_7b/vllm) | | 4.2.0 | | CLIP | | [✅](models/multimodal/vision_language_model/clip/ixformer) | 4.2.0 | | Fuyu-8B | [✅](models/multimodal/vision_language_model/fuyu_8b/vllm) | | 4.2.0 | +| H2OVL Mississippi | [✅](models/multimodal/vision_language_model/h2vol/vllm) | | 4.2.0 | +| Idefics3 | [✅](models/multimodal/vision_language_model/idefics3/vllm) | | 4.2.0 | | InternVL2-4B | [✅](models/multimodal/vision_language_model/intern_vl/vllm) | | 4.2.0 | | LLaVA | [✅](models/multimodal/vision_language_model/llava/vllm) | | 4.2.0 | | LLaVA-Next-Video-7B | [✅](models/multimodal/vision_language_model/llava_next_video_7b/vllm) | | 4.2.0 | -| MiniCPM V2 | [✅](models/multimodal/vision_language_model/minicpm_v/vllm) | | 4.2.0 | +| Llama-3.2 | [✅](models/multimodal/vision_language_model/llama-3.2/vllm) | | 4.2.0 | +| MiniCPM-V 2 | [✅](models/multimodal/vision_language_model/minicpm_v/vllm) | | 4.2.0 | +| Pixtral | [✅](models/multimodal/vision_language_model/pixtral/vllm) | | 4.2.0 | ### 自然语言处理(NLP) diff --git a/README_en.md b/README_en.md index 6450df59..b35b14e7 100644 --- a/README_en.md +++ b/README_en.md @@ -133,14 +133,14 @@ inference to be expanded in the future. | ResNetV1D50 | FP16 | [✅](models/cv/classification/resnetv1d50/igie) | [✅](models/cv/classification/resnetv1d50/ixrt) | 4.2.0 | | | INT8 | | [✅](models/cv/classification/resnetv1d50/ixrt) | 4.2.0 | | ResNeXt50_32x4d | FP16 | [✅](models/cv/classification/resnext50_32x4d/igie) | [✅](models/cv/classification/resnext50_32x4d/ixrt) | 4.2.0 | -| ResNeXt101_64x4d | FP16 | [✅](models/cv/classification/resnext101_64x4d/igie) | | 4.2.0 | -| ResNeXt101_32x8d | FP16 | [✅](models/cv/classification/resnext101_32x8d/igie) | | 4.2.0 | +| ResNeXt101_64x4d | FP16 | [✅](models/cv/classification/resnext101_64x4d/igie) | [✅](models/cv/classification/resnext101_64x4d/ixrt) | 4.2.0 | +| ResNeXt101_32x8d | FP16 | [✅](models/cv/classification/resnext101_32x8d/igie) | [✅](models/cv/classification/resnext101_32x8d/ixrt) | 4.2.0 | | SEResNet50 | FP16 | [✅](models/cv/classification/se_resnet50/igie) | | 4.2.0 | | ShuffleNetV1 | FP16 | | [✅](models/cv/classification/shufflenet_v1/ixrt) | 4.2.0 | -| ShuffleNetV2_x0_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x0_5/igie) | | 4.2.0 | -| ShuffleNetV2_x1_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_0/igie) | | 4.2.0 | -| ShuffleNetV2_x1_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_5/igie) | | 4.2.0 | -| ShuffleNetV2_x2_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x2_0/igie) | | 4.2.0 | +| ShuffleNetV2_x0_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x0_5/igie) | [✅](models/cv/classification/shufflenetv2_x0_5/ixrt) | 4.2.0 | +| ShuffleNetV2_x1_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_0/igie) | [✅](models/cv/classification/shufflenetv2_x1_0/ixrt) | 4.2.0 | +| ShuffleNetV2_x1_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_5/igie) | [✅](models/cv/classification/shufflenetv2_x1_5/ixrt) | 4.2.0 | +| ShuffleNetV2_x2_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x2_0/igie) | [✅](models/cv/classification/shufflenetv2_x2_0/ixrt) | 4.2.0 | | SqueezeNet 1.0 | FP16 | [✅](models/cv/classification/squeezenet_v1_0/igie) | [✅](models/cv/classification/squeezenet_v1_0/ixrt) | 4.2.0 | | | INT8 | | [✅](models/cv/classification/squeezenet_v1_0/ixrt) | 4.2.0 | | SqueezeNet 1.1 | FP16 | [✅](models/cv/classification/squeezenet_v1_1/igie) | [✅](models/cv/classification/squeezenet_v1_1/ixrt) | 4.2.0 | @@ -191,9 +191,9 @@ inference to be expanded in the future. | | INT8 | [✅](models/cv/object_detection/yolov7/igie) | [✅](models/cv/object_detection/yolov7/ixrt) | 4.2.0 | | YOLOv8 | FP16 | [✅](models/cv/object_detection/yolov8/igie) | [✅](models/cv/object_detection/yolov8/ixrt) | 4.2.0 | | | INT8 | [✅](models/cv/object_detection/yolov8/igie) | [✅](models/cv/object_detection/yolov8/ixrt) | 4.2.0 | -| YOLOv9 | FP16 | [✅](models/cv/object_detection/yolov9/igie) | | 4.2.0 | -| YOLOv10 | FP16 | [✅](models/cv/object_detection/yolov10/igie) | | 4.2.0 | -| YOLOv11 | FP16 | [✅](models/cv/object_detection/yolov11/igie) | | 4.2.0 | +| YOLOv9 | FP16 | [✅](models/cv/object_detection/yolov9/igie) | [✅](models/cv/object_detection/yolov9/ixrt) | 4.2.0 | +| YOLOv10 | FP16 | [✅](models/cv/object_detection/yolov10/igie) | [✅](models/cv/object_detection/yolov10/ixrt) | 4.2.0 | +| YOLOv11 | FP16 | [✅](models/cv/object_detection/yolov11/igie) | [✅](models/cv/object_detection/yolov11/ixrt) | 4.2.0 | | YOLOv12 | FP16 | [✅](models/cv/object_detection/yolov12/igie) | | 4.2.0 | | YOLOX | FP16 | [✅](models/cv/object_detection/yolox/igie) | [✅](models/cv/object_detection/yolox/ixrt) | 4.2.0 | | | INT8 | [✅](models/cv/object_detection/yolox/igie) | [✅](models/cv/object_detection/yolox/ixrt) | 4.2.0 | @@ -246,13 +246,18 @@ inference to be expanded in the future. | Model | vLLM | IxFormer | IXUCA SDK | |---------------------|-----------------------------------------------------------------------|------------------------------------------------------------|-----------| +| Aria | [✅](models/multimodal/vision_language_model/aria/vllm) | | 4.2.0 | | Chameleon-7B | [✅](models/multimodal/vision_language_model/chameleon_7b/vllm) | | 4.2.0 | | CLIP | | [✅](models/multimodal/vision_language_model/clip/ixformer) | 4.2.0 | | Fuyu-8B | [✅](models/multimodal/vision_language_model/fuyu_8b/vllm) | | 4.2.0 | +| H2OVL Mississippi | [✅](models/multimodal/vision_language_model/h2vol/vllm) | | 4.2.0 | +| Idefics3 | [✅](models/multimodal/vision_language_model/idefics3/vllm) | | 4.2.0 | | InternVL2-4B | [✅](models/multimodal/vision_language_model/intern_vl/vllm) | | 4.2.0 | | LLaVA | [✅](models/multimodal/vision_language_model/llava/vllm) | | 4.2.0 | | LLaVA-Next-Video-7B | [✅](models/multimodal/vision_language_model/llava_next_video_7b/vllm) | | 4.2.0 | -| MiniCPM V2 | [✅](models/multimodal/vision_language_model/minicpm_v/vllm) | | 4.2.0 | +| Llama-3.2 | [✅](models/multimodal/vision_language_model/llama-3.2/vllm) | | 4.2.0 | +| MiniCPM-V 2 | [✅](models/multimodal/vision_language_model/minicpm_v/vllm) | | 4.2.0 | +| Pixtral | [✅](models/multimodal/vision_language_model/pixtral/vllm) | | 4.2.0 | ### NLP diff --git a/models/cv/classification/resnext101_32x8d/ixrt/README.md b/models/cv/classification/resnext101_32x8d/ixrt/README.md index aec82d02..5859d915 100644 --- a/models/cv/classification/resnext101_32x8d/ixrt/README.md +++ b/models/cv/classification/resnext101_32x8d/ixrt/README.md @@ -1,8 +1,12 @@ -# ResNext101_32x8d (IXRT) +# ResNext101_32x8d (IxRT) ## Model Description -ResNeXt101_32x8d is a deep convolutional neural network introduced in the paper "Aggregated Residual Transformations for Deep Neural Networks." It enhances the traditional ResNet architecture by incorporating group convolutions, offering a new dimension for scaling network capacity through "cardinality" (the number of groups) rather than merely increasing depth or width.The model consists of 101 layers and uses a configuration of 32 groups, each with a width of 8 channels. This design improves feature extraction while maintaining computational efficiency. +ResNeXt101_32x8d is a deep convolutional neural network introduced in the paper "Aggregated Residual Transformations for +Deep Neural Networks." It enhances the traditional ResNet architecture by incorporating group convolutions, offering a +new dimension for scaling network capacity through "cardinality" (the number of groups) rather than merely increasing +depth or width.The model consists of 101 layers and uses a configuration of 32 groups, each with a width of 8 channels. +This design improves feature extraction while maintaining computational efficiency. ## Supported Environments diff --git a/models/cv/classification/resnext101_64x4d/ixrt/README.md b/models/cv/classification/resnext101_64x4d/ixrt/README.md index bba444ee..cc647490 100644 --- a/models/cv/classification/resnext101_64x4d/ixrt/README.md +++ b/models/cv/classification/resnext101_64x4d/ixrt/README.md @@ -1,8 +1,11 @@ -# ResNext101_64x4d (IGIE) +# ResNext101_64x4d (IxRT) ## Model Description -The ResNeXt101_64x4d is a deep learning model based on the deep residual network architecture, which enhances performance and efficiency through the use of grouped convolutions. With a depth of 101 layers and 64 filter groups, it is particularly suited for complex image recognition tasks. While maintaining excellent accuracy, it can adapt to various input sizes +The ResNeXt101_64x4d is a deep learning model based on the deep residual network architecture, which enhances +performance and efficiency through the use of grouped convolutions. With a depth of 101 layers and 64 filter groups, it +is particularly suited for complex image recognition tasks. While maintaining excellent accuracy, it can adapt to +various input sizes ## Supported Environments diff --git a/models/cv/classification/shufflenetv2_x0_5/ixrt/README.md b/models/cv/classification/shufflenetv2_x0_5/ixrt/README.md index 1929de59..405523a5 100644 --- a/models/cv/classification/shufflenetv2_x0_5/ixrt/README.md +++ b/models/cv/classification/shufflenetv2_x0_5/ixrt/README.md @@ -1,4 +1,4 @@ -# ShuffleNetV2 x0_5 (IxRT) +# ShuffleNetV2_x0_5 (IxRT) ## Model Description diff --git a/models/cv/classification/shufflenetv2_x1_0/ixrt/README.md b/models/cv/classification/shufflenetv2_x1_0/ixrt/README.md index e4fb84a3..b2fd0085 100644 --- a/models/cv/classification/shufflenetv2_x1_0/ixrt/README.md +++ b/models/cv/classification/shufflenetv2_x1_0/ixrt/README.md @@ -1,4 +1,4 @@ -# ShuffleNetV2_x1_0 (IXRT) +# ShuffleNetV2_x1_0 (IxRT) ## Model Description diff --git a/models/cv/classification/shufflenetv2_x1_5/ixrt/README.md b/models/cv/classification/shufflenetv2_x1_5/ixrt/README.md index 1e86d054..34bb7cbe 100644 --- a/models/cv/classification/shufflenetv2_x1_5/ixrt/README.md +++ b/models/cv/classification/shufflenetv2_x1_5/ixrt/README.md @@ -1,4 +1,4 @@ -# ShuffleNetV2_x1_5 (IXRT) +# ShuffleNetV2_x1_5 (IxRT) ## Model Description diff --git a/models/cv/classification/shufflenetv2_x2_0/ixrt/README.md b/models/cv/classification/shufflenetv2_x2_0/ixrt/README.md index c8650d9f..ca8b5212 100644 --- a/models/cv/classification/shufflenetv2_x2_0/ixrt/README.md +++ b/models/cv/classification/shufflenetv2_x2_0/ixrt/README.md @@ -1,4 +1,4 @@ -# ShuffleNetV2_x2_0 (IXRT) +# ShuffleNetV2_x2_0 (IxRT) ## Model Description diff --git a/models/cv/object_detection/yolov10/ixrt/README.md b/models/cv/object_detection/yolov10/ixrt/README.md index 274c59fc..6fade83d 100644 --- a/models/cv/object_detection/yolov10/ixrt/README.md +++ b/models/cv/object_detection/yolov10/ixrt/README.md @@ -1,4 +1,4 @@ -# YOLOv10 (IXRT) +# YOLOv10 (IxRT) ## Model Description diff --git a/models/cv/object_detection/yolov11/ixrt/README.md b/models/cv/object_detection/yolov11/ixrt/README.md index 1f7993b1..3172be85 100644 --- a/models/cv/object_detection/yolov11/ixrt/README.md +++ b/models/cv/object_detection/yolov11/ixrt/README.md @@ -1,4 +1,4 @@ -# YOLOv11 (IGIE) +# YOLOv11 (IxRT) ## Model Description diff --git a/models/cv/object_detection/yolov9/ixrt/README.md b/models/cv/object_detection/yolov9/ixrt/README.md index e74bd516..806be63a 100644 --- a/models/cv/object_detection/yolov9/ixrt/README.md +++ b/models/cv/object_detection/yolov9/ixrt/README.md @@ -1,4 +1,4 @@ -# YOLOv9 (IXRT) +# YOLOv9 (IxRT) ## Model Description diff --git a/models/multimodal/vision_language_model/aria/vllm/README.md b/models/multimodal/vision_language_model/aria/vllm/README.md index eb8924c1..10ef24f8 100644 --- a/models/multimodal/vision_language_model/aria/vllm/README.md +++ b/models/multimodal/vision_language_model/aria/vllm/README.md @@ -1,8 +1,9 @@ -# Aria +# Aria (vLLM) ## Model Description Aria is a multimodal native MoE model. It features: + - State-of-the-art performance on various multimodal and language tasks, superior in video and document understanding; - Long multimodal context window of 64K tokens; - 3.9B activated parameters per token, enabling fast inference speed and low fine-tuning cost. @@ -45,4 +46,4 @@ export VLLM_ASSETS_CACHE=../vllm/ python3 offline_inference_vision_language.py --model data/Aria --max-tokens 256 -tp 4 --trust-remote-code --temperature 0.0 --dtype bfloat16 --tokenizer-mode slow ``` -## Model Results \ No newline at end of file +## Model Results diff --git a/models/multimodal/vision_language_model/chameleon_7b/vllm/README.md b/models/multimodal/vision_language_model/chameleon_7b/vllm/README.md index d2b61966..1ed7c911 100755 --- a/models/multimodal/vision_language_model/chameleon_7b/vllm/README.md +++ b/models/multimodal/vision_language_model/chameleon_7b/vllm/README.md @@ -1,4 +1,4 @@ -# Chameleon +# Chameleon (vLLM) ## Model Description diff --git a/models/multimodal/vision_language_model/fuyu_8b/vllm/README.md b/models/multimodal/vision_language_model/fuyu_8b/vllm/README.md index b03fa5d4..f751f8c4 100755 --- a/models/multimodal/vision_language_model/fuyu_8b/vllm/README.md +++ b/models/multimodal/vision_language_model/fuyu_8b/vllm/README.md @@ -1,10 +1,12 @@ -# Fuyu-8B +# Fuyu-8B (vLLM) ## Model Description Fuyu-8B is a multi-modal text and image transformer trained by Adept AI. -Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder. Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the transformer decoder like an image transformer (albeit with no pooling and causal attention). +Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder. Image patches are instead +linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the +transformer decoder like an image transformer (albeit with no pooling and causal attention). ## Supported Environments diff --git a/models/multimodal/vision_language_model/h2vol/vllm/README.md b/models/multimodal/vision_language_model/h2vol/vllm/README.md index 80c4c329..671410ea 100644 --- a/models/multimodal/vision_language_model/h2vol/vllm/README.md +++ b/models/multimodal/vision_language_model/h2vol/vllm/README.md @@ -1,8 +1,12 @@ -# H2ovl +# H2OVL Mississippi (vLLM) ## Model Description -The H2OVL-Mississippi-800M is a compact yet powerful vision-language model from H2O.ai, featuring 0.8 billion parameters. Despite its small size, it delivers state-of-the-art performance in text recognition, excelling in the Text Recognition segment of OCRBench and outperforming much larger models in this domain. Built upon the robust architecture of our H2O-Danube language models, the Mississippi-800M extends their capabilities by seamlessly integrating vision and language tasks. +The H2OVL-Mississippi-800M is a compact yet powerful vision-language model from H2O.ai, featuring 0.8 billion +parameters. Despite its small size, it delivers state-of-the-art performance in text recognition, excelling in the Text +Recognition segment of OCRBench and outperforming much larger models in this domain. Built upon the robust architecture +of our H2O-Danube language models, the Mississippi-800M extends their capabilities by seamlessly integrating vision and +language tasks. ## Supported Environments @@ -42,4 +46,4 @@ export VLLM_ASSETS_CACHE=../vllm/ python3 offline_inference_vision_language.py --model data/h2ovl-mississippi-800m -tp 1 --max-tokens 256 --trust-remote-code --temperature 0.0 --disable-mm-preprocessor-cache ``` -## Model Results \ No newline at end of file +## Model Results diff --git a/models/multimodal/vision_language_model/idefics3/vllm/README.md b/models/multimodal/vision_language_model/idefics3/vllm/README.md index d89c20d9..721d7f0a 100644 --- a/models/multimodal/vision_language_model/idefics3/vllm/README.md +++ b/models/multimodal/vision_language_model/idefics3/vllm/README.md @@ -1,8 +1,11 @@ -# Idefics3 +# Idefics3 (vLLM) ## Model Description -Idefics3 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon Idefics1 and Idefics2, significantly enhancing capabilities around OCR, document understanding and visual reasoning. +Idefics3 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text +outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple +images, or simply behave as a pure language model without visual inputs. It improves upon Idefics1 and Idefics2, +significantly enhancing capabilities around OCR, document understanding and visual reasoning. ## Supported Environments @@ -42,4 +45,4 @@ export VLLM_ASSETS_CACHE=../vllm/ python3 offline_inference_vision_language.py --model data/Idefics3-8B-Llama3 -tp 4 --max-tokens 256 --trust-remote-code --temperature 0.0 --disable-mm-preprocessor-cache ``` -## Model Results \ No newline at end of file +## Model Results diff --git a/models/multimodal/vision_language_model/intern_vl/vllm/README.md b/models/multimodal/vision_language_model/intern_vl/vllm/README.md index be75164b..78bb8d1b 100644 --- a/models/multimodal/vision_language_model/intern_vl/vllm/README.md +++ b/models/multimodal/vision_language_model/intern_vl/vllm/README.md @@ -1,8 +1,11 @@ -# InternVL2-4B +# InternVL2-4B (vLLM) ## Model Description -InternVL2-4B is a large-scale multimodal model developed by WeTab AI, designed to handle a wide range of tasks involving both text and visual data. With 4 billion parameters, it is capable of understanding and generating complex patterns in data, making it suitable for applications such as image recognition, natural language processing, and multimodal learning. +InternVL2-4B is a large-scale multimodal model developed by WeTab AI, designed to handle a wide range of tasks involving +both text and visual data. With 4 billion parameters, it is capable of understanding and generating complex patterns in +data, making it suitable for applications such as image recognition, natural language processing, and multimodal +learning. ## Supported Environments diff --git a/models/multimodal/vision_language_model/mllama/vllm/README.md b/models/multimodal/vision_language_model/llama-3.2/vllm/README.md similarity index 73% rename from models/multimodal/vision_language_model/mllama/vllm/README.md rename to models/multimodal/vision_language_model/llama-3.2/vllm/README.md index 70d9574b..b6aab078 100644 --- a/models/multimodal/vision_language_model/mllama/vllm/README.md +++ b/models/multimodal/vision_language_model/llama-3.2/vllm/README.md @@ -1,8 +1,11 @@ -# Mllama +# Llama-3.2 (vLLM) ## Model Description -The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. +The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and +instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only +models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They +outperform many of the available open source and closed chat models on common industry benchmarks. ## Supported Environments @@ -43,4 +46,4 @@ export VLLM_FORCE_NCCL_COMM=1 python3 offline_inference_vision_language.py --model data/LLamaV3.2 --max-tokens 256 -tp 2 --trust-remote-code --temperature 0.0 --max-model-len 8192 --max-num-seqs 16 ``` -## Model Results \ No newline at end of file +## Model Results diff --git a/models/multimodal/vision_language_model/mllama/vllm/ci/prepare.sh b/models/multimodal/vision_language_model/llama-3.2/vllm/ci/prepare.sh similarity index 100% rename from models/multimodal/vision_language_model/mllama/vllm/ci/prepare.sh rename to models/multimodal/vision_language_model/llama-3.2/vllm/ci/prepare.sh diff --git a/models/multimodal/vision_language_model/mllama/vllm/offline_inference_vision_language.py b/models/multimodal/vision_language_model/llama-3.2/vllm/offline_inference_vision_language.py similarity index 100% rename from models/multimodal/vision_language_model/mllama/vllm/offline_inference_vision_language.py rename to models/multimodal/vision_language_model/llama-3.2/vllm/offline_inference_vision_language.py diff --git a/models/multimodal/vision_language_model/llava/vllm/README.md b/models/multimodal/vision_language_model/llava/vllm/README.md index f60207df..599b66f0 100644 --- a/models/multimodal/vision_language_model/llava/vllm/README.md +++ b/models/multimodal/vision_language_model/llava/vllm/README.md @@ -1,8 +1,13 @@ -# LLava +# LLava (vLLM) ## Model Description -LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.The LLaVA-NeXT model was proposed in LLaVA-NeXT: Improved reasoning, OCR, and world knowledge by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee. LLaVa-NeXT (also called LLaVa-1.6) improves upon LLaVa-1.5 by increasing the input image resolution and training on an improved visual instruction tuning dataset to improve OCR and common sense reasoning. +LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following +data. It is an auto-regressive language model, based on the transformer architecture.The LLaVA-NeXT model was proposed +in LLaVA-NeXT: Improved reasoning, OCR, and world knowledge by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan +Zhang, Sheng Shen, Yong Jae Lee. LLaVa-NeXT (also called LLaVa-1.6) improves upon LLaVa-1.5 by increasing the input +image resolution and training on an improved visual instruction tuning dataset to improve OCR and common sense +reasoning. ## Supported Environments diff --git a/models/multimodal/vision_language_model/llava_next_video_7b/vllm/README.md b/models/multimodal/vision_language_model/llava_next_video_7b/vllm/README.md index d705d550..31b5622f 100755 --- a/models/multimodal/vision_language_model/llava_next_video_7b/vllm/README.md +++ b/models/multimodal/vision_language_model/llava_next_video_7b/vllm/README.md @@ -1,8 +1,11 @@ -# LLaVA-Next-Video-7B +# LLaVA-Next-Video-7B (vLLM) ## Model Description -LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. The model is buit on top of LLaVa-NeXT by tuning on a mix of video and image data to achieves better video understanding capabilities. The videos were sampled uniformly to be 32 frames per clip. The model is a current SOTA among open-source models on VideoMME bench. Base LLM: lmsys/vicuna-7b-v1.5 +LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. The +model is buit on top of LLaVa-NeXT by tuning on a mix of video and image data to achieves better video understanding +capabilities. The videos were sampled uniformly to be 32 frames per clip. The model is a current SOTA among open-source +models on VideoMME bench. Base LLM: lmsys/vicuna-7b-v1.5 ## Supported Environments diff --git a/models/multimodal/vision_language_model/minicpm_v/vllm/README.md b/models/multimodal/vision_language_model/minicpm_v/vllm/README.md index bc60e34a..a404f6ec 100644 --- a/models/multimodal/vision_language_model/minicpm_v/vllm/README.md +++ b/models/multimodal/vision_language_model/minicpm_v/vllm/README.md @@ -1,8 +1,10 @@ -# MiniCPM-V-2 +# MiniCPM-V 2 (vLLM) ## Model Description -MiniCPM V2 is a compact and efficient language model designed for various natural language processing (NLP) tasks. Building on its predecessor, MiniCPM-V-1, this model integrates advancements in architecture and optimization techniques, making it suitable for deployment in resource-constrained environments.s +MiniCPM V2 is a compact and efficient language model designed for various natural language processing (NLP) tasks. +Building on its predecessor, MiniCPM-V-1, this model integrates advancements in architecture and optimization +techniques, making it suitable for deployment in resource-constrained environments.s ## Supported Environments @@ -44,4 +46,4 @@ export VLLM_ASSETS_CACHE=../vllm/ PT_SDPA_ENABLE_HEAD_DIM_PADDING=1 python3 offline_inference_vision_language.py --model data/MiniCPM-V-2 --max-tokens 256 -tp 2 --trust-remote-code --temperature 0.0 ``` -## Model Results \ No newline at end of file +## Model Results diff --git a/models/multimodal/vision_language_model/pixtral/vllm/README.md b/models/multimodal/vision_language_model/pixtral/vllm/README.md index 904b90c8..bb3abd99 100644 --- a/models/multimodal/vision_language_model/pixtral/vllm/README.md +++ b/models/multimodal/vision_language_model/pixtral/vllm/README.md @@ -1,4 +1,4 @@ -# Pixtral +# Pixtral (vLLM) ## Model Description -- Gitee