diff --git a/README.md b/README.md index 2dc67b0ec4576122b8bb6700d700553c675fa564..88b380540d4cf759133891913e95156c82c125fb 100644 --- a/README.md +++ b/README.md @@ -123,14 +123,14 @@ DeepSparkInference将按季度进行版本更新,后续会逐步丰富模型 | ResNetV1D50 | FP16 | [✅](models/cv/classification/resnetv1d50/igie) | [✅](models/cv/classification/resnetv1d50/ixrt) | 4.2.0 | | | INT8 | | [✅](models/cv/classification/resnetv1d50/ixrt) | 4.2.0 | | ResNeXt50_32x4d | FP16 | [✅](models/cv/classification/resnext50_32x4d/igie) | [✅](models/cv/classification/resnext50_32x4d/ixrt) | 4.2.0 | -| ResNeXt101_64x4d | FP16 | [✅](models/cv/classification/resnext101_64x4d/igie) | | 4.2.0 | -| ResNeXt101_32x8d | FP16 | [✅](models/cv/classification/resnext101_32x8d/igie) | | 4.2.0 | +| ResNeXt101_64x4d | FP16 | [✅](models/cv/classification/resnext101_64x4d/igie) | [✅](models/cv/classification/resnext101_64x4d/ixrt) | 4.2.0 | +| ResNeXt101_32x8d | FP16 | [✅](models/cv/classification/resnext101_32x8d/igie) | [✅](models/cv/classification/resnext101_32x8d/ixrt) | 4.2.0 | | SEResNet50 | FP16 | [✅](models/cv/classification/se_resnet50/igie) | | 4.2.0 | | ShuffleNetV1 | FP16 | | [✅](models/cv/classification/shufflenet_v1/ixrt) | 4.2.0 | -| ShuffleNetV2_x0_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x0_5/igie) | | 4.2.0 | -| ShuffleNetV2_x1_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_0/igie) | | 4.2.0 | -| ShuffleNetV2_x1_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_5/igie) | | 4.2.0 | -| ShuffleNetV2_x2_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x2_0/igie) | | 4.2.0 | +| ShuffleNetV2_x0_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x0_5/igie) | [✅](models/cv/classification/shufflenetv2_x0_5/ixrt) | 4.2.0 | +| ShuffleNetV2_x1_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_0/igie) | [✅](models/cv/classification/shufflenetv2_x1_0/ixrt) | 4.2.0 | +| ShuffleNetV2_x1_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_5/igie) | [✅](models/cv/classification/shufflenetv2_x1_5/ixrt) | 4.2.0 | +| ShuffleNetV2_x2_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x2_0/igie) | [✅](models/cv/classification/shufflenetv2_x2_0/ixrt) | 4.2.0 | | SqueezeNet 1.0 | FP16 | [✅](models/cv/classification/squeezenet_v1_0/igie) | [✅](models/cv/classification/squeezenet_v1_0/ixrt) | 4.2.0 | | | INT8 | | [✅](models/cv/classification/squeezenet_v1_0/ixrt) | 4.2.0 | | SqueezeNet 1.1 | FP16 | [✅](models/cv/classification/squeezenet_v1_1/igie) | [✅](models/cv/classification/squeezenet_v1_1/ixrt) | 4.2.0 | @@ -181,9 +181,9 @@ DeepSparkInference将按季度进行版本更新,后续会逐步丰富模型 | | INT8 | [✅](models/cv/object_detection/yolov7/igie) | [✅](models/cv/object_detection/yolov7/ixrt) | 4.2.0 | | YOLOv8 | FP16 | [✅](models/cv/object_detection/yolov8/igie) | [✅](models/cv/object_detection/yolov8/ixrt) | 4.2.0 | | | INT8 | [✅](models/cv/object_detection/yolov8/igie) | [✅](models/cv/object_detection/yolov8/ixrt) | 4.2.0 | -| YOLOv9 | FP16 | [✅](models/cv/object_detection/yolov9/igie) | | 4.2.0 | -| YOLOv10 | FP16 | [✅](models/cv/object_detection/yolov10/igie) | | 4.2.0 | -| YOLOv11 | FP16 | [✅](models/cv/object_detection/yolov11/igie) | | 4.2.0 | +| YOLOv9 | FP16 | [✅](models/cv/object_detection/yolov9/igie) | [✅](models/cv/object_detection/yolov9/ixrt) | 4.2.0 | +| YOLOv10 | FP16 | [✅](models/cv/object_detection/yolov10/igie) | [✅](models/cv/object_detection/yolov10/ixrt) | 4.2.0 | +| YOLOv11 | FP16 | [✅](models/cv/object_detection/yolov11/igie) | [✅](models/cv/object_detection/yolov11/ixrt) | 4.2.0 | | YOLOv12 | FP16 | [✅](models/cv/object_detection/yolov12/igie) | | 4.2.0 | | YOLOX | FP16 | [✅](models/cv/object_detection/yolox/igie) | [✅](models/cv/object_detection/yolox/ixrt) | 4.2.0 | | | INT8 | [✅](models/cv/object_detection/yolox/igie) | [✅](models/cv/object_detection/yolox/ixrt) | 4.2.0 | @@ -236,13 +236,18 @@ DeepSparkInference将按季度进行版本更新,后续会逐步丰富模型 | Model | vLLM | IxFormer | IXUCA SDK | |---------------------|-----------------------------------------------------------------------|------------------------------------------------------------|-----------| +| Aria | [✅](models/multimodal/vision_language_model/aria/vllm) | | 4.2.0 | | Chameleon-7B | [✅](models/multimodal/vision_language_model/chameleon_7b/vllm) | | 4.2.0 | | CLIP | | [✅](models/multimodal/vision_language_model/clip/ixformer) | 4.2.0 | | Fuyu-8B | [✅](models/multimodal/vision_language_model/fuyu_8b/vllm) | | 4.2.0 | +| H2OVL Mississippi | [✅](models/multimodal/vision_language_model/h2vol/vllm) | | 4.2.0 | +| Idefics3 | [✅](models/multimodal/vision_language_model/idefics3/vllm) | | 4.2.0 | | InternVL2-4B | [✅](models/multimodal/vision_language_model/intern_vl/vllm) | | 4.2.0 | | LLaVA | [✅](models/multimodal/vision_language_model/llava/vllm) | | 4.2.0 | | LLaVA-Next-Video-7B | [✅](models/multimodal/vision_language_model/llava_next_video_7b/vllm) | | 4.2.0 | -| MiniCPM V2 | [✅](models/multimodal/vision_language_model/minicpm_v/vllm) | | 4.2.0 | +| Llama-3.2 | [✅](models/multimodal/vision_language_model/llama-3.2/vllm) | | 4.2.0 | +| MiniCPM-V 2 | [✅](models/multimodal/vision_language_model/minicpm_v/vllm) | | 4.2.0 | +| Pixtral | [✅](models/multimodal/vision_language_model/pixtral/vllm) | | 4.2.0 | ### 自然语言处理(NLP) diff --git a/README_en.md b/README_en.md index 6450df59157f0af7cab860bb2da129986b1497cb..b35b14e7f71e7d73cf68f0a5b09639fe3e3afe4d 100644 --- a/README_en.md +++ b/README_en.md @@ -133,14 +133,14 @@ inference to be expanded in the future. | ResNetV1D50 | FP16 | [✅](models/cv/classification/resnetv1d50/igie) | [✅](models/cv/classification/resnetv1d50/ixrt) | 4.2.0 | | | INT8 | | [✅](models/cv/classification/resnetv1d50/ixrt) | 4.2.0 | | ResNeXt50_32x4d | FP16 | [✅](models/cv/classification/resnext50_32x4d/igie) | [✅](models/cv/classification/resnext50_32x4d/ixrt) | 4.2.0 | -| ResNeXt101_64x4d | FP16 | [✅](models/cv/classification/resnext101_64x4d/igie) | | 4.2.0 | -| ResNeXt101_32x8d | FP16 | [✅](models/cv/classification/resnext101_32x8d/igie) | | 4.2.0 | +| ResNeXt101_64x4d | FP16 | [✅](models/cv/classification/resnext101_64x4d/igie) | [✅](models/cv/classification/resnext101_64x4d/ixrt) | 4.2.0 | +| ResNeXt101_32x8d | FP16 | [✅](models/cv/classification/resnext101_32x8d/igie) | [✅](models/cv/classification/resnext101_32x8d/ixrt) | 4.2.0 | | SEResNet50 | FP16 | [✅](models/cv/classification/se_resnet50/igie) | | 4.2.0 | | ShuffleNetV1 | FP16 | | [✅](models/cv/classification/shufflenet_v1/ixrt) | 4.2.0 | -| ShuffleNetV2_x0_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x0_5/igie) | | 4.2.0 | -| ShuffleNetV2_x1_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_0/igie) | | 4.2.0 | -| ShuffleNetV2_x1_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_5/igie) | | 4.2.0 | -| ShuffleNetV2_x2_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x2_0/igie) | | 4.2.0 | +| ShuffleNetV2_x0_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x0_5/igie) | [✅](models/cv/classification/shufflenetv2_x0_5/ixrt) | 4.2.0 | +| ShuffleNetV2_x1_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_0/igie) | [✅](models/cv/classification/shufflenetv2_x1_0/ixrt) | 4.2.0 | +| ShuffleNetV2_x1_5 | FP16 | [✅](models/cv/classification/shufflenetv2_x1_5/igie) | [✅](models/cv/classification/shufflenetv2_x1_5/ixrt) | 4.2.0 | +| ShuffleNetV2_x2_0 | FP16 | [✅](models/cv/classification/shufflenetv2_x2_0/igie) | [✅](models/cv/classification/shufflenetv2_x2_0/ixrt) | 4.2.0 | | SqueezeNet 1.0 | FP16 | [✅](models/cv/classification/squeezenet_v1_0/igie) | [✅](models/cv/classification/squeezenet_v1_0/ixrt) | 4.2.0 | | | INT8 | | [✅](models/cv/classification/squeezenet_v1_0/ixrt) | 4.2.0 | | SqueezeNet 1.1 | FP16 | [✅](models/cv/classification/squeezenet_v1_1/igie) | [✅](models/cv/classification/squeezenet_v1_1/ixrt) | 4.2.0 | @@ -191,9 +191,9 @@ inference to be expanded in the future. | | INT8 | [✅](models/cv/object_detection/yolov7/igie) | [✅](models/cv/object_detection/yolov7/ixrt) | 4.2.0 | | YOLOv8 | FP16 | [✅](models/cv/object_detection/yolov8/igie) | [✅](models/cv/object_detection/yolov8/ixrt) | 4.2.0 | | | INT8 | [✅](models/cv/object_detection/yolov8/igie) | [✅](models/cv/object_detection/yolov8/ixrt) | 4.2.0 | -| YOLOv9 | FP16 | [✅](models/cv/object_detection/yolov9/igie) | | 4.2.0 | -| YOLOv10 | FP16 | [✅](models/cv/object_detection/yolov10/igie) | | 4.2.0 | -| YOLOv11 | FP16 | [✅](models/cv/object_detection/yolov11/igie) | | 4.2.0 | +| YOLOv9 | FP16 | [✅](models/cv/object_detection/yolov9/igie) | [✅](models/cv/object_detection/yolov9/ixrt) | 4.2.0 | +| YOLOv10 | FP16 | [✅](models/cv/object_detection/yolov10/igie) | [✅](models/cv/object_detection/yolov10/ixrt) | 4.2.0 | +| YOLOv11 | FP16 | [✅](models/cv/object_detection/yolov11/igie) | [✅](models/cv/object_detection/yolov11/ixrt) | 4.2.0 | | YOLOv12 | FP16 | [✅](models/cv/object_detection/yolov12/igie) | | 4.2.0 | | YOLOX | FP16 | [✅](models/cv/object_detection/yolox/igie) | [✅](models/cv/object_detection/yolox/ixrt) | 4.2.0 | | | INT8 | [✅](models/cv/object_detection/yolox/igie) | [✅](models/cv/object_detection/yolox/ixrt) | 4.2.0 | @@ -246,13 +246,18 @@ inference to be expanded in the future. | Model | vLLM | IxFormer | IXUCA SDK | |---------------------|-----------------------------------------------------------------------|------------------------------------------------------------|-----------| +| Aria | [✅](models/multimodal/vision_language_model/aria/vllm) | | 4.2.0 | | Chameleon-7B | [✅](models/multimodal/vision_language_model/chameleon_7b/vllm) | | 4.2.0 | | CLIP | | [✅](models/multimodal/vision_language_model/clip/ixformer) | 4.2.0 | | Fuyu-8B | [✅](models/multimodal/vision_language_model/fuyu_8b/vllm) | | 4.2.0 | +| H2OVL Mississippi | [✅](models/multimodal/vision_language_model/h2vol/vllm) | | 4.2.0 | +| Idefics3 | [✅](models/multimodal/vision_language_model/idefics3/vllm) | | 4.2.0 | | InternVL2-4B | [✅](models/multimodal/vision_language_model/intern_vl/vllm) | | 4.2.0 | | LLaVA | [✅](models/multimodal/vision_language_model/llava/vllm) | | 4.2.0 | | LLaVA-Next-Video-7B | [✅](models/multimodal/vision_language_model/llava_next_video_7b/vllm) | | 4.2.0 | -| MiniCPM V2 | [✅](models/multimodal/vision_language_model/minicpm_v/vllm) | | 4.2.0 | +| Llama-3.2 | [✅](models/multimodal/vision_language_model/llama-3.2/vllm) | | 4.2.0 | +| MiniCPM-V 2 | [✅](models/multimodal/vision_language_model/minicpm_v/vllm) | | 4.2.0 | +| Pixtral | [✅](models/multimodal/vision_language_model/pixtral/vllm) | | 4.2.0 | ### NLP diff --git a/models/cv/classification/resnext101_32x8d/ixrt/README.md b/models/cv/classification/resnext101_32x8d/ixrt/README.md index aec82d029de1fd36d6235ad0fac9908987f55b1a..5859d9157db83776129c5ecdd2a36814726ffbae 100644 --- a/models/cv/classification/resnext101_32x8d/ixrt/README.md +++ b/models/cv/classification/resnext101_32x8d/ixrt/README.md @@ -1,8 +1,12 @@ -# ResNext101_32x8d (IXRT) +# ResNext101_32x8d (IxRT) ## Model Description -ResNeXt101_32x8d is a deep convolutional neural network introduced in the paper "Aggregated Residual Transformations for Deep Neural Networks." It enhances the traditional ResNet architecture by incorporating group convolutions, offering a new dimension for scaling network capacity through "cardinality" (the number of groups) rather than merely increasing depth or width.The model consists of 101 layers and uses a configuration of 32 groups, each with a width of 8 channels. This design improves feature extraction while maintaining computational efficiency. +ResNeXt101_32x8d is a deep convolutional neural network introduced in the paper "Aggregated Residual Transformations for +Deep Neural Networks." It enhances the traditional ResNet architecture by incorporating group convolutions, offering a +new dimension for scaling network capacity through "cardinality" (the number of groups) rather than merely increasing +depth or width.The model consists of 101 layers and uses a configuration of 32 groups, each with a width of 8 channels. +This design improves feature extraction while maintaining computational efficiency. ## Supported Environments diff --git a/models/cv/classification/resnext101_64x4d/ixrt/README.md b/models/cv/classification/resnext101_64x4d/ixrt/README.md index bba444eec1d6dfd3d566e14cb3ec6c2ccdae5a21..cc6474908af271a1dbc23e5b77a4d712121daa57 100644 --- a/models/cv/classification/resnext101_64x4d/ixrt/README.md +++ b/models/cv/classification/resnext101_64x4d/ixrt/README.md @@ -1,8 +1,11 @@ -# ResNext101_64x4d (IGIE) +# ResNext101_64x4d (IxRT) ## Model Description -The ResNeXt101_64x4d is a deep learning model based on the deep residual network architecture, which enhances performance and efficiency through the use of grouped convolutions. With a depth of 101 layers and 64 filter groups, it is particularly suited for complex image recognition tasks. While maintaining excellent accuracy, it can adapt to various input sizes +The ResNeXt101_64x4d is a deep learning model based on the deep residual network architecture, which enhances +performance and efficiency through the use of grouped convolutions. With a depth of 101 layers and 64 filter groups, it +is particularly suited for complex image recognition tasks. While maintaining excellent accuracy, it can adapt to +various input sizes ## Supported Environments diff --git a/models/cv/classification/shufflenetv2_x0_5/ixrt/README.md b/models/cv/classification/shufflenetv2_x0_5/ixrt/README.md index 1929de59e8c2d32c19c4c58df0b919659010c9d2..405523a5ee5c4977f59337435fb7837344d544da 100644 --- a/models/cv/classification/shufflenetv2_x0_5/ixrt/README.md +++ b/models/cv/classification/shufflenetv2_x0_5/ixrt/README.md @@ -1,4 +1,4 @@ -# ShuffleNetV2 x0_5 (IxRT) +# ShuffleNetV2_x0_5 (IxRT) ## Model Description diff --git a/models/cv/classification/shufflenetv2_x1_0/ixrt/README.md b/models/cv/classification/shufflenetv2_x1_0/ixrt/README.md index e4fb84a3d05c3bd860adac009fe8452ebd7ef897..b2fd0085eb927d475470170a20a1482d78d98f03 100644 --- a/models/cv/classification/shufflenetv2_x1_0/ixrt/README.md +++ b/models/cv/classification/shufflenetv2_x1_0/ixrt/README.md @@ -1,4 +1,4 @@ -# ShuffleNetV2_x1_0 (IXRT) +# ShuffleNetV2_x1_0 (IxRT) ## Model Description diff --git a/models/cv/classification/shufflenetv2_x1_5/ixrt/README.md b/models/cv/classification/shufflenetv2_x1_5/ixrt/README.md index 1e86d0542dd69b09c3ac9299caf8b4bd2c5684e2..34bb7cbe77bfa1eeae9a59bebd3cfdac7e5de070 100644 --- a/models/cv/classification/shufflenetv2_x1_5/ixrt/README.md +++ b/models/cv/classification/shufflenetv2_x1_5/ixrt/README.md @@ -1,4 +1,4 @@ -# ShuffleNetV2_x1_5 (IXRT) +# ShuffleNetV2_x1_5 (IxRT) ## Model Description diff --git a/models/cv/classification/shufflenetv2_x2_0/ixrt/README.md b/models/cv/classification/shufflenetv2_x2_0/ixrt/README.md index c8650d9f3eef27ef738493d9dfd17f3448ce0b82..ca8b5212b55955a670517bf45e4181b7806ba316 100644 --- a/models/cv/classification/shufflenetv2_x2_0/ixrt/README.md +++ b/models/cv/classification/shufflenetv2_x2_0/ixrt/README.md @@ -1,4 +1,4 @@ -# ShuffleNetV2_x2_0 (IXRT) +# ShuffleNetV2_x2_0 (IxRT) ## Model Description diff --git a/models/cv/object_detection/yolov10/ixrt/README.md b/models/cv/object_detection/yolov10/ixrt/README.md index 274c59fc6119ad6e551ba30ffea39bc3fd74cee7..6fade83d11496a0b3206ca89347ffb368b17ae83 100644 --- a/models/cv/object_detection/yolov10/ixrt/README.md +++ b/models/cv/object_detection/yolov10/ixrt/README.md @@ -1,4 +1,4 @@ -# YOLOv10 (IXRT) +# YOLOv10 (IxRT) ## Model Description diff --git a/models/cv/object_detection/yolov11/ixrt/README.md b/models/cv/object_detection/yolov11/ixrt/README.md index 1f7993b1ba409da3f535b39a6360013d4c045f94..3172be8544a51c291fd4bad291761ca88908a827 100644 --- a/models/cv/object_detection/yolov11/ixrt/README.md +++ b/models/cv/object_detection/yolov11/ixrt/README.md @@ -1,4 +1,4 @@ -# YOLOv11 (IGIE) +# YOLOv11 (IxRT) ## Model Description diff --git a/models/cv/object_detection/yolov9/ixrt/README.md b/models/cv/object_detection/yolov9/ixrt/README.md index e74bd51633812ec45c9f174b9cef1e54364a2c3f..806be63ac0fbb630f0a891920453fdaa0b8a7157 100644 --- a/models/cv/object_detection/yolov9/ixrt/README.md +++ b/models/cv/object_detection/yolov9/ixrt/README.md @@ -1,4 +1,4 @@ -# YOLOv9 (IXRT) +# YOLOv9 (IxRT) ## Model Description diff --git a/models/multimodal/vision_language_model/aria/vllm/README.md b/models/multimodal/vision_language_model/aria/vllm/README.md index eb8924c10500647d8911f2868a2eb8080c53a7b9..10ef24f8ce7d2c2e6db5cb9a34fc9c181df7e9a6 100644 --- a/models/multimodal/vision_language_model/aria/vllm/README.md +++ b/models/multimodal/vision_language_model/aria/vllm/README.md @@ -1,8 +1,9 @@ -# Aria +# Aria (vLLM) ## Model Description Aria is a multimodal native MoE model. It features: + - State-of-the-art performance on various multimodal and language tasks, superior in video and document understanding; - Long multimodal context window of 64K tokens; - 3.9B activated parameters per token, enabling fast inference speed and low fine-tuning cost. @@ -45,4 +46,4 @@ export VLLM_ASSETS_CACHE=../vllm/ python3 offline_inference_vision_language.py --model data/Aria --max-tokens 256 -tp 4 --trust-remote-code --temperature 0.0 --dtype bfloat16 --tokenizer-mode slow ``` -## Model Results \ No newline at end of file +## Model Results diff --git a/models/multimodal/vision_language_model/chameleon_7b/vllm/README.md b/models/multimodal/vision_language_model/chameleon_7b/vllm/README.md index d2b61966dc1a27d5a5201e53ea5d47acb853dcdc..1ed7c9116c970df30b47800496835aac9a0016c9 100755 --- a/models/multimodal/vision_language_model/chameleon_7b/vllm/README.md +++ b/models/multimodal/vision_language_model/chameleon_7b/vllm/README.md @@ -1,4 +1,4 @@ -# Chameleon +# Chameleon (vLLM) ## Model Description diff --git a/models/multimodal/vision_language_model/fuyu_8b/vllm/README.md b/models/multimodal/vision_language_model/fuyu_8b/vllm/README.md index b03fa5d48aa029896b8ea992b87137ada165fa98..f751f8c4db94a5b7c1e170ead59ec7ad40fcfc9c 100755 --- a/models/multimodal/vision_language_model/fuyu_8b/vllm/README.md +++ b/models/multimodal/vision_language_model/fuyu_8b/vllm/README.md @@ -1,10 +1,12 @@ -# Fuyu-8B +# Fuyu-8B (vLLM) ## Model Description Fuyu-8B is a multi-modal text and image transformer trained by Adept AI. -Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder. Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the transformer decoder like an image transformer (albeit with no pooling and causal attention). +Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder. Image patches are instead +linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the +transformer decoder like an image transformer (albeit with no pooling and causal attention). ## Supported Environments diff --git a/models/multimodal/vision_language_model/h2vol/vllm/README.md b/models/multimodal/vision_language_model/h2vol/vllm/README.md index 80c4c329ada1686e4ead4e739365932cf149d37c..671410ea7e2080fe193bcceabd3542535d5d42d6 100644 --- a/models/multimodal/vision_language_model/h2vol/vllm/README.md +++ b/models/multimodal/vision_language_model/h2vol/vllm/README.md @@ -1,8 +1,12 @@ -# H2ovl +# H2OVL Mississippi (vLLM) ## Model Description -The H2OVL-Mississippi-800M is a compact yet powerful vision-language model from H2O.ai, featuring 0.8 billion parameters. Despite its small size, it delivers state-of-the-art performance in text recognition, excelling in the Text Recognition segment of OCRBench and outperforming much larger models in this domain. Built upon the robust architecture of our H2O-Danube language models, the Mississippi-800M extends their capabilities by seamlessly integrating vision and language tasks. +The H2OVL-Mississippi-800M is a compact yet powerful vision-language model from H2O.ai, featuring 0.8 billion +parameters. Despite its small size, it delivers state-of-the-art performance in text recognition, excelling in the Text +Recognition segment of OCRBench and outperforming much larger models in this domain. Built upon the robust architecture +of our H2O-Danube language models, the Mississippi-800M extends their capabilities by seamlessly integrating vision and +language tasks. ## Supported Environments @@ -42,4 +46,4 @@ export VLLM_ASSETS_CACHE=../vllm/ python3 offline_inference_vision_language.py --model data/h2ovl-mississippi-800m -tp 1 --max-tokens 256 --trust-remote-code --temperature 0.0 --disable-mm-preprocessor-cache ``` -## Model Results \ No newline at end of file +## Model Results diff --git a/models/multimodal/vision_language_model/idefics3/vllm/README.md b/models/multimodal/vision_language_model/idefics3/vllm/README.md index d89c20d9cc4f54901d79a8d631388ae2853f0822..721d7f0af8f1de2b604198e174c1abc19a59d119 100644 --- a/models/multimodal/vision_language_model/idefics3/vllm/README.md +++ b/models/multimodal/vision_language_model/idefics3/vllm/README.md @@ -1,8 +1,11 @@ -# Idefics3 +# Idefics3 (vLLM) ## Model Description -Idefics3 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon Idefics1 and Idefics2, significantly enhancing capabilities around OCR, document understanding and visual reasoning. +Idefics3 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text +outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple +images, or simply behave as a pure language model without visual inputs. It improves upon Idefics1 and Idefics2, +significantly enhancing capabilities around OCR, document understanding and visual reasoning. ## Supported Environments @@ -42,4 +45,4 @@ export VLLM_ASSETS_CACHE=../vllm/ python3 offline_inference_vision_language.py --model data/Idefics3-8B-Llama3 -tp 4 --max-tokens 256 --trust-remote-code --temperature 0.0 --disable-mm-preprocessor-cache ``` -## Model Results \ No newline at end of file +## Model Results diff --git a/models/multimodal/vision_language_model/intern_vl/vllm/README.md b/models/multimodal/vision_language_model/intern_vl/vllm/README.md index be75164b9862761879bcc680d62223b623f6b26d..78bb8d1b1297bba864816057c0192193e85f8849 100644 --- a/models/multimodal/vision_language_model/intern_vl/vllm/README.md +++ b/models/multimodal/vision_language_model/intern_vl/vllm/README.md @@ -1,8 +1,11 @@ -# InternVL2-4B +# InternVL2-4B (vLLM) ## Model Description -InternVL2-4B is a large-scale multimodal model developed by WeTab AI, designed to handle a wide range of tasks involving both text and visual data. With 4 billion parameters, it is capable of understanding and generating complex patterns in data, making it suitable for applications such as image recognition, natural language processing, and multimodal learning. +InternVL2-4B is a large-scale multimodal model developed by WeTab AI, designed to handle a wide range of tasks involving +both text and visual data. With 4 billion parameters, it is capable of understanding and generating complex patterns in +data, making it suitable for applications such as image recognition, natural language processing, and multimodal +learning. ## Supported Environments diff --git a/models/multimodal/vision_language_model/mllama/vllm/README.md b/models/multimodal/vision_language_model/llama-3.2/vllm/README.md similarity index 73% rename from models/multimodal/vision_language_model/mllama/vllm/README.md rename to models/multimodal/vision_language_model/llama-3.2/vllm/README.md index 70d9574bf128b463bab143415278493d0fada1ba..b6aab0789255ee31da3817ea962dacbf0b797fa7 100644 --- a/models/multimodal/vision_language_model/mllama/vllm/README.md +++ b/models/multimodal/vision_language_model/llama-3.2/vllm/README.md @@ -1,8 +1,11 @@ -# Mllama +# Llama-3.2 (vLLM) ## Model Description -The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. +The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and +instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only +models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They +outperform many of the available open source and closed chat models on common industry benchmarks. ## Supported Environments @@ -43,4 +46,4 @@ export VLLM_FORCE_NCCL_COMM=1 python3 offline_inference_vision_language.py --model data/LLamaV3.2 --max-tokens 256 -tp 2 --trust-remote-code --temperature 0.0 --max-model-len 8192 --max-num-seqs 16 ``` -## Model Results \ No newline at end of file +## Model Results diff --git a/models/multimodal/vision_language_model/mllama/vllm/ci/prepare.sh b/models/multimodal/vision_language_model/llama-3.2/vllm/ci/prepare.sh similarity index 100% rename from models/multimodal/vision_language_model/mllama/vllm/ci/prepare.sh rename to models/multimodal/vision_language_model/llama-3.2/vllm/ci/prepare.sh diff --git a/models/multimodal/vision_language_model/mllama/vllm/offline_inference_vision_language.py b/models/multimodal/vision_language_model/llama-3.2/vllm/offline_inference_vision_language.py similarity index 100% rename from models/multimodal/vision_language_model/mllama/vllm/offline_inference_vision_language.py rename to models/multimodal/vision_language_model/llama-3.2/vllm/offline_inference_vision_language.py diff --git a/models/multimodal/vision_language_model/llava/vllm/README.md b/models/multimodal/vision_language_model/llava/vllm/README.md index f60207dfa600e629964763f0b2ec0c495e6f7b14..599b66f04af0d6d093fd96be05febbad896292fd 100644 --- a/models/multimodal/vision_language_model/llava/vllm/README.md +++ b/models/multimodal/vision_language_model/llava/vllm/README.md @@ -1,8 +1,13 @@ -# LLava +# LLava (vLLM) ## Model Description -LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.The LLaVA-NeXT model was proposed in LLaVA-NeXT: Improved reasoning, OCR, and world knowledge by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee. LLaVa-NeXT (also called LLaVa-1.6) improves upon LLaVa-1.5 by increasing the input image resolution and training on an improved visual instruction tuning dataset to improve OCR and common sense reasoning. +LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following +data. It is an auto-regressive language model, based on the transformer architecture.The LLaVA-NeXT model was proposed +in LLaVA-NeXT: Improved reasoning, OCR, and world knowledge by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan +Zhang, Sheng Shen, Yong Jae Lee. LLaVa-NeXT (also called LLaVa-1.6) improves upon LLaVa-1.5 by increasing the input +image resolution and training on an improved visual instruction tuning dataset to improve OCR and common sense +reasoning. ## Supported Environments diff --git a/models/multimodal/vision_language_model/llava_next_video_7b/vllm/README.md b/models/multimodal/vision_language_model/llava_next_video_7b/vllm/README.md index d705d5505b0092316f50bc79e768a609e056643c..31b5622fc6e6cd7e62af94f71d20aaf0da78581b 100755 --- a/models/multimodal/vision_language_model/llava_next_video_7b/vllm/README.md +++ b/models/multimodal/vision_language_model/llava_next_video_7b/vllm/README.md @@ -1,8 +1,11 @@ -# LLaVA-Next-Video-7B +# LLaVA-Next-Video-7B (vLLM) ## Model Description -LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. The model is buit on top of LLaVa-NeXT by tuning on a mix of video and image data to achieves better video understanding capabilities. The videos were sampled uniformly to be 32 frames per clip. The model is a current SOTA among open-source models on VideoMME bench. Base LLM: lmsys/vicuna-7b-v1.5 +LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. The +model is buit on top of LLaVa-NeXT by tuning on a mix of video and image data to achieves better video understanding +capabilities. The videos were sampled uniformly to be 32 frames per clip. The model is a current SOTA among open-source +models on VideoMME bench. Base LLM: lmsys/vicuna-7b-v1.5 ## Supported Environments diff --git a/models/multimodal/vision_language_model/minicpm_v/vllm/README.md b/models/multimodal/vision_language_model/minicpm_v/vllm/README.md index bc60e34a4bd80833373f7f734ca175243fd6898a..a404f6ec2cb73151184612fbfa89bee0d5ce26ca 100644 --- a/models/multimodal/vision_language_model/minicpm_v/vllm/README.md +++ b/models/multimodal/vision_language_model/minicpm_v/vllm/README.md @@ -1,8 +1,10 @@ -# MiniCPM-V-2 +# MiniCPM-V 2 (vLLM) ## Model Description -MiniCPM V2 is a compact and efficient language model designed for various natural language processing (NLP) tasks. Building on its predecessor, MiniCPM-V-1, this model integrates advancements in architecture and optimization techniques, making it suitable for deployment in resource-constrained environments.s +MiniCPM V2 is a compact and efficient language model designed for various natural language processing (NLP) tasks. +Building on its predecessor, MiniCPM-V-1, this model integrates advancements in architecture and optimization +techniques, making it suitable for deployment in resource-constrained environments.s ## Supported Environments @@ -44,4 +46,4 @@ export VLLM_ASSETS_CACHE=../vllm/ PT_SDPA_ENABLE_HEAD_DIM_PADDING=1 python3 offline_inference_vision_language.py --model data/MiniCPM-V-2 --max-tokens 256 -tp 2 --trust-remote-code --temperature 0.0 ``` -## Model Results \ No newline at end of file +## Model Results diff --git a/models/multimodal/vision_language_model/pixtral/vllm/README.md b/models/multimodal/vision_language_model/pixtral/vllm/README.md index 904b90c82517454b22b9b4bb2f23d593528ac9eb..bb3abd99e2f14eb82f410568c7573c40818cf154 100644 --- a/models/multimodal/vision_language_model/pixtral/vllm/README.md +++ b/models/multimodal/vision_language_model/pixtral/vllm/README.md @@ -1,4 +1,4 @@ -# Pixtral +# Pixtral (vLLM) ## Model Description