diff --git a/README.md b/README.md index 41ef4c7b95d6e1f0b05f873db7c272d3ff2143c7..ea7faddf5f87860512436b1276f09d0db1cd3453 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ ## 1. 项目简介 -Moark Skills 是一个为 [OpenClaw](https://github.com/nicepkg/openclaw)提供扩展能力的技能(Skill)集合仓库。本项目旨在解决 AI 代理在处理特定任务时能力受限的问题,通过提供标准化的独立功能模块,使 AI 代理能够轻松**调用外部 API 或执行本地脚本,从而完成复杂任务**。 +Moark Skills 是一个为 [OpenClaw](https://docs.openclaw.ai/zh-CN) 提供扩展能力的技能(Skill)集合仓库。本项目旨在解决 AI 代理在处理特定任务时能力受限的问题,通过提供标准化的独立功能模块,使 AI 代理能够轻松**调用外部 API 或执行本地脚本,从而完成复杂任务**。 ## 2. 快速开始 @@ -89,12 +89,12 @@ OpenClaw 会自动读取 `SKILL.md`,匹配对应的技能,构造参数并执 ### moark-image-gen (AI 图像生成) - **能力**:根据文本描述生成高质量图像。 -- **输入**:正向提示词 (`--prompt`)、图像尺寸 (`--size`)、负面提示词 (`--negative-prompt`),生成时的迭代次数 (`--num-inference-steps`),对用户提示词的遵从程度 (`--guidance-scale`),API Key (`--api-key`)。 +- **输入**:所使用的模型(`--model`)、正向提示词 (`--prompt`)、图像尺寸 (`--size`)、负面提示词 (`--negative-prompt`),生成时的迭代次数 (`--num-inference-steps`),对用户提示词的遵从程度 (`--guidance-scale`),API Key (`--api-key`)。 - **输出**:生成的图像链接。 ### moark-ocr (光学字符识别) -- **能力**:识别并提取图像文件中的文字内容。 -- **输入**:图像文件的本地路径 (`--image`)、用户的具体提取要求 (`--prompt`),API Key (`--api-key`)。 +- **能力**:识别并提取图像文件中的文字内容或其他信息。 +- **输入**:图像文件的本地路径 (`--image`)、用户的具体提取要求 (`--prompt`)、所使用的模型(`--model`)、API Key (`--api-key`)。 - **输出**:识别出的文本内容及基于用户要求的输出。 ### moark-tts (文本转语音) @@ -130,10 +130,6 @@ OpenClaw 会自动读取 `SKILL.md`,匹配对应的技能,构造参数并执 python skills/moark-doc-extraction/scripts/perform_doc_extraction.py \ --file /path/to/document.pdf \ --api-key "你的_GITEEAI_API_KEY" - -# 或者通过环境变量传入 -export GITEEAI_API_KEY="你的_GITEEAI_API_KEY" -python skills/moark-doc-extraction/scripts/perform_doc_extraction.py --file /path/to/document.pdf ``` **示例 2:AI 图像生成** diff --git a/skills/moark-image-gen/SKILL.md b/skills/moark-image-gen/SKILL.md index 6e44aca281b1bb608439e91600b7c5a2bc3434ed..e6392227b5125bf2a9db2f4fb5efd878f5fd5d39 100644 --- a/skills/moark-image-gen/SKILL.md +++ b/skills/moark-image-gen/SKILL.md @@ -59,7 +59,7 @@ python {baseDir}/scripts/perform_image_gen.py --prompt "your image description" **Additional flags:** - `--model` - Specify the model to use. Options include `Qwen-Image` (default), `Kolors`, `GLM-Image`, `FLUX.2-dev`, `HunyuanDiT-v1.2-Diffusers-Distilled`. -- `--negative-prompt` - Specify what elements users want to avoid in the generated image(default: "低分辨率,低画质,肢体畸形,手指畸形,画面过饱和,蜡像感,人脸无细节,过度光滑,画面具有AI感。构图混乱。文字模糊,扭曲。"). +- `--negative-prompt` - Specify what elements users want to avoid in the generated image(default: "Low resolution, low image quality, distorted limbs and fingers, oversaturated image, wax figure appearance, lack of facial detail, excessive smoothing, AI-like appearance. Chaotic composition. Blurry and distorted text."). - `--size` - Specify the size of the generated image. Options include `256x256`, `512x512`, `1024x1024` (default), `1024x576`, `576x1024`, `1024x768`, `768x1024`, `1024x640`, `640x1024`, `2048x2048`. - `--guidance-scale` - Float value to control how closely the model adheres to the prompt (default depends on model). - `--num-inference-steps` - Integer for denoise steps (default depends on model). Higher values typically increase quality but take longer. @@ -79,13 +79,16 @@ python {baseDir}/scripts/perform_image_gen.py --prompt "your image description" 4. Display the image to the user using markdown syntax: `🖼️[Generated Image](URL)`. ## Notes -- You should not only return the image URL but also describe the image based on the user's prompt, and claim the hyperparameters used for generation. -- You should always wait for the script to finish executing, don't shut it down prematurely. +- You should not only return the image URL but also describe the image based on the user's prompt, and ask the user if the generated image meets their requirements. +- You should always wait for the script to finish executing, don't shut it down prematurely until it output messages. +- When one model fails to work, you can try to use other models. If all the models fail, you can inform the user that the image generation failed and suggest them to modify the prompt or try again later. - The Lanaguage of your answer should be consistent with the user's question. - By default, return image URL directly without downloading. -- If GITEEAI_API_KEY is none, the user must provide --api-key argument. +- If GITEEAI_API_KEY is none, you should shut down and ask the user to provide the --api-key argument. - The script prints `IMAGE_URL:` in the output - extract this URL and display it using markdown image syntax: `🖼️[Generated image](URL)`. - Always look for the line starting with `IMAGE_URL:` in the script output and render the image for the user. - You should honestly repeat the description of the image from user without any additional imaginations. - **Handling User Feedback on Quality**: If the user states the image quality is low or lacks details, you should retry generating with a higher `--num-inference-steps` (e.g. 25 → 30). -- **Handling User Feedback on Prompt Adherence**: If the user states the image doesn't follow the prompt closely enough or ignores details, increase the `--guidance-scale` parameter (e.g. 7.5 → 15). If they say it's oversaturated or distorted, decrease it. \ No newline at end of file +- **Handling User Feedback on Prompt Adherence**: If the user states the image doesn't follow the prompt closely enough or ignores details, increase the `--guidance-scale` parameter (e.g. 7.5 → 15). If they say it's oversaturated or distorted, decrease it. +- If the user says the image quality is low, you can try to increase the `--num-inference-steps` parameter and regenerate the image. If they say the image doesn't follow the prompt closely enough, you can try to increase the `--guidance-scale` parameter and regenerate the image. If they say it's oversaturated or distorted, you can try to decrease the `--guidance-scale` parameter and regenerate the image. +- When you increase `--num-inference-steps` or `--guidance-scale` , you must obey the range of the hyperparameter for the specific model in **Model Specific Defaults:**. For example, if the user is using `Kolors` model and you want to increase `--num-inference-steps` from 25 to 30, you can do that because it's within the range of 20-30. But if you want to increase it to 35, you should not do that because it's out of range. In that case, you can inform the user that you have increased the `--num-inference-steps` to the maximum value of 30 for `Kolors` model. \ No newline at end of file diff --git a/skills/moark-image-gen/scripts/perform_image_gen.py b/skills/moark-image-gen/scripts/perform_image_gen.py index 5ba278e82a81e475b567db8a6efed58ea9bf13ec..34da06b401ed0322fe92f3227df05f5f79cbe289 100644 --- a/skills/moark-image-gen/scripts/perform_image_gen.py +++ b/skills/moark-image-gen/scripts/perform_image_gen.py @@ -50,7 +50,8 @@ def main(): ) parser.add_argument( "--negative-prompt", "-n", - default="低分辨率,低画质,肢体畸形,手指畸形,画面过饱和,蜡像感,人脸无细节,过度光滑,画面具有AI感。构图混乱。文字模糊,扭曲。", + default="Low resolution, low image quality, distorted limbs and fingers, oversaturated image, wax figure appearance, " \ + "lack of facial detail, excessive smoothing, AI-like appearance. Chaotic composition. Blurry and distorted text.", help="Negative prompt to avoid unwanted elements" ) parser.add_argument( @@ -95,6 +96,14 @@ def main(): "HunyuanDiT-v1.2-Diffusers-Distilled": {"num_inference_steps": 25, "guidance_scale": 5.0}, } + RANGE_LIMITS = { + "Qwen-Image": {"num_inference_steps": (4, 50)}, + "Kolors": {"num_inference_steps": (20, 30), "guidance_scale": (0, 100)}, + "GLM-Image": {"num_inference_steps": (10, 50), "guidance_scale": (0, 10)}, + "FLUX.2-dev": {"num_inference_steps": (10, 50), "guidance_scale": (0, 100)}, + "HunyuanDiT-v1.2-Diffusers-Distilled": {"num_inference_steps": (25, 50), "guidance_scale": (0, 20)}, + } + allowed_params = SUPPORTED_KWARGS.get(args.model, []) model_defaults = MODEL_DEFAULTS.get(args.model, {}) @@ -105,11 +114,19 @@ def main(): if "guidance_scale" in allowed_params: gs_val = args.guidance_scale if args.guidance_scale is not None else model_defaults.get("guidance_scale") + limits = RANGE_LIMITS.get(args.model, {}).get("guidance_scale") + if gs_val is not None and limits: + lo, hi = limits + gs_val = max(lo, min(hi, gs_val)) if gs_val is not None: extra_body["guidance_scale"] = gs_val - + if "num_inference_steps" in allowed_params: nis_val = args.num_inference_steps if args.num_inference_steps is not None else model_defaults.get("num_inference_steps") + limits = RANGE_LIMITS.get(args.model, {}).get("num_inference_steps") + if nis_val is not None and limits: + lo, hi = limits + nis_val = max(lo, min(hi, nis_val)) if nis_val is not None: extra_body["num_inference_steps"] = nis_val diff --git a/skills/moark-ocr/SKILL.md b/skills/moark-ocr/SKILL.md index ad852c1a50af875bac7a41bd8700f70177e03b6d..6e3b0d9285b6e24d9eac1b93d2ae5c2b3c4a5436 100644 --- a/skills/moark-ocr/SKILL.md +++ b/skills/moark-ocr/SKILL.md @@ -13,29 +13,57 @@ metadata: --- # OCR (Optical Character Recognition) -This skill allows users to extract and recognize text from images using an external GITEE AI API. +This skill allows users to extract and recognize text from images using an external GITEE AI API. Supported local image formats: .jpg, .jpeg, .png, .webp. For URLs, ensure they point to one of these formats. Additionally, `GOT-OCR2_0` and `HunyuanOCR` also accept .gif. ## Usage -Ensure you have installed the required dependencies (`pip install openai`). Use the bundled script to perform OCR on an image. +Ensure you have installed the required dependencies (`pip install openai requests requests-toolbelt`). Use the bundled script to perform OCR. +**MinerU2.5(default)** ```bash -python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --prompt "Users requirements" --api-key YOUR_API_KEY +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model MinerU2.5 --api-key YOUR_API_KEY +``` + +**PaddleOCR-VL-1.5** +```bash +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model PaddleOCR-VL-1.5 --prompt "Please extract and recognize text from the image" --api-key YOUR_API_KEY +``` + +**GOT-OCR2_0** +```bash +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model GOT-OCR2_0 --api-key YOUR_API_KEY +``` + +**HunyuanOCR** +```bash +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.gif --model HunyuanOCR --prompt "Please extract and recognize text from the image" --api-key YOUR_API_KEY +``` + +**DeepSeek-OCR-2** +```bash +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model DeepSeek-OCR-2 --prompt "Please extract and recognize text from the image" --api-key YOUR_API_KEY ``` ## Options -No additional parameters are required for this skill. +- `--image` - Path to image file or image URL (supports .jpg, .jpeg, .png, .webp; `.gif` is supported for `GOT-OCR2_0`, `HunyuanOCR` and `MinerU2.5`). +- `--model` - Available models: `MinerU2.5` (default), `PaddleOCR-VL-1.5`, `DeepSeek-OCR-2`, `GOT-OCR2_0`, `HunyuanOCR`. +- `--prompt` - Instructions for the OCR task (`GOT-OCR2_0`, `MinerU2.5` don't need a prompt; others accept an optional prompt). +- `--api-key` - Gitee AI API key (overrides env `GITEEAI_API_KEY`). ## Workflow 1. Execute the perform_ocr.py script with the parameters from the user. 2. Parse the script output and find the line starting with `OCR_RESULT:`. -3. Extract the OCR result from that line (format: `OCR_RESULT: ...`). +3. Extract the OCR result from the line starting with `OCR_RESULT:` . 4. Display the OCR result to the user using markdown syntax: `📷[OCR Result]`. ## Notes -- If GITEEAI_API_KEY is none, you should remind user to provide --api-key argument -- You should not only return the OCR result but also provide a brief summary of the recognized text based on the user's prompt. +- If GITEEAI_API_KEY is none, you should shut down and remind user to provide --api-key argument +- You should always wait for the script to finish executing, don't shut it down prematurely until it output messages. +- When user require OCR on a GIF image, you should first check if the model they specified supports GIF format. If it doesn't, you should inform the user that the model they chose doesn't support GIF images and suggest them to choose another model that supports GIF or convert the GIF to a supported format before performing OCR. +- You should not only return the OCR result honestly but also provide a brief summary of the recognized text. +- When one model fails to work, you can try to use other models. If all the models fail, you can inform the user that the image generation failed and suggest them to modify the prompt or try again later. - When you add prompt, you should honestly repeat the requirements from user without any additional imaginations. - The script prints `OCR_RESULT:` in the output - extract this result and display it using markdown image syntax:`📷[OCR Result]`. -- Always look for the line starting with `OCR_RESULT:` in the script output. \ No newline at end of file +- Always look for the line starting with `OCR_RESULT:` in the script output. +- Supported local image formats: .jpg, .jpeg, .png, .webp and .gif. \ No newline at end of file diff --git a/skills/moark-ocr/scripts/perform_ocr.py b/skills/moark-ocr/scripts/perform_ocr.py index 31903cdeb43602641444958b5c48b400c3083bc4..90730cab2ef363e710542071808488a5dc100c55 100644 --- a/skills/moark-ocr/scripts/perform_ocr.py +++ b/skills/moark-ocr/scripts/perform_ocr.py @@ -2,7 +2,9 @@ # /// script # requires-python = ">=3.10" # dependencies = [ -# "openai" +# "openai", +# "requests", +# "requests-toolbelt" # ] # /// @@ -10,15 +12,32 @@ Perform OCR on images using Gitee AI Vision API. Usage: - python perform_ocr.py --image /path/to/image.jpg --prompt "Users requirements" [--api-key KEY] + python perform_ocr.py --image /path/to/image.jpg --model "Default Model or Specific Model" --prompt "Users requirements" [--api-key KEY] """ import argparse -import os -import sys import base64 +import contextlib import mimetypes +import os +import sys +import time +from typing import Tuple + +import requests from openai import OpenAI +from requests_toolbelt import MultipartEncoder + +API_BASE = "https://ai.gitee.com/v1" +IMAGE_OCR_URL = f"{API_BASE}/images/ocr" +DOC_PARSE_URL = f"{API_BASE}/async/documents/parse" +TASK_STATUS_URL = f"{API_BASE}/task/{{task_id}}" + +CHAT_MODELS = {"PaddleOCR-VL-1.5", "DeepSeek-OCR-2"} +IMAGE_OCR_MODELS = {"GOT-OCR2_0", "HunyuanOCR", "MinerU2.5"} +CHAT_ALLOWED_EXTS = (".jpg", ".jpeg", ".png", ".webp") +IMAGE_ALLOWED_EXTS = (".jpg", ".jpeg", ".png", ".webp", ".gif") + def get_api_key(provided_key: str | None) -> str | None: """Get API key from argument, config, or environment.""" @@ -26,6 +45,191 @@ def get_api_key(provided_key: str | None) -> str | None: return provided_key return os.environ.get("GITEEAI_API_KEY") + +def build_image_url(filepath: str, allowed_exts: tuple[str, ...]) -> str: + """Return http(s) url directly, otherwise convert local image to data URL.""" + if filepath.startswith(("http://", "https://")): + return filepath + + if not os.path.exists(filepath): + raise FileNotFoundError(f"Image file not found at {filepath}") + _, ext = os.path.splitext(filepath) + if ext.lower() not in allowed_exts: + raise ValueError( + f"Unsupported image format: {ext}. Supported: {', '.join(allowed_exts)}" + ) + + mime_type, _ = mimetypes.guess_type(filepath) + mime_type = mime_type or "image/jpeg" + + with open(filepath, "rb") as f: + base64_image = base64.b64encode(f.read()).decode("utf-8") + return f"data:{mime_type};base64,{base64_image}" + + +def load_file_field(filepath: str, stack: contextlib.ExitStack, allowed_exts: tuple[str, ...]) -> Tuple[str, object, str]: + """ + Prepare a multipart file tuple (name, data/handle, mime) for requests-toolbelt. + + Caller is responsible for providing an ExitStack to manage open file handles. + """ + name = os.path.basename(filepath) + if filepath.startswith(("http://", "https://")): + resp = requests.get(filepath, timeout=10) + resp.raise_for_status() + content_type = resp.headers.get("Content-Type", "application/octet-stream") + return name, resp.content, content_type + + mime_type, _ = mimetypes.guess_type(filepath) + mime_type = mime_type or "application/octet-stream" + _, ext = os.path.splitext(filepath) + + if ext.lower() not in allowed_exts: + raise ValueError( + f"Unsupported image format: {ext}. Supported: {', '.join(allowed_exts)}" + ) + file_handle = stack.enter_context(open(filepath, "rb")) + + return name, file_handle, mime_type + + +def run_chat_ocr(model: str, prompt: str, image_path: str, client: OpenAI): + image_url = build_image_url(image_path, CHAT_ALLOWED_EXTS) # Chat models require a URL or data URL + response = client.chat.completions.create( + messages=[ + { + "role": "system", + "content": "You are a helpful and harmless assistant. You should think step-by-step.", + }, + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": {"url": image_url}, + }, + { + "type": "text", + "text": prompt, + }, + ], + }, + ], + model=model, + stream=False, + max_tokens=512, + temperature=0.7, + top_p=1, + extra_body={"top_k": 1}, + frequency_penalty=0, + ) + + print("\nOCR_RESULT:") + if response.choices: + result_text = response.choices[0].message.content + if result_text: + print(result_text.strip()) + else: + print("No text recognized.") + else: + print("No response choices returned.") + + +def run_image_ocr_multipart(model: str, prompt: str, image_path: str, api_key: str): + with contextlib.ExitStack() as stack: + name, file_obj, mime_type = load_file_field(image_path, stack, IMAGE_ALLOWED_EXTS) # Validate and open image + base_fields = { + "model": model, + } + if prompt: + base_fields["prompt"] = prompt + fields = list(base_fields.items()) + fields.append(("image", (name, file_obj, mime_type))) + + encoder = MultipartEncoder(fields) + headers = { + "Authorization": f"Bearer {api_key}", + "Content-Type": encoder.content_type, + } + resp = requests.post(IMAGE_OCR_URL, headers=headers, data=encoder, timeout=30) # Send multipart request + + resp.raise_for_status() + data = resp.json() + + print("\nOCR_RESULT:") + print(data) + + +def poll_task(task_id: str, api_key: str, interval: float = 2.0, timeout: float = 120.0): + """Poll the task status until it completes or fails.""" + headers = {"Authorization": f"Bearer {api_key}"} + start_time = time.time() + while True: + resp = requests.get(TASK_STATUS_URL.format(task_id=task_id), headers=headers, timeout=15) + resp.raise_for_status() + payload = resp.json() + status = payload.get("task_status") or payload.get("status") + + if status in {"success"}: + return payload + elif status in {"failed", "cancelled"}: + raise RuntimeError(f"Task {task_id} failed with status: {status}") + + if time.time() - start_time > timeout: + raise TimeoutError(f"Task {task_id} polling timed out") + time.sleep(interval) + + +def format_mineru_output(payload: dict) -> str | dict: + """Extract readable text from MinerU2.5 payload; fall back to raw payload if missing.""" + output = payload.get("output") or {} + segments = output.get("segments") if isinstance(output, dict) else None + if segments and isinstance(segments, list): + contents = [seg.get("content") for seg in segments if isinstance(seg, dict) and seg.get("content")] + if contents: + return "\n\n".join(contents) + return payload + + +def run_mineru_async(image_path: str, api_key: str): + with contextlib.ExitStack() as stack: + name, file_obj, mime_type = load_file_field(image_path, stack, IMAGE_ALLOWED_EXTS) + fields = [ + ("model", "MinerU2.5"), + ("is_ocr", "true"), + ("include_image_base64", "true"), + ("formula_enable", "true"), + ("table_enable", "true"), + ("layout_model", "doclayout_yolo"), + ("output_format", "md"), + ("file", (name, file_obj, mime_type)), + ] + + encoder = MultipartEncoder(fields) + headers = { + "Authorization": f"Bearer {api_key}", + "Content-Type": encoder.content_type, + } + # Send multipart request to submit OCR task to model + resp = requests.post(DOC_PARSE_URL, headers=headers, data=encoder, timeout=30) + + resp.raise_for_status() + data = resp.json() + if data.get("error"): + raise RuntimeError(f"Error from MinerU2.5 submission: {data['error']}: {data.get('message', 'No message')}") + + task_id = data.get("task_id") + if not task_id: + raise RuntimeError(f"No task_id returned from MinerU2.5: {data}") + + print(f"Submitted MinerU2.5 task: {task_id}, polling for result...") + # Because we use async OCR model, we need to continuously poll for the result after submission + result_payload = poll_task(task_id, api_key) + + print("\nOCR_RESULT:") + print(format_mineru_output(result_payload)) + + def main(): parser = argparse.ArgumentParser( description="Perform OCR on images using Gitee AI Vision API" @@ -35,9 +239,15 @@ def main(): required=True, help="Path to the local image file" ) + parser.add_argument( + "--model", "-m", + default="MinerU2.5", + choices=sorted(list(CHAT_MODELS | IMAGE_OCR_MODELS)), + help="Model to use for OCR" + ) parser.add_argument( "--prompt", "-p", - default="请提取这张图片中的所有文字内容。", + default="Please extract and recognize text from the image", help="Prompt/Instructions for the OCR task" ) parser.add_argument( @@ -46,7 +256,6 @@ def main(): ) args = parser.parse_args() - # Get API key api_key = get_api_key(args.api_key) @@ -57,76 +266,22 @@ def main(): print(" 2. Set GITEEAI_API_KEY environment variable", file=sys.stderr) sys.exit(1) - # Initialize OpenAI client - client = OpenAI( - base_url="https://ai.gitee.com/v1", - api_key=api_key, - ) - - print(f"Processing image for OCR...") - print(f"Image: {args.image}") + print(f"Processing for OCR...") + print(f"Model: {args.model}") + print(f"Input: {args.image}") print(f"Prompt: {args.prompt}") try: - filepath = args.image - - # Prepare image_url (support both local file paths and URLs) - if filepath.startswith(("http://", "https://")): - image_url = filepath - else: - if not os.path.exists(filepath): - print(f"Error: Image file not found at {filepath}", file=sys.stderr) - sys.exit(1) - mime_type, _ = mimetypes.guess_type(filepath) - mime_type = mime_type or "image/jpeg" - with open(filepath, "rb") as f: - base64_image = base64.b64encode(f.read()).decode('utf-8') - image_url = f"data:{mime_type};base64,{base64_image}" - - response = client.chat.completions.create( - messages=[ - { - "role": "system", - "content": "You are a helpful and harmless assistant. You should think step-by-step." - }, - { - "role": "user", - "content": [ - { - "type": "image_url", - "image_url": { - "url": image_url - } - }, - { - "type": "text", - "text": args.prompt - } - ] - } - ], - model="PaddleOCR-VL-1.5", - stream=False, - max_tokens=512, - temperature=0.7, - top_p=1, - extra_body={ - "top_k": 1, - }, - frequency_penalty=0, - ) - - print("\nOCR_RESULT:") - - # Extract and print the final content directly - if response.choices and len(response.choices) > 0: - result_text = response.choices[0].message.content - if result_text: - print(result_text.strip()) + if args.model in CHAT_MODELS: + client = OpenAI(base_url=API_BASE, api_key=api_key) # Chat-style OCR flow + run_chat_ocr(args.model, args.prompt, args.image, client) + elif args.model in IMAGE_OCR_MODELS: + if args.model == "MinerU2.5": + run_mineru_async(args.image, api_key) else: - print("No text recognized.") + run_image_ocr_multipart(args.model, args.prompt, args.image, api_key) # Multipart image OCR flow else: - print("No response choices returned.") + raise ValueError(f"Unsupported model: {args.model}") except Exception as e: print(f"\nError performing OCR: {e}", file=sys.stderr)