From a95919652edd0f83c5f6f1fd3f6139dde7b3ede9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B0=A2=E6=81=92=E4=B9=8B?= <16585759+xie-hengzhi@user.noreply.gitee.com> Date: Thu, 5 Mar 2026 14:06:54 +0800 Subject: [PATCH 1/9] =?UTF-8?q?=E4=BC=98=E5=8C=96=E7=94=9F=E5=9B=BESKILL?= =?UTF-8?q?=E7=9A=84=E6=B3=A8=E6=84=8F=E4=BA=8B=E9=A1=B9?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- skills/moark-image-gen/SKILL.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/skills/moark-image-gen/SKILL.md b/skills/moark-image-gen/SKILL.md index 6e44aca..9bb662c 100644 --- a/skills/moark-image-gen/SKILL.md +++ b/skills/moark-image-gen/SKILL.md @@ -80,10 +80,11 @@ python {baseDir}/scripts/perform_image_gen.py --prompt "your image description" ## Notes - You should not only return the image URL but also describe the image based on the user's prompt, and claim the hyperparameters used for generation. -- You should always wait for the script to finish executing, don't shut it down prematurely. +- You should always wait for the script to finish executing, don't shut it down prematurely until it output messages. +- When one model fails to generate images, you can try other models. If all the models fail, you can inform the user that the image generation failed and suggest them to modify the prompt or try again later. - The Lanaguage of your answer should be consistent with the user's question. - By default, return image URL directly without downloading. -- If GITEEAI_API_KEY is none, the user must provide --api-key argument. +- If GITEEAI_API_KEY is none, you should shut down and ask the user to provide the --api-key argument. - The script prints `IMAGE_URL:` in the output - extract this URL and display it using markdown image syntax: `🖼️[Generated image](URL)`. - Always look for the line starting with `IMAGE_URL:` in the script output and render the image for the user. - You should honestly repeat the description of the image from user without any additional imaginations. -- Gitee From cb172fa7cae152b476700ff3a3742cea73776e63 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B0=A2=E6=81=92=E4=B9=8B?= <16585759+xie-hengzhi@user.noreply.gitee.com> Date: Thu, 5 Mar 2026 16:45:14 +0800 Subject: [PATCH 2/9] =?UTF-8?q?=E4=B8=BAOCR=20skill=E6=8F=90=E4=BE=9B?= =?UTF-8?q?=E6=9B=B4=E5=A4=9A=E5=8F=AF=E9=80=89=E6=8B=A9=E7=9A=84=E6=A8=A1?= =?UTF-8?q?=E5=9E=8B?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- skills/moark-image-gen/SKILL.md | 2 +- skills/moark-ocr/SKILL.md | 39 ++++- skills/moark-ocr/scripts/perform_ocr.py | 218 ++++++++++++++++-------- 3 files changed, 179 insertions(+), 80 deletions(-) diff --git a/skills/moark-image-gen/SKILL.md b/skills/moark-image-gen/SKILL.md index 9bb662c..b104944 100644 --- a/skills/moark-image-gen/SKILL.md +++ b/skills/moark-image-gen/SKILL.md @@ -81,7 +81,7 @@ python {baseDir}/scripts/perform_image_gen.py --prompt "your image description" ## Notes - You should not only return the image URL but also describe the image based on the user's prompt, and claim the hyperparameters used for generation. - You should always wait for the script to finish executing, don't shut it down prematurely until it output messages. -- When one model fails to generate images, you can try other models. If all the models fail, you can inform the user that the image generation failed and suggest them to modify the prompt or try again later. +- When one model fails to work, you can try to use other models. If all the models fail, you can inform the user that the image generation failed and suggest them to modify the prompt or try again later. - The Lanaguage of your answer should be consistent with the user's question. - By default, return image URL directly without downloading. - If GITEEAI_API_KEY is none, you should shut down and ask the user to provide the --api-key argument. diff --git a/skills/moark-ocr/SKILL.md b/skills/moark-ocr/SKILL.md index ad852c1..ecd66d1 100644 --- a/skills/moark-ocr/SKILL.md +++ b/skills/moark-ocr/SKILL.md @@ -1,6 +1,6 @@ --- name: moark-ocr -description: Perform Optical Character Recognition (OCR) to extract and recognize text from images. +description: Perform Optical Character Recognition (OCR) to extract and recognize text and other information from images. metadata: { "openclaw": @@ -13,29 +13,50 @@ metadata: --- # OCR (Optical Character Recognition) -This skill allows users to extract and recognize text from images using an external GITEE AI API. +This skill allows users to extract and recognize text and other information from images using an external GITEE AI API. Supported local image formats: .jpg, .jpeg, .png, .webp. For URLs, ensure they point to one of these formats. Additionally, `GOT-OCR2_0` and `HunyuanOCR` also accept .gif. ## Usage -Ensure you have installed the required dependencies (`pip install openai`). Use the bundled script to perform OCR on an image. +Ensure you have installed the required dependencies (`pip install openai requests requests-toolbelt`). Use the bundled script to perform OCR. +**PaddleOCR-VL-1.5(default)** ```bash -python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --prompt "Users requirements" --api-key YOUR_API_KEY +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model PaddleOCR-VL-1.5 --prompt "提取图片文字" --api-key YOUR_API_KEY +``` + +**GOT-OCR2_0** +```bash +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model GOT-OCR2_0 --api-key YOUR_API_KEY +``` + +**HunyuanOCR** +```bash +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.gif --model HunyuanOCR --prompt "识别图片文字并给出坐标" --api-key YOUR_API_KEY +``` + +**DeepSeek-OCR-2** +```bash +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model DeepSeek-OCR-2 --prompt "请逐步提取并回答" --api-key YOUR_API_KEY ``` ## Options -No additional parameters are required for this skill. +- `--image` - Path to image file or image URL (supports .jpg, .jpeg, .png, .webp; `.gif` is supported for `GOT-OCR2_0` and `HunyuanOCR`). +- `--model` - Available models: `PaddleOCR-VL-1.5` (default), `DeepSeek-OCR-2`, `GOT-OCR2_0`, `HunyuanOCR`. +- `--prompt` - Instructions for the OCR task.(`GOT-OCR2_0` don't need prompt, so you can ignore this parameter when using `GOT-OCR2_0` model) +- `--api-key` - Gitee AI API key (overrides env `GITEEAI_API_KEY`). ## Workflow 1. Execute the perform_ocr.py script with the parameters from the user. 2. Parse the script output and find the line starting with `OCR_RESULT:`. -3. Extract the OCR result from that line (format: `OCR_RESULT: ...`). +3. Extract the OCR result from the line starting with `OCR_RESULT:` . 4. Display the OCR result to the user using markdown syntax: `📷[OCR Result]`. ## Notes -- If GITEEAI_API_KEY is none, you should remind user to provide --api-key argument -- You should not only return the OCR result but also provide a brief summary of the recognized text based on the user's prompt. +- If GITEEAI_API_KEY is none, you should shut down and remind user to provide --api-key argument +- You should not only return the OCR result honestly but also provide a brief summary of the recognized text. +- When one model fails to work, you can try to use other models. If all the models fail, you can inform the user that the image generation failed and suggest them to modify the prompt or try again later. - When you add prompt, you should honestly repeat the requirements from user without any additional imaginations. - The script prints `OCR_RESULT:` in the output - extract this result and display it using markdown image syntax:`📷[OCR Result]`. -- Always look for the line starting with `OCR_RESULT:` in the script output. \ No newline at end of file +- Always look for the line starting with `OCR_RESULT:` in the script output. +- Supported local image formats: .jpg, .jpeg, .png, .webp and .gif. \ No newline at end of file diff --git a/skills/moark-ocr/scripts/perform_ocr.py b/skills/moark-ocr/scripts/perform_ocr.py index 31903cd..fc51dd4 100644 --- a/skills/moark-ocr/scripts/perform_ocr.py +++ b/skills/moark-ocr/scripts/perform_ocr.py @@ -2,7 +2,9 @@ # /// script # requires-python = ">=3.10" # dependencies = [ -# "openai" +# "openai", +# "requests", +# "requests-toolbelt" # ] # /// @@ -10,15 +12,29 @@ Perform OCR on images using Gitee AI Vision API. Usage: - python perform_ocr.py --image /path/to/image.jpg --prompt "Users requirements" [--api-key KEY] + python perform_ocr.py --image /path/to/image.jpg --model "Default Model or Specific Model" --prompt "Users requirements" [--api-key KEY] """ import argparse -import os -import sys import base64 +import contextlib import mimetypes +import os +import sys +from typing import Tuple + +import requests from openai import OpenAI +from requests_toolbelt import MultipartEncoder + +API_BASE = "https://ai.gitee.com/v1" +IMAGE_OCR_URL = f"{API_BASE}/images/ocr" + +CHAT_MODELS = {"PaddleOCR-VL-1.5", "DeepSeek-OCR-2"} +IMAGE_OCR_MODELS = {"GOT-OCR2_0", "HunyuanOCR"} +CHAT_ALLOWED_EXTS = (".jpg", ".jpeg", ".png", ".webp") +IMAGE_ALLOWED_EXTS = (".jpg", ".jpeg", ".png", ".webp", ".gif") + def get_api_key(provided_key: str | None) -> str | None: """Get API key from argument, config, or environment.""" @@ -26,6 +42,119 @@ def get_api_key(provided_key: str | None) -> str | None: return provided_key return os.environ.get("GITEEAI_API_KEY") + +def build_image_url(filepath: str, allowed_exts: tuple[str, ...]) -> str: + """Return http(s) url directly, otherwise convert local image to data URL.""" + if filepath.startswith(("http://", "https://")): + return filepath + + if not os.path.exists(filepath): + raise FileNotFoundError(f"Image file not found at {filepath}") + _, ext = os.path.splitext(filepath) + if ext.lower() not in allowed_exts: + raise ValueError( + f"Unsupported image format: {ext}. Supported: {', '.join(allowed_exts)}" + ) + + mime_type, _ = mimetypes.guess_type(filepath) + mime_type = mime_type or "image/jpeg" + + with open(filepath, "rb") as f: + base64_image = base64.b64encode(f.read()).decode("utf-8") + return f"data:{mime_type};base64,{base64_image}" + + +def load_file_field(filepath: str, stack: contextlib.ExitStack, allowed_exts: tuple[str, ...]) -> Tuple[str, object, str]: + """ + Prepare a multipart file tuple (name, data/handle, mime) for requests-toolbelt. + + Caller is responsible for providing an ExitStack to manage open file handles. + """ + name = os.path.basename(filepath) + if filepath.startswith(("http://", "https://")): + resp = requests.get(filepath, timeout=10) + resp.raise_for_status() + content_type = resp.headers.get("Content-Type", "application/octet-stream") + return name, resp.content, content_type + + mime_type, _ = mimetypes.guess_type(filepath) + mime_type = mime_type or "application/octet-stream" + _, ext = os.path.splitext(filepath) + + if ext.lower() not in allowed_exts: + raise ValueError( + f"Unsupported image format: {ext}. Supported: {', '.join(allowed_exts)}" + ) + file_handle = stack.enter_context(open(filepath, "rb")) + + return name, file_handle, mime_type + + +def run_chat_ocr(model: str, prompt: str, image_path: str, client: OpenAI): + image_url = build_image_url(image_path, CHAT_ALLOWED_EXTS) # Chat models require a URL or data URL + response = client.chat.completions.create( + messages=[ + { + "role": "system", + "content": "You are a helpful and harmless assistant. You should think step-by-step.", + }, + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": {"url": image_url}, + }, + { + "type": "text", + "text": prompt, + }, + ], + }, + ], + model=model, + stream=False, + max_tokens=512, + temperature=0.7, + top_p=1, + extra_body={"top_k": 1}, + frequency_penalty=0, + ) + + print("\nOCR_RESULT:") + if response.choices: + result_text = response.choices[0].message.content + if result_text: + print(result_text.strip()) + else: + print("No text recognized.") + else: + print("No response choices returned.") + + +def run_image_ocr_multipart(model: str, prompt: str, image_path: str, api_key: str): + with contextlib.ExitStack() as stack: + name, file_obj, mime_type = load_file_field(image_path, stack, IMAGE_ALLOWED_EXTS) # Validate and open image + fields = [ + ("model", model), + ("prompt", prompt) if prompt else ("response_format", "format"), # Prompt optional for API + ("image", (name, file_obj, mime_type)), + ] + + encoder = MultipartEncoder(fields) + headers = { + "Authorization": f"Bearer {api_key}", + "Content-Type": encoder.content_type, + } + resp = requests.post(IMAGE_OCR_URL, headers=headers, data=encoder, timeout=30) # Send multipart request + + resp.raise_for_status() + data = resp.json() + + print("\nOCR_RESULT:") + print(data) + + def main(): parser = argparse.ArgumentParser( description="Perform OCR on images using Gitee AI Vision API" @@ -35,6 +164,12 @@ def main(): required=True, help="Path to the local image file" ) + parser.add_argument( + "--model", "-m", + default="PaddleOCR-VL-1.5", + choices=sorted(list(CHAT_MODELS | IMAGE_OCR_MODELS)), + help="Model to use for OCR" + ) parser.add_argument( "--prompt", "-p", default="请提取这张图片中的所有文字内容。", @@ -57,76 +192,19 @@ def main(): print(" 2. Set GITEEAI_API_KEY environment variable", file=sys.stderr) sys.exit(1) - # Initialize OpenAI client - client = OpenAI( - base_url="https://ai.gitee.com/v1", - api_key=api_key, - ) - - print(f"Processing image for OCR...") - print(f"Image: {args.image}") + print(f"Processing for OCR...") + print(f"Model: {args.model}") + print(f"Input: {args.image}") print(f"Prompt: {args.prompt}") try: - filepath = args.image - - # Prepare image_url (support both local file paths and URLs) - if filepath.startswith(("http://", "https://")): - image_url = filepath - else: - if not os.path.exists(filepath): - print(f"Error: Image file not found at {filepath}", file=sys.stderr) - sys.exit(1) - mime_type, _ = mimetypes.guess_type(filepath) - mime_type = mime_type or "image/jpeg" - with open(filepath, "rb") as f: - base64_image = base64.b64encode(f.read()).decode('utf-8') - image_url = f"data:{mime_type};base64,{base64_image}" - - response = client.chat.completions.create( - messages=[ - { - "role": "system", - "content": "You are a helpful and harmless assistant. You should think step-by-step." - }, - { - "role": "user", - "content": [ - { - "type": "image_url", - "image_url": { - "url": image_url - } - }, - { - "type": "text", - "text": args.prompt - } - ] - } - ], - model="PaddleOCR-VL-1.5", - stream=False, - max_tokens=512, - temperature=0.7, - top_p=1, - extra_body={ - "top_k": 1, - }, - frequency_penalty=0, - ) - - print("\nOCR_RESULT:") - - # Extract and print the final content directly - if response.choices and len(response.choices) > 0: - result_text = response.choices[0].message.content - if result_text: - print(result_text.strip()) - else: - print("No text recognized.") + if args.model in CHAT_MODELS: + client = OpenAI(base_url=API_BASE, api_key=api_key) # Chat-style OCR flow + run_chat_ocr(args.model, args.prompt, args.image, client) + elif args.model in IMAGE_OCR_MODELS: + run_image_ocr_multipart(args.model, args.prompt, args.image, api_key) # Multipart image OCR flow else: - print("No response choices returned.") + raise ValueError(f"Unsupported model: {args.model}") except Exception as e: print(f"\nError performing OCR: {e}", file=sys.stderr) -- Gitee From dcecf2d75c42a9c38aa97dda230112601cd0a3e5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B0=A2=E6=81=92=E4=B9=8B?= <16585759+xie-hengzhi@user.noreply.gitee.com> Date: Thu, 5 Mar 2026 16:53:09 +0800 Subject: [PATCH 3/9] =?UTF-8?q?=E7=BB=9F=E4=B8=80=E9=BB=98=E8=AE=A4?= =?UTF-8?q?=E6=8F=90=E7=A4=BA=E8=AF=8D=E7=9A=84=E8=AF=AD=E8=A8=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- skills/moark-image-gen/SKILL.md | 2 +- skills/moark-image-gen/scripts/perform_image_gen.py | 3 ++- skills/moark-ocr/SKILL.md | 6 +++--- skills/moark-ocr/scripts/perform_ocr.py | 2 +- 4 files changed, 7 insertions(+), 6 deletions(-) diff --git a/skills/moark-image-gen/SKILL.md b/skills/moark-image-gen/SKILL.md index b104944..9cd67a7 100644 --- a/skills/moark-image-gen/SKILL.md +++ b/skills/moark-image-gen/SKILL.md @@ -59,7 +59,7 @@ python {baseDir}/scripts/perform_image_gen.py --prompt "your image description" **Additional flags:** - `--model` - Specify the model to use. Options include `Qwen-Image` (default), `Kolors`, `GLM-Image`, `FLUX.2-dev`, `HunyuanDiT-v1.2-Diffusers-Distilled`. -- `--negative-prompt` - Specify what elements users want to avoid in the generated image(default: "低分辨率,低画质,肢体畸形,手指畸形,画面过饱和,蜡像感,人脸无细节,过度光滑,画面具有AI感。构图混乱。文字模糊,扭曲。"). +- `--negative-prompt` - Specify what elements users want to avoid in the generated image(default: "Low resolution, low image quality, distorted limbs and fingers, oversaturated image, wax figure appearance, lack of facial detail, excessive smoothing, AI-like appearance. Chaotic composition. Blurry and distorted text."). - `--size` - Specify the size of the generated image. Options include `256x256`, `512x512`, `1024x1024` (default), `1024x576`, `576x1024`, `1024x768`, `768x1024`, `1024x640`, `640x1024`, `2048x2048`. - `--guidance-scale` - Float value to control how closely the model adheres to the prompt (default depends on model). - `--num-inference-steps` - Integer for denoise steps (default depends on model). Higher values typically increase quality but take longer. diff --git a/skills/moark-image-gen/scripts/perform_image_gen.py b/skills/moark-image-gen/scripts/perform_image_gen.py index 5ba278e..6f7031a 100644 --- a/skills/moark-image-gen/scripts/perform_image_gen.py +++ b/skills/moark-image-gen/scripts/perform_image_gen.py @@ -50,7 +50,8 @@ def main(): ) parser.add_argument( "--negative-prompt", "-n", - default="低分辨率,低画质,肢体畸形,手指畸形,画面过饱和,蜡像感,人脸无细节,过度光滑,画面具有AI感。构图混乱。文字模糊,扭曲。", + default="Low resolution, low image quality, distorted limbs and fingers, oversaturated image, wax figure appearance, " \ + "lack of facial detail, excessive smoothing, AI-like appearance. Chaotic composition. Blurry and distorted text.", help="Negative prompt to avoid unwanted elements" ) parser.add_argument( diff --git a/skills/moark-ocr/SKILL.md b/skills/moark-ocr/SKILL.md index ecd66d1..7d390b8 100644 --- a/skills/moark-ocr/SKILL.md +++ b/skills/moark-ocr/SKILL.md @@ -21,7 +21,7 @@ Ensure you have installed the required dependencies (`pip install openai request **PaddleOCR-VL-1.5(default)** ```bash -python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model PaddleOCR-VL-1.5 --prompt "提取图片文字" --api-key YOUR_API_KEY +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model PaddleOCR-VL-1.5 --prompt "Please extract and recognize text from the image" --api-key YOUR_API_KEY ``` **GOT-OCR2_0** @@ -31,12 +31,12 @@ python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model GOT-O **HunyuanOCR** ```bash -python {baseDir}/scripts/perform_ocr.py --image /path/to/image.gif --model HunyuanOCR --prompt "识别图片文字并给出坐标" --api-key YOUR_API_KEY +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.gif --model HunyuanOCR --prompt "Please extract and recognize text from the image" --api-key YOUR_API_KEY ``` **DeepSeek-OCR-2** ```bash -python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model DeepSeek-OCR-2 --prompt "请逐步提取并回答" --api-key YOUR_API_KEY +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model DeepSeek-OCR-2 --prompt "Please extract and recognize text from the image" --api-key YOUR_API_KEY ``` ## Options diff --git a/skills/moark-ocr/scripts/perform_ocr.py b/skills/moark-ocr/scripts/perform_ocr.py index fc51dd4..6f26fe1 100644 --- a/skills/moark-ocr/scripts/perform_ocr.py +++ b/skills/moark-ocr/scripts/perform_ocr.py @@ -172,7 +172,7 @@ def main(): ) parser.add_argument( "--prompt", "-p", - default="请提取这张图片中的所有文字内容。", + default="Please extract and recognize text from the image", help="Prompt/Instructions for the OCR task" ) parser.add_argument( -- Gitee From af955ce1d45a8910306e713b3f0319811609ad78 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B0=A2=E6=81=92=E4=B9=8B?= <16585759+xie-hengzhi@user.noreply.gitee.com> Date: Thu, 5 Mar 2026 16:57:47 +0800 Subject: [PATCH 4/9] =?UTF-8?q?=E6=9B=B4=E6=96=B0README=E4=B8=ADocr?= =?UTF-8?q?=E7=9A=84=E6=8A=80=E8=83=BD=E6=8F=8F=E8=BF=B0?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 41ef4c7..ea7fadd 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ ## 1. 项目简介 -Moark Skills 是一个为 [OpenClaw](https://github.com/nicepkg/openclaw)提供扩展能力的技能(Skill)集合仓库。本项目旨在解决 AI 代理在处理特定任务时能力受限的问题,通过提供标准化的独立功能模块,使 AI 代理能够轻松**调用外部 API 或执行本地脚本,从而完成复杂任务**。 +Moark Skills 是一个为 [OpenClaw](https://docs.openclaw.ai/zh-CN) 提供扩展能力的技能(Skill)集合仓库。本项目旨在解决 AI 代理在处理特定任务时能力受限的问题,通过提供标准化的独立功能模块,使 AI 代理能够轻松**调用外部 API 或执行本地脚本,从而完成复杂任务**。 ## 2. 快速开始 @@ -89,12 +89,12 @@ OpenClaw 会自动读取 `SKILL.md`,匹配对应的技能,构造参数并执 ### moark-image-gen (AI 图像生成) - **能力**:根据文本描述生成高质量图像。 -- **输入**:正向提示词 (`--prompt`)、图像尺寸 (`--size`)、负面提示词 (`--negative-prompt`),生成时的迭代次数 (`--num-inference-steps`),对用户提示词的遵从程度 (`--guidance-scale`),API Key (`--api-key`)。 +- **输入**:所使用的模型(`--model`)、正向提示词 (`--prompt`)、图像尺寸 (`--size`)、负面提示词 (`--negative-prompt`),生成时的迭代次数 (`--num-inference-steps`),对用户提示词的遵从程度 (`--guidance-scale`),API Key (`--api-key`)。 - **输出**:生成的图像链接。 ### moark-ocr (光学字符识别) -- **能力**:识别并提取图像文件中的文字内容。 -- **输入**:图像文件的本地路径 (`--image`)、用户的具体提取要求 (`--prompt`),API Key (`--api-key`)。 +- **能力**:识别并提取图像文件中的文字内容或其他信息。 +- **输入**:图像文件的本地路径 (`--image`)、用户的具体提取要求 (`--prompt`)、所使用的模型(`--model`)、API Key (`--api-key`)。 - **输出**:识别出的文本内容及基于用户要求的输出。 ### moark-tts (文本转语音) @@ -130,10 +130,6 @@ OpenClaw 会自动读取 `SKILL.md`,匹配对应的技能,构造参数并执 python skills/moark-doc-extraction/scripts/perform_doc_extraction.py \ --file /path/to/document.pdf \ --api-key "你的_GITEEAI_API_KEY" - -# 或者通过环境变量传入 -export GITEEAI_API_KEY="你的_GITEEAI_API_KEY" -python skills/moark-doc-extraction/scripts/perform_doc_extraction.py --file /path/to/document.pdf ``` **示例 2:AI 图像生成** -- Gitee From 3ce6025a23d692de7596074c2b54f7535b2d0dc8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B0=A2=E6=81=92=E4=B9=8B?= <16585759+xie-hengzhi@user.noreply.gitee.com> Date: Thu, 5 Mar 2026 19:05:56 +0800 Subject: [PATCH 5/9] =?UTF-8?q?=E9=99=90=E5=88=B6=E7=94=9F=E5=9B=BEskill?= =?UTF-8?q?=E6=A8=A1=E5=9E=8B=E7=9A=84=E8=B6=85=E5=8F=82=E6=95=B0=E8=8C=83?= =?UTF-8?q?=E5=9B=B4?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- skills/moark-image-gen/SKILL.md | 3 ++- .../scripts/perform_image_gen.py | 18 +++++++++++++++++- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/skills/moark-image-gen/SKILL.md b/skills/moark-image-gen/SKILL.md index 9cd67a7..451642e 100644 --- a/skills/moark-image-gen/SKILL.md +++ b/skills/moark-image-gen/SKILL.md @@ -89,4 +89,5 @@ python {baseDir}/scripts/perform_image_gen.py --prompt "your image description" - Always look for the line starting with `IMAGE_URL:` in the script output and render the image for the user. - You should honestly repeat the description of the image from user without any additional imaginations. - **Handling User Feedback on Quality**: If the user states the image quality is low or lacks details, you should retry generating with a higher `--num-inference-steps` (e.g. 25 → 30). -- **Handling User Feedback on Prompt Adherence**: If the user states the image doesn't follow the prompt closely enough or ignores details, increase the `--guidance-scale` parameter (e.g. 7.5 → 15). If they say it's oversaturated or distorted, decrease it. \ No newline at end of file +- **Handling User Feedback on Prompt Adherence**: If the user states the image doesn't follow the prompt closely enough or ignores details, increase the `--guidance-scale` parameter (e.g. 7.5 → 15). If they say it's oversaturated or distorted, decrease it. +- When you increase `--num-inference-steps` or `--guidance-scale` , you must obey the range of the hyperparameter for the specific model in **Model Specific Defaults:**. For example, if the user is using `Kolors` model and you want to increase `--num-inference-steps` from 25 to 30, you can do that because it's within the range of 20-30. But if you want to increase it to 35, you should not do that because it's out of range. In that case, you can inform the user that you have increased the `--num-inference-steps` to the maximum value of 30 for `Kolors` model. \ No newline at end of file diff --git a/skills/moark-image-gen/scripts/perform_image_gen.py b/skills/moark-image-gen/scripts/perform_image_gen.py index 6f7031a..34da06b 100644 --- a/skills/moark-image-gen/scripts/perform_image_gen.py +++ b/skills/moark-image-gen/scripts/perform_image_gen.py @@ -96,6 +96,14 @@ def main(): "HunyuanDiT-v1.2-Diffusers-Distilled": {"num_inference_steps": 25, "guidance_scale": 5.0}, } + RANGE_LIMITS = { + "Qwen-Image": {"num_inference_steps": (4, 50)}, + "Kolors": {"num_inference_steps": (20, 30), "guidance_scale": (0, 100)}, + "GLM-Image": {"num_inference_steps": (10, 50), "guidance_scale": (0, 10)}, + "FLUX.2-dev": {"num_inference_steps": (10, 50), "guidance_scale": (0, 100)}, + "HunyuanDiT-v1.2-Diffusers-Distilled": {"num_inference_steps": (25, 50), "guidance_scale": (0, 20)}, + } + allowed_params = SUPPORTED_KWARGS.get(args.model, []) model_defaults = MODEL_DEFAULTS.get(args.model, {}) @@ -106,11 +114,19 @@ def main(): if "guidance_scale" in allowed_params: gs_val = args.guidance_scale if args.guidance_scale is not None else model_defaults.get("guidance_scale") + limits = RANGE_LIMITS.get(args.model, {}).get("guidance_scale") + if gs_val is not None and limits: + lo, hi = limits + gs_val = max(lo, min(hi, gs_val)) if gs_val is not None: extra_body["guidance_scale"] = gs_val - + if "num_inference_steps" in allowed_params: nis_val = args.num_inference_steps if args.num_inference_steps is not None else model_defaults.get("num_inference_steps") + limits = RANGE_LIMITS.get(args.model, {}).get("num_inference_steps") + if nis_val is not None and limits: + lo, hi = limits + nis_val = max(lo, min(hi, nis_val)) if nis_val is not None: extra_body["num_inference_steps"] = nis_val -- Gitee From 5d57d22622710ec940be5d92d5d33f0c470ecc7c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B0=A2=E6=81=92=E4=B9=8B?= <16585759+xie-hengzhi@user.noreply.gitee.com> Date: Fri, 6 Mar 2026 15:37:58 +0800 Subject: [PATCH 6/9] =?UTF-8?q?=E4=B8=BAocr=20skill=E6=B7=BB=E5=8A=A0?= =?UTF-8?q?=E5=BC=82=E6=AD=A5=E6=A8=A1=E5=9E=8BMinerU2.5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- skills/moark-ocr/SKILL.md | 18 +++-- skills/moark-ocr/scripts/perform_ocr.py | 93 ++++++++++++++++++++++--- 2 files changed, 97 insertions(+), 14 deletions(-) diff --git a/skills/moark-ocr/SKILL.md b/skills/moark-ocr/SKILL.md index 7d390b8..98f5afc 100644 --- a/skills/moark-ocr/SKILL.md +++ b/skills/moark-ocr/SKILL.md @@ -1,6 +1,6 @@ --- name: moark-ocr -description: Perform Optical Character Recognition (OCR) to extract and recognize text and other information from images. +description: Perform Optical Character Recognition (OCR) to extract and recognize text from images. metadata: { "openclaw": @@ -13,13 +13,18 @@ metadata: --- # OCR (Optical Character Recognition) -This skill allows users to extract and recognize text and other information from images using an external GITEE AI API. Supported local image formats: .jpg, .jpeg, .png, .webp. For URLs, ensure they point to one of these formats. Additionally, `GOT-OCR2_0` and `HunyuanOCR` also accept .gif. +This skill allows users to extract and recognize text from images using an external GITEE AI API. Supported local image formats: .jpg, .jpeg, .png, .webp. For URLs, ensure they point to one of these formats. Additionally, `GOT-OCR2_0` and `HunyuanOCR` also accept .gif. ## Usage Ensure you have installed the required dependencies (`pip install openai requests requests-toolbelt`). Use the bundled script to perform OCR. -**PaddleOCR-VL-1.5(default)** +**MinerU2.5(default)** +```bash +python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model MinerU2.5 --api-key YOUR_API_KEY +``` + +**PaddleOCR-VL-1.5** ```bash python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model PaddleOCR-VL-1.5 --prompt "Please extract and recognize text from the image" --api-key YOUR_API_KEY ``` @@ -40,9 +45,9 @@ python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model DeepS ``` ## Options -- `--image` - Path to image file or image URL (supports .jpg, .jpeg, .png, .webp; `.gif` is supported for `GOT-OCR2_0` and `HunyuanOCR`). -- `--model` - Available models: `PaddleOCR-VL-1.5` (default), `DeepSeek-OCR-2`, `GOT-OCR2_0`, `HunyuanOCR`. -- `--prompt` - Instructions for the OCR task.(`GOT-OCR2_0` don't need prompt, so you can ignore this parameter when using `GOT-OCR2_0` model) +- `--image` - Path to image file or image URL (supports .jpg, .jpeg, .png, .webp; `.gif` is supported for `GOT-OCR2_0`, `HunyuanOCR` and `MinerU2.5`). +- `--model` - Available models: `MinerU2.5` (default), `PaddleOCR-VL-1.5`, `DeepSeek-OCR-2`, `GOT-OCR2_0`, `HunyuanOCR`. +- `--prompt` - Instructions for the OCR task (`GOT-OCR2_0` can omit prompt; others accept an optional prompt). - `--api-key` - Gitee AI API key (overrides env `GITEEAI_API_KEY`). ## Workflow @@ -54,6 +59,7 @@ python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model DeepS ## Notes - If GITEEAI_API_KEY is none, you should shut down and remind user to provide --api-key argument +- You should always wait for the script to finish executing, don't shut it down prematurely until it output messages. - You should not only return the OCR result honestly but also provide a brief summary of the recognized text. - When one model fails to work, you can try to use other models. If all the models fail, you can inform the user that the image generation failed and suggest them to modify the prompt or try again later. - When you add prompt, you should honestly repeat the requirements from user without any additional imaginations. diff --git a/skills/moark-ocr/scripts/perform_ocr.py b/skills/moark-ocr/scripts/perform_ocr.py index 6f26fe1..d6524e7 100644 --- a/skills/moark-ocr/scripts/perform_ocr.py +++ b/skills/moark-ocr/scripts/perform_ocr.py @@ -21,6 +21,7 @@ import contextlib import mimetypes import os import sys +import time from typing import Tuple import requests @@ -29,9 +30,11 @@ from requests_toolbelt import MultipartEncoder API_BASE = "https://ai.gitee.com/v1" IMAGE_OCR_URL = f"{API_BASE}/images/ocr" +DOC_PARSE_URL = f"{API_BASE}/async/documents/parse" +TASK_STATUS_URL = f"{API_BASE}/task/{{task_id}}" CHAT_MODELS = {"PaddleOCR-VL-1.5", "DeepSeek-OCR-2"} -IMAGE_OCR_MODELS = {"GOT-OCR2_0", "HunyuanOCR"} +IMAGE_OCR_MODELS = {"GOT-OCR2_0", "HunyuanOCR", "MinerU2.5"} CHAT_ALLOWED_EXTS = (".jpg", ".jpeg", ".png", ".webp") IMAGE_ALLOWED_EXTS = (".jpg", ".jpeg", ".png", ".webp", ".gif") @@ -135,11 +138,13 @@ def run_chat_ocr(model: str, prompt: str, image_path: str, client: OpenAI): def run_image_ocr_multipart(model: str, prompt: str, image_path: str, api_key: str): with contextlib.ExitStack() as stack: name, file_obj, mime_type = load_file_field(image_path, stack, IMAGE_ALLOWED_EXTS) # Validate and open image - fields = [ - ("model", model), - ("prompt", prompt) if prompt else ("response_format", "format"), # Prompt optional for API - ("image", (name, file_obj, mime_type)), - ] + base_fields = { + "model": model, + } + if prompt: + base_fields["prompt"] = prompt + fields = list(base_fields.items()) + fields.append(("image", (name, file_obj, mime_type))) encoder = MultipartEncoder(fields) headers = { @@ -155,6 +160,76 @@ def run_image_ocr_multipart(model: str, prompt: str, image_path: str, api_key: s print(data) +def poll_task(task_id: str, api_key: str, interval: float = 2.0, timeout: float = 120.0): + """Poll the task status until it completes or fails.""" + headers = {"Authorization": f"Bearer {api_key}"} + start_time = time.time() + while True: + resp = requests.get(TASK_STATUS_URL.format(task_id=task_id), headers=headers, timeout=15) + resp.raise_for_status() + payload = resp.json() + status = payload.get("task_status") or payload.get("status") + + if status in {"success"}: + return payload + elif status in {"failed", "cancelled"}: + raise RuntimeError(f"Task {task_id} failed with status: {status}, response: {payload}") + + if time.time() - start_time > timeout: + raise TimeoutError(f"Task {task_id} polling timed out") + time.sleep(interval) + + +def format_mineru_output(payload: dict) -> str | dict: + """Extract readable text from MinerU2.5 payload; fall back to raw payload if missing.""" + output = payload.get("output") or {} + segments = output.get("segments") if isinstance(output, dict) else None + if segments and isinstance(segments, list): + contents = [seg.get("content") for seg in segments if isinstance(seg, dict) and seg.get("content")] + if contents: + return "\n\n".join(contents) + return payload + + +def run_mineru_async(image_path: str, api_key: str): + with contextlib.ExitStack() as stack: + name, file_obj, mime_type = load_file_field(image_path, stack, IMAGE_ALLOWED_EXTS) + fields = [ + ("model", "MinerU2.5"), + ("is_ocr", "true"), + ("include_image_base64", "true"), + ("formula_enable", "true"), + ("table_enable", "true"), + ("layout_model", "doclayout_yolo"), + ("output_format", "md"), + ("file", (name, file_obj, mime_type)), + ] + + encoder = MultipartEncoder(fields) + headers = { + "Authorization": f"Bearer {api_key}", + "Content-Type": encoder.content_type, + } + # Send multipart request to submit OCR task to model + resp = requests.post(DOC_PARSE_URL, headers=headers, data=encoder, timeout=30) + + resp.raise_for_status() + data = resp.json() + if data.get("error"): + raise RuntimeError(f"Error from MinerU2.5 submission: {data['error']}: {data.get('message', 'No message')}") + + task_id = data.get("task_id") + if not task_id: + raise RuntimeError(f"No task_id returned from MinerU2.5: {data}") + + print(f"Submitted MinerU2.5 task: {task_id}, polling for result...") + # Because we use async OCR model, we need to continuously poll for the result after submission + result_payload = poll_task(task_id, api_key) + + print("\nOCR_RESULT:") + print(format_mineru_output(result_payload)) + + def main(): parser = argparse.ArgumentParser( description="Perform OCR on images using Gitee AI Vision API" @@ -181,7 +256,6 @@ def main(): ) args = parser.parse_args() - # Get API key api_key = get_api_key(args.api_key) @@ -202,7 +276,10 @@ def main(): client = OpenAI(base_url=API_BASE, api_key=api_key) # Chat-style OCR flow run_chat_ocr(args.model, args.prompt, args.image, client) elif args.model in IMAGE_OCR_MODELS: - run_image_ocr_multipart(args.model, args.prompt, args.image, api_key) # Multipart image OCR flow + if args.model == "MinerU2.5": + run_mineru_async(args.image, api_key) + else: + run_image_ocr_multipart(args.model, args.prompt, args.image, api_key) # Multipart image OCR flow else: raise ValueError(f"Unsupported model: {args.model}") -- Gitee From 077f666df94f22798b3bbd9461cdbc84f03be574 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B0=A2=E6=81=92=E4=B9=8B?= <16585759+xie-hengzhi@user.noreply.gitee.com> Date: Fri, 6 Mar 2026 15:56:28 +0800 Subject: [PATCH 7/9] =?UTF-8?q?=E5=A4=84=E7=90=86=E6=97=A0=E9=9C=80prompt?= =?UTF-8?q?=E4=BB=A5=E5=8F=8A=E6=97=A0=E6=B3=95=E5=A4=84=E7=90=86gif?= =?UTF-8?q?=E6=A0=BC=E5=BC=8F=E7=9A=84ocr=E6=A8=A1=E5=9E=8B?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- skills/moark-ocr/SKILL.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/skills/moark-ocr/SKILL.md b/skills/moark-ocr/SKILL.md index 98f5afc..6e3b0d9 100644 --- a/skills/moark-ocr/SKILL.md +++ b/skills/moark-ocr/SKILL.md @@ -47,7 +47,7 @@ python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model DeepS ## Options - `--image` - Path to image file or image URL (supports .jpg, .jpeg, .png, .webp; `.gif` is supported for `GOT-OCR2_0`, `HunyuanOCR` and `MinerU2.5`). - `--model` - Available models: `MinerU2.5` (default), `PaddleOCR-VL-1.5`, `DeepSeek-OCR-2`, `GOT-OCR2_0`, `HunyuanOCR`. -- `--prompt` - Instructions for the OCR task (`GOT-OCR2_0` can omit prompt; others accept an optional prompt). +- `--prompt` - Instructions for the OCR task (`GOT-OCR2_0`, `MinerU2.5` don't need a prompt; others accept an optional prompt). - `--api-key` - Gitee AI API key (overrides env `GITEEAI_API_KEY`). ## Workflow @@ -60,6 +60,7 @@ python {baseDir}/scripts/perform_ocr.py --image /path/to/image.jpg --model DeepS ## Notes - If GITEEAI_API_KEY is none, you should shut down and remind user to provide --api-key argument - You should always wait for the script to finish executing, don't shut it down prematurely until it output messages. +- When user require OCR on a GIF image, you should first check if the model they specified supports GIF format. If it doesn't, you should inform the user that the model they chose doesn't support GIF images and suggest them to choose another model that supports GIF or convert the GIF to a supported format before performing OCR. - You should not only return the OCR result honestly but also provide a brief summary of the recognized text. - When one model fails to work, you can try to use other models. If all the models fail, you can inform the user that the image generation failed and suggest them to modify the prompt or try again later. - When you add prompt, you should honestly repeat the requirements from user without any additional imaginations. -- Gitee From d3664755beec84aec43b20c91585f44a42ebb3ca Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B0=A2=E6=81=92=E4=B9=8B?= <16585759+xie-hengzhi@user.noreply.gitee.com> Date: Fri, 6 Mar 2026 16:54:31 +0800 Subject: [PATCH 8/9] =?UTF-8?q?=E7=BB=9F=E4=B8=80ocr=E4=BD=BF=E7=94=A8?= =?UTF-8?q?=E7=9A=84=E9=BB=98=E8=AE=A4=E6=A8=A1=E5=9E=8B?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- skills/moark-ocr/scripts/perform_ocr.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/skills/moark-ocr/scripts/perform_ocr.py b/skills/moark-ocr/scripts/perform_ocr.py index d6524e7..90730ca 100644 --- a/skills/moark-ocr/scripts/perform_ocr.py +++ b/skills/moark-ocr/scripts/perform_ocr.py @@ -173,7 +173,7 @@ def poll_task(task_id: str, api_key: str, interval: float = 2.0, timeout: float if status in {"success"}: return payload elif status in {"failed", "cancelled"}: - raise RuntimeError(f"Task {task_id} failed with status: {status}, response: {payload}") + raise RuntimeError(f"Task {task_id} failed with status: {status}") if time.time() - start_time > timeout: raise TimeoutError(f"Task {task_id} polling timed out") @@ -241,7 +241,7 @@ def main(): ) parser.add_argument( "--model", "-m", - default="PaddleOCR-VL-1.5", + default="MinerU2.5", choices=sorted(list(CHAT_MODELS | IMAGE_OCR_MODELS)), help="Model to use for OCR" ) -- Gitee From 249696af8eecda9b598941bd6686b221eb840979 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B0=A2=E6=81=92=E4=B9=8B?= <16585759+xie-hengzhi@user.noreply.gitee.com> Date: Fri, 6 Mar 2026 17:09:27 +0800 Subject: [PATCH 9/9] =?UTF-8?q?=E4=BC=98=E5=8C=96=E8=BE=93=E5=87=BA?= =?UTF-8?q?=E5=B9=B6=E5=A2=9E=E5=8A=A0=E7=94=9F=E5=9B=BEskill=E4=B8=8E?= =?UTF-8?q?=E7=94=A8=E6=88=B7=E7=9A=84=E4=BA=A4=E4=BA=92?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- skills/moark-image-gen/SKILL.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/skills/moark-image-gen/SKILL.md b/skills/moark-image-gen/SKILL.md index 451642e..e639222 100644 --- a/skills/moark-image-gen/SKILL.md +++ b/skills/moark-image-gen/SKILL.md @@ -79,7 +79,7 @@ python {baseDir}/scripts/perform_image_gen.py --prompt "your image description" 4. Display the image to the user using markdown syntax: `🖼️[Generated Image](URL)`. ## Notes -- You should not only return the image URL but also describe the image based on the user's prompt, and claim the hyperparameters used for generation. +- You should not only return the image URL but also describe the image based on the user's prompt, and ask the user if the generated image meets their requirements. - You should always wait for the script to finish executing, don't shut it down prematurely until it output messages. - When one model fails to work, you can try to use other models. If all the models fail, you can inform the user that the image generation failed and suggest them to modify the prompt or try again later. - The Lanaguage of your answer should be consistent with the user's question. @@ -90,4 +90,5 @@ python {baseDir}/scripts/perform_image_gen.py --prompt "your image description" - You should honestly repeat the description of the image from user without any additional imaginations. - **Handling User Feedback on Quality**: If the user states the image quality is low or lacks details, you should retry generating with a higher `--num-inference-steps` (e.g. 25 → 30). - **Handling User Feedback on Prompt Adherence**: If the user states the image doesn't follow the prompt closely enough or ignores details, increase the `--guidance-scale` parameter (e.g. 7.5 → 15). If they say it's oversaturated or distorted, decrease it. +- If the user says the image quality is low, you can try to increase the `--num-inference-steps` parameter and regenerate the image. If they say the image doesn't follow the prompt closely enough, you can try to increase the `--guidance-scale` parameter and regenerate the image. If they say it's oversaturated or distorted, you can try to decrease the `--guidance-scale` parameter and regenerate the image. - When you increase `--num-inference-steps` or `--guidance-scale` , you must obey the range of the hyperparameter for the specific model in **Model Specific Defaults:**. For example, if the user is using `Kolors` model and you want to increase `--num-inference-steps` from 25 to 30, you can do that because it's within the range of 20-30. But if you want to increase it to 35, you should not do that because it's out of range. In that case, you can inform the user that you have increased the `--num-inference-steps` to the maximum value of 30 for `Kolors` model. \ No newline at end of file -- Gitee