# jsoup **Repository Path**: openharmony-sig/jsoup ## Basic Information - **Project Name**: jsoup - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 8 - **Forks**: 25 - **Created**: 2022-04-16 - **Last Updated**: 2025-05-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # jsoup ## 简介 快速且宽容的HTML解析器 - 从URL、文件或字符串中抓取和解析HTML; - 将HTML文档转化为DOM结构,可以从元素中提取属性、文本; - 操作HTML元素、属性和文本; - 清理用户提交的HTML,在每个元素的基础上保留用户列入白名单的元素和列入白名单的属性; - 输出整洁的HTML或者XHTML。 ## 下载安装 按功能对应下载安装: 场景一:HTML操作:对HTML文档进行解析、提取、清理 ``` ohpm install @ohos/sanitize-html ``` 场景二:HTML转化为整洁的XHTML ``` ohpm install @ohos/htmltoxml ``` 场景三:HTML转化为json ``` ohpm install parser-html-json ``` OpenHarmony ohpm环境配置等更多内容,请参考 [如何安装OpenHarmony ohpm包](https://gitee.com/openharmony-tpc/docs/blob/master/OpenHarmony_har_usage.md) 。 ## 使用说明 ### HTML操作 #### 解析HTML并提取元素中的属性、文本 - 在src/main/ets/entryability/EntryAbility.ts中配置GlobalContext ``` GlobalContext.getContext().setValue("resManager", this.context.resourceManager); GlobalContext.getContext().setValue("filesPath", this.context.filesDir); GlobalContext.getContext().setValue("context", this.context); ``` - 创建Partial,(helper.ts) ``` import type { Parser } from "htmlparser2"; import { Handler } from 'htmlparser2/src/main/ets/esm/Parser'; interface Event { $event: string; data: unknown[]; startIndex: number; endIndex: number; } /** * Creates a handler that calls the supplied callback with simplified events on * completion. * * @internal * @param callback Function to call with all events. */ export function getEventCollector( callback: (error: Error | null, data?: ESObject) => void, ): Partial { const events: Event[] = []; let parser: Parser; function handle(event: string, data: unknown[]): void { switch (event) { case "onerror": { callback(data[0] as Error); break; } case "onend": { callback(null, { $event: event.slice(2), startIndex: parser.startIndex, endIndex: parser.endIndex, data, }); break; } case "onreset": { events.length = 0; break; } case "onparserinit": { parser = data[0] as Parser; break; } case "onopentag": { callback(null, { $event: event.slice(2), startIndex: parser.startIndex, endIndex: parser.endIndex, data, }); break; } case "ontext": { callback(null, { $event: event.slice(2), startIndex: parser.startIndex, endIndex: parser.endIndex, data: data[0], }) break; } case "onclosetag": { if (data[0] === "script") { console.info("htmlparser2--That's it?!"); } break; } default: { const last = events[events.length - 1]; if (event === "ontext" && last && last.$event === "text") { (last.data[0] as string) += data[0]; last.endIndex = parser.endIndex; break; } if (event === "onattribute" && data[2] === undefined) { data.pop(); } if (!(parser.startIndex <= parser.endIndex)) { throw new Error( `Invalid start/end index ${parser.startIndex} > ${parser.endIndex}`, ); } events.push({ $event: event.slice(2), startIndex: parser.startIndex, endIndex: parser.endIndex, data, }); parser.endIndex; } } } return new Proxy( {}, { get: (_, event: string) => (...data: unknown[]) => handle(event, data), }, ); } ``` - 使用Handler构建Parser ``` import { Parser } from 'htmlparser2' let parser = new Parser(helper.getEventCollector((error, actual: ESObject) => { if (actual.$event == "opentag") { this.addLog(this.parserContent, `jsoup-- onopentag name --> ${actual.data[0]} attributes --> ${JSON.stringify(actual.data[1])}`); } if (actual.$event == "text") { this.addLog(this.parserContent, "jsoup-- text -->" + actual.data); } if (actual.$event == "opentagname") { this.addLog(this.parserContent, "jsoup-- tagName -->" + actual.data); } if (actual.$event == "attribute") { this.addLog(this.parserContent, `jsoup-- attribName name --> ${actual.data[0]} value --> ${actual.data[1]}`); } if (actual.$event == "closetag") { this.addLog(this.parserContent, "jsoup-- closeTag --> " + actual.data); } if (actual.$event == "end") { this.showResult(this.parserContent.join('\n')) this.parserContent = []; } })); parser.write(html); parser.end(); ``` - 使用DomHandler构建Parser ``` import { Parser } from 'htmlparser2' import { DomHandler } from 'domhandler' import * as DomUtils from 'domutils' const handler = new DomHandler((error, dom) => { if (error) { // Handle error } else { // Parsing completed, do something console.info('jsoup dom.toString()=' + dom + ""); let elements = DomUtils.getElementsByTagName('style', dom) console.info('jsoup elements.length=', elements.length); let element = elements[0] console.info('jsoup element=', Object.keys(element)); let text = DomUtils.getText(elements) console.info('jsoup text=', text); } }); const parser = new Parser(handler, { decodeEntities: true }); parser.write(html); parser.end(); ``` - parseDocument解析 ``` import { parseDocument } from 'htmlparser2' import * as DomUtils from 'domutils' let dom: Document = parseDocument(html) // 通过DomUtils对解析过的Dom对象进行操作 // 根据标签名称获取元素 let element = DomUtils.getElementsByTagName('style', dom) // 获取文本 let text = DomUtils.getText(element) // 判断元素类型是否为tag let isTag = DomUtils.isTag(element[0]) // 判断元素类型是否为CDATA let isCDATA = DomUtils.isCDATA(element[0]) // 判断元素类型是否Text let isText = DomUtils.isText(element[0]) // 判断元素类型是否为Comment let isComment = DomUtils.isComment(element[0]) // 获取指定元素的子元素集 let childrens = DomUtils.getChildren(body[0]) ``` #### 获取HTML文本 - 通过URL获取HTML文本 ``` import http from '@ohos.net.http'; let httpRequest = http.createHttp() httpRequest.request('http://106.15.92.248/share/html.txt') .then((data) => { console.log("jsoup url html=" + JSON.stringify(data)) // TODO do something if (data.result && typeof data.result === 'string') { parser.write(data.result); parser.end(); } }) .catch((err) => { console.error('jsoup connect error:' + JSON.stringify(err)); }) ``` - 通过文件流获取HTML文本 ``` import fileio from '@ohos.fileio'; let buf = new ArrayBuffer(html.length) stream.readSync(buf, { offset: 0, length: html.length, position: 0 }) let dom = String.fromCharCode.apply(null, new Uint8Array(buf)) // TODO do something parser.write(dom); parser.end(); ``` - 通过rawfile获取HTML文本 ``` import util from '@ohos.util'; // 注意:需要先在MainAbility中为该变量赋值: let resourceManager=GlobalContext.getContext().getValue("resManager") as resmgr.ResourceManager if (!resourceManager ) { console.log('jsoup resourceManager is undefined'); return; } resourceManager.getRawFile(filePath) .then((data) => { var textDecoder = new util.TextDecoder("utf-8", { ignoreBOM: true }) var result: string = textDecoder.decode(data, { stream: false }) // TODO do something parser.write(result); parser.end(); }) .catch((err) => { console.log("jsoup getHtmlFromRawFile err=" + err) }) ``` - 通过文件路径获取HTML文本 ``` import fileio from '@ohos.fileio'; let filesPath = GlobalContext.getContext() .getValue("filesPath") as string if (!filesPath) { console.log('jsoup filesPath is undefined'); return; } var filePath = filesPath + '/jsoup.html'; fileio.readText(filePath) .then((data) => { console.log("jsoup getHtmlFromFilePath text=" + data); // TODO do something parser.write(data); parser.end(); }) .catch((err) => { console.log("jsoup getHtmlFromFilePath err=" + err) }) ``` #### 清理HTML并且可以操作HTML元素、属性和文本 - 导入模块 ``` import SanitizeHtml from 'sanitize-html' ``` - 清理HTML 使用默认的标签和属性列表: ``` const clean = SanitizeHtml(dirty); ``` 允许的特定的标签和属性不会被清除: ``` const clean = sanitizeHtml(dirty, { allowedTags: [ 'b', 'i', 'em', 'strong', 'a' ], allowedAttributes: { 'a': [ 'href' ] }, allowedIframeHostnames: ['www.youtube.com'] }); ``` 在默认列表的基础上添加标签: ``` const clean = SanitizeHtml(dirty, { allowedTags: SanitizeHtml.defaults.allowedTags.concat([ 'img' ]) }); ``` 将不允许的标签进行转义,而不是清除: ``` const clean = SanitizeHtml('before after', { disallowedTagsMode: 'escape', allowedTags: [], allowedAttributes: false }) ``` 允许所有标签或所有属性: ``` allowedTags: false, allowedAttributes: false ``` 不想允许任何标签: ``` allowedTags: [], allowedAttributes: {} ``` 在特定元素上允许特定的CSS类: ``` const clean = SanitizeHtml(dirty, { allowedTags: [ 'p', 'em', 'strong' ], allowedClasses: { 'p': [ 'fancy', 'simple' ] } }); ``` 在特定元素上允许特定的CSS样式 ``` const clean = SanitizeHtml(dirty, { allowedTags: ['p'], allowedAttributes: { 'p': ["style"], }, allowedStyles: { '*': { // Match HEX and RGB 'color': [/^#(0x)?[0-9a-f]+$/i, /^rgb\(\s*(\d{1,3})\s*,\s*(\d{1,3})\s*,\s*(\d{1,3})\s*\)$/], 'text-align': [/^left$/, /^right$/, /^center$/], // Match any number with px, em, or % 'font-size': [/^\d+(?:px|em|%)$/] }, 'p': { 'font-size': [/^\d+rem$/] } } }); ``` - 更改标签 ``` const dirty='
  1. Hello world
'; const clean = SanitizeHtml(dirty, { transformTags: { 'ol': 'ul', } }); ``` 更改标签并且添加属性: ``` const dirty = '
  1. Hello world
'; const clean = SanitizeHtml(dirty, { transformTags: { ol: SanitizeHtml.simpleTransform('ul', { class: 'foo' }) }, allowedAttributes: { ul: ['foo', 'bar', 'class'] } }); ``` - 可以添加或修改标签的文本内容 ``` const clean = SanitizeHtml(dirty, { transformTags: { 'a': function(tagName, attribs) { return { tagName: 'a', attribs: attribs, text: 'Some text' }; } } }); ``` 例如,您可以转换缺少锚文本的链接元素: ``` ``` 到带有锚文本的链接: ``` Some text ``` - 提供过滤功能来删除不需要的标签 ``` const dirty = '

This is
Linux

'; const clean = SanitizeHtml(dirty, { exclusiveFilter: function (frame) { return frame.tag === 'a' && !frame.text.trim(); } }); ``` ### HTML转化为整洁的XHTML ``` import { XMLWriter } from '@ohos/htmltoxml' let property = [{ key: XMLWriter.DOCTYPE_PUBLIC, value: '-//W3C//DTD XHTML 1.1//EN' }, { key: XMLWriter.DOCTYPE_SYSTEM, value: 'http://www.w3.org/TR?xhtml11/DTD/xhtml11.dtd' }] const xml = new XMLWriter(html, property); xml.convertToXML((content, error) => { }) ``` ### 提取CSS ``` import * as ParserHTMLJson from 'parser-html-json' let parserJson = new ParserHTMLJson.default(html); let result = parserJson.getClassStyleJson(); console.info("jsoup css=" + JSON.stringify(result)); ``` ## 接口说明 类型定义: ``` // 解析器处理回调 interface Handler { onparserinit(parser: Parser): void; onreset(): void; onend(): void; onerror(error: Error): void; onclosetag(name: string): void; onopentagname(name: string): void; onattribute(name: string, value: string, quote?: string | undefined | null): void; onopentag(name: string, attribs: { [s: string]: string; }): void; ontext(data: string): void; oncomment(data: string): void; oncdatastart(): void; oncdataend(): void; oncommentend(): void; onprocessinginstruction(name: string, data: string): void; } // 解析器选项 interface ParserOptions { decodeEntities?: boolean; lowerCaseTags?: boolean; lowerCaseAttributeNames?: boolean; recognizeCDATA?: boolean; } // 清理HTML,抵御XSS攻击 declare namespace sanitize { interface Attributes { [attr: string]: string; } interface Tag { tagName: string; attribs: Attributes; text?: string ; } type Transformer = (tagName: string, attribs: Attributes) => Tag; type AllowedAttribute = string | { name: string; multiple?: boolean ; values: string[] }; type DisallowedTagsModes = 'discard' | 'escape' | 'recursiveEscape'; interface IDefaults { allowedAttributes: Record; allowedSchemes: string[]; allowedSchemesByTag: { [index: string]: string[] }; allowedSchemesAppliedToAttributes: string[]; allowedTags: string[]; allowProtocolRelative: boolean; disallowedTagsMode: DisallowedTagsModes; enforceHtmlBoundary: boolean; selfClosing: string[]; } interface IFrame { tag: string; attribs: { [index: string]: string }; text: string; tagPosition: number; } interface IOptions { allowedAttributes?: Record | false; allowedStyles?: { [index: string]: { [index: string]: RegExp[] } } ; allowedClasses?: { [index: string]: boolean | Array }; allowIframeRelativeUrls?: boolean ; allowedSchemes?: string[] | boolean ; allowedSchemesByTag?: { [index: string]: string[] } | boolean ; allowedSchemesAppliedToAttributes?: string[] ; allowProtocolRelative?: boolean ; allowedTags?: string[] | false ; allowVulnerableTags?: boolean ; textFilter?: ((text: string, tagName: string) => string) ; exclusiveFilter?: ((frame: IFrame) => boolean) ; nonTextTags?: string[] ; selfClosing?: string[] ; transformTags?: { [tagName: string]: string | Transformer } ; parser?: ParserOptions ; disallowedTagsMode?: DisallowedTagsModes ; enforceHtmlBoundary?: boolean ; } const defaults: IDefaults; const options: IOptions; function simpleTransform(tagName: string, attribs: Attributes, merge?: boolean): Transformer; } ``` 接口定义: | 方法名 | 入参 | 接口描述 | | ------------------------------------------------------------ | ------------------------ | -------------------------------------------------------- | | new Parser(cbs: Partial\ | null, options?: ParserOptions) | handler,ParserOptions | 创建HTML解析器 | | write(chunk: string): void | string | 向HTML解析器内写入数据,解析一大块数据并调用相应的回调。 | | end(chunk?: string): void | string | 解析缓冲区的末尾并清除堆栈,调用 onend。 | | parseComplete(data: string): void | string | 重置解析器,然后解析完整的文档并将其推送到处理程序。 | | parseDocument(data: string, options?: ParserOptions): Document | string,ParserOptions | 解析数据,返回结果文档。 | | SanitizeHtml(dirty: string, options?: sanitize.IOptions): string | string,sanitize.IOptions | 清理HTML,实现HTML可信化 | |new XMLWriter(html: string, property?: Array\) | string,Array\ | 创建XHTML转换器对象 | |convertToXML(callback: (content: string | null, error?: Error) => void):void | callback | 将HTML转化为XHTML | | new ParserHTMLJson.default(html: string) | html | 创建HTML json解析器 | getClassStyleJson() | 无 | 提取css | getHtmlJson() | 无 | 获取html的json格式字符串 DomUtils接口定义参照:[DomUtils](https://domutils.js.org/modules.html) ## 约束与限制 在下述版本验证通过: DevEco Studio: 4.1 Canary(4.1.3.317),OpenHarmony SDK:API11 (4.1.0.36) ## 目录结构 ```` |---- jsoup | |---- entry # 示例代码文件夹 | |----src/main/ets | |pages | |----addTag.ets | |----index.ets | |----showResult.ets | |---- library # 将HTML转化为XHTMl功能库 | |---- README.md # 安装使用方法 | |---- README_zh.md # 安装使用方法 ```` ## 关于混淆 - 代码混淆,请查看[代码混淆简介](https://docs.openharmony.cn/pages/v5.0/zh-cn/application-dev/arkts-utils/source-obfuscation.md) - 如果希望htmltoxml库在代码混淆过程中不会被混淆,需要在混淆规则配置文件obfuscation-rules.txt中添加相应的排除规则: ``` -keep ./oh_modules/@ohos/htmltoxml ``` ## 贡献代码 使用过程中发现任何问题都可以提 [Issue](https://gitee.com/openharmony-sig/jsoup/issues) 给组件,当然,也非常欢迎发 [PR](https://gitee.com/openharmony-sig/jsoup/pulls)共建 。 ## 开源协议 本项目基于 [MIT](https://gitee.com/openharmony-sig/jsoup/blob/master/LICENSE) 协议,请自由地享受和参与开源。