# caijitest **Repository Path**: infinitezz/caijitest ## Basic Information - **Project Name**: caijitest - **Description**: 网站采集的测试 - **Primary Language**: PHP - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2019-12-10 - **Last Updated**: 2021-11-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # caiji 采集类网站 - 资产 ### 环境配置 PHP7.0-nts及以上 #PHP扩展: php_curl php_openssl php_sockets php_mbstring php_exif php_fileinfo php_gd2 php_imagick php_gettext php_mysqli ### ThinkPHP5.0.24 release/thinkphp/helper.php #RBAC 权限管理 https://www.cnblogs.com/caicaizi/p/7797357.html composer require gmars/tp5-rbac ### 采集测试网站 京东拍卖 https://auction.jd.com/home.html https://auction.jd.com/paimai_list.html?t=1&limit=40&page=2 ### 采集 #QueryList采集 http://querylist.cc/docs/guide/v4/integration composer require jaeger/querylist 使用PhantomJS采集JavaScript动态渲染的页面。 composer require jaeger/querylist-phantomjs 二进制文件下载: https://phantomjs.org/download.html 说明: doc/PhantomJS/readme.md 配置: release/data/conf/config.php release/vendor/jaeger/querylist/src/QueryList.php release/vendor/jaeger/querylist/src/Dom/Elements.php #Python爬虫 后期考虑用 裁判文书网2019年9月份最新爬虫 https://blog.csdn.net/Since_you/article/details/100566633 ### PDF转文字 #imagick https://windows.php.net/downloads/pecl/deps/ https://windows.php.net/downloads/pecl/releases/imagick/3.4.3/ 几点注意: PHP Version: PHP版本 compiler: MSVC11 Architecture: x86 Thread Safety: disabled 非线程安全,也就是NTS,相反的则是线程安全TS GhostScript安装 https://www.ghostscript.com/download/gsdnld.html 如何利用PHP将PDF转为图片以及拼接图片(附代码) https://www.php.cn/php-weizijiaocheng-406701.html release/extend/pdf/Handler.php ### OCR 百度api 翻转识别 http://ai.baidu.com/tech/ocr release/extend/baidu/Ocr.php ### doc转文字 #antiword => doc,有几率失败 http://get.ftqq.com/9122.get https://blog.csdn.net/sheqianweilong/article/details/88081958 #PHPWord => docx https://segmentfault.com/a/1190000019479817 https://github.com/PHPOffice/PHPWord composer require phpoffice/phpword