# html2markdown
**Repository Path**: chiamzhang/html2markdown
## Basic Information
- **Project Name**: html2markdown
- **Description**: No description available
- **Primary Language**: Python
- **License**: GPL-3.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-12-03
- **Last Updated**: 2023-12-04
## Categories & Tags
**Categories**: Uncategorized
**Tags**: HTML, Python, Markdown
## README
# html2markdown
`html2markdown` is a Python script that transforms HTML pages into clean, readable plain ASCII text. What's even better is that this ASCII text conveniently conforms to valid Markdown, a text-to-HTML formatting syntax.
To use it, one only needs to provide the URL of a website along with a few simple parameters. The script then converts the content into Markdown and downloads it locally, naming the file as `title.md`.
`html2markdown` 是一个将 HTML 页面转换为干净、易读的纯 ASCII 文本的 Python 脚本。更好的是,该 ASCII 文本也恰好是有效的 Markdown(一种文本到 HTML 的格式)。
我们只需要输入一个网站的url和简单的参数就可以进行使用,将markdown按照title.md命名下载到本地。
- [html2markdown](#html2markdown)
- [English](#english)
- [Introduction](#introduction)
- [Functionality](#functionality)
- [Demo](#demo)
- [中文](#中文)
- [介绍](#介绍)
- [功能](#功能)
- [demo](#demo-1)
## English
### Introduction
This directory consists of two files:
- `html2text.py`: This is derived from the GitHub project "html2text." I have performed a simple refactoring, modifying modules specific to Python 2, and eliminating unnecessary code.
- `html2markdown.py`: This is the core code responsible for implementing the functionality.
#### Functionality
1. **Web Scraping and Markdown Conversion:**
- Utilizes the `requests` library to retrieve webpage content.
- Utilizes `BeautifulSoup` to parse the HTML content of the webpage.
- Converts HTML tables on the webpage to Markdown format.
- Extracts the title and main content from HTML, converting them to plain text using the `html2text` library.
- Combines the title and content into a Markdown-formatted string.
2. **File Operations:**
- Defines functions to sanitize filenames by removing invalid characters.
- Generates a random filename using alphanumeric characters.
- Ensures the existence of a target directory for saving Markdown files and creates the directory if it doesn't exist.
- Writes Markdown content to a file in the specified directory, using the sanitized or randomly generated filename.
- Prints a message indicating the successful download of the file.
3. **Main Functionality:**
- The `getHttpResponse` function takes a URL, title ID, content ID, and cookie data as parameters.
- Sends an HTTP GET request to the specified URL with the provided headers and cookies.
- Extracts the title and HTML content from the response using the specified title and content IDs.
- Converts the title and HTML content to plain text using `html2text`.
- Determines the sanitized or randomly generated filename.
- ==Combines the title and content and writes them to a Markdown file.==
4. **Usage:**
```python
from html2markdown import html2markdown
html2Markdown(arg1, arg2, arg3, arg4)
```
- Parameter 1: URL to be converted (cannot be empty).
- Parameter 2: ID of the title element in the website's HTML, such as `articleContentId` on CSDN (if empty, it will be randomly named).
- Parameter 3: ID of the content element in the website's HTML, such as `article_content` on CSDN (if empty, the entire page will be converted).
- Parameter 4: Cookie information in JSON format (if empty, no cookie will be used).
#### Demo
There is a `test.py` in the directory containing a sample for users to reference. When obtaining parameters, you can use the browser's F12 to inspect elements.
## 中文
### 介绍
本目录下由两个文件组成:
- html2text.py 这个源自于github项目html2text,本人进行了简单重构,代码对于一些只用于python2的模块进行了修改,并且删除了一些无用的代码。
- html2markdown.py 这个是核心代码,主要用于功能的实现
#### 功能
1. **网页抓取与Markdown转换:**
- 使用`requests`库获取网页内容。
- 使用`BeautifulSoup`解析网页的HTML内容。
- 将网页中的HTML表格转换为Markdown格式。
- 提取HTML的标题和主要内容,使用`html2text`库将它们转换为纯文本。
- 将标题和内容组合成Markdown格式的字符串。
2. **文件操作:**
- 定义了一些函数来清理文件名,去除无效字符。
- 使用字母数字字符生成一个随机文件名。
- 确保存在用于保存Markdown文件的目标目录,如果不存在则创建该目录。
- 将Markdown内容写入指定目录的文件中,使用清理后或随机生成的文件名。
- 打印一条消息,指示文件成功下载。
3. **主功能:**
- `getHttpResponse`函数以URL、标题ID、内容ID和Cookie数据作为参数。
- 使用指定的标题和内容ID向URL发送HTTP GET请求,并携带提供的头信息和Cookie。
- 使用指定的标题和内容ID从响应中提取标题和HTML内容。
- 使用`html2text`将标题和HTML内容转换为纯文本。
- 确定清理后或随机生成的文件名。
- ==将标题和内容组合并写入Markdown文件==。
4. **用法:**
```python
from html2markdown import html2markdown
html2Markdown(参数1,参数2,参数3,参数4)
```
- 参数1:所要转化的网址(不能为空)
- 参数2:网站中文章的title的id,如在CSDN中为articleContentId (为空值则随机命名)
- 参数3:网站中文章的内容的id,如在CSDN中为article_content (为空则转化整个页面)
- 参数4:Cookie信息,Json格式即可,会自动转为可以使用的格式。(为空则不使用cookie)
#### demo
目录下有一个test.py,里面有一个样例供大家使用,获取参数的时候可以使用浏览器F12去寻找。