# html2markdown

**Repository Path**: chiamzhang/html2markdown

## Basic Information

- **Project Name**: html2markdown
- **Description**: No description available
- **Primary Language**: Python
- **License**: GPL-3.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-12-03
- **Last Updated**: 2023-12-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: HTML, Python, Markdown

## README

# html2markdown

`html2markdown` is a Python script that transforms HTML pages into clean, readable plain ASCII text. What's even better is that this ASCII text conveniently conforms to valid Markdown, a text-to-HTML formatting syntax.

To use it, one only needs to provide the URL of a website along with a few simple parameters. The script then converts the content into Markdown and downloads it locally, naming the file as `title.md`.

`html2markdown` 是一个将 HTML 页面转换为干净、易读的纯 ASCII 文本的 Python 脚本。更好的是，该 ASCII 文本也恰好是有效的 Markdown（一种文本到 HTML 的格式）。

我们只需要输入一个网站的url和简单的参数就可以进行使用，将markdown按照title.md命名下载到本地。

- [html2markdown](#html2markdown)
  - [English](#english)
    - [Introduction](#introduction)
      - [Functionality](#functionality)
      - [Demo](#demo)
  - [中文](#中文)
    - [介绍](#介绍)
      - [功能](#功能)
      - [demo](#demo-1)

## English

### Introduction

This directory consists of two files:

- `html2text.py`: This is derived from the GitHub project "html2text." I have performed a simple refactoring, modifying modules specific to Python 2, and eliminating unnecessary code.
  
- `html2markdown.py`: This is the core code responsible for implementing the functionality.

#### Functionality

1. **Web Scraping and Markdown Conversion:**
   - Utilizes the `requests` library to retrieve webpage content.
   - Utilizes `BeautifulSoup` to parse the HTML content of the webpage.
   - Converts HTML tables on the webpage to Markdown format.
   - Extracts the title and main content from HTML, converting them to plain text using the `html2text` library.
   - Combines the title and content into a Markdown-formatted string.

2. **File Operations:**
   - Defines functions to sanitize filenames by removing invalid characters.
   - Generates a random filename using alphanumeric characters.
   - Ensures the existence of a target directory for saving Markdown files and creates the directory if it doesn't exist.
   - Writes Markdown content to a file in the specified directory, using the sanitized or randomly generated filename.
   - Prints a message indicating the successful download of the file.

3. **Main Functionality:**
   - The `getHttpResponse` function takes a URL, title ID, content ID, and cookie data as parameters.
   - Sends an HTTP GET request to the specified URL with the provided headers and cookies.
   - Extracts the title and HTML content from the response using the specified title and content IDs.
   - Converts the title and HTML content to plain text using `html2text`.
   - Determines the sanitized or randomly generated filename.
   - ==Combines the title and content and writes them to a Markdown file.==

4. **Usage:**

```python
from html2markdown import html2markdown
html2Markdown(arg1, arg2, arg3, arg4)
```
- Parameter 1: URL to be converted (cannot be empty).
- Parameter 2: ID of the title element in the website's HTML, such as `articleContentId` on CSDN (if empty, it will be randomly named).
- Parameter 3: ID of the content element in the website's HTML, such as `article_content` on CSDN (if empty, the entire page will be converted).
- Parameter 4: Cookie information in JSON format (if empty, no cookie will be used).

#### Demo

There is a `test.py` in the directory containing a sample for users to reference. When obtaining parameters, you can use the browser's F12 to inspect elements.

## 中文

### 介绍

本目录下由两个文件组成：
    - html2text.py 这个源自于github项目html2text，本人进行了简单重构，代码对于一些只用于python2的模块进行了修改，并且删除了一些无用的代码。
    - html2markdown.py 这个是核心代码，主要用于功能的实现

#### 功能

1. **网页抓取与Markdown转换：**
   - 使用`requests`库获取网页内容。
   - 使用`BeautifulSoup`解析网页的HTML内容。
   - 将网页中的HTML表格转换为Markdown格式。
   - 提取HTML的标题和主要内容，使用`html2text`库将它们转换为纯文本。
   - 将标题和内容组合成Markdown格式的字符串。

2. **文件操作：**
   - 定义了一些函数来清理文件名，去除无效字符。
   - 使用字母数字字符生成一个随机文件名。
   - 确保存在用于保存Markdown文件的目标目录，如果不存在则创建该目录。
   - 将Markdown内容写入指定目录的文件中，使用清理后或随机生成的文件名。
   - 打印一条消息，指示文件成功下载。

3. **主功能：**
   - `getHttpResponse`函数以URL、标题ID、内容ID和Cookie数据作为参数。
   - 使用指定的标题和内容ID向URL发送HTTP GET请求，并携带提供的头信息和Cookie。
   - 使用指定的标题和内容ID从响应中提取标题和HTML内容。
   - 使用`html2text`将标题和HTML内容转换为纯文本。
   - 确定清理后或随机生成的文件名。
   - ==将标题和内容组合并写入Markdown文件==。

4. **用法：**

```python
from html2markdown import html2markdown
html2Markdown(参数1，参数2，参数3，参数4)
```

- 参数1:所要转化的网址（不能为空）
- 参数2:网站中文章的title的id，如在CSDN中为articleContentId （为空值则随机命名）
- 参数3:网站中文章的内容的id，如在CSDN中为article_content (为空则转化整个页面)
- 参数4:Cookie信息，Json格式即可，会自动转为可以使用的格式。(为空则不使用cookie)


#### demo

目录下有一个test.py,里面有一个样例供大家使用，获取参数的时候可以使用浏览器F12去寻找。