# MDfromHTML
**Repository Path**: mirrors_ibm/MDfromHTML
## Basic Information
- **Project Name**: MDfromHTML
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-11-23
- **Last Updated**: 2025-08-23
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# MDfromHTML
Generate Markdown from HTML using filters to remove noise from web pages (e.g., headers, footers, advertisements, sidebars). Captures provenance of markdown generation back to original HTML content and explains filtering that occurred. Also includes tools to generate formatted text from the generated markdown. This repo includes multiple Eclipse Maven Java projects including REST web services to generate MD from HTML.
Release 2.0 introduced exposing the `./properties/HTML_Filters.json` file selection as a parameter when running the GetMarkdownFromHTML utility so different sets of filters could be used to get different types of output.
## Project Components
* html_extractor: Python project to enable HTML capture providing a web server for cURL requests or for interactive use to capture rendered web pages via Selenium and Chromium
* MDfromHTMLBase: Maven Java Eclipse project providing common utility methods used by other projects
* Remark: upgraded Maven Java Eclipse project providing code from https://bitbucket.org/OverZealous/remark/src/default/ to provide HTML parsing and converstion to Markdown
* MarkdownGenerator: Maven Java Eclipse project providing utilities and services to perform Markdown generation from HTML
* MDfromHTMLWebServices: Maven Java Eclipse project providing WAR file generation of REST web services to generate markdown form HTML
## Building Projects
Each project can be built by using the command line: **mvn clean install** command in the project directory to write jar or war files to the target subdirectory. Alternativiely, right clicking the pom.xml file in Eclipse, selecting Run As... Maven build... and specifying **clean install** as the goals will build the project in Eclipse.
### JDK Version
Content has been build using the Open JDK version 1.8.0_242_b08 available for download from https://adoptopenjdk.net/
### Eclipse Version
Projects were developed in Eclipse 2020-03 available from https://www.eclipse.org/downloads/ installing the Java EE Profile during installation.
## License
The code in this repository is licensed under the Apache 2.0 License
## Support
It is best to open an issue in this repository. You may also contact Nathaniel Mills at wnm3@us.ibm.com.