# cleanvision **Repository Path**: data_factory/cleanvision ## Basic Information - **Project Name**: cleanvision - **Description**: No description available - **Primary Language**: Unknown - **License**: AGPL-3.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-04-16 - **Last Updated**: 2024-04-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Screen Shot 2023-03-10 at 10 23 33 AM CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc. This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning. CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset! [![Read the Docs](https://readthedocs.org/projects/cleanvision/badge/?version=latest)](https://cleanvision.readthedocs.io/en/latest/) [![pypi](https://img.shields.io/pypi/v/cleanvision?color=blue)](https://pypi.org/pypi/cleanvision/) [![os](https://img.shields.io/badge/platform-noarch-lightgrey)](https://pypi.org/pypi/cleanvision/) [![py\_versions](https://img.shields.io/badge/python-3.7%2B-blue)](https://pypi.org/pypi/cleanvision/) [![codecov](https://codecov.io/github/cleanlab/cleanvision/branch/main/graph/badge.svg?token=y1N6MluN9H)](https://codecov.io/gh/cleanlab/cleanvision) [![Slack Community](https://img.shields.io/static/v1?logo=slack&style=flat&color=white&label=slack&message=community)](https://cleanlab.ai/slack) [![Twitter](https://img.shields.io/twitter/follow/CleanlabAI?style=social)](https://twitter.com/CleanlabAI) [![Cleanlab Studio](https://raw.githubusercontent.com/cleanlab/assets/master/shields/cl-studio-shield.svg)](https://cleanlab.ai/studio/?utm_source=github&utm_medium=readme&utm_campaign=clostostudio) ## Installation ```shell pip install cleanvision ``` ## Quickstart Download an example dataset (optional). Or just use any collection of image files you have. ```shell wget -nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip' ``` 1. Run CleanVision to audit the images. ```python from cleanvision import Imagelab # Specify path to folder containing the image files in your dataset imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/") # Automatically check for a predefined list of issues within your dataset imagelab.find_issues() # Produce a neat report of the issues found in your dataset imagelab.report() ``` 2. CleanVision diagnoses many types of issues, but you can also check for only specific issues. ```python issue_types = {"dark": {}, "blurry": {}} imagelab.find_issues(issue_types=issue_types) # Produce a report with only the specified issue_types imagelab.report(issue_types=issue_types) ``` ## More resources on how to use CleanVision - [Tutorial](https://cleanvision.readthedocs.io/en/latest/tutorials/tutorial.html) - [Run CleanVision on a HuggingFace dataset](https://cleanvision.readthedocs.io/en/latest/tutorials/huggingface_dataset.html) - [Run CleanVision on a Torchvision dataset](https://cleanvision.readthedocs.io/en/latest/tutorials/torchvision_dataset.html) - [Example script](https://github.com/cleanlab/cleanvision/blob/main/docs/source/tutorials/run.py) that can be run with: `python examples/run.py --path ` - [Additional example notebooks](https://github.com/cleanlab/cleanvision-examples) - [Documentation](https://cleanvision.readthedocs.io/) - [Blog Post](https://cleanlab.ai/blog/cleanvision/) - [FAQ](https://cleanvision.readthedocs.io/en/latest/faq.html) ## *Clean* your data for better Computer *Vision* The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. CleanVision helps you automatically identify common types of data issues lurking in image datasets. This package currently detects issues in the raw images themselves, making it a useful tool for any computer vision task such as: classification, segmentation, object detection, pose estimation, keypoint detection, [generative modeling](https://openai.com/research/dall-e-2-pre-training-mitigations), etc. To detect issues in the labels of your image data, you can instead use the [cleanlab](https://github.com/cleanlab/cleanlab/) package. In any collection of image files (most [formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html) supported), CleanVision can detect the following types of issues: | | Issue Type | Description | Issue Key | Example | |---|------------------|-----------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------| | 1 | Exact Duplicates | Images that are identical to each other | exact_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/exact_duplicates.png) | | 2 | Near Duplicates | Images that are visually almost identical | near_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/near_duplicates.png) | | 3 | Blurry | Images where details are fuzzy (out of focus) | blurry | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/blurry.png) | | 4 | Low Information | Images lacking content (little entropy in pixel values) | low_information | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/low_information.png) | | 5 | Dark | Irregularly dark images (*under*exposed) | dark | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/dark.jpg) | | 6 | Light | Irregularly bright images (*over*exposed) | light | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/light.jpg) | | 7 | Grayscale | Images lacking color | grayscale | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/grayscale.jpg) | | 8 | Odd Aspect Ratio | Images with an unusual aspect ratio (overly skinny/wide) | odd_aspect_ratio | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/odd_aspect_ratio.jpg) | | 9 | Odd Size | Images that are abnormally large or small compared to the rest of the dataset | odd_size | | CleanVision supports Linux, macOS, and Windows and runs on Python 3.7+. ## Join our community * The best place to learn is [our Slack community](https://cleanlab.ai/slack). Join the discussion there to see how folks are using this library, discuss upcoming features, or ask for private support. * Need professional help with CleanVision? Join our [\#help Slack channel](https://cleanlab.ai/slack) and message us there, or reach out via email: team@cleanlab.ai * Interested in contributing? See the [contributing guide](CONTRIBUTING.md). An easy starting point is to consider [issues](https://github.com/cleanlab/cleanvision/labels/good%20first%20issue) marked `good first issue` or simply reach out in [Slack](https://cleanlab.ai/slack). We welcome your help building a standard open-source library for data-centric computer vision! * Ready to start adding your own code? See the [development guide](DEVELOPMENT.md). * Have an issue? [Search existing issues](https://github.com/cleanlab/cleanvision/issues?q=is%3Aissue) or [submit a new issue](https://github.com/cleanlab/cleanvision/issues/new/choose). * Have ideas for the future of data-centric computer vision? Check out [our active/planned Projects and what we could use your help with](https://github.com/cleanlab/cleanvision/projects). ## License Copyright (c) 2022 Cleanlab Inc. cleanvision is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. cleanvision is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See [GNU Affero General Public LICENSE](https://github.com/cleanlab/cleanvision/blob/main/LICENSE) for details. Commercial licensing is available for enterprise teams that want to use CleanVision in production workflows, but are unable to open-source their code [as is required by the current license](https://github.com/cleanlab/cleanvision/blob/main/LICENSE). Please email us: team@cleanlab.ai [issue]: https://github.com/cleanlab/cleanvision/issues/new