# crossing_docking

**Repository Path**: wangqxw/cross_docking

## Basic Information

- **Project Name**: crossing_docking
- **Description**: Retrieving protein structures with neseccary ligands and proper resolution. Then coduct molecular docking using knime workflow.Finally, a data analysis script could be used to analyze the result.
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 1
- **Created**: 2024-10-25
- **Last Updated**: 2025-05-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## Cross Docking Via Schrodinger&Knime

Retrieving protein structures with neseccary ligands and proper resolution. Then coduct molecular docking using knime workflow.Finally, a data analysis script could be used to analyze the result.

><b>NOTE</b>: This is a highly job-oriented workflow designed for a protein screeninng at [Dr. Huijie Pan's lab](https://www.x-mol.com/groups/pan_huijie) at Nanjing University, which means this just can handle specific task. For other work, I highly **NOT RECOMMANDE** you just download and use directly.
>
>P.S. I know this job is kind of messy and not easy to use (even modified) for general work. Necessary optimization about the code is ongoing.


### Installation

* Download the zip file and extract to a proper directory. Or run the following command at proper directory:

```bash
git clone https://gitee.com/wangqxw/cross_docking.git
```

* Creating a conda enviroment using the *env.ymal*.

```bash
# Create a conda environment
# Default name is 'docking', you can use a -n flag to set a different name
conda env create -f env.yaml
# Activating the conda enviroment
conda activate docking
```

<div class="alert alert-block alert-info">

<b>NOTE</b>: If you do not familiar with [git](https://git-scm.com/doc) and [conda](https://docs.conda.io/en/latest/), you can click them to find the official documentation.
</div>

### How to Use?

There are 2 python scripts (*get_pdb.py* and *get_lig.py*) for the docking preprocess and 2 knime workflows (*grid_gen* and *docking_with grid*) for calling glide to run the docking. Also, there is 1 python script (*data_analysis.py*) for post-analysis of the docking result.

Here are the introduction of these files:

***getpdb.py***
This is a python script to get the protein structure with a specific native ligand (for example 'NAD') from the [RCSB](https://www.rcsb.org/) database. And then the protein will be splited as single chain (with ligand). The retrieved raw pdb files and chain-splited structures will be stored at *./structures/pdb* and *./structures/pdb_clean* seperately.

The arguments of *getpdb.py* are:
'--ligand': Name of the ligand to search for (e.g. 'NAD').
'--resolution': Maximum resolution limit for X-ray diffraction structures (default: 4.0).
'--output_path': Path to store the modified PDB structures.

```bash
# Example usage 
# Serch for structure with NAD as native ligand with a minimum resolution higher than 3.0
python getpdb.py --ligand NAD --resolution 3.0
```

***getlig.py***
This is a python script to retrieve molecules with given similarity of the input molecule from the [PubChem](https://pubchem.ncbi.nlm.nih.gov) database. The retrieved molecules will be stored at *./ligand* directory.

The arguments of *getlig.py* are:
'--smiles': The canonical SMILES string for the compound.
'--output_dir': The directory to save output files.
'--threshold': The minimum similarity threshold (%) comparing with reference.

```bash
# Example usage 
# Serch for molecules with similarity higher than 75% of NAD 
python getlig.py --smiles 'NC(=O)C1=CN(C=CC1)[C@@H]1O[C@H](COP(O)(=O)OP(O)(=O)OC[C@H]2O[C@H]([C@H](O)[C@@H]2O)n2cnc3c(N)ncnc23)[C@@H](O)[C@H]1O' --threshold 75
```

***grid_gen.knwf***
This is a knime workflow used for docking grid generation using glide. This workflow can run grid generation parallely. One shortcomming of this workflow is all the pdb structure will be readed to your local memory. For my computer with **64GB** memory, over **2000 pdb files** can be readed without a memory boom.

Another important note is that this workflow will run glide generatioi parallely, so a suitable number should be setted in the *parallel Chunk Start* node. For my computer, **16** is a suitable number, **24** is the maximum number.

This workflow will output the processed protein structure in pdb format (*./structures/pdb_processed*) and the grid file in zip format (*./structures/pdb_grid*).

***docking_with_grid.knwf***
This is a knime workflow used for cross docking using glide. As *grid_gen* workflow, this work flow also can run paralelly, so a suitable number in *parallel Chunk Start* node is important. For my computer, **8** is a suitable number, **12** is the maximum number (standard precision sp); **3** and **6** are the minimun and maximum for extra precision (xp). **WE HIGHLY RECOMMAND YOU START WITH A SMALL NUMBER!**

This workflow will output the protein and docked ligand in a mae format (*./outputs/protein_names*). Also, the summary of docking descriptors export as a csv file (*./summary/docking_summary.csv*).

> <b>NOTE</b>:  
> You can open knime workspace at the current folder by run the following command:
>
> ```bash
> # Open knime in current workspace and without 'Example Workflow' folder
> knime -data . --nosplash
> ```
>Then click *File* and import necessary workflow.

***data_analysis.py***
This is a python script to analyze the outputted results by *docking_with_grid* workflow. Since this script will take the result, so this workflow will not take any argument. The docking results will be visulized by their docking score distribution of each ligand (*./summary/docking_scores_distribution.png*). Thus, the x-axis represents the docking score and the y-axis represent the each ligand, and the data point shown in the box is a protein target. Besides, a output.txt file (*./summary/output.txt*) records that top ten protein targets of each ligand, as well as a final summary of top 25% proteins and least 25% proteins that show a good results in all ligands.

```bash
# Example usage 
python data_analysis.py
```

***file_converter.py***
This is a python script to convert all the poseviewer file into a single pdb file with the output ***docking_with_grid.knwf*** as input. These pdb file could be used for secondary docking.
```bash
# Example usage 
python file_converter.py
```