QSAR Model Development Using DTC Lab. Software Tools

Software Tools

Important Links

Prof. Kunal Roy
Lab. Members
Publications
IJQSPR
Gallery
Contact Us



Jump to the Tool

1. Normalization
*NormalizeTheData 1.0

2. Data Pre-Treatment
*Data PreTreatment 1.2

3. Dataset Division
*Dataset Division GUI 1.2
Clustering Tool
*Modified k-Medoids GUI

4. Model Development
*Double Cross-Validation
*Partial Least Squares
*StepWise MLR
*Genetic Algorithm(MAE-based Fitness Function)
*MLR-BestSubsetSelection

5. Model Validation
*MLRplusValidation
*XternalValidation
*MLR Y-Randomization
*Intelligent Consensus Predictor

6. Applicability Domain
*AD using std. approach
*AD-MDI

7. Nano-Profiling
*NanoProfiler

Note: If Java is not installed in your computer. Please install Java (click here) before using the following software tools.
**To be a registered user (free of charge) of this site for academic/commercial purpose, kindly download (click here to download) and sign a one-time License Agreement Form and send to kunalroy_in@yahoo.com
**If you have any queries regarding software tools, please feel free to contact at kunalroy_in@yahoo.com
**List of research articles citing this website: click here.
**Please cite the reference article/s of the respective tools, along with the web site link.
** If your input file is very large (e.g. >20,000 rows) or if there is a possibility of generation of large output file, then prefer .csv file type over .xlsx and .xls. Since there is a memory limit for xls/xlsx file type, which may cause incomplete execution of the program (program will stop and throw java heap size error). This can be avoided (up to a limit) by using .csv file.




**The DTC Lab Tools Supplementary Site (Tools developed in 2021 and onwards) is accessible here: Website Link .


DTC-QSAR software is a complete modeling package providing a user-friendly, easy-to-use GUI to develop regression (MLR, PLS) and classification-based (LDA and Random Forest) QSAR models. It includes two well-known variable selection techniques, i.e., genetic algorithm and best subset selection. Moreover, it also provides a 'screening module' to screen or predict the response values/class of query compounds using already developed QSAR models. All the major steps (data pre-treatment, data set division, feature selection, model development, validation and applicability domain determination) are included in this software package.


The main purpose of this software is to provide user-friendly easy-to-use GUI to develop classification-based QSAR models (Genetic Algorithm-LDA, best subset selection-LDA and Random Forest). It also provides a 'screening module' to screen or predict response class of query compounds using already developed LDA or RF models.
Important Note: If you have already downloaded DTC-QSAR software, then you don't need to download this software since it does not provide any additional functionality.


As the name suggests, this tool is dedicated to QSAR modeling of small datasets. It employs double-cross-validation approach (*modified) and a set of optimum model selection techniques for performing the small-dataset QSAR modelling. It performs four basic steps, i.e., i) Data Pre-treatment, ii) Model development using double-cross-validation approach, iii) Selection of optimum model and iv) Model Validation (both internal and external).


The present tool “Prediction Reliability Indicator” is developed to indicate or categorize the quality of predictions for the test set (Known experimental response) or external (unknown experimental response) sets into three groups: good (with composite score 3), moderate (with composite score 2) and bad (with composite score 1).


This tool judges the performance of three 'new consensus predictions' and compares them with the prediction quality obtained from the individual (MLR) models and the 'original consensus' predictions.


Partial Least Squares version 1.0 tool can be used to develop QSAR models using partial least squares (PLS) technique and further it can also be used to perform validation of the developed PLS model via computing various internal and external validation metrics. The tool follows Non-linear Iterative Partial Least Squares (NIPALS) algorithm as described in the literature [Ref. 1]. In this tool, the optimal number of components for the PLS models can be defined by the user or can be decided by the tool based on the Q^2 (LOO) values judged after addition of next component. It is based on the fact that the program will stop if further addition of an additional component does not increase Q^2 (LOO) value for the training set by at least 5%.
Note that, though, a single PLS model can be built for multiple, correlated Y (or dependent) variables, but in the present tool, only single Y variable is allowed. This shortcoming will be fixed as soon as possible with the next version of tool.

Reference:
1. Wold, Svante, Michael Sjöström, and Lennart Eriksson. "PLS-regression: a basic tool of chemometrics." Chemometrics Intelligent Laboratory Systems 58.2 (2001): 109-130.


Massive screenings of large chemical libraries against panels of biological targets have led to the rapid expansion of publicly available databases such as ChEMBL, PubChem, BindingDB etc. A basic assumption of any cheminformatics study is the accuracy of the input data available in various databases. However, one should be concerned about the poor quality and the irreproducibility of both the chemical and biological records present in such databases. Curating both chemical and biological data, i.e., verifying the accuracy, consistency, and reproducibility of the reported experimental data is critical for the success of any cheminformatics studies, including Quantitative Structure-Activity Relationships (QSAR).


The double cross-validation process comprises two nested cross-validation loops which are referred as internal and external cross-validation loops. In the outer (external) loop of double cross-validation, all data objects are divided into two subsets referred to as training and test sets. The training set is used in the inner (internal) loop of double cross-validation for model building and model selection, while the test set is exclusively used for model assessment. So in the internal loop, the training set is repeatedly split into calibration and validation data sets. The calibration objects are used to develop different models whereas the validation objects are used to estimate the models‟ error. Finally, the model with the lowest prediction errors (validation set) in the inner loop is selected. Then, the test objects in the outer loop are employed to assess the predictive performance of the selected model. This method of multiple splits of the training set into calibration and validation sets obviates the bias introduced in variable selection in case of usage of a single training set of fixed composition.


Genetic Algorithm(GA) is a search heuristic method that mimics the process of natural selection. Where the exhaustive search is impractical, heuristic methods are used to speed up the process of finding a satisfactory solution. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, crossover, mutation, and selection. Here, the Genetic Algorithm tool performs the genetic algorithm for selection of significant variables (descriptors) during QSAR model development using Fitness Function based on recently reported MAE-based criteria. Note that in version 4.1 and above, you can perform process validation.


Bias-Variance Estimator is a software tool that can be used to understand the contribution of two components of the prediction errors, namely, Bias (or systematic) error and Variance (or random) error, for the developed model. This tool employs ‘bootstrapping’ as a re-sampling technique. The model with high bias or systematic errors should be discarded, and one should try to reduce such high bias by redeveloping the model employing appropriate functional form.


MLR BestSubsetSelection : To select best descriptor combination out of set of descriptors by evaluating all possible combinations of descriptors in the input file. Along with the conventional parameters like R2, Q2, Q2f1, Q2F2; the prediction quality of training as well as test set is judged using recently reported MAE-based criteria (For reference, see reference article for XternalValidationPlus Tool below). Further using the MAE-based metrics, a QSAR score is computed that can be used to select the best QSAR models in terms of prediction quality. User can define r^2 cut-off and inter-correlation cut-off values, which is useful to reduce the computational time and to remove models with inter-correlated descriptors, respectively.


Xternal Validation Plus is a tool which computes all the required external validation parameters, while further it also judges the performance of prediction quality of a QSAR model based on the MAE-based criteria.


Applicability Domain (using standardization approach) is a tool to find out compounds (test set/query compounds) that are outside the applicability domain of the built QSAR model and it also detects outliers present in the training set compounds.


NanoProfiler (endpoint-dependent analogues identification software) is a tool to predict different properties of nanoparticle’s using the nano-QSAR models which are already reported in the literature (the nano-QSAR models are stored in a database file available with the tool), and further it performs clustering to find analogues based on the predicted property. We have also included three more clustering methods i.e. k-Medoids algorithm (slow and exhaustive; searches best ‘k’ medoids), modified k-Medoid (fast; searches optimum ‘k’ medoids), Euclidean distance-based method, for analogues identification.


AD-MDI Applicability domain- Model Disturbance Index (AD-MDI) program is a tool to define applicability domain (AD) of unknown samples based on a concept proposed by Yan et. al. (see reference below). For more information regarding defining AD and finding AD of query compounds please read the article (see reference below). This program is the updated version of the previous version (now removed). The only difference between two versions is that in updated version the AD of query compounds can be determined (optional), which was not included in the previous version.

To Download and Run : Click on download button(it will direct you to google drive) and then press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program.

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output" folder."Lib" folder consist of library files required for running the program. Check the format of training, test set input files (.xlsx/.xls/.csv) and query file (.xlsx/.xls/.csv) before using the program (sample file provided in Data Folder). *Manual is provided in the program folder

File Format: Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

Reference Articles for AD-MDI Tool
1. Yan, Jun, et al. "A Combinational Strategy of Model Disturbance and Outlier Comparison to Define Applicability Domain in Quantitative Structural Activity Relationship. " Molecular Informatics (2014).
2. Ambure, P., Aher, R. B., Gajewicz, A., Puzyn, T., & Roy, K. (2015). “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling. Chemometrics and Intelligent Laboratory Systems, Volume 147, 15 October 2015, Pages 1–13. doi:10.1016/j.chemolab.2015.07.007


Stepwise MLR tool perform stepwise Multiple Linear Reaction using two methods: 1) using alpha value, 2) using F value. User can also select data pre-treatment option to remove constant and inter-correlated descriptors prior to performing stepwise MLR. Three output files are generated 1) LogFile.txt : Consist of names of descriptor (constant and/or intercorrelated) removed based on variance and correlation-coefficient cut-off; 2) SMLR.txt : Information regarding descriptor selected/removed along with validation parameters at each step, based on f-values (F-to-Enter,F-to-Remove) or alpha-value (alpha-to-Enter,alpha-to-Remove) cut-offs; 3) xlsx/xls/csv file : consist of set of descriptors selected (along with activity/property column) after performing stepwise MLR.


Modified k-Medoid is a simple and fast algorithm for k-medoids clustering (see reference below). The above algorithm is a local heuristic that runs just like k-means clustering when updating the medoids. This method tends to select k most middle objects as initial medoids. The algorithm involves calculation of the distance matrix once and uses it for finding new medoids at every iterative step.

To Download and Run : Click on download button(it will direct you to google drive) and then press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program.

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output" folder."Lib" folder consist of library files required for running the program. Check the format of input file(.xlsx/.xls/.csv) before using the program (sample file provided in Data Folder). *Manual is provided in the program folder

File Format: Compound number (first column), Descriptors (Subsequent Columns)

Reference Article for Modified k-Medoid Tool
1. Park, Hae-Sang, and Chi-Hyuck Jun. "A simple and fast algorithm for K-medoids clustering." Expert Systems with Applications 36.2 (2009): 3336-3341.
2. Ambure, P., Aher, R. B., Gajewicz, A., Puzyn, T., & Roy, K. (2015). “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling. Chemometrics and Intelligent Laboratory Systems, Volume 147, 15 October 2015, Pages 1–13. doi:10.1016/j.chemolab.2015.07.007


Data PreTreatment 1.2 (using v-WSP algorithm) : To remove the constant and highly correlated descriptors based on user specified variance and correlation cut-off values using V-WSP algorithm (see reference below).

To Download and Run : Click on download button(it will direct you to google drive) and then press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program.

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output" folder."Lib" folder consist of library files required for running the program. Check the format of input file(.xlsx/.xls/.csv) before using the program (sample file provided in Data Folder). *Manual is provided in the program folder

File Format: Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

Reference Article for Data Pretreatment (using vWSP algorithm) Tool
1. Ballabio, Davide, et al. "A novel variable reduction method adapted from space-filling designs." Chemometrics and Intelligent Laboratory Systems (2014).
2. Ambure, P., Aher, R. B., Gajewicz, A., Puzyn, T., & Roy, K. (2015). “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling. Chemometrics and Intelligent Laboratory Systems, Volume 147, 15 October 2015, Pages 1–13. doi:10.1016/j.chemolab.2015.07.007


Dataset Division 1.2 is a user friendly application tool,which includes three different methods i.e. Kennard-Stone based, Euclidean Distance based(or Diversity based) and Activity/Property based dataset division into training and test set.

To Download and Run : Click on download button(it will direct you to google drive) and then press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program.

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output" folder."Lib" folder consist of library files required for running the program. Check the format of input file(.xlsx/.xls/.csv) before using the program (sample file provided in Data Folder). *Manual is provided in the program folder

File Format: Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

Reference Article for Dataset Division Tool
1. Kennard, Ronald W., and Larry A. Stone. "Computer aided design of experiments." Technometrics 11.1 (1969): 137-148.
2. "Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?" published in J. Chem. Inf. Model.


MLR plus Validation tool develops QSAR model using MLR technique and calculates internal, and external validation parameters of the developed model. Further it judges the test set predictions based on the actual prediction errors as GOOD, MODERATE and BAD. It also checks Golbarikh and Tropsha model acceptibillity criteria, and optionally can determine applicability domain (AD) employing two available methods i.e., Standardization Approach and Euclidean-based Method. User may also perform Y randomization test.


MLR Y-Randomization Test: : This test is performed to check the robustness of the QSAR model by building several random models via shuffling the dependent variables, while keeping the independent variables as it is. The resultant random models are expected to have significantly low r^2 and q^2 values for several trials, to pass the test. Another parameter, cRp^2 is also calculated, which should be more then 0.5 for passing this test.

To Download and Run : Click on download button(it will direct you to google drive) and then press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program.

Note: Keep your input file in the same folder where you keep yRandomization.jar file. Sample input file (.xlsx/.xls/.csv) are provided in the Program Folder. *Manual is provided in the program folder

File Format: Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)


Normalization: : To normalize the data by scaling between 0 to 1.

To Download and Run : Click on download button(it will direct you to google drive) and then press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program.

Note: Keep your input file (only .csv file type allowed) in the same folder where you keep normalizeTheData1.0.jar file. Sample input file is provided in the Program Folder.

File Format: All the columns comprising data undergo scaling.

All the above programs have been developed in Java and are validated on known data sets.

The website and the software tools are developed by:
Dr. Pravin Ambure (ambure.pharmait@gmail.com)
and
Prof. Kunal Roy (kunalroy_in@yahoo.com)
Drug Theoretics and Cheminformatics (DTC) Laboratory
Department of Pharmaceutical Technology
Jadavpur University
Kolkata -700032

Acknowledgment:
*The programmer is highly grateful to Department of Biotechnology, Government of India for providing financial assistance (2013-2015).

*The following software tools are developed during 6 months (March - August 2014) of participation in an International project "NanoBridges"
at Gdansk University, Poland (click here) that has received funding from the People Programme (Marie Curie Actions) of the European Union
Seventh Framework Programme :
1. Stepwise MLR
2. Modified k-Medoid
3. vWSP
4. AD-MDI
5. Genetic Algorithm 1.4
6. Nano Profiler

Please send your feedback at: kunalroy_in@yahoo.com

Back To The Top