Quantitative structure–activity relationship

Quantitative structure–activity relationship models (QSAR models) are regression or classification models used in the chemical and biological sciences and engineering. Like other regression models, QSAR regression models relate a set of "predictor" variables (X) to the potency of the response variable (Y), while classification QSAR models relate the predictor variables to a categorical value of the response variable.

In QSAR modeling, the predictors consist of physico-chemical properties or theoretical molecular descriptors{{cite book |last1=Todeschini |first1=Roberto |last2=Consonni |first2=Viviana |title=Molecular Descriptors for Chemoinformatics |series=Methods and Principles in Medicinal Chemistry |year=2009 |volume=41 |publisher=Wiley |doi=10.1002/9783527628766 |isbn=978-3-527-31852-0 |url=https://onlinelibrary.wiley.com/doi/book/10.1002/9783527628766 |language=en}}{{cite book |last1=Mauri |first1=Andrea |last2=Consonni |first2=Viviana |last3=Todeschini |first3=Roberto |title=Handbook of Computational Chemistry |publisher=Springer International Publishing |isbn=978-3-319-27282-5 |pages=2065–2093 |url=https://link.springer.com/referenceworkentry/10.1007/978-3-319-27282-5_51 |language=en |chapter=Molecular Descriptors|year=2017 |doi=10.1007/978-3-319-27282-5_51 }} of chemicals; the QSAR response-variable could be a biological activity of the chemicals. QSAR models first summarize a supposed relationship between chemical structures and biological activity in a data-set of chemicals. Second, QSAR models predict the activities of new chemicals.{{cite book |last1=Roy |first1=Kunal |last2=Kar |first2=Supratik |last3=Das |first3=Rudra Narayan | name-list-style = vanc | chapter = Chapter 1.2: What is QSAR? Definitions and Formulism |title=A primer on QSAR/QSPR modeling: Fundamental Concepts |date=2015 |publisher=Springer-Verlag Inc |location=New York |isbn=978-3-319-17281-1 |pages=2–6 |chapter-url=https://books.google.com/books?id=FFcSCAAAQBAJ&pg=PA4 }}{{cite journal |last1=Ghasemi |first1=Pérez-Sánchez|last2=Mehri |first2= Pérez-Garrido |year=2018 |title= Neural network and deep-learning algorithms used in QSAR studies: merits and drawbacks |journal=Drug Discovery Today |volume=23 |issue=10 |pages=1784–1790 |doi=10.1016/j.drudis.2018.06.016|pmid=29936244|s2cid=49418479}}

Related terms include quantitative structure–property relationships (QSPR) when a chemical property is modeled as the response variable.{{cite journal | vauthors = Nantasenamat C, Isarankura-Na-Ayudhya C, Naenna T, Prachayasittikul V | title = A practical overview of quantitative structure-activity relationship | journal = Excli Journal | volume = 8 | pages = 74–88 | year = 2009 | doi = 10.17877/DE290R-690 }}{{cite journal | vauthors = Nantasenamat C, Isarankura-Na-Ayudhya C, Prachayasittikul V | title = Advances in computational methods to predict the biological activity of compounds | journal = Expert Opinion on Drug Discovery | volume = 5 | issue = 7 | pages = 633–54 | date = Jul 2010 | pmid = 22823204 | doi = 10.1517/17460441.2010.492827 | s2cid = 17622541 }}

"Different properties or behaviors of chemical molecules have been investigated in the field of QSPR. Some examples are quantitative structure–reactivity relationships (QSRRs), quantitative structure–chromatography relationships (QSCRs) and, quantitative structure–toxicity relationships (QSTRs), quantitative structure–electrochemistry relationships (QSERs), and quantitative structure–biodegradability relationships (QSBRs)."{{cite journal | vauthors = Yousefinejad S, Hemmateenejad B | title = Chemometrics tools in QSAR/QSPR studies: A historical perspective | journal = Chemometrics and Intelligent Laboratory Systems | volume = 149, Part B | pages = 177–204 | year = 2015| doi = 10.1016/j.chemolab.2015.06.016}}

As an example, biological activity can be expressed quantitatively as the concentration of a substance required to give a certain biological response. Additionally, when physicochemical properties or structures are expressed by numbers, one can find a mathematical relationship, or quantitative structure-activity relationship, between the two. The mathematical expression, if carefully validated,{{cite journal | vauthors = Tropsha A, Gramatica P, Gombar VJ | author-link1 = Alexander Tropsha | title = The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models | journal = QSAR Comb. Sci. | volume = 22 | pages = 69–77 | year = 2003 | doi = 10.1002/qsar.200390007 }}{{cite journal | vauthors = Gramatica P | title = Principles of QSAR models validation: internal and external | journal = QSAR Comb. Sci. | volume = 26 | issue = 5 | pages = 694–701| year = 2007 | doi = 10.1002/qsar.200610151 | hdl = 11383/1668881 | hdl-access = free }}{{cite journal |last1=Ruusmann |first1=V. |last2=Sild |first2=S. |last3=Maran |first3=U. |date=2015 |title=QSAR DataBank repository: open and linked qualitative and quantitative structure–activity relationship models |journal=Journal of Cheminformatics |volume=7 |pages=32 | pmid=26110025 | doi=10.1186/s13321-015-0082-6|pmc=4479250 | doi-access=free}}{{cite journal | vauthors = Chirico N, Gramatica P | title = Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspection | journal = Journal of Chemical Information and Modeling | volume = 52 | issue = 8 | pages = 2044–58 | date = Aug 2012 | pmid = 22721530 | doi = 10.1021/ci300084j }} can then be used to predict the modeled response of other chemical structures.{{cite journal|last1=Tropsha|first1=Alexander|author-link=Alexander Tropsha|title=Best Practices for QSAR Model Development, Validation, and Exploitation|journal=Molecular Informatics|volume=29|issue=6–7|year=2010|pages=476–488|issn=1868-1743|doi=10.1002/minf.201000061|pmid=27463326|s2cid=23564249}}

A QSAR has the form of a mathematical model:

Activity = f{{tsp}}(physiochemical properties and/or structural properties) + error

The error includes model error (bias) and observational variability, that is, the variability in observations even on a correct model.

Essential steps in QSAR studies

The principal steps of QSAR/QSPR include:

Selection of data set and extraction of structural/empirical descriptors
Variable selection
Model construction
Validation evaluation

SAR and the SAR paradox

The basic assumption for all molecule-based hypotheses is that similar molecules have similar activities. This principle is also called Structure–Activity Relationship (SAR). The underlying problem is therefore how to define a small difference on a molecular level, since each kind of activity, e.g. reaction ability, biotransformation ability, solubility, target activity, and so on, might depend on another difference. Examples were given in the bioisosterism reviews by Patanie/LaVoie{{cite journal | vauthors = Patani GA, LaVoie EJ | title = Bioisosterism: A Rational Approach in Drug Design | journal = Chemical Reviews | volume = 96 | issue = 8 | pages = 3147–3176 | date = Dec 1996 | pmid = 11848856 | doi = 10.1021/cr950066q }} and Brown.{{cite book | last1 = Brown | first1 = Nathan | name-list-style = vanc | title = Bioisosteres in Medicinal Chemistry | date = 2012 | publisher = Wiley-VCH | location = Weinheim | isbn = 978-3-527-33015-7 }}

In general, one is more interested in finding strong trends. Created hypotheses usually rely on a finite number of chemicals, so care must be taken to avoid overfitting: the generation of hypotheses that fit training data very closely but perform poorly when applied to new data.

The SAR paradox refers to the fact that it is not the case that all similar molecules have similar activities {{Citation needed|date=April 2024}}.

Types

= Fragment based (group contribution) =

Analogously, the "partition coefficient"—a measurement of differential solubility and itself a component of QSAR predictions—can be predicted either by atomic methods (known as "XLogP" or "ALogP") or by chemical fragment methods (known as "CLogP" and other variations). It has been shown that the logP of compound can be determined by the sum of its fragments; fragment-based methods are generally accepted as better predictors than atomic-based methods.{{cite journal | vauthors = Thompson SJ, Hattotuwagama CK, Holliday JD, Flower DR | title = On the hydrophobicity of peptides: Comparing empirical predictions of peptide log P values | journal = Bioinformation | volume = 1 | issue = 7 | pages = 237–41 | year = 2006 | pmid = 17597897 | pmc = 1891704 | doi = 10.6026/97320630001237 }} Fragmentary values have been determined statistically, based on empirical data for known logP values. This method gives mixed results and is generally not trusted to have accuracy of more than ±0.1 units.{{Cite journal | title = Prediction of physicochemical parameters by atomic contributions |vauthors=Wildman SA, Crippen GM | doi = 10.1021/ci990307l | year = 1999 | journal = J. Chem. Inf. Comput. Sci. | pages = 868–873 | volume = 39 | issue = 5 }}

Group or fragment-based QSAR is also known as GQSAR. GQSAR allows flexibility to study various molecular fragments of interest in relation to the variation in biological response. The molecular fragments could be substituents at various substitution sites in congeneric set of molecules or could be on the basis of pre-defined chemical rules in case of non-congeneric sets. GQSAR also considers cross-terms fragment descriptors, which could be helpful in identification of key fragment interactions in determining variation of activity.{{citation | vauthors = Ajmani S, Jadhav K, Kulkarni SA | title = Group-Based QSAR (G-QSAR)}}

Lead discovery using fragnomics is an emerging paradigm. In this context FB-QSAR proves to be a promising strategy for fragment library design and in fragment-to-lead identification endeavours.{{cite journal | vauthors = Manoharan P, Vijayan RS, Ghoshal N | title = Rationalizing fragment based drug discovery for BACE1: insights from FB-QSAR, FB-QSSR, multi objective (MO-QSPR) and MIF studies | journal = Journal of Computer-Aided Molecular Design | volume = 24 | issue = 10 | pages = 843–64 | date = Oct 2010 | pmid = 20740315 | doi = 10.1007/s10822-010-9378-9 | bibcode = 2010JCAMD..24..843M | s2cid = 1171860 }}

An advanced approach on fragment or group-based QSAR based on the concept of pharmacophore-similarity is developed. This method, pharmacophore-similarity-based QSAR (PS-QSAR) uses topological pharmacophoric descriptors to develop QSAR models. This activity prediction may assist the contribution of certain pharmacophore features encoded by respective fragments toward activity improvement and/or detrimental effects.{{cite journal | vauthors = Prasanth Kumar S, Jasrai YT, Pandya HA, Rawal RM | title = Pharmacophore-similarity-based QSAR (PS-QSAR) for group-specific biological activity predictions | journal = Journal of Biomolecular Structure & Dynamics | volume = 33 | issue = 1 | pages = 56–69 | date = November 2013 | pmid = 24266725 | doi = 10.1080/07391102.2013.849618 | s2cid = 45364247 | url = https://figshare.com/articles/dataset/Pharmacophore_similarity_based_QSAR_PS_QSAR_for_group_specific_biological_activity_predictions/861021 | url-access = subscription }}

= 3D-QSAR =

The acronym 3D-QSAR or 3-D QSAR refers to the application of force field calculations requiring three-dimensional structures of a given set of small molecules with known activities (training set). The training set needs to be superimposed (aligned) by either experimental data (e.g. based on ligand-protein crystallography) or molecule superimposition software. It uses computed potentials, e.g. the Lennard-Jones potential, rather than experimental constants and is concerned with the overall molecule rather than a single substituent. The first 3-D QSAR was named Comparative Molecular Field Analysis (CoMFA) by Cramer et al. It examined the steric fields (shape of the molecule) and the electrostatic fields{{cite book | vauthors = Leach AR | title = Molecular modelling: principles and applications | publisher = Prentice Hall | location = Englewood Cliffs, N.J | year = 2001 | isbn = 978-0-582-38210-7 }} which were correlated by means of partial least squares regression (PLS).

The created data space is then usually reduced by a following feature extraction (see also dimensionality reduction). The following learning method can be any of the already mentioned machine learning methods, e.g. support vector machines.{{cite book | vauthors = Vert JP, Schölkopf B, Tsuda K | title = Kernel methods in computational biology | publisher = MIT Press | location = Cambridge, Mass | year = 2004 | isbn = 978-0-262-19509-6 }} An alternative approach uses multiple-instance learning by encoding molecules as sets of data instances, each of which represents a possible molecular conformation. A label or response is assigned to each set corresponding to the activity of the molecule, which is assumed to be determined by at least one instance in the set (i.e. some conformation of the molecule).{{cite journal | vauthors = Dietterich TG, Lathrop RH, Lozano-Pérez T | title = Solving the multiple instance problem with axis-parallel rectangles | journal = Artificial Intelligence | volume = 89 | issue = 1–2 | year = 1997 | pages = 31–71 | doi = 10.1016/S0004-3702(96)00034-3}}

On June 18, 2011 the Comparative Molecular Field Analysis (CoMFA) patent has dropped any restriction on the use of GRID and partial least-squares (PLS) technologies.{{citation needed|date=March 2018}}

= Chemical descriptor based =

In this approach, descriptors quantifying various electronic, geometric, or steric properties of a molecule are computed and used to develop a QSAR.{{cite journal | vauthors = Caruthers JM, Lauterbach JA, Thomson KT, Venkatasubramanian V, Snively CM, Bhan A, Katare S, Oskarsdottir G | title = Catalyst design: knowledge extraction from high-throughput experimentation | journal = J. Catal. | year = 2003 | volume = 216 | issue = 1–2 | pages = 3776–3777 | doi = 10.1016/S0021-9517(02)00036-2}} This approach is different from the fragment (or group contribution) approach in that the descriptors are computed for the system as whole rather than from the properties of individual fragments. This approach is different from the 3D-QSAR approach in that the descriptors are computed from scalar quantities (e.g., energies, geometric parameters) rather than from 3D fields.

An example of this approach is the QSARs developed for olefin polymerization by half sandwich compounds.{{cite journal | vauthors = Manz TA, Phomphrai K, Medvedev G, Krishnamurthy BB, Sharma S, Haq J, Novstrup KA, Thomson KT, Delgass WN, Caruthers JM, Abu-Omar MM | title = Structure-activity correlation in titanium single-site olefin polymerization catalysts containing mixed cyclopentadienyl/aryloxide ligation | journal = Journal of the American Chemical Society | volume = 129 | issue = 13 | pages = 3776–7 | date = Apr 2007 | pmid = 17348648 | doi = 10.1021/ja0640849 }}{{cite journal | vauthors = Manz TA, Caruthers JM, Sharma S, Phomphrai K, Thomson KT, Delgass WN, Abu-Omar MM | title = Structure–Activity Correlation for Relative Chain Initiation to Propagation Rates in Single-Site Olefin Polymerization Catalysis | journal = Organometallics | year = 2012 | volume = 31 | pages = 602–618 | doi = 10.1021/om200884x | issue = 2}}

= String based =

It has been shown that activity prediction is even possible based purely on the SMILES string.{{cite arXiv |last1=Jastrzębski |first1=Stanisław |last2=Leśniak |first2=Damian |last3=Czarnecki |first3=Wojciech Marian |title=Learning to SMILE(S) |date=8 March 2018 |class=cs.CL |eprint=1602.06289}}{{cite arXiv |last1=Bjerrum |first1=Esben Jannik |title=SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules |date=17 May 2017 |class=cs.LG |eprint=1703.07076}}{{cite journal |last1=Mayr |first1=Andreas |last2=Klambauer |first2=Günter |last3=Unterthiner |first3=Thomas |last4=Steijaert |first4=Marvin |last5=Wegner |first5=Jörg K. |last6=Ceulemans |first6=Hugo |last7=Clevert |first7=Djork-Arné |last8=Hochreiter |first8=Sepp |title=Large-scale comparison of machine learning methods for drug target prediction on ChEMBL |journal=Chemical Science |date=20 June 2018 |volume=9 |issue=24 |pages=5441–5451 |doi=10.1039/c8sc00148k|pmid=30155234 |pmc=6011237 }}

= Graph based =

Similarly to string-based methods, the molecular graph can directly be used as input for QSAR models,{{cite journal |last1=Merkwirth |first1=Christian |last2=Lengauer |first2=Thomas |title=Automatic Generation of Complementary Descriptors with Molecular Graph Networks |journal=Journal of Chemical Information and Modeling |date=1 September 2005 |volume=45 |issue=5 |pages=1159–1168 |doi=10.1021/ci049613b|pmid=16180893 }}{{cite journal |last1=Kearnes |first1=Steven |last2=McCloskey |first2=Kevin |last3=Berndl |first3=Marc |last4=Pande |first4=Vijay |last5=Riley |first5=Patrick |title=Molecular graph convolutions: moving beyond fingerprints |journal=Journal of Computer-Aided Molecular Design |date=1 August 2016 |volume=30 |issue=8 |pages=595–608 |doi=10.1007/s10822-016-9938-8|pmid=27558503 |pmc=5028207 |arxiv=1603.00856 |bibcode=2016JCAMD..30..595K }} but usually yield inferior performance compared to descriptor-based QSAR models.{{cite journal |last1=Jiang |first1=Dejun |last2=Wu |first2=Zhenxing |last3=Hsieh |first3=Chang-Yu |last4=Chen |first4=Guangyong |last5=Liao |first5=Ben |last6=Wang |first6=Zhe |last7=Shen |first7=Chao |last8=Cao |first8=Dongsheng |last9=Wu |first9=Jian |last10=Hou |first10=Tingjun |title=Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models |journal=Journal of Cheminformatics |date=17 February 2021 |volume=13 |issue=1 |pages=12 |doi=10.1186/s13321-020-00479-8|pmid=33597034 |pmc=7888189 |doi-access=free }}{{cite journal |last1=van Tilborg |first1=Derek |last2=Alenicheva |first2=Alisa |last3=Grisoni |first3=Francesca |title=Exposing the Limitations of Molecular Machine Learning with Activity Cliffs |journal=Journal of Chemical Information and Modeling |date=12 December 2022 |volume=62 |issue=23 |pages=5938–5951 |doi=10.1021/acs.jcim.2c01073|pmid=36456532 |pmc=9749029 }}

=q-RASAR=

QSAR has been merged with the similarity-based read-across technique to develop a new field of q-RASAR. The [https://sites.google.com/site/kunalroyindia/home/the-dtc-laboratory?authuser=0 DTC Laboratory] at Jadavpur University has developed this hybrid method and the details are available at their [https://sites.google.com/site/kunalroyindia/home/rasar?authuser=0 laboratory page]. Recently, the q-RASAR framework has been improved by its integration with the ARKA descriptors in QSAR.

Modeling

In the literature it can be often found that chemists have a preference for partial least squares (PLS) methods,{{citation needed|date=April 2012}} since it applies the feature extraction and induction in one step.

= Data mining approach =

Computer SAR models typically calculate a relatively large number of features. Because those lack structural interpretation ability, the preprocessing steps face a feature selection problem (i.e., which structural features should be interpreted to determine the structure-activity relationship). Feature selection can be accomplished by visual inspection (qualitative selection by a human); by data mining; or by molecule mining.

A typical data mining based prediction uses e.g. support vector machines, decision trees, artificial neural networks for inducing a predictive learning model.

Molecule mining approaches, a special case of structured data mining approaches, apply a similarity matrix based prediction or an automatic fragmentation scheme into molecular substructures. Furthermore, there exist also approaches using maximum common subgraph searches or graph kernels.{{cite book | vauthors = Gusfield D | title = Algorithms on strings, trees, and sequences: computer science and computational biology | publisher = Cambridge University Press | location = Cambridge, UK | year = 1997 | isbn = 978-0-521-58519-4 }}{{cite book | vauthors = Helma C | title = Predictive toxicology | publisher = Taylor & Francis | location = Washington, DC | year = 2005 | isbn = 978-0-8247-2397-2 }}

File:QSAR-protocol.jpg

=Matched molecular pair analysis=

Typically QSAR models derived from non linear machine learning is seen as a "black box", which fails to guide medicinal chemists. Recently there is a relatively new concept of matched molecular pair analysis{{cite journal | vauthors = Dossetter AG, Griffen EJ, Leach AG | title = Matched molecular pair analysis in drug discovery | journal = Drug Discovery Today | volume = 18 | issue = 15–16 | pages = 724–31 | year = 2013 | pmid = 23557664 | doi = 10.1016/j.drudis.2013.03.003 }} or prediction driven MMPA which is coupled with QSAR model in order to identify activity cliffs.{{cite journal | vauthors = Sushko Y, Novotarskyi S, Körner R, Vogt J, Abdelaziz A, Tetko IV | title = Prediction-driven matched molecular pairs to interpret QSARs and aid the molecular optimization process | journal = Journal of Cheminformatics | volume = 6 | issue = 1 | pages = 48 | year = 2014 | pmid = 25544551 | pmc = 4272757 | doi = 10.1186/s13321-014-0048-0 | doi-access = free }}

Evaluation of the quality of QSAR models

QSAR modeling produces predictive models derived from application of statistical tools correlating biological activity (including desirable therapeutic effect and undesirable side effects) or physico-chemical properties in QSPR models of chemicals (drugs/toxicants/environmental pollutants) with descriptors representative of molecular structure or properties. QSARs are being applied in many disciplines, for example: risk assessment, toxicity prediction, and regulatory decisions{{cite journal | vauthors = Tong W, Hong H, Xie Q, Shi L, Fang H, Perkins R | title = Assessing QSAR Limitations – A Regulatory Perspective | journal = Current Computer-Aided Drug Design | volume = 1 | issue = 2 | pages = 195–205 |date=April 2005 | doi = 10.2174/1573409053585663 | url = https://zenodo.org/record/1235864 }} in addition to drug discovery and lead optimization.{{cite journal | vauthors = Dearden JC | title = In silico prediction of drug toxicity | journal = Journal of Computer-Aided Molecular Design | volume = 17 | issue = 2–4 | pages = 119–27 | year = 2003 | pmid = 13677480 | doi = 10.1023/A:1025361621494 | bibcode = 2003JCAMD..17..119D | s2cid = 21518449 }} Obtaining a good quality QSAR model depends on many factors, such as the quality of input data, the choice of descriptors and statistical methods for modeling and for validation. Any QSAR modeling should ultimately lead to statistically robust and predictive models capable of making accurate and reliable predictions of the modeled response of new compounds.

For validation of QSAR models, usually various strategies are adopted:{{cite book | vauthors = Wold S, Eriksson L | editor = Waterbeemd, Han van de | title = Chemometric methods in molecular design | publisher = VCH | location = Weinheim | year = 1995 | pages = 309–318 | chapter = Statistical validation of QSAR results | isbn = 978-3-527-30044-0 }}

internal validation or cross-validation (actually, while extracting data, cross validation is a measure of model robustness, the more a model is robust (higher q2) the less data extraction perturb the original model);
external validation by splitting the available data set into training set for model development and prediction set for model predictivity check;
blind external validation by application of model on new external data and
data randomization or Y-scrambling for verifying the absence of chance correlation between the response and the modeling descriptors.

The success of any QSAR model depends on accuracy of the input data, selection of appropriate descriptors and statistical tools, and most importantly validation of the developed model. Validation is the process by which the reliability and relevance of a procedure are established for a specific purpose; for QSAR models validation must be mainly for robustness, prediction performances and applicability domain (AD) of the models.{{cite journal | vauthors = Roy K | title = On some aspects of validation of predictive quantitative structure-activity relationship models | journal = Expert Opinion on Drug Discovery | volume = 2 | issue = 12 | pages = 1567–77 | date = Dec 2007 | pmid = 23488901| doi = 10.1517/17460441.2.12.1567 | s2cid = 21305783 }}{{cite journal |last1=Sahigara |first1=Faizan |last2=Mansouri |first2=Kamel |last3=Ballabio |first3=Davide |last4=Mauri |first4=Andrea |last5=Consonni |first5=Viviana |last6=Todeschini |first6=Roberto |title=Comparison of Different Approaches to Define the Applicability Domain of QSAR Models |journal=Molecules |date=2012 |volume=17 |issue=5 |pages=4791–4810 |doi=10.3390/molecules17054791|pmid=22534664 |pmc=6268288 |doi-access=free }}

Some validation methodologies can be problematic. For example, leave one-out cross-validation generally leads to an overestimation of predictive capacity. Even with external validation, it is difficult to determine whether the selection of training and test sets was manipulated to maximize the predictive capacity of the model being published.

Different aspects of validation of QSAR models that need attention include methods of selection of training set compounds,{{cite journal | vauthors = Leonard JT, Roy K | title = On selection of training and test sets for the development of predictive QSAR models | journal = QSAR & Combinatorial Science | volume = 25 | issue = 3 | pages = 235–251| year = 2006 | doi = 10.1002/qsar.200510161 }} setting training set size{{cite journal | vauthors = Roy PP, Leonard JT, Roy K | title = Exploring the impact of size of training sets for the development of predictive QSAR models | journal = Chemometrics and Intelligent Laboratory Systems | volume = 90 | issue = 1 | pages = 31–42 | year = 2008 | doi = 10.1016/j.chemolab.2007.07.004 }} and impact of variable selection{{cite journal | vauthors = Put R, Vander Heyden Y | title = Review on modelling aspects in reversed-phase liquid chromatographic quantitative structure-retention relationships | journal = Analytica Chimica Acta | volume = 602 | issue = 2 | pages = 164–72 | date = Oct 2007 | pmid = 17933600 | doi = 10.1016/j.aca.2007.09.014 }} for training set models for determining the quality of prediction. Development of novel validation parameters for judging quality of QSAR models is also important.{{cite journal | vauthors = Pratim Roy P, Paul S, Mitra I, Roy K | title = On two novel parameters for validation of predictive QSAR models | journal = Molecules | volume = 14 | issue = 5 | pages = 1660–701 | year = 2009 | pmid = 19471190 | pmc = 6254296 | doi = 10.3390/molecules14051660 | doi-access = free }}{{cite journal | vauthors = Chirico N, Gramatica P | title = Real external predictivity of QSAR models: how to evaluate it? Comparison of different validation criteria and proposal of using the concordance correlation coefficient | journal = Journal of Chemical Information and Modeling | volume = 51 | issue = 9 | pages = 2320–35 | date = Sep 2011 | pmid = 21800825 | doi = 10.1021/ci200211n }}

Application

= Chemical =

One of the first historical QSAR applications was to predict boiling points.{{cite book | vauthors = Rouvray DH, Bonchev D | title = Chemical graph theory: introduction and fundamentals | publisher = Abacus Press | location = Tunbridge Wells, Kent, England | year = 1991 | isbn = 978-0-85626-454-2 }}

It is well known for instance that within a particular family of chemical compounds, especially of organic chemistry, that there are strong correlations between structure and observed properties. A simple example is the relationship between the number of carbons in alkanes and their boiling points. There is a clear trend in the increase of boiling point with an increase in the number carbons, and this serves as a means for predicting the boiling points of higher alkanes.

A still very interesting application is the Hammett equation, Taft equation and pKa prediction methods.{{cite encyclopedia | last = Fraczkiewicz | first = R | encyclopedia = Reference Module in Chemistry, Molecular Sciences and Chemical Engineering [Online] | editor-last = Reedijk | editor-first = J | volume = 5 | publisher = Elsevier | location = Amsterdam, the Netherlands | year = 2013 | doi = 10.1016/B978-0-12-409547-2.02610-X | title = Reference Module in Chemistry, Molecular Sciences and Chemical Engineering | isbn = 9780124095472 | chapter = In Silico Prediction of Ionization }}

= Biological =

The biological activity of molecules is usually measured in assays to establish the level of inhibition of particular signal transduction or metabolic pathways. Drug discovery often involves the use of QSAR to identify chemical structures that could have good inhibitory effects on specific targets and have low toxicity (non-specific activity). Of special interest is the prediction of partition coefficient log P, which is an important measure used in identifying "druglikeness" according to Lipinski's Rule of Five.{{cn|date=March 2024}}

While many quantitative structure activity relationship analyses involve the interactions of a family of molecules with an enzyme or receptor binding site, QSAR can also be used to study the interactions between the structural domains of proteins. Protein-protein interactions can be quantitatively analyzed for structural variations resulted from site-directed mutagenesis.{{cite journal | vauthors = Freyhult EK, Andersson K, Gustafsson MG | title = Structural modeling extends QSAR analysis of antibody-lysozyme interactions to 3D-QSAR | journal = Biophysical Journal | volume = 84 | issue = 4 | pages = 2264–72 | date = Apr 2003 | pmid = 12668435 | pmc = 1302793 | doi = 10.1016/S0006-3495(03)75032-2 | bibcode = 2003BpJ....84.2264F }}

It is part of the machine learning method to reduce the risk for a SAR paradox, especially taking into account that only a finite amount of data is available (see also MVUE). In general, all QSAR problems can be divided into coding{{cite book | vauthors = Timmerman H, Todeschini R, Consonni V, Mannhold R, Kubinyi H | title = Handbook of Molecular Descriptors | publisher = Wiley-VCH | location = Weinheim | year = 2002 | isbn = 978-3-527-29913-3 }}

and learning.{{cite book |vauthors=Duda RO, Hart PW, Stork DG | title = Pattern classification | publisher = John Wiley & Sons | location = Chichester | year = 2001 | isbn = 978-0-471-05669-0 }}

= Applications =

(Q)SAR models have been used for risk management. QSARS are suggested by regulatory authorities; in the European Union, QSARs are suggested by the REACH regulation, where "REACH" abbreviates "Registration, Evaluation, Authorisation and Restriction of Chemicals". Regulatory application of QSAR methods includes in silico toxicological assessment of genotoxic impurities.{{Cite journal|last1=Fioravanzo|first1=E.|last2=Bassan|first2=A.|last3=Pavan|first3=M.|last4=Mostrag-Szlichtyng|first4=A.|last5=Worth|first5=A. P.|date=2012-04-01|title=Role of in silico genotoxicity tools in the regulatory assessment of pharmaceutical impurities|journal=SAR and QSAR in Environmental Research|volume=23|issue=3–4|pages=257–277|doi=10.1080/1062936X.2012.657236|issn=1062-936X|pmid=22369620|s2cid=2714861}} Commonly used QSAR assessment software such as DEREK or CASE Ultra (MultiCASE) is used to genotoxicity of impurity according to ICH M7.ICH M7 Assessment and control of DNA reactive (mutagenic) impurities in pharmaceuticals to limit potential carcinogenic risk - Scientific guideline [https://www.ema.europa.eu/en/ich-m7-assessment-control-dna-reactive-mutagenic-impurities-pharmaceuticals-limit-potential]

The chemical descriptor space whose convex hull is generated by a particular training set of chemicals is called the training set's applicability domain. Prediction of properties of novel chemicals that are located outside the applicability domain uses extrapolation, and so is less reliable (on average) than prediction within the applicability domain. The assessment of the reliability of QSAR predictions remains a research topic.{{cn|date=March 2024}}

The QSAR equations can be used to predict biological activities of newer molecules before their synthesis.

Examples of machine learning tools for QSAR modeling include:{{cite journal | vauthors = Lavecchia A | title = Machine-learning approaches in drug discovery: methods and applications | journal = Drug Discovery Today | volume = 20 | issue = 3 | pages = 318–31 | date = Mar 2015 | pmid = 25448759 | doi = 10.1016/j.drudis.2014.10.012 }}

S.No.	Name	Algorithms	External link
class="wikitable"
1.	R	RF, SVM, Naïve Bayesian, and ANN	{{cite web \| url = http://www.r-project.org/ \| title = R: The R Project for Statistical Computing }}
2.	libSVM	SVM	{{cite web \| url = https://www.csie.ntu.edu.tw/~cjlin/libsvm/ \| title = LIBSVM -- A Library for Support Vector Machines }}
3.	Orange	RF, SVM, and Naïve Bayesian	{{cite web \| url = http://www.ailab.si/orange/ \| title = Orange Data Mining }}
4.	RapidMiner	SVM, RF, Naïve Bayes, DT, ANN, and k-NN	{{cite web \| url = http://rapid-i.com/ \| title = RapidMiner \| #1 Open Source Predictive Analytics Platform }}
5.	Weka	RF, SVM, and Naïve Bayes	{{cite web \| url = http://www.cs.waikato.ac.nz/ml/weka/ \| title = Weka 3 - Data Mining with Open Source Machine Learning Software in Java \| access-date = 2016-03-24 \| archive-date = 2011-10-28 \| archive-url = https://web.archive.org/web/20111028090649/http://www.cs.waikato.ac.nz/ml/weka/ \| url-status = dead }}
6.	Knime	DT, Naïve Bayes, and SVM	{{cite web \| url = http://www.knime.org/ \| title = KNIME \| Open for Innovation }}
7.	AZOrange{{cite journal \| vauthors = Stålring JC, Carlsson LA, Almeida P, Boyer S \| title = AZOrange - High performance open source machine learning for QSAR modeling in a graphical programming environment \| journal = Journal of Cheminformatics \| volume = 3 \| pages = 28 \| year = 2011 \| pmid = 21798025 \| pmc = 3158423 \| doi = 10.1186/1758-2946-3-28 \| doi-access = free }}	RT, SVM, ANN, and RF	{{cite web \| url = https://github.com/AZcompTox/AZOrange \| title = AZCompTox/AZOrange: AstraZeneca add-ons to Orange. \| work = GitHub \| date = 2018-09-19 }}
8.	Tanagra	SVM, RF, Naïve Bayes, and DT	{{cite web \| url = http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html \| title = TANAGRA - A free DATA MINING software for teaching and research \| access-date = 2016-03-24 \| archive-date = 2017-12-19 \| archive-url = https://web.archive.org/web/20171219194223/http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html \| url-status = dead }}
9.	Elki	k-NN	{{cite web \| url = http://elki.dbs.ifi.lmu.de/++ \| title = ELKI Data Mining Framework \| archive-url = https://web.archive.org/web/20161119100656/http://elki.dbs.ifi.lmu.de/ \| archive-date = 2016-11-19 \| url-status = dead }}
10.	MALLET		{{cite web \| url = http://mallet.cs.umass.edu/ \| title = MALLET homepage }}
11.	MOA		{{cite web \| url = http://moa.cms.waikato.ac.nz/+ \| title = MOA Massive Online Analysis \| Real Time Analytics for Data Streams \| archive-url = https://web.archive.org/web/20170619113241/http://moa.cms.waikato.ac.nz/ \| archive-date = 2017-06-19 \| url-status = dead }}
12.	Deep Chem	Logistic Regression, Naive Bayes, RF, ANN, and others	{{cite web\|title=DeepChem\|url=https://deepchem.io/\|website=deepchem.io\|access-date=20 October 2017}}
13.	alvaModel{{cite journal \|last1=Mauri \|first1=Andrea \|last2=Bertola \|first2=Matteo\| title = Alvascience: A New Software Suite for the QSAR Workflow Applied to the Blood–Brain Barrier Permeability \| journal = International Journal of Molecular Sciences \| volume = 23 \| issue= 12882 \| year = 2022 \|page=12882 \| doi = 10.3390/ijms232112882 \|pmid=36361669 \|pmc=9655980 \|doi-access=free }}	Regression (OLS, PLS, k-NN, SVM and Consensus) and Classification (LDA/QDA, PLS-DA, k-NN, SVM and Consensus)	{{cite web \| title=alvaModel: a software tool to create QSAR/QSPR models \| url=https://www.alvascience.com/alvamodel/ \| website=alvascience.com}}
14.	scikit-learn (Python) {{cite journal \|author1=Fabian Pedregosa \|author2=Gaël Varoquaux \|author3=Alexandre Gramfort \|author4=Vincent Michel \|author5=Bertrand Thirion \|author6=Olivier Grisel \|author7=Mathieu Blondel \|author8=Peter Prettenhofer \|author9=Ron Weiss \|author10=Vincent Dubourg \|author11=Jake Vanderplas \|author12=Alexandre Passos \|author13=David Cournapeau \|author14=Matthieu Perrot \|author15=Édouard Duchesnay \|title=scikit-learn: Machine Learning in Python \|journal=Journal of Machine Learning Research \|year=2011 \|volume=12 \|pages=2825–2830 \|url=http://jmlr.org/papers/v12/pedregosa11a.html }}	Logistic Regression, Naive Bayes, kNN, RF, SVM, GP, ANN, and others	{{cite web\|title=SciKit-Learn\|url=https://scikit-learn.org/stable/index.html#\|website=scikit-learn.org\|access-date=13 August 2023}}
15. \|Scikit-Mol{{Citation \|last=Bjerrum \|first=Esben Jannik \|title=Scikit-Mol brings cheminformatics to Scikit-Learn \|date=2023-12-06 \|url=https://chemrxiv.org/engage/chemrxiv/article-details/60ef0fc58825826143a82cc0 \|access-date=2025-01-17 \|language=en \|doi=10.26434/chemrxiv-2023-fzqwd \|last2=Bachorz \|first2=Rafał Adam \|last3=Bitton \|first3=Adrien \|last4=Choung \|first4=Oh-hyeon \|last5=Chen \|first5=Ya \|last6=Esposito \|first6=Carmen \|last7=Ha \|first7=Son Viet \|last8=Poehlmann \|first8=Andreas}} \|Integration of Scikit-learn models and RDKit featurization \|[https://pypi.org/project/scikit-mol/ scikit-mol] on pypi.org
16.	scikit-fingerprintsAdamczyk, J., & Ludynia, P. (2024). Scikit-fingerprints: Easy and efficient computation of molecular fingerprints in Python. SoftwareX, 28, 101944. https://doi.org/https://doi.org/10.1016/j.softx.2024.101944	Molecular fingerprints, API compatible with Scikit-learn models	{{cite web\|title=scikit-fingerprints\|url=https://github.com/scikit-fingerprints/scikit-fingerprints\|access-date=29 December 2024}}
17.	DTC Lab Tools	Multiple Linear Regression, Partial Least Squares, Applicability Domain, Validation, and others	{{cite web\|title=DTCLab Tools\|url=https://teqip.jdvu.ac.in/QSAR_Tools/\|access-date=12 May 2025}}
18.	DTC Lab Supplementary Tools	Quantitative Read-across, q-RASAR, ARKA, Regression and Classification-based ML tools, and others	{{cite web\|title=DTCLab Supplementary Tools\|url=https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/\|access-date=12 May 2025}}

References

External links

{{cite web | url = http://www.qsar.org/ | title = The Cheminformatics and QSAR Society | access-date = 2009-05-11}}
{{cite web | url = http://www.3d-qsar.com/ | title = The 3D QSAR Server | access-date = 2011-06-18}}
{{cite journal | url = http://www.natureprotocols.com/2007/03/05/development_of_qsar_models_usi_1.php | title = Nature Protocols: Development of QSAR models using C-QSAR program | journal = Protocol Exchange| archive-url =https://web.archive.org/web/20070501144332/http://www.natureprotocols.com/2007/03/05/development_of_qsar_models_usi_1.php| archive-date =2007-05-01| quote = A regression program that has dual databases of over 21,000 QSAR models | doi = 10.1038/nprot.2007.125 | url-status = dead | access-date = 2009-05-11| year = 2007 | last1 = Verma | first1 = Rajeshwar P. | last2 = Hansch | first2 = Corwin | doi-access = free }}
{{cite web| url =http://www.qsarworld.com| title =QSAR World| archive-url =https://web.archive.org/web/20090425113637/http://www.qsarworld.com/| archive-date =2009-04-25| quote =A comprehensive web resource for QSAR modelers| access-date =2009-05-11| url-status =dead}}
[http://dtclab.webs.com/software-tools Chemoinformatics Tools] {{Webarchive|url=https://web.archive.org/web/20170704100534/http://dtclab.webs.com/software-tools |date=2017-07-04 }}, Drug Theoretics and Cheminformatics Laboratory
[https://zenodo.org/record/854760 Multiscale Conceptual Model Figures for QSARs in Biological and Environmental Science]

Category:Medicinal chemistry

Category:Drug discovery

Category:Cheminformatics

Category:Computational chemistry

Structure-Activity Relationship paradox