Gower's distance

In statistics, Gower's distance between two mixed-type objects is a similarity measure that can handle different types of data within the same dataset and is particularly useful in cluster analysis or other multivariate statistical techniques. Data can be binary, ordinal, or continuous variables. It works by normalizing the differences between each pair of variables and then computing a weighted average of these differences. The distance was defined in 1971 by Gower{{cite journal |last1=Gower|first1=John C|date=1971|title=A general coefficient of similarity and some of its properties|url=https://www.jstor.org/stable/2528823|journal=Biometrics|volume=27|issue=4|pages=857–871|doi=10.2307/2528823 |jstor=2528823 |access-date=2024-06-03}} and it takes values between 0 and 1 with smaller values indicating higher similarity.

Definition

For two objects $i$ and $j$ having $p$ descriptors, the similarity $S$ is defined as:

$S_{ij} = \frac{\sum_{k=1}^pw_{ijk}s_{ijk}}{\sum_{k=1}^pw_{ijk}},$

where the $w_{ijk}$ are non-negative weights usually set to $1$ {{cite book |last1=Borg |first1=Ingwer |last2=Groenen |first2=Patrick J. F. |title=Modern multidimensional scaling: theory and applications |date=2005 |publisher=Springer |location=New York [Heidelberg] |isbn=978-0387-25150-9 |pages=124–125 |edition=2}} and $s_{ijk}$ is the similarity between the two objects regarding their $k$ -th variable. If the variable is binary or ordinal, the values of $s_{ijk}$ are 0 or 1, with 1 denoting equality. If the variable is continuous, $s_{ijk} = 1- \frac$

x_i-x_j

{R_k} with

R_k

being the range of

k

-th variable and thus ensuring

0\leq s_{ijk}\leq 1

. As a result, the overall similarity

S_{ij}

between two objects is the weighted average of the

similarities calculated for all their descriptors.{{cite book |last1=Legendre |first1=Pierre |last2=Legendre |first2=Louis |title=Numerical ecology |date=2012 |publisher=Elsevier |location=Amsterdam |isbn=978-0-444-53868-0 |pages=278–280 |edition=Third English}}

In its original exposition, the distance does not treat ordinal variables in a special manner. In the 1990s, first Kaufman and Rousseeuw{{cite book |last1=Kaufman |first1=Leonard |last2=Rousseeuw |first2=Peter J. |title=Finding groups in data: an introduction to cluster analysis |date=1990|pages=35–36 |publisher=Wiley |location=New York |isbn=9780471878766}} and later Podani{{cite journal |last1=Podani |first1=János |title=Extending Gower's general coefficient of similarity to ordinal characters |journal=Taxon |date=May 1999 |jstor=1224438|volume=48 |issue=2 |pages=331–340|doi=10.2307/1224438 }} suggested extensions where the ordering of an ordinal feature is used. For example, Podani obtains relative rank differences as $s_{ijk} = 1- \frac$

r_i-r_j

{\max{\{r\}}- \min{\{r\}}} with

r

being the ranks corresponding to the ordered categories of the

k

-th variable.

Software implementations

Many programming languages and statistical packages, such as R, Python, etc., include implementations of Gower's distance. The implementations may follow Kaufmann and Rousseeuw's extensions, which change the similarity for continuous variables to $s_{ijk} = \frac$

x_i-x_j

{R_k} {{Cite web| url = https://search.r-project.org/CRAN/refmans/StatMatch/html/gower.dist.html| title = gower.dist {StatMatch}| last = D'Orazio| first = Marcello | website = The R Project for Statistical Computing| access-date = 31 October 2024

}}

class="wikitable sortable"

! Language/program !! Function !! Ref.

R StatMatch::gower.dist(X) [https://search.r-project.org/CRAN/refmans/StatMatch/html/gower.dist.html]

Python gower.gower_matrix(X) [https://pypi.org/project/gower/]

References

Category:Statistical distance

Category:Similarity measures