Distances and Similarities between Probability Density Functions

This functions computes the distance/dissimilarity between two probability density functions.

distance(
  x,
  method = "euclidean",
  p = NULL,
  test.na = TRUE,
  unit = "log",
  epsilon = 1e-05,
  est.prob = NULL,
  use.row.names = FALSE,
  as.dist.obj = FALSE,
  diag = FALSE,
  upper = FALSE,
  mute.message = FALSE
)

Arguments

x

a numeric data.frame or matrix (storing probability vectors) or a numeric data.frame or matrix storing counts (if est.prob is specified).

method

a character string indicating whether the distance measure that should be computed.

p

power of the Minkowski distance.

test.na

a boolean value indicating whether input vectors should be tested for NA values. Faster computations if test.na = FALSE.

unit

a character string specifying the logarithm unit that should be used to compute distances that depend on log computations.

epsilon

a small value to address cases in the distance computation where division by zero occurs. In these cases, x / 0 or 0 / 0 will be replaced by epsilon. The default is epsilon = 0.00001. However, we recommend to choose a custom epsilon value depending on the size of the input vectors, the expected similarity between compared probability density functions and whether or not many 0 values are present within the compared vectors. As a rough rule of thumb we suggest that when dealing with very large input vectors which are very similar and contain many 0 values, the epsilon value should be set even smaller (e.g. epsilon = 0.000000001), whereas when vector sizes are small or distributions very divergent then higher epsilon values may also be appropriate (e.g. epsilon = 0.01). Addressing this epsilon issue is important to avoid cases where distance metrics return negative values which are not defined and only occur due to the technical issues of computing x / 0 or 0 / 0 cases.

est.prob

method to estimate probabilities from input count vectors such as non-probability vectors. Default: est.prob = NULL. Options are:

est.prob = "empirical": The relative frequencies of each vector are computed internally. For example an input matrix rbind(1:10, 11:20) will be transformed to a probability vector rbind(1:10 / sum(1:10), 11:20 / sum(11:20))

use.row.names

a logical value indicating whether or not row names from the input matrix shall be used as rownames and colnames of the output distance matrix. Default value is use.row.names = FALSE.

as.dist.obj

shall the return value or matrix be an object of class link[stats]{dist}? Default is as.dist.obj = FALSE.

diag

if as.dist.obj = TRUE, then this value indicates whether the diagonal of the distance matrix should be printed. Default

upper

if as.dist.obj = TRUE, then this value indicates whether the upper triangle of the distance matrix should be printed.

mute.message

a logical value indicating whether or not messages printed by distance shall be muted. Default is mute.message = FALSE.

Value

The following results are returned depending on the dimension of x:

in case nrow(x) = 2 : a single distance value.
in case nrow(x) > 2 : a distance matrix storing distance values for all pairwise probability vector comparisons.

Details

Here a distance is defined as a quantitative degree of how far two mathematical objects are apart from eachother (Cha, 2007).

This function implements the following distance/similarity measures to quantify the distance between probability density functions:

L_p Minkowski family
- Euclidean : \(d = sqrt( \sum | P_i - Q_i |^2)\)
- Manhattan : \(d = \sum | P_i - Q_i |\)
- Minkowski : \(d = ( \sum | P_i - Q_i |^p)^1/p\)
- Chebyshev : \(d = max | P_i - Q_i |\)
L_1 family
- Sorensen : \(d = \sum | P_i - Q_i | / \sum (P_i + Q_i)\)
- Gower : \(d = 1/d * \sum | P_i - Q_i |\)
- Soergel : \(d = \sum | P_i - Q_i | / \sum max(P_i , Q_i)\)
- Kulczynski d : \(d = \sum | P_i - Q_i | / \sum min(P_i , Q_i)\)
- Canberra : \(d = \sum | P_i - Q_i | / (P_i + Q_i)\)
- Lorentzian : \(d = \sum ln(1 + | P_i - Q_i |)\)
Intersection family
- Intersection : \(s = \sum min(P_i , Q_i)\)
- Non-Intersection : \(d = 1 - \sum min(P_i , Q_i)\)
- Wave Hedges : \(d = \sum | P_i - Q_i | / max(P_i , Q_i)\)
- Czekanowski : \(d = \sum | P_i - Q_i | / \sum | P_i + Q_i |\)
- Motyka : \(d = \sum min(P_i , Q_i) / (P_i + Q_i)\)
- Kulczynski s : \(d = 1 / \sum | P_i - Q_i | / \sum min(P_i , Q_i)\)
- Tanimoto : \(d = \sum (max(P_i , Q_i) - min(P_i , Q_i)) / \sum max(P_i , Q_i)\) ; equivalent to Soergel
- Ruzicka : \(s = \sum min(P_i , Q_i) / \sum max(P_i , Q_i)\) ; equivalent to 1 - Tanimoto = 1 - Soergel
Inner Product family
- Inner Product : \(s = \sum P_i * Q_i\)
- Harmonic mean : \(s = 2 * \sum (P_i * Q_i) / (P_i + Q_i)\)
- Cosine : \(s = \sum (P_i * Q_i) / sqrt(\sum P_i^2) * sqrt(\sum Q_i^2)\)
- Kumar-Hassebrook (PCE) : \(s = \sum (P_i * Q_i) / (\sum P_i^2 + \sum Q_i^2 - \sum (P_i * Q_i))\)
- Jaccard : \(d = 1 - \sum (P_i * Q_i) / (\sum P_i^2 + \sum Q_i^2 - \sum (P_i * Q_i))\) ; equivalent to 1 - Kumar-Hassebrook
- Dice : \(d = \sum (P_i - Q_i)^2 / (\sum P_i^2 + \sum Q_i^2)\)
Squared-chord family
- Fidelity : \(s = \sum sqrt(P_i * Q_i)\)
- Bhattacharyya : \(d = - ln \sum sqrt(P_i * Q_i)\)
- Hellinger : \(d = 2 * sqrt( 1 - \sum sqrt(P_i * Q_i))\)
- Matusita : \(d = sqrt( 2 - 2 * \sum sqrt(P_i * Q_i))\)
- Squared-chord : \(d = \sum ( sqrt(P_i) - sqrt(Q_i) )^2\)
Squared L_2 family (\(X\)^2 squared family)
- Squared Euclidean : \(d = \sum ( P_i - Q_i )^2\)
- Pearson \(X\)^2 : \(d = \sum ( (P_i - Q_i )^2 / Q_i )\)
- Neyman \(X\)^2 : \(d = \sum ( (P_i - Q_i )^2 / P_i )\)
- Squared \(X\)^2 : \(d = \sum ( (P_i - Q_i )^2 / (P_i + Q_i) )\)
- Probabilistic Symmetric \(X\)^2 : \(d = 2 * \sum ( (P_i - Q_i )^2 / (P_i + Q_i) )\)
- Divergence : \(X\)^2 : \(d = 2 * \sum ( (P_i - Q_i )^2 / (P_i + Q_i)^2 )\)
- Clark : \(d = sqrt ( \sum (| P_i - Q_i | / (P_i + Q_i))^2 )\)
- Additive Symmetric \(X\)^2 : \(d = \sum ( ((P_i - Q_i)^2 * (P_i + Q_i)) / (P_i * Q_i) ) \)
Shannon's entropy family
- Kullback-Leibler : \(d = \sum P_i * log(P_i / Q_i)\)
- Jeffreys : \(d = \sum (P_i - Q_i) * log(P_i / Q_i)\)
- K divergence : \(d = \sum P_i * log(2 * P_i / P_i + Q_i)\)
- Topsoe : \(d = \sum ( P_i * log(2 * P_i / P_i + Q_i) ) + ( Q_i * log(2 * Q_i / P_i + Q_i) )\)
- Jensen-Shannon : \(d = 0.5 * ( \sum P_i * log(2 * P_i / P_i + Q_i) + \sum Q_i * log(2 * Q_i / P_i + Q_i))\)
- Jensen difference : \(d = \sum ( (P_i * log(P_i) + Q_i * log(Q_i) / 2) - (P_i + Q_i / 2) * log(P_i + Q_i / 2) )\)
Combinations
- Taneja : \(d = \sum ( P_i + Q_i / 2) * log( P_i + Q_i / ( 2 * sqrt( P_i * Q_i)) )\)
- Kumar-Johnson : \(d = \sum (P_i^2 - Q_i^2)^2 / 2 * (P_i * Q_i)^1.5\)
- Avg(L_1, L_n) : \(d = \sum | P_i - Q_i| + max{ | P_i - Q_i |} / 2\)
In cases where x specifies a count matrix, the argument est.prob can be selected to first estimate probability vectors from input count vectors and second compute the corresponding distance measure based on the estimated probability vectors.

The following probability estimation methods are implemented in this function:
- est.prob = "empirical" : relative frequencies of counts.

Note

According to the reference in some distance measure computations invalid computations can occur when dealing with 0 probabilities.

In these cases the convention is treated as follows:

division by zero - case 0/0: when the divisor and dividend become zero, 0/0 is treated as 0.
division by zero - case n/0: when only the divisor becomes 0, the corresponsning 0 is replaced by a small \(\epsilon = 0.00001\).
log of zero - case 0 * log(0): is treated as 0.
log of zero - case log(0): zero is replaced by a small \(\epsilon = 0.00001\).

References

Sung-Hyuk Cha. (2007). Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. International Journal of Mathematical Models and Methods in Applied Sciences 4: 1.

Author

Hajk-Georg Drost

Examples

# Simple Examples

# receive a list of implemented probability distance measures
getDistMethods()
#>  [1] "euclidean"         "manhattan"         "minkowski"        
#>  [4] "chebyshev"         "sorensen"          "gower"            
#>  [7] "soergel"           "kulczynski_d"      "canberra"         
#> [10] "lorentzian"        "intersection"      "non-intersection" 
#> [13] "wavehedges"        "czekanowski"       "motyka"           
#> [16] "kulczynski_s"      "tanimoto"          "ruzicka"          
#> [19] "inner_product"     "harmonic_mean"     "cosine"           
#> [22] "hassebrook"        "jaccard"           "dice"             
#> [25] "fidelity"          "bhattacharyya"     "hellinger"        
#> [28] "matusita"          "squared_chord"     "squared_euclidean"
#> [31] "pearson"           "neyman"            "squared_chi"      
#> [34] "prob_symm"         "divergence"        "clark"            
#> [37] "additive_symm"     "kullback-leibler"  "jeffreys"         
#> [40] "k_divergence"      "topsoe"            "jensen-shannon"   
#> [43] "jensen_difference" "taneja"            "kumar-johnson"    
#> [46] "avg"              

## compute the euclidean distance between two probability vectors
distance(rbind(1:10/sum(1:10), 20:29/sum(20:29)), method = "euclidean")
#> Metric: 'euclidean'; comparing: 2 vectors.
#> euclidean 
#> 0.1280713 

## compute the euclidean distance between all pairwise comparisons of probability vectors
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))
distance(ProbMatrix, method = "euclidean")
#> Metric: 'euclidean'; comparing: 3 vectors.
#>           v1         v2         v3
#> v1 0.0000000 0.12807130 0.13881717
#> v2 0.1280713 0.00000000 0.01074588
#> v3 0.1388172 0.01074588 0.00000000

# compute distance matrix without testing for NA values in the input matrix
distance(ProbMatrix, method = "euclidean", test.na = FALSE)
#> Metric: 'euclidean'; comparing: 3 vectors.
#>           v1         v2         v3
#> v1 0.0000000 0.12807130 0.13881717
#> v2 0.1280713 0.00000000 0.01074588
#> v3 0.1388172 0.01074588 0.00000000

# alternatively use the colnames of the input data for the rownames and colnames
# of the output distance matrix
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))
rownames(ProbMatrix) <- paste0("Example", 1:3)
distance(ProbMatrix, method = "euclidean", use.row.names = TRUE)
#> Metric: 'euclidean'; comparing: 3 vectors.
#>           Example1   Example2   Example3
#> Example1 0.0000000 0.12807130 0.13881717
#> Example2 0.1280713 0.00000000 0.01074588
#> Example3 0.1388172 0.01074588 0.00000000

# Specialized Examples

CountMatrix <- rbind(1:10, 20:29, 30:39)

## estimate probabilities from a count matrix
distance(CountMatrix, method = "euclidean", est.prob = "empirical")
#> Metric: 'euclidean'; comparing: 3 vectors.
#>           v1         v2         v3
#> v1 0.0000000 0.12807130 0.13881717
#> v2 0.1280713 0.00000000 0.01074588
#> v3 0.1388172 0.01074588 0.00000000

## compute the euclidean distance for count data
## NOTE: some distance measures are only defined for probability values,
distance(CountMatrix, method = "euclidean")
#> Metric: 'euclidean'; comparing: 3 vectors.
#>          v1       v2       v3
#> v1  0.00000 60.08328 91.70605
#> v2 60.08328  0.00000 31.62278
#> v3 91.70605 31.62278  0.00000

## compute the Kullback-Leibler Divergence with different logarithm bases:
### case: unit = log (Default)
distance(ProbMatrix, method = "kullback-leibler", unit = "log")
#> Metric: 'kullback-leibler' using unit: 'log'; comparing: 3 vectors.
#>            v1           v2           v3
#> v1 0.00000000 0.0965296706 0.1111323599
#> v2 0.09652967 0.0000000000 0.0005867893
#> v3 0.11113236 0.0005867893 0.0000000000

### case: unit = log2
distance(ProbMatrix, method = "kullback-leibler", unit = "log2")
#> Metric: 'kullback-leibler' using unit: 'log2'; comparing: 3 vectors.
#>           v1           v2           v3
#> v1 0.0000000 0.1392628771 0.1603301045
#> v2 0.1392629 0.0000000000 0.0008465581
#> v3 0.1603301 0.0008465581 0.0000000000

### case: unit = log10
distance(ProbMatrix, method = "kullback-leibler", unit = "log10")
#> Metric: 'kullback-leibler' using unit: 'log10'; comparing: 3 vectors.
#>            v1           v2           v3
#> v1 0.00000000 0.0419223033 0.0482641707
#> v2 0.04192230 0.0000000000 0.0002548394
#> v3 0.04826417 0.0002548394 0.0000000000