`distance()`

The `distance()`

function implemented in `philentropy`

is able to compute 46 different distances/similarities between probability density functions (see `?philentropy::distance`

for details).

The `distance()`

function is implemented using the same *logic* as R’s base functions `stats::dist()`

and takes a `matrix`

or `data.frame`

as input. The corresponding `matrix`

or `data.frame`

should store probability density functions (as rows) for which distance computations should be performed.

# define a probability density function P P <- 1:10/sum(1:10) # define a probability density function Q Q <- 20:29/sum(20:29) # combine P and Q as matrix object x <- rbind(P,Q)

Please note that when defining a `matrix`

from vectors, probability vectors should be combined as rows (`rbind()`

).

library(philentropy) # compute the Euclidean Distance with default parameters distance(x, method = "euclidean")

```
euclidean
0.1280713
```

For this simple case you can compare the results with R’s base function to compute the euclidean distance `stats::dist()`

.

# compute the Euclidean Distance using R's base function stats::dist(x, method = "euclidean")

```
P
Q 0.1280713
```

However, the R base function `stats::dist()`

only computes the following distance measures: `"euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"`

, whereas `distance()`

allows you to choose from 46 distance/similarity measures.

To find out which `method`

s are implemented in `distance()`

you can consult the `getDistMethods()`

function.

# names of implemented distance/similarity functions getDistMethods()

```
[1] "euclidean" "manhattan" "minkowski" "chebyshev"
[5] "sorensen" "gower" "soergel" "kulczynski_d"
[9] "canberra" "lorentzian" "intersection" "non-intersection"
[13] "wavehedges" "czekanowski" "motyka" "kulczynski_s"
[17] "tanimoto" "ruzicka" "inner_product" "harmonic_mean"
[21] "cosine" "hassebrook" "jaccard" "dice"
[25] "fidelity" "bhattacharyya" "hellinger" "matusita"
[29] "squared_chord" "squared_euclidean" "pearson" "neyman"
[33] "squared_chi" "prob_symm" "divergence" "clark"
[37] "additive_symm" "kullback-leibler" "jeffreys" "k_divergence"
[41] "topsoe" "jensen-shannon" "jensen_difference" "taneja"
[45] "kumar-johnson" "avg"
```

Now you can choose any distance/similarity `method`

that serves you.

# compute the Jaccard Distance with default parameters distance(x, method = "jaccard")

```
jaccard
0.133869
```

Analogously, in case a probability matrix is specified the following output is generated.

# combine three probabilty vectors to a probabilty matrix ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39)) rownames(ProbMatrix) <- paste0("Example", 1:3) # compute the euclidean distance between all # pairwise comparisons of probability vectors distance(ProbMatrix, method = "euclidean")

```
#> Metric: 'euclidean'; comparing: 3 vectors.
v1 v2 v3
v1 0.0000000 0.12807130 0.13881717
v2 0.1280713 0.00000000 0.01074588
v3 0.1388172 0.01074588 0.00000000
```

Alternatively, users can specify the argument `use.row.names = TRUE`

to maintain the rownames of the input matrix and pass them as rownames and colnames to the output distance matrix.

# compute the euclidean distance between all # pairwise comparisons of probability vectors distance(ProbMatrix, method = "euclidean", use.row.names = TRUE)

```
#> Metric: 'euclidean'; comparing: 3 vectors.
Example1 Example2 Example3
Example1 0.0000000 0.12807130 0.13881717
Example2 0.1280713 0.00000000 0.01074588
Example3 0.1388172 0.01074588 0.00000000
```

This output differs from the output of `stats::dist()`

.

# compute the euclidean distance between all # pairwise comparisons of probability vectors # using stats::dist() stats::dist(ProbMatrix, method = "euclidean")

```
1 2
2 0.12807130
3 0.13881717 0.01074588
```

Whereas `distance()`

returns a symmetric distance matrix, `stats::dist()`

returns only one part of the symmetric matrix.

However, users can also specify the argument `as.dist.obj = TRUE`

in `philentropy::distance()`

to retrieve a `philentropy::distance()`

output which is an object of type `stats::dist()`

.

ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39)) rownames(ProbMatrix) <- paste0("test", 1:3) distance(ProbMatrix, method = "euclidean", use.row.names = TRUE, as.dist.obj = TRUE)

```
Metric: 'euclidean'; comparing: 3 vectors.
test1 test2
test2 0.12807130
test3 0.13881717 0.01074588
```

Now let’s compare the run times of base R and `philentropy`

. For this purpose you need to install the `microbenchmark`

package.

# install.packages("microbenchmark") library(microbenchmark) microbenchmark( distance(x,method = "euclidean", test.na = FALSE), dist(x,method = "euclidean"), euclidean(x[1 , ], x[2 , ], FALSE) )

```
Unit: microseconds
expr min lq mean median uq max neval
distance(x, method = "euclidean", test.na = FALSE) 26.518 28.3495 29.73174 29.2210 30.1025 62.096 100
dist(x, method = "euclidean") 11.073 12.9375 14.65223 14.3340 15.1710 65.130 100
euclidean(x[1, ], x[2, ], FALSE) 4.329 4.9605 5.72378 5.4815 6.1240 22.510 100
```

As you can see, although the `distance()`

function is quite fast, the internal checks cause it to be 2x slower than the base `dist()`

function (for the `euclidean`

example). Nevertheless, in case you need to implement a faster version of the corresponding distance measure you can type `philentropy::`

and then `TAB`

allowing you to select the base distance computation functions (written in C++), e.g. `philentropy::euclidean()`

which is almost 3x faster than the base `dist()`

function.

The advantage of `distance()`

is that it implements 46 distance measures based on base C++ functions that can be accessed individually by typing `philentropy::`

and then `TAB`

. In future versions of `philentropy`

I will optimize the `distance()`

function so that internal checks for data type correctness and correct input data will take less termination time than the base `dist()`

function.

The vast amount of available similarity metrics raises the immediate question which metric should be used for which application. Here, I will review the origin of each individual metric and will discuss the most recent literature that aims to compare these measures. I hope that users will find valuable insights and might be stimulated to conduct their own comparative research since this is a field of ongoing research.

The euclidean distance is (named after Euclid) a straight line distance between two points. Euclid argued that that the **shortest** distance between two points is always a line.

\(d = \sqrt{\sum_{i = 1}^N | P_i - Q_i |^2)}\)