vignettes/Many_Distances.Rmd
Many_Distances.Rmd
The philentropy package has several mechanisms to
calculate distances between probability density functions. The main one
is to use the the distance()
function, which enables to
compute 46 different distances/similarities between probability density
functions (see ?philentropy::distance
and a companion vignette for details).
Alternatively, it is possible to call each distance/dissimilarity
function directly. For example, the euclidean()
function
will compute the euclidean distance, while jaccard
- the
Jaccard distance. The complete list of available distance measures are
available with the philentropy::getDistMethods()
function.
Both of the above approaches have their pros and cons. The
distance()
function is more flexible as it allows users to
use any distance measure and can return either a matrix
or
a dist
object. It also has several defensive programming
checks implemented, and thus, it is more appropriate for regular users.
Single distance functions, such as euclidean()
or
jaccard()
, can be, on the other hand, slightly faster as
they directly call the underlining C++ code.
Now, we introduce three new low-level functions that are
intermediaries between distance()
and single distance
functions. They are fairly flexible, allowing to use of any implemented
distance measure, but also usually faster than calling the
distance()
functions (especially, if it is needed to use
many times). These functions are:
dist_one_one()
- expects two vectors (probability
density functions), returns a single valuedist_one_many()
- expects one vector (a probability
density function) and one matrix (a set of probability density
functions), returns a vector of valuesdist_many_many()
- expects two matrices (two sets of
probability density functions), returns a matrix of valuesLet’s start testing them by attaching the philentropy package.
dist_one_one()
dist_one_one()
is a lower level equivalent to
distance()
. However, instead of accepting a numeric
data.frame
or matrix
, it expects two vectors
representing probability density functions. In this example, we create
two vectors, P
and Q
.
To calculate the euclidean distance between them we can use several
approaches - (a) build-in R dist()
function, (b)
philentropy::distance()
, (c)
philentropy::euclidean()
, or the new
dist_one_one()
.
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
dist(rbind(P, Q), method = "euclidean"),
distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE),
euclidean(P, Q, FALSE),
dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
)
## Warning in microbenchmark::microbenchmark(dist(rbind(P, Q), method =
## "euclidean"), : less accurate nanosecond times to avoid potential integer
## overflows
## Unit: nanoseconds
## expr
## dist(rbind(P, Q), method = "euclidean")
## distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE)
## euclidean(P, Q, FALSE)
## dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 5986 6396 8233.62 6724.0 7113.5 108691 100
## 9266 9922 18894.85 10475.5 11931.0 783756 100
## 820 902 1150.87 943.0 1086.5 3854 100
## 1189 1271 1756.03 1353.0 1496.5 19680 100
All of them return the same, single value. However, as you can see in the benchmark above, some are more flexible, and others are faster.
dist_one_many()
The role of dist_one_many()
is to calculate distances
between one probability density function (in a form of a
vector
) and a set of probability density functions (as rows
in a matrix
).
Firstly, let’s create our example data.
P
is our input vector and M
is our input
matrix.
Distances between the P
vector and probability density
functions in M
can be calculated using several approaches.
For example, we could write a for
loop (adding a new code)
or just use the existing distance()
function and extract
only one row (or column) from the results. The
dist_one_many()
allows for this calculation directly as it
goes through each row in M
and calculates a given distance
measure between P
and values in this row.
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1],
distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1],
dist_one_many(P, M, method = "euclidean", testNA = FALSE)
)
## Unit: microseconds
## expr
## as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1]
## distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1]
## dist_one_many(P, M, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 130.831 148.1125 160.26367 157.9935 167.690 263.548 100
## 9016.515 9735.1630 11155.15536 10312.8735 12742.677 14336.511 100
## 9.512 10.3730 12.67269 11.8285 13.858 44.526 100
The dist_one_many()
returns a vector of values. It is,
in this case, much faster than distance()
, and visibly
faster than dist()
while allowing for more possible
distance measures to be used.
dist_many_many()
dist_many_many()
calculates distances between two sets
of probability density functions (as rows in two matrix
objects).
Let’s create two new matrix
example data.
set.seed(2020-08-20)
M1 <- t(replicate(10, sample(1:10, size = 10) / 55))
M2 <- t(replicate(10, sample(1:10, size = 10) / 55))
M1
is our first input matrix and M2
is our
second input matrix. I am not aware of any function build-in R that
allows calculating distances between rows of two matrices, and thus, to
solve this problem, we can create our own -
many_dists()
…
many_dists = function(m1, m2){
r = matrix(nrow = nrow(m1), ncol = nrow(m2))
for (i in seq_len(nrow(m1))){
for (j in seq_len(nrow(m2))){
x = rbind(m1[i, ], m2[j, ])
r[i, j] = distance(x, method = "euclidean", mute.message = TRUE)
}
}
r
}
… and compare it to dist_many_many()
.
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
many_dists(M1, M2),
dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
)
## Unit: microseconds
## expr min lq
## many_dists(M1, M2) 958.990 994.5575
## dist_many_many(M1, M2, method = "euclidean", testNA = FALSE) 14.104 14.8830
## mean median uq max neval
## 1181.08126 1013.458 1081.375 4497.126 100
## 16.36269 15.375 16.851 37.843 100
Both many_dists()
and dist_many_many()
return a matrix. The above benchmark concludes that
dist_many_many()
is about 30 times faster than our custom
many_dists()
approach.