Distances and Similarities between Many Probability Density Functions

This functions computes the distance/dissimilarity between two sets of probability density functions.

dist_many_many(
  dists1,
  dists2,
  method,
  p = NA_real_,
  testNA = TRUE,
  unit = "log",
  epsilon = 1e-05
)

Arguments

dists1

a numeric matrix storing distributions in its rows.

dists2

a numeric matrix storing distributions in its rows.

method

a character string indicating whether the distance measure that should be computed.

p

power of the Minkowski distance.

testNA

a logical value indicating whether or not distributions shall be checked for NA values.

unit

type of log function. Option are

unit = "log"
unit = "log2"
unit = "log10"

epsilon

epsilon a small value to address cases in the distance computation where division by zero occurs. In these cases, x / 0 or 0 / 0 will be replaced by epsilon. The default is epsilon = 0.00001. However, we recommend to choose a custom epsilon value depending on the size of the input vectors, the expected similarity between compared probability density functions and whether or not many 0 values are present within the compared vectors. As a rough rule of thumb we suggest that when dealing with very large input vectors which are very similar and contain many 0 values, the epsilon value should be set even smaller (e.g. epsilon = 0.000000001), whereas when vector sizes are small or distributions very divergent then higher epsilon values may also be appropriate (e.g. epsilon = 0.01). Addressing this epsilon issue is important to avoid cases where distance metrics return negative values which are not defined and only occur due to the technical issues of computing x / 0 or 0 / 0 cases.

Value

A matrix of distance values

Examples

  set.seed(2020-08-20)
  M1 <- t(replicate(10, sample(1:10, size = 10) / 55))
  M2 <- t(replicate(10, sample(1:10, size = 10) / 55))
  result <- dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)