Introduction to the philentropy Package

Comparison is a fundamental method of scientific research leading to more general insights about the processes that generate similarity or dissimilarity. In statistical terms comparisons between probability functions are performed to infer connections, correlations, or relationships between samples. The philentropy package implements optimized distance and similarity measures for comparing probability functions. These comparisons between probability functions have their foundations in a broad range of scientific disciplines from mathematics to ecology. The aim of this package is to provide a base framework for clustering, classification, statistical inference, goodness-of-fit, non-parametric statistics, information theory, and machine learning tasks that are based on comparing univariate or multivariate probability functions.

Applying the method of comparison in statistics often means computing distances between probability functions. In this context Sung-Hyuk Cha (2007) provides a clear definition of distance :

From the scientific and mathematical point of view, distance is defined as a quantitative degree of how far apart two objects are.

Hence, quantifying the distance of two objects requires the assumption about a particular space in which these objects live. For the euclidean distance, for example, this would mean comparison of objects (coordinates) in euclidean space (e.g. coordinate system) while other distance measures may require different spaces to allow sensitive and appropriate quantification of distances between objects (e.g. probability space). This aspect of quantifying the degree of how far two objects are apart in a defined space (adjusted definition) motivates the existence of diverse distance measures. As a result, the domain expert should appreciate the responsibility to decide in which space their model or experimental data is best represented and which distance metric then maximizes the usefulness of object comparison within this defined space.

Cha’s comprehensive review of distance/similarity measures motivated me to implement all these measures to better understand their comparative nature. As Cha states:

The choice of distance/similarity measures depends on the measurement type or representation of objects.

As a result, the philentropy package implements functions that are part of the following topics:

  • Distance Measure
  • Information Theory
  • Correlation Analyses

Personally, I hope that some of these functions are helpful to the scientific community.

Distance Measures

Here, the Distance Measure Vignette introduces how to work with the main function distance() that implements the 46 distance measures presented in Cha’s review.

Furthermore, for each distance/similarity measure a short description on usage and performance is presented.

The following probability distance/similarity measures will be described in detail:

Distance and Similarity Measures

LpL_p Minkowski Family

  • Euclidean : d=i=1N|PiQi|2)d = \sqrt{\sum_{i = 1}^N | P_i - Q_i |^2)}
  • Manhattan : d=i=1N|PiQi|d = \sum_{i = 1}^N | P_i - Q_i |
  • Minkowski : d=(i=1N|PiQi|p)1/pd = ( \sum_{i = 1}^N | P_i - Q_i |^p)^{1/p}
  • Chebyshev : d=max|PiQi|d = max | P_i - Q_i |

L1L_1 Family

  • Sorensen : d=i=1N|PiQi|i=1N(Pi+Qi)d = \frac{\sum_{i = 1}^N | P_i - Q_i |}{\sum_{i = 1}^N (P_i + Q_i)}
  • Gower : d=1Ṅi=1N|PiQi|d = \frac{1}{N} \dot \sum_{i = 1}^N | P_i - Q_i |, where NN is the total number of elements ii in PiP_i and QiQ_i
  • Soergel : d=i=1N|PiQi|i=1Nmax(Pi,Qi)d = \frac{\sum_{i = 1}^N | P_i - Q_i |}{\sum_{i = 1}^N max(P_i , Q_i)}
  • Kulczynski d : d=i=1N|PiQi|i=1Nmin(Pi,Qi)d = \frac{\sum_{i = 1}^N | P_i - Q_i |}{\sum_{i = 1}^N min(P_i , Q_i)}
  • Canberra : d=i=1N|PiQi|(Pi+Qi)d = \frac{\sum_{i = 1}^N | P_i - Q_i |}{(P_i + Q_i)}
  • Lorentzian : d=i=1Nln(1+|PiQi|)d = \sum_{i = 1}^N ln(1 + | P_i - Q_i |)

Intersection Family

  • Intersection : s=i=1Nmin(Pi,Qi)s = \sum_{i = 1}^N min(P_i , Q_i)
  • Non-Intersection : d=1i=1Nmin(Pi,Qi)d = 1 - \sum_{i = 1}^N min(P_i , Q_i)
  • Wave Hedges : d=i=1N|PiQi|max(Pi,Qi)d = \frac{\sum_{i = 1}^N | P_i - Q_i |}{max(P_i , Q_i)}
  • Czekanowski : d=i=1N|PiQi|i=1N|Pi+Qi|d = \frac{\sum_{i = 1}^N | P_i - Q_i |}{\sum_{i = 1}^N | P_i + Q_i |}
  • Motyka : d=i=1Nmin(Pi,Qi)(Pi+Qi)d = \frac{\sum_{i = 1}^N min(P_i , Q_i)}{(P_i + Q_i)}
  • Kulczynski s : d=i=1Nmin(Pi,Qi)i=1N|PiQi|d = \frac{\sum_{i = 1}^N min(P_i , Q_i)}{\sum_{i = 1}^N | P_i - Q_i |}
  • Tanimoto : d=i=1N(max(Pi,Qi)min(Pi,Qi))i=1Nmax(Pi,Qi)d = \frac{\sum_{i = 1}^N (max(P_i , Q_i) - min(P_i , Q_i))}{\sum_{i = 1}^N max(P_i , Q_i)} ; equivalent to Soergel
  • Ruzicka : s=i=1Nmin(Pi,Qi)i=1Nmax(Pi,Qi)s = \frac{\sum_{i = 1}^N min(P_i , Q_i)}{\sum_{i = 1}^N max(P_i , Q_i)} ; equivalent to 1 - Tanimoto = 1 - Soergel

Inner Product Family

  • Inner Product : s=i=1NPiQ̇is = \sum_{i = 1}^N P_i \dot Q_i
  • Harmonic mean : s=2i=1NPiQiPi+Qis = 2 \cdot \frac{ \sum_{i = 1}^N P_i \cdot Q_i}{P_i + Q_i}
  • Cosine : s=i=1NPiQii=1NPi2i=1NQi2s = \frac{\sum_{i = 1}^N P_i \cdot Q_i}{\sqrt{\sum_{i = 1}^N P_i^2} \cdot \sqrt{\sum_{i = 1}^N Q_i^2}}
  • Kumar-Hassebrook (PCE) : s=i=1N(PiQi)(i=1NPi2+i=1NQi2i=1N(PiQi))s = \frac{\sum_{i = 1}^N (P_i \cdot Q_i)}{(\sum_{i = 1}^N P_i^2 + \sum_{i = 1}^N Q_i^2 - \sum_{i = 1}^N (P_i \cdot Q_i))}
  • Jaccard : d=1i=1NPiQii=1NPi2+i=1NQi2i=1NPiQid = 1 - \frac{\sum_{i = 1}^N P_i \cdot Q_i}{\sum_{i = 1}^N P_i^2 + \sum_{i = 1}^N Q_i^2 - \sum_{i = 1}^N P_i \cdot Q_i} ; equivalent to 1 - Kumar-Hassebrook
  • Dice : d=i=1N(PiQi)2(i=1NPi2+i=1NQi2)d = \frac{\sum_{i = 1}^N (P_i - Q_i)^2}{(\sum_{i = 1}^N P_i^2 + \sum_{i = 1}^N Q_i^2)}

Squared-chord Family

  • Fidelity : s=i=1NPiQis = \sum_{i = 1}^N \sqrt{P_i \cdot Q_i}
  • Bhattacharyya : d=lni=1NPiQid = - ln \sum_{i = 1}^N \sqrt{P_i \cdot Q_i}
  • Hellinger : d=21i=1NPiQid = 2 \cdot \sqrt{1 - \sum_{i = 1}^N \sqrt{P_i \cdot Q_i}}
  • Matusita : d=22i=1NPiQid = \sqrt{2 - 2 \cdot \sum_{i = 1}^N \sqrt{P_i \cdot Q_i}}
  • Squared-chord : d=i=1N(PiQi)2d = \sum_{i = 1}^N ( \sqrt{P_i} - \sqrt{Q_i} )^2

Squared L2L_2 family (X2X^2 squared family)

  • Squared Euclidean : d=i=1N(PiQi)2d = \sum_{i = 1}^N ( P_i - Q_i )^2
  • Pearson X2X^2 : d=i=1N((PiQi)2Qi)d = \sum_{i = 1}^N ( \frac{(P_i - Q_i )^2}{Q_i} )
  • Neyman X2X^2 : d=i=1N((PiQi)2Pi)d = \sum_{i = 1}^N ( \frac{(P_i - Q_i )^2}{P_i} )
  • Squared X2X^2 : d=i=1N((PiQi)2(Pi+Qi))d = \sum_{i = 1}^N ( \frac{(P_i - Q_i )^2}{(P_i + Q_i)} )
  • Probabilistic Symmetric X2X^2 : d=2i=1N((PiQi)2(Pi+Qi))d = 2 \cdot \sum_{i = 1}^N ( \frac{(P_i - Q_i )^2}{(P_i + Q_i)} )
  • Divergence : X2X^2 : d=2i=1N((PiQi)2(Pi+Qi)2)d = 2 \cdot \sum_{i = 1}^N ( \frac{(P_i - Q_i )^2}{(P_i + Q_i)^2} )
  • Clark : d=i=1N(|PiQi|(Pi+Qi)2d = \sqrt{\sum_{i = 1}^N (\frac{| P_i - Q_i |}{(P_i + Q_i)^2}}
  • Additive Symmetric X2X^2 : d=i=1N(((PiQi)2(Pi+Qi))(PiQi))d = \sum_{i = 1}^N ( \frac{((P_i - Q_i)^2 \cdot (P_i + Q_i))}{(P_i \cdot Q_i)} )

Shannon’s Entropy Family

  • Kullback-Leibler : d=i=1NPilog(PiQi)d = \sum_{i = 1}^N P_i \cdot log(\frac{P_i}{Q_i})
  • Jeffreys : d=i=1N(PiQi)log(PiQi)d = \sum_{i = 1}^N (P_i - Q_i) \cdot log(\frac{P_i}{Q_i})
  • K divergence : d=i=1NPilog(2PiPi+Qi)d = \sum_{i = 1}^N P_i \cdot log(\frac{2 \cdot P_i}{P_i + Q_i})
  • Topsoe : d=i=1N(Pilog(2PiPi+Qi))+(Qilog(2QiPi+Qi))d = \sum_{i = 1}^N ( P_i \cdot log(\frac{2 \cdot P_i}{P_i + Q_i}) ) + ( Q_i \cdot log(\frac{2 \cdot Q_i}{P_i + Q_i}) )
  • Jensen-Shannon : d=0.5(i=1NPilog(2PiPi+Qi)+i=1NQilog(2*QiPi+Qi))d = 0.5 \cdot ( \sum_{i = 1}^N P_i \cdot log(\frac{2 \cdot P_i}{P_i + Q_i}) + \sum_{i = 1}^N Q_i \cdot log(\frac{2 * Q_i}{P_i + Q_i}))
  • Jensen difference : d=i=1N((Pilog(Pi)+Qilog(Qi)2)(Pi+Qi2)log(Pi+Qi2))d = \sum_{i = 1}^N ( (\frac{P_i \cdot log(P_i) + Q_i \cdot log(Q_i)}{2}) - (\frac{P_i + Q_i}{2}) \cdot log(\frac{P_i + Q_i}{2}) )

Combinations

  • Taneja : d=i=1N(Pi+Qi2)log(Pi+Qi(2PiQi))d = \sum_{i = 1}^N ( \frac{P_i + Q_i}{2}) \cdot log( \frac{P_i + Q_i}{( 2 \cdot \sqrt{P_i \cdot Q_i})} )
  • Kumar-Johnson : d=i=1N(Pi2Qi2)22(PiQi)32d = \sum_{i = 1}^N \frac{(P_i^2 - Q_i^2)^2}{2 \cdot (P_i \cdot Q_i)^{\frac{3}{2}}}
  • Avg(L1L_1, LnL_n) : d=i=1N|PiQi|+max|PiQi|2d = \frac{\sum_{i = 1}^N | P_i - Q_i| + max{ | P_i - Q_i |}}{2}

Note: dd refers to distance measures, whereas ss denotes similarity measures.

Information Theory

Modern methods for distribution comparisons have a strong information theoretic background. This fact motivated me to name this package philentropy and as a result, several well established information theory measures are (and further will be) implemented in this package.

  • Shannon’s Entropy H(X) : H(X)=i=1nP(xi)logb(P(xi))H(X) = -\sum\limits_{i=1}^n P(x_i) \cdot log_b(P(x_i))
  • Shannon’s Joint-Entropy H(X,Y) : H(X,Y)=i=1nj=1mP(xi,yj)logb(P(xi,yj))H(X,Y) = -\sum\limits_{i=1}^n\sum\limits_{j=1}^m P(x_i, y_j) \cdot log_b(P(x_i, y_j))
  • Shannon’s Conditional-Entropy H(X | Y) : H(Y|X)=i=1nj=1mP(xi,yj)logb(P(xi)P(xi,yj))H(Y|X) = \sum\limits_{i=1}^n\sum\limits_{j=1}^m P(x_i, y_j) \cdot log_b( \frac{P(x_i)}{P(x_i, y_j)})
  • Mutual Information I(X,Y) : MI(X,Y)=i=1nj=1mP(xi,yj)logb(P(xi,yj)(P(xi)*P(yj)))MI(X,Y) = \sum\limits_{i=1}^n\sum\limits_{j=1}^m P(x_i, y_j) \cdot log_b( \frac{P(x_i, y_j)}{( P(x_i) * P(y_j) )})
  • Kullback-Leibler Divergence : KL(P||Q)=i=1nP(pi)log2(P(pi)P(qi))=H(P,Q)H(P)KL(P || Q) = \sum\limits_{i=1}^n P(p_i) \cdot log_2(\frac{P(p_i) }{P(q_i)}) = H(P, Q) - H(P)
  • Jensen-Shannon Divergence : JSD(P||Q)=0.5*(KL(P||R)+KL(Q||R))JSD(P || Q) = 0.5 * (KL(P || R) + KL(Q || R))
  • Generalized Jensen-Shannon Divergence : gJSDπ1,...,πn(P1,...,Pn)=H(i=1nπiPi)i=1nπiH(Pi)gJSD_{\pi_1,...,\pi_n}(P_1, ..., P_n) = H(\sum_{i = 1}^n \pi_i \cdot P_i) - \sum_{i = 1}^n \pi_i \cdot H(P_i)