Latest Tweets

 

A prefix metric on Strings

I’ve been investigating how clustering can be applied to sets of character strings in which common prefixes indicate similarity, and I came up with the following metric (probably known already, but no amount of online searching gave any useful results).

A metric over infinite sequences

We start buy considering the set \(S\) of sequences over some alphabet \(\Sigma\). We write \(s=\{s_i\}_{i\geq0}\) with \(s_i\in\Sigma\) for \(s\in\Sigma\).

For \(s,t\in\Sigma\) with \(s\neq t\) we write \(\widehat{st}\) for the length of the longest common initial subsequence, ie. \(\widehat{st}\) is the greatest \(n\) such that \(s_k=t_k\forall{k < n}\). Then \(d:S^2\rightarrow\mathbb{R}\) defines a metric where

\[ d(s,t) = \left\{ \begin{array}{ll} \frac 1 {|\Sigma|^{\widehat{st}}} & \mbox{if } s \neq t \\ 0 & \mbox{if } s = t \end{array} \right. \]

By its definition, necessarily \(d(s,t)\geq0\) with \(d(s,t)=0\) iff \(s=t\). Furthermore, \(d(s,t)\) is symmetric as a consequence of the symmetry of \(\widehat{st}\). It only remains to demonstrate that \(d\) satisfies the triangle inequality so that, for \(s,t,u \in S\)

\[ d(s,u) \leq d(s,t) + d(t,u) \]

If any of the sequences are equal, the inequality is trivial so we only consider mutually distinct sequences. Hence \(|\Sigma| > 1\) since otherwise there would be no distinct sequences. Thus

\[ \begin{aligned} d(s,u) & \leq d(s,t) + d(t,u) \\ \frac 1 {|\Sigma|^{\widehat{su}}} & \leq \frac 1 {|\Sigma|^{\widehat{st}}} + \frac 1 {|\Sigma|^{\widehat{tu}}} \\ & \leq \frac 1 {|\Sigma|^{ \min(\widehat{st}, \widehat{tu}) }} \end{aligned} \]

so it suffices to show \(\widehat{su} \geq \min(\widehat{st},\widehat{tu})\). But this is clearly true since \(\widehat{su} < \widehat{uv} \Rightarrow \widehat{tu} < \widehat{su}\).

Extending it to finite sequences

The metric is easily extended to finite sequences by extending \(\Sigma\) with a new null symbol \(\diamond \notin \Sigma\). Let \(\Sigma’ = \Sigma \cup \{\diamond\}\) then we can map any finite sequence \(w\) over \(\Sigma\) to a sequence \(w \in S’\) by

\[ w’_i = \left\{ \begin{array}{ll} w_i & \mbox{if } i \leq \operatorname{len}(w) \\ \diamond & \mbox{if } i > \operatorname{len}(w) \end{array} \right. \]

This map is clearly injective so we can extend \(d\) to a metric over finite sequences of \(\Sigma\).

Practical properties of the metric

So what properties does this metric have in the context of its practical application to clustering strings of characters?

  • The distance between two strings is always less than or equal to one.
  • The distance between from the empty string to any non-empty string is always one.
  • The distance between strings is inversely proportional to the length of their common prefix.
  • This distance is normalized by size of the character set, so for example, the distance between two string of 8 bit ASCII characters approximates (from above) their distance when represented as a two binary sequences.

Further work

Lots of further investigation is needed to see how useful this metric will be in practice. There are lots of avenues too explore too:

  • Can we create structures from this metric that can increase its applicability to existing algorithms?
  • Can we incorporate any additional knowledge we may have of ‘character similarity’ to increase its precision?
  • What is the best way of numerically representing the computed metric values?