FreeBSD Manual Pages

home | help
DISTANCE(3)		    Library Functions Manual		   DISTANCE(3)

NAME
       distance	-- compute the distance	between	two pieces of data

SYNOPSIS
       #include	<distance.h>

       int
       levenshtein_d(const   void   *d1,   size_t   len1,   const   void  *d2,
	   size_t len2);

       int
       damerau_d(const void *d1, size_t	len1, const void *d2, size_t len2);

       double
       needleman_wunsch_d(const	 void  *d1,  size_t  len1,  const  void	  *d2,
	   size_t len2,	matrix *m);

       int
       hamming_d(const void *d1, size_t	len1, const void *d2, size_t len2);

       void
       bloom_create(const   void   *data,  size_t  len,	 const	void  *digest,
	   size_t digest_len);

       double
       bloom_d(const void *digest1, const void *digest2, size_t	digest_len);

       float
       jaccard_d(const void *d1, size_t	len1, const void *d2, size_t len2);

       float
       minkowski_d(const void *d1, size_t len1,	const void *d2,	 size_t	 len2,
	   int power);

       MANHATTAN_D(const void *d1, size_t len1,	const void *d2,	size_t len2);

       EUDCLID_D(const void *d1, size_t	len1, const void *d2, size_t len2);

DESCRIPTION
       The  distance library is	used to	compare	pieces of data for similarity.
       Specifically, it	contains a number of methods to	find  the  "edit  dis-
       tance" between inputs, or the number of differences between them. These
       differences are calculated using	various	mechanisms.

       Inputs  are  operated  on  pairwise  as *s and *t and the edit distance
       value is	returned.

LEVENSHTEIN DISTANCES
       Levenshtein distance (LD) is a measure of the  similarity  between  two
       inputs,	which we will refer to as the source s and the target input t.
       The distance is the number of deletions,	insertions,  or	 substitutions
       required	to transform s into t.	For example,

       If  s is	"test" and t is	"test",	then LD(s,t) = 0, because no transfor-
       mations are needed. The strings are already identical.

       If s is "test" and t is "tent", then LD(s,t) = 1, because one substitu-
       tion (change "s"	to "n")	is sufficient to transform s into t.

       The greater the Levenshtein distance, the  more	different  the	inputs
       are.

       Levenshtein distance is named after the Russian scientist Vladimir Lev-
       enshtein, who devised the algorithm in 1965. If you can't spell or pro-
       nounce Levenshtein, the metric is also sometimes	called edit distance.

DAMERAU	DISTANCE
       The  Damerau  distance  is almost identical to the Levenshtein distance
       but can tolerate	adjascent characters that have been swapped, a	common
       spelling	 mistake or typographic	error. For example, if s is "abcd" and
       t is "acbd", then DD(s,t) is 0 because of the transposition of "b"  and
       "c".  Other costs found in the Levenshtein distance are identical.

NEEDLEMAN-WUNSCH DISTANCE
       The  Levenshtein	distance algorithm assumes that	the cost of all	inser-
       tions or	conversions is equal. However, in some scenarios this may  not
       be desirable and	may mask the acceptable	distances between inputs.

       A modified Levenshtein distance algorithm, also known as	the Needleman-
       Wunsch  distance	 algorithm,  accepts  a	cost matrix as an extra	input.
       This matrix structure contains two cost	matricies  for	each  pair  of
       characters  to convert from and to. The costs of	inserting this charac-
       ter and converting between characters is	listed	in  this  matrix.  The
       structure of the	matrix is as follows:

       struct matrix {
	       char    input[255];
	       float   conversion[255][255];
	       float   insertion[255][255];
       };

       These values should be assigned before the distance algorithm is	used.

HAMMING	DISTANCES
       The  Hamming  distance H	is defined only	for inputs of the same length.
       For two inputs s	and t, H(s, t) is the number of	places	in  which  the
       two string differ, i.e.,	have different characters.

BLOOM FILTER DISTANCES
       A  Bloom	Filter (BF) is a method	of encoding a variable length piece of
       input data to a fixed length signature.	The basic idea is to  map  at-
       tributes	 of the	data to	specific positions in a	binary vector.	First,
       the result vector is initialized	to all 0's.  Next, a  series  of  hash
       functions are applied to	the data to map	some part of the input data to
       a  position  in the result vector.  Thus, if a is the input data, and m
       is length of the	result vector, then hash(a) produces a	value  in  the
       range 0...m-1.

       A similar method	encoding data into a fixed length result vector	can be
       accomplished  using  a crytographic digest.  A digest function like MD5
       is designed such	that the difference between any	two digests  does  not
       in any way correlate to similar differences in input messages.  This is
       an  important  property	in certain applications	however	it may also be
       desirable that the difference between digests  corresponds  to  differ-
       ences  in  the  input messages.	Because	a Bloom	filter is derived from
       attributes of the input message,	the difference between any  two	 Bloom
       Filter digests corresponds to differences in the	input data.

       This Bloom filter implementation	is meant take any byte sequence	as in-
       put  so	a  generic hash	function is used.  The input stream is divided
       into chunks and hashed using a 32 bit adler modulo m.  In order to  get
       the  most  complete coverage, the chunk N+1 includes all	but one	of the
       bytes in	chunk N	and one	byte not included in chunk N.  (i.e. the chunk
       window slides to	right one byte per hash.)

       The difference between any two Bloom filter digests is computed by tak-
       ing the logical and of the two digests and computing the	number of  bit
       of similarity.  The similar count is then divided by maximum of the to-
       tal  number  of	bit  in	either digest.	This acts to normalize the bit
       count for large and small signatures.  The similarity  is  then	turned
       into a distance by take 1 - normalized bit count.

JACCARD	DISTANCE
       The  Jaccard similarity between two strings is the ratio	of the size of
       their intersection to the size of their union. This distance is 1 minus
       this similarity sscore. Indentical strings have a Jacard	distance of 0,
       completely dissimilar strings have a score of 1.	The two	inputs must be
       of the same size.

MINKOWSKI DISTANCE
       The Minkowski distance between two strings is  the  geometric  distance
       between	two  inputs  and  uses a variable scaling factor, power.  When
       this value is 1 the Minkowski distance is equal to the  Manhattan  dis-
       tance.  When  power  is	2 it yields the	Euclidian distance between two
       strings.	When this value	grows, the value  of  the  Minkowski  distance
       tends towards a Chebyshev result. Therefore by inceasing	power, one can
       place  more  numerical  value on	the largest distance (in terms of ele-
       ments in	the two	vectors	in question).  Two  macros  are	 provided  for
       these   modifications   to   the	 Minkowski  distance,  EUCLID_D()  and
       MANHATTAN_D().

       A disadvantage of the Minkowski method is that if one  element  in  the
       vectors has a wider range than the other	elements then that large range
       may 'dilute' the	distances of the small-range elements.

RETURN VALUES
       Each fuction returns the	calculated distance between the	two inputs.  A
       distance	 of 0 indicates	that the strings are the same. A distance less
       than 0 indicates	that an	error has occurred. For	the Hamming  and  Jac-
       card distances, this is because the two inputs are of different sizes.

BUGS
       The  minkowski_d()  implementation  is  possibly	 broken,  as  are  the
       MANHATTAN_D() and EUCLID_D() macros.

AUTHOR
       Lorenzo Seidenari wrote the Levenshtein distance	implementation.

       Jose Nazario <jose@monkey.org> wrote the	implementations	 of  the  Dam-
       erau, Hamming and Jaccard distance algorithms along with	the Needleman-
       Wunsch Levenshtein distance implementation here.

       Evan  Cooke  adapted the	adler32	routine	from Jean-loup Gailly and Mark
       Adler used in the bloom_d() function, which he also wrote.

SEE ALSO
       V. I. Levenshtein,  "Binary  codes  capable  of	correcting  deletions,
       insertions and reversals", Doklady Akademii Nauk	SSSR, 4, 163, 845-848,
       1965.

       Fred  J.	Damerau, "A technique for computer detection and correction of
       spelling	errors", Communications	of the ACM, 3, 7, 171-176, March 1964.

       S. B. Needleman and C. D. Wunsch, "A general method applicable  to  the
       search  for  similarities  in the amino acid sequence of	two proteins",
       Jrnl Molec Biol,	48, 443-453, 1970.

       Burton Bloom, "Space/time trade-offs  in	 hash  coding  with  allowable
       errors",	Communications of the ACM, 7, 13, 422-426, July	1970.

       R.  W.  Hamming,	 "Error	 Detecting  and	 Error Correcting Codes", Bell
       System Tech Journal, 9, 147-160,	April 1950.

       P. Jaccard, "Etude comparative de  la  distribution  florale  dans  une
       portion	des Alpes et du	Jura", Bull Soc	Vaudoise Sci Nat, 44, 223-270,
       1908.

FreeBSD	ports 15.0	       February	14, 2004		   DISTANCE(3)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=distance&sektion=3&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages