Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
COMPARATOR(1)			  Open Source			 COMPARATOR(1)

NAME
       comparator, filterator -	fast location of common	code in	large source
       trees

SYNOPSIS
       comparator [-c] [-C] [-d	dir] [-h] [-m minsize] [-n] [-N
       normalization-spec] [-o file] [-s shredsize] [-v] [-w] [-x]
       source-tree-path...

       filterator [-d dir] [-m minsize]

       comparator.py

DESCRIPTION
       comparator is a program for quickly finding common sections in two or
       more source-code	trees. It is named after a now-obsolete	astronomical
       instrument called a blink comparator that was used to find moving
       objects in starfield photos. These reports tend to be noisy; filterator
       is a postprocessor for comparator reports that can apply	significance
       filtering. comparator.py	is a Python module for interpreting comparator
       reports;	filterator uses	it.

   How comparator works
       comparator works	by first chopping the specified	trees into overlapping
       shreds (by default 3 lines long)	and computing a	hash of	each shred.
       The resulting list of shreds is examined	and all	unique hashes are
       thrown out. Then	comparator generates a report listing all cliques of
       lines with duplicate hashes, with overlapping ranges merged. The	output
       format can (not by coincidence) be stepped through with the next-error
       function	of Emacs.

       A consequence of	the method is that comparator will find	common code
       segments	wherever they are. It is insensitive to	rearrangements of the
       source trees. With appropriate options, code can	be normalized at input
       to ignore differences in	whitespace and (in C code) brace placement. An
       option to strip comments	also exists.

       The program is capable of generating a hash list	for a source code tree
       which can be used in place of the tree itself. Thus, it is possible to
       do comparisons with source trees	without	having access to the actual
       source code, providing someone who has access is	willing	to ship	you a
       hash list. These	hash lists are called 'SCF files' and made with	the
       extension .scf; the name	stands for Source Comparison Format.

       In its normal mode of operation,	comparator expects two or more
       command-line arguments which are	the root directories of	the
       source-code trees to be compared, and generates a report	to standard
       output. Progress	messages are emitted to	standard error.	Some arguments
       may be the names	of SCF files.

       When given a single path	argument which is a tree, the program
       generates an SCF	hash list to standard output.

       When the	-c option is specified,	the program generates a	SCF file from
       each tree specified on the command line.	The name of the	resulting file
       is the argument name with .scf appended.

       The -o option directs output to a specified file. This may be used with
       -c to override the normal .scf convention, or simply as an alternative
       to output redirection when generating a final report. One advantage of
       -o is that the program will not create the output file until it is
       ready to	write; thus, unlike shell redirects, it	will not leave an
       empty file lying	around if it aborts early.

       The -d option changes current directory to the specified	tree before
       walking down each argument path to generate hashes. This	will be	useful
       if you want to generate a report	for trees in a directory other than
       your current, but want the filenames in the report to be	relative.

       The -s option changes the shred size. Smaller shred sizes are more
       sensitive to small code duplications, but produce correspondingly
       noisier output. Larger ones will	suppress both noise and	small
       similarities.

       The -m option sets the minimum-sized span of lines that will be output.
       By default this is zero;	setting	it to a	value higher than the shred
       size will eliminate a lot of junk from the report. These	are two
       separate	parameters because shred size controls how likely common
       segments	are to not get recognized because of unshared things near
       them, while minimum span	size defines the smallest features we want to
       see in the output report.

       Normally, comparator performs significance filtering before emitting a
       span into the common-segment report. The	-n option suppresses this,
       emitting	all common spans. See the discussion of	significance filtering
       below.

       The -v option enables progress and timing messages to standard error.

       The -x option enables some debugging output. It is for developers; if
       you need	to understand it, read the code.

       comparator uses a heuristic for figuring	out which files	in a tree are
       eligible	to be compared.	Files with .c and .h extensions	are always
       eligible. Files with .o or ~ extensions are always ignored. CVS,	RCS,
       SCCS, and .svn directories are ignored. Other files are checked to see
       if the percentage of printable characters near the front	of the file is
       above 90%.

   Normalization specifications
       The -N option passes in its argument as a normalization specification.
       A normalization specification consists of a comma-separated list	of
       tokens. The first token names a normalizer, e.g., a set of rules	for
       transforming code into a	canonical form.	Subsequent tokens (if present)
       are options to the normalizer.

   The line-oriented normalizer
       The default normalizer is named by the leading token line-oriented. It
       has the following options:

       remove-whitespace
	   All whitespace in the file (including blank lines) is ignored for
	   comparison purposes (line numbers in	the output report will
	   nevertheless	be correct).

       remove-braces
	   Remove { and	}. This	is recommended for comparing C code; it
	   prevents the	comparison from	being fooled by	differences in indent
	   style.

       remove-comments
	   Remove comments in C	files. All // comments are removed, but	only
	   "winged" /* comments	with the start and end of comment on the same
	   line	are ignored; also, block /* comments are retained, but the
	   beginning / and ending / are	ignored. In other files, all #
	   comments are	removed.

   Significance	filtering
       Normally, comparator tries to throw out common segments that are	just
       noise. A	segment	is considered noise if one of the following conditions
       holds:

	1. Its file has	a .c or	.h extension, and the segment consists
	   entirely of whitespace, punctuation,	C keywords, C include lines,
	   and other things that are not C identifiers,	function names,	or
	   user-defined	macros.

	2. The file has	a .sh extension	or a #!	first line directing shell
	   interpretation, and the segment consists entirely of	shell keywords
	   and punctuation.

       Significance filtering can be disabled with the -n option.

   filterator
       filterator is a postprocessor for the SCF-B output of comparator	that
       produces	an actual code listing.

       The -d option acts as it	does for comparator.

       The -m option sets the minimum size of overlap to be reported. Setting
       this to a slightly larger number	than the comparator shred size is a
       useful way to filter out	spurious matches; thus,	while comparator
       defaults	to a shred size	of 3, filterator defaults to a minimum size of
       5.

       The -f option takes a Python regexp as argument and throws out all
       cliques for which no associated filename	matches	the regexp. This might
       be useful if you	want to	see a common-segment report only for files
       with a particular extension such	as .h or .S.

       The -F option takes an argument which is	a file containing a list of
       filenames. Any clique that does not contain one of these	filenames is
       discarded.

       The -n option suppresses	code listing, so that an SCF-B output is
       produced	instead. You can use this to filter common-segment reports
       with -m,	-f, and	-F options of the tool and then	hand-filter them for
       junk before generating a	listing.

       Each filterator line beginning with % is	the filename from which	the
       following common	span of	lines is displayed. filterator displays	each
       span of lines, unnormalized, from the first file	in the clique of
       matches that it can find	and open.

       Occasionally the	filterator output looks	as though the code has failed
       to properly merge two or	more overlapping line ranges. When this
       happens,	the ranges have	different match	sets; they may overlap in one
       tree, but not in	others.	A clue to this is the number of	matches
       reported	for each range.

   comparator.py
       comparator.py is	a Python module	that parses and	interprets the format
       emitted by comparator. It can be	used to	write report generators. The
       filterator program is an	example; it is a trivial wrapper around
       comparator.py.

AUTHOR
       Eric S. Raymond esr@snark.thyrsus.com. Ronald Rivest suggested the
       custom RXOR hash	function used in versions 2.0 and up. See ESR's	home
       page at http://www.catb.org/~esr/ for updates and other resources.

LIMITATIONS
       comparator does not attempt to do semantic analysis and catch
       relatively trivial changes like renaming	of variables, etc. This	is
       because comparator is designed not as a tool to detect plagiarism of
       ideas (the subject of patent law), but as a tool	to detect copying of
       the expression of ideas (the subject of copyright law). Normalizing the
       code in excessively clever ways would trespass into the territory of
       ideas and tend to produce false positives.

       The heuristic for eligible files	can be fooled, though this is
       unlikely.

       The use of hashes means there is	a very small probability of
       false-positive matches between shreds; to be precise, with the default
       RXOR hash you can expect	the first false	positive after accumulating
       6,074,000,999 shreds. For even very large forests, such as collections
       of Linux	and Unix source	trees running into the tens of millions	of
       lines, you are very unlikely to ever see	this. There is no possibility
       of false	negatives.

       The design of this program accepts some size limits to get high
       performance in the typical case.	The RXOR hash function's performance
       will degrade, possibly leading to false positives, on shreds of 512 or
       more characters.	Also, comparison will handle a maximum of 65536	lines
       per file	(you will be warned if this limit is exceeded).	A full integer
       line-number range would increase	working-set size by 33%.

       You will	get spurious results if	you mix	.scf files with	different hash
       methods,	shred sizes or normalizations, or with settings	for these that
       conflict	with the command-line options you have in effect for shredding
       source-code trees. The program does some	sanity checking	but will not
       prevent you from	shooting yourself in the foot.

       The implementation is O(n log n)	in the size of the code	trees and
       O(n^2) in the number of common segments.	The code for finding duplicate
       hashes uses a brute-force approach that relies on being able to
       allocate	core for the entire hash list generated	during processing. The
       results are almost ridiculously fast on machines	with enough core (the
       author has observed speeds of up	to two million lines of	code compared
       per minute on a 1.8GHz Intel box).

       Comparing a directory to	itself will show no similarities. This is
       because the algorithm ignores matching shreds from within a single
       directory, and it only knows which directory a shred came from by
       relative	pathname.

comparator			  2026-02-28			 COMPARATOR(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=compararitor&sektion=1&manpath=FreeBSD+Ports+15.1.quarterly>

home | help