Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
MAILCROSS(1)							  MAILCROSS(1)

NAME
       mailcross - a cross-validation simulator	for use	with dbacl.

SYNOPSIS

       mailcross command [ command_arguments ]

DESCRIPTION
       mailcross  automates  the  task of cross-validating email filtering and
       classification programs such as dbacl(1).  Given	a set  of  categorized
       documents,  mailcross initiates simulation runs to estimate the classi-
       fication	errors and thereby permits fine	tuning of  the	parameters  of
       the classifier.

       Cross-validation	 is a method which is widely used to compare the qual-
       ity of classification and learning  algorithms,	and  as	 such  permits
       rudimentary  comparisons	 between  those	 classifiers which make	use of
       dbacl(1)	and bayesol(1),	and other competing classifiers.

       The mechanics of	cross-validation are as	follows: A set of  pre-classi-
       fied email messages is first split into a number	of roughly equal-sized
       subsets.	 For each subset, the filter (by default, dbacl(1)) is used to
       classify	each message within this subset, based upon having learned the
       categories from the remaining subsets. The resulting classification er-
       rors are	then averaged over all subsets.

       The results obtained by cross validation	essentially do not depend upon
       the  ordering of	the sample emails. Other methods (see mailtoe(1),mail-
       foot(1))	attempt	to capture the behaviour of classification errors over
       time.

       mailcross uses the environment variables	 MAILCROSS_LEARNER  and	 MAIL-
       CROSS_FILTER  when executing, which permits the cross-validation	of ar-
       bitrary filters,	provided these satisfy	the  compatibility  conditions
       stated in the ENVIRONMENT section below.

       For convenience,	mailcross implements a testsuite framework with	prede-
       fined  wrappers	for  several open source classifiers. This permits the
       direct comparison of dbacl(1) with competing classifiers	 on  the  same
       set of email samples. See the USAGE section below.

       During  preparation,  mailcross builds a	subdirectory named mailcross.d
       in the current working directory.  All  needed  calculations  are  per-
       formed inside this subdirectory.

EXIT STATUS
       mailcross returns 0 on success, 1 if a problem occurred.

COMMANDS
       prepare size
	      Prepares a subdirectory named mailcross.d	in the current working
	      directory,  and  populates  it with empty	subdirectories for ex-
	      actly size subsets.

       add category [FILE]...
	      Takes a set of emails from either	FILE if	specified,  or	STDIN,
	      and  associates  them with category.  All	emails are distributed
	      randomly into the	subdirectories of mailcross.d for  later  use.
	      For  each	 category, this	command	can be repeated	several	times,
	      but should be executed at	least once.

       clean  Deletes the directory mailcross.d	and all	its contents.

       learn  For every	previously built subset	of email messages,  pre-learns
	      all  the categories based	on the contents	of all the subsets ex-
	      cept this	 one.	The  command_arguments	are  passed  to	 MAIL-
	      CROSS_LEARNER.

       run    For  every  previously  built subset of email messages, performs
	      the classification based upon the	pre-learned categories associ-
	      ated with	all but	this subset.  The command_arguments are	passed
	      to MAILCROSS_FILTER.

       summarize
	      Prints statistics	for the	latest cross-validation	run.

       review truecat predcat
	      Scans the	last run statistics  and  extracts  all	 the  messages
	      which  belong  to	category truecat but have been classified into
	      category predcat.	 The extracted messages	are copied to the  di-
	      rectory mailcross.d/review for perusal.

       testsuite list
	      Shows  a	list of	available filters/wrapper scripts which	can be
	      selected.

       testsuite select	[FILTER]...
	      Prepares the filter(s) named FILTER to be	used  for  simulation.
	      The  filter  name	is the name of a wrapper script	located	in the
	      directory	/usr/local/share/dbacl/testsuite.  Each	filter	has  a
	      rigid  interface	documented  below, and the act of selecting it
	      copies it	to the mailcross.d/filters directory. Only filters lo-
	      cated there are used in the simulations.

       testsuite deselect [FILTER]...
	      Removes the named	filter(s) from the directory  mailcross.d/fil-
	      ters so that they	are not	used in	the simulation.

       testsuite run
	      Invokes  every selected filter on	the datasets added previously,
	      and calculates misclassification rates.

       testsuite status
	      Describes	the scheduled simulations.

       testsuite summarize
	      Shows the	cross validation results for all filters.  Only	 makes
	      sense after the run command.

USAGE
       The  normal  usage pattern is the following: first, you should separate
       your email collection into several categories (manually or  otherwise).
       Each  category  should be associated with one or	more folders, but each
       folder should not contain more than one category. Next, you should  de-
       cide  how many subsets to use, say 10.  Note that too many subsets will
       slow down the calculations rapidly. Now you can type

       % mailcross prepare 10

       Next, for every category, you must add  every  folder  associated  with
       this  category. Suppose you have	three categories named spam, work, and
       play, which are associated with the mbox	 files	spam.mbox,  work.mbox,
       and play.mbox respectively. You would type

       % mailcross add spam spam.mbox
       % mailcross add work work.mbox
       % mailcross add play play.mbox

       You can now perform as many simulations as desired. Every cross valida-
       tion  consists  of a learning, a	running	and a summarizing stage. These
       operations are performed	on  the	 classifier  specified	in  the	 MAIL-
       CROSS_FILTER  and  MAILCROSS_LEARNER  variables.	By setting these vari-
       ables appropriately, you	can compare classification performance as  you
       vary the	command	line options of	your classifier(s).

       % mailcross learn
       % mailcross run
       % mailcross summarize

       The testsuite commands are designed to simplify the above steps and al-
       low  comparison of a wide range of email	classifiers, including but not
       limited to dbacl.  Classifiers are supported through  wrapper  scripts,
       which are located in the	/usr/local/share/dbacl/testsuite directory.

       The  first stage	when using the testsuite is deciding which classifiers
       to compare.  You	can view a list	of available wrappers by typing:

       % mailcross testsuite list

       Note that the wrapper scripts are NOT  the  actual  email  classifiers,
       which must be installed separately by your system administrator or oth-
       erwise.	Once this is done, you can select one or more wrappers for the
       simulation by typing, for example:

       % mailcross testsuite select dbaclA ifile

       If some of the selected classifiers cannot be found on the system, they
       are not selected. Note also that	some wrappers can have hard-coded cat-
       egory  names,  e.g.  if the classifier only supports binary classifica-
       tion. Heed the warning messages.

       It remains only to run the simulation. Beware, this  can	 take  a  long
       time (several hours depending on	the classifier).

       % mailcross testsuite run
       % mailcross testsuite summarize

       Once  you  are  all  done  with simulations, you	can delete the working
       files, log files	etc. by	typing

       % mailcross clean

       The progress of the cross validation is written silently	in various log
       files which are located in the mailcross.d/log directory.  Check	 these
       in case of problems.

SCRIPT INTERFACE
       mailcross  testsuite  takes  care of learning and classifying your pre-
       pared email corpora for each  selected  classifier.  Since  classifiers
       have widely varying interfaces, this is only possible by	wrapping those
       interfaces individually into a standard form which can be used by mail-
       cross testsuite.

       Each  wrapper script is a command line tool which accepts a single com-
       mand followed by	zero or	more optional arguments, in the	standard form:

       wrapper command [argument]...

       Each wrapper script also	makes use of STDIN and STDOUT in  a  well  de-
       fined way. If no	behaviour is described,	then no	output or input	should
       be used.	 The possible commands are described below:

       filter In this case, a single email is expected on STDIN, and a list of
	      category filenames is expected in	$2, $3,	etc. The script	writes
	      the category name	corresponding to the input email on STDOUT. No
	      trailing newline is required or expected.

       learn  In this case, a standard mbox stream is expected on STDIN, while
	      a	 suitable  category  file name is expected in $2. No output is
	      written to STDOUT.

       clean  In this case, a directory	is expected in $2, which  is  examined
	      for  old	database  information. If any old databases are	found,
	      they are purged or reset.	No output is written to	STDOUT.

       describe
	      IN this case, a single line of text is written  to  STDOUT,  de-
	      scribing	the  filter's  functionality.  The line	should be kept
	      short to prevent line wrapping on	a terminal.

       bootstrap
	      In this case, a directory	is expected in $2. The wrapper	script
	      first checks for the existence of	its associated classifier, and
	      other  prerequisites. If the check is successful,	then the wrap-
	      per is cloned into the supplied directory.  A courtesy notifica-
	      tion should be given on STDOUT to	express	 success  or  failure.
	      It is also permissible to	give longer descriptions caveats.

       toe    Used by mailtoe(1).

       foot   Used by mailfoot(1).

ENVIRONMENT
       Right  after  loading,  mailcross reads the hidden file .mailcrossrc in
       the $HOME directory, if it exists, so this would	be a good place	to de-
       fine custom values for environment variables.

       MAILCROSS_FILTER
	      This variable contains a shell command to	be executed repeatedly
	      during the running stage.	 The command should  accept  an	 email
	      message on STDIN and output a resulting category name. It	should
	      also  accept  a list of category file names on the command line.
	      If undefined, mailcross uses the	default	 value	MAILCROSS_FIL-
	      TER="dbacl  -T  email -T xml -v" (and also magically adds	the -c
	      option before each category).

       MAILCROSS_LEARNER
	      This variable contains a shell command to	be executed repeatedly
	      during the learning stage. The command should accept a mbox type
	      stream of	emails on STDIN	for learning, and the file name	of the
	      category on the command line.  If	undefined, mailcross uses  the
	      default  value  MAILCROSS_LEARNER="dbacl	-H  19 -T email	-T xml
	      -l".

       TEMPDIR
	      This directory is	exported for the benefit of  wrapper  scripts.
	      Scripts which need to create temporary files should place	them a
	      the location given in TEMPDIR.

NOTES
       The  subdirectory  mailcross.d can grow quite large. It contains	a full
       copy of the training corpora, as	well as	learning files for size	 times
       all the added categories, and various log files.

WARNING
       Cross-validation	 is  a	widely used, but ad-hoc	statistical procedure,
       completely unrelated to Bayesian	theory,	and  subject  to  controversy.
       Use this	at your	own risk.

SOURCE
       The  source code	for the	latest version of this program is available at
       the following locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR
       Laird A.	Breyer <laird@lbreyer.com>

SEE ALSO
       bayesol(1) dbacl(1), mailinspect(1), mailtoe(1),	mailfoot(1), regex(7)

Version	1.14.1	      Bayesian Text Classification Tools	  MAILCROSS(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=mailcross&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help