Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
MAILFOOT(1)							   MAILFOOT(1)

NAME
       mailfoot	- a full-online-ordered-training simulator for use with	dbacl.

SYNOPSIS

       mailfoot	command	[ command_arguments ]

DESCRIPTION
       mailfoot	 automates the task of testing email filtering and classifica-
       tion programs such as dbacl(1).	Given a	set of categorized  documents,
       mailfoot	 initiates test	runs to	estimate the classification errors and
       thereby permit fine tuning of the parameters of the classifier.

       Full Online Ordered Training is a learning method for email classifiers
       where each incoming email is learned as soon as it arrives, thereby al-
       ways keeping category descriptions up to	date for the next  classifica-
       tion.   This  directly  models  the way that some email classifiers are
       used in practice.

       FOOT's error rates depend directly on the order	in  which  emails  are
       seen.   A  small	 change	in ordering, as	might happen due to networking
       delays, can have	an impact on the number	of misclassifications.	Conse-
       quently,	mailfoot does not give meaningful results, unless  the	sample
       emails  are chosen carefully.  However, as this method is commonly used
       by spam filters,	it is still worth  computing  to  foster  comparisons.
       Other methods  (see mailcross(1),mailtoe(1)) attempt to capture the be-
       haviour of classification errors	in other ways.

       To  improve and stabilize the error rate	calculation, mailfoot performs
       the FOOT	simulations several times on slightly reordered	email streams,
       and averages the	results. The reorderings  occur	 by  multiplexing  the
       emails  from  each  category mailbox in random order. Thus if there are
       three categories, the first email classified is	chosen	randomly  from
       the  front  of the sample email streams of each type.  The second email
       is also chosen randomly among the three types, from the front of	the
	streams	after the first	email was removed. Simulation stops  when  all
       sample streams are exhausted.

       mailfoot	 uses the environment variable MAILFOOT_FILTER when executing,
       which permits the simulation of arbitrary filters, provided these  sat-
       isfy the	compatibility conditions stated	in the ENVIRONMENT section be-
       low.

       For  convenience, mailfoot implements a testsuite framework with	prede-
       fined wrappers for several open source classifiers.  This  permits  the
       direct  comparison  of  dbacl(1)	with competing classifiers on the same
       set of email samples. See the USAGE section below.

       During preparation, mailfoot builds a subdirectory named	mailfoot.d  in
       the  current  working directory.	 All needed calculations are performed
       inside this subdirectory.

EXIT STATUS
       mailfoot	returns	0 on success, 1	if a problem occurred.

COMMANDS
       prepare size
	      Prepares a subdirectory named mailfoot.d in the current  working
	      directory,  and  populates  it with empty	subdirectories for ex-
	      actly size subsets.

       add category [ FILE ]...
	      Takes a set of emails from either	FILE if	specified,  or	STDIN,
	      and  associates  them  with  category.   The  ordering of	emails
	      within FILE is preserved,	and subsequent FILEs are  appended  to
	      the  first  in each category.  This command can be repeated sev-
	      eral times, but should be	executed at least once.

       clean  Deletes the directory mailfoot.d and all its contents.

       run    Multiplexes randomly from	the email streams added	 earlier,  and
	      relearns	categories  only  when a misclassification occurs. The
	      simulation is repeated size times.

       summarize
	      Prints average error rates for the simulations.

       plot [ ps | logscale ]...
	      Plots the	number of errors over simulation time.	The  "ps"  op-
	      tion,  if	 present,  writes the plot to a	postscript file	in the
	      directory	mailfoot/plots,	instead	of being shown on-screen.  The
	      "logscale"  option, if present, causes the plot to be on the log
	      scale for	both ordinates.

       review truecat predcat
	      Scans the	last run statistics  and  extracts  all	 the  messages
	      which  belong  to	category truecat but have been classified into
	      category predcat.	 The extracted messages	are copied to the  di-
	      rectory mailfoot.d/review	for perusal.

       testsuite list
	      Shows  a	list of	available filters/wrapper scripts which	can be
	      selected.

       testsuite select	[ FILTER ]...
	      Prepares the filter(s) named FILTER to be	used  for  simulation.
	      The  filter  name	is the name of a wrapper script	located	in the
	      directory	/usr/local/share/dbacl/testsuite.  Each	filter	has  a
	      rigid  interface	documented  below, and the act of selecting it
	      copies it	to the mailfoot.d/filters directory. Only filters  lo-
	      cated there are used in the simulations.

       testsuite deselect [ FILTER ]...
	      Removes  the  named filter(s) from the directory mailfoot.d/fil-
	      ters so that they	are not	used in	the simulation.

       testsuite run [ plots ]
	      Invokes every selected filter on the datasets added  previously,
	      and calculates misclassification rates. If the "plots" option is
	      present,	each filter simulation is plotted as a postscript file
	      in the directory mailfoot.d/plots.

       testsuite status
	      Describes	the scheduled simulations.

       testsuite summarize
	      Shows the	cross validation results for all filters.  Only	 makes
	      sense after the run command.

USAGE
       The  normal  usage pattern is the following: first, you should separate
       your email collection into several categories (manually or  otherwise).
       Each  category  should be associated with one or	more folders, but each
       folder should not contain more than one category. Next, you should  de-
       cide  how  many runs to use, say	10.  The more runs you use, the	better
       the predicted error rates. However, more	runs take more time.  Now  you
       can type

       % mailfoot prepare 10

       Next,  for  every  category,  you must add every	folder associated with
       this category. Suppose you have three categories	named spam, work,  and
       play,  which  are  associated with the mbox files spam.mbox, work.mbox,
       and play.mbox respectively. You would type

       % mailfoot add spam spam.mbox
       % mailfoot add work work.mbox
       % mailfoot add play play.mbox

       You should aim for a similar number of emails in	each category, as  the
       random  multiplexing  will be unbalanced	otherwise. The ordering	of the
       email messages in each *.mbox file is important,	and is preserved  dur-
       ing  each  simulation.  If you repeatedly add to	the same category, the
       later mailboxes will be appended	to the first, preserving  the  implied
       ordering.

       You  can	 now  perform  as many FOOT simulations	as desired. The	multi-
       plexed emails are classified and	learned	one at a  time,	 by  executing
       the  command  given in the environment variable MAILFOOT_FILTER.	If not
       set, a default value is used.

       % mailfoot run
       % mailfoot summarize

       The testsuite commands are designed to simplify the above steps and al-
       low comparison of a wide	range of email classifiers, including but  not
       limited	to  dbacl.  Classifiers	are supported through wrapper scripts,
       which are located in the	/usr/local/share/dbacl/testsuite directory.

       The first stage when using the testsuite	is deciding which  classifiers
       to compare.  You	can view a list	of available wrappers by typing:

       % mailfoot testsuite list

       Note  that  the	wrapper	 scripts are NOT the actual email classifiers,
       which must be installed separately by your system administrator or oth-
       erwise.	Once this is done, you can select one or more wrappers for the
       simulation by typing, for example:

       % mailfoot testsuite select dbaclA ifile

       If some of the selected classifiers cannot be found on the system, they
       are not selected. Note also that	some wrappers can have hard-coded cat-
       egory names, e.g. if the	classifier only	 supports  binary  classifica-
       tion. Heed the warning messages.

       It  remains  only  to  run the simulation. Beware, this can take	a long
       time (several hours depending on	the classifier).

       % mailfoot testsuite run
       % mailfoot testsuite summarize

       Once you	are all	done, you can delete the working files,	log files etc.
       by typing

       % mailfoot clean

SCRIPT INTERFACE
       mailfoot	testsuite takes	care of	learning and classifying your prepared
       email corpora for each  selected	 classifier.  Since  classifiers  have
       widely  varying interfaces, this	is only	possible by wrapping those in-
       terfaces	individually into a standard form which	can be used  by	 mail-
       foot testsuite.

       Each  wrapper script is a command line tool which accepts a single com-
       mand followed by	zero or	more optional arguments, in the	standard form:

       wrapper command [argument]...

       Each wrapper script also	makes use of STDIN and STDOUT in  a  well  de-
       fined way. If no	behaviour is described,	then no	output or input	should
       be used.	 The possible commands are described below:

       filter In this case, a single email is expected on STDIN, and a list of
	      category filenames is expected in	$2, $3,	etc. The script	writes
	      the category name	corresponding to the input email on STDOUT. No
	      trailing newline is required or expected.

       learn  In this case, a standard mbox stream is expected on STDIN, while
	      a	 suitable  category  file name is expected in $2. No output is
	      written to STDOUT.

       clean  In this case, a directory	is expected in $2, which  is  examined
	      for  old	database  information. If any old databases are	found,
	      they are purged or reset.	No output is written to	STDOUT.

       describe
	      IN this case, a single line of text is written  to  STDOUT,  de-
	      scribing	the  filter's  functionality.  The line	should be kept
	      short to prevent line wrapping on	a terminal.

       bootstrap
	      In this case, a directory	is expected in $2. The wrapper	script
	      first checks for the existence of	its associated classifier, and
	      other  prerequisites. If the check is successful,	then the wrap-
	      per is cloned into the supplied directory.  A courtesy notifica-
	      tion should be given on STDOUT to	express	 success  or  failure.
	      It is also permissible to	give longer descriptions caveats.

       toe    Used by mailtoe(1).

       foot   In  this	case, a	list of	categories is expected in $3, $4, etc.
	      Every possible category must be listed. Preceding	this list, the
	      true category is given in	$2.

ENVIRONMENT
       Right after loading, mailfoot reads the hidden file .mailfootrc in  the
       $HOME  directory, if it exists, so this would be	a good place to	define
       custom values for environment variables.

       MAILFOOT_FILTER
	      This variable contains a shell command to	be executed repeatedly
	      during the running stage.	 The command should  accept  an	 email
	      message  on  STDIN  and output a resulting category name.	On the
	      command line, it should also  accept  first  the	true  category
	      name,  then  a list of all possible category file	names.	If the
	      output category does not match the true category,	then the rele-
	      vant categories are assumed to have  been	 silently  updated/re-
	      learned.	 If  MAILFOOT_FILTER is	undefined, mailfoot uses a de-
	      fault value.

       TEMPDIR
	      This directory is	exported for the benefit of  wrapper  scripts.
	      Scripts which need to create temporary files should place	them a
	      the location given in TEMPDIR.

NOTES
       The  subdirectory  mailfoot.d  can grow quite large. It contains	a full
       copy of the training corpora, as	well as	learning files for size	 times
       all the added categories, and various log files.

       FOOT simulations	for dbacl(1) are very, very slow (order	n squared) and
       will take all night to perform. This is not easy	to improve.

WARNING
       Because	the ordering of	emails within the added	mailboxes matters, the
       estimated error rates are not well defined or even meaningful in	an ob-
       jective sense.  However,	if the sample emails represent an actual snap-
       shot of a user's	incoming email,	then  the  error  rates	 are  somewhat
       meaningful. The simulations can then be interpreted as alternate	reali-
       ties where a given classifier would have	intercepted the	incoming mail.

SOURCE
       The  source code	for the	latest version of this program is available at
       the following locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR
       Laird A.	Breyer <laird@lbreyer.com>

SEE ALSO
       bayesol(1) dbacl(1), mailcross(1), mailinspect(1), mailtoe(1), regex(7)

Version	1.14.1	      Bayesian Text Classification Tools	   MAILFOOT(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=mailfoot&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help