Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
DBACL(1)							      DBACL(1)

       dbacl - a digramic Bayesian classifier for text recognition.

       dbacl [-01dvnirmwMNDXW] [-T type	] -l category [-h size]	[-H gsize] [-x
	      decim] [-q quality] [-w max_order] [-e deftok] [-o  online]  [-L
	      measure] [-z ftresh] [-O ronline]...  [-g	regex]...  [FILE]...

       dbacl  [-vnimNRXYP]  [-h	 size]	[-T type] -c category [-c category]...
	      [-f keep]...  [FILE]...

       dbacl -V

       dbacl is	a Bayesian text	 and  email  classifier.  When	using  the  -l
       switch,	it  learns  a  body  of	text and produce a file	named category
       which summarizes	the text. When using the -c switch, it compares	an in-
       put text	stream with any	number of category files, and outputs the name
       of the closest match, or	optionally various numerical scores  explained

       Whereas	this manual page is intended as	a reference, there are several
       tutorials and documents you can read to	get  specialized  information.
       Specific	 documentation	about  the design of dbacl and the statistical
       models that it uses can be found	in  For a basic overview  of
       text  classification  using dbacl, see tutorial.html. A companion tuto-
       rial geared towards email filtering is email.html. If you have  trouble
       getting dbacl to	classify reliably, read	is_it_working.html.  The USAGE
       section of this manual page also	has some examples.





       dbacl uses a maximum entropy (minimum divergence) language  model  con-
       structed	 with  respect to a digramic reference measure (unknown	tokens
       are predicted from digrams, i.e.	pairs of letters).  Practically,  this
       means  that  a category is constructed from tokens in the training set,
       while previously	unseen tokens  can  be	predicted  automatically  from
       their  letters.	A token	here is	either a word (fragment) or a combina-
       tion of words (fragments),  selected  according	to  various  switches.
       Learning	roughly	works by tweaking token	probabilities until the	train-
       ing data	is least surprising.

       The normal shell	exit conventions aren't	followed (sorry!). When	 using
       the -l command form, dbacl returns zero on success, nonzero if an error
       occurs. When using the -c form, dbacl returns a positive	integer	corre-
       sponding	 to  the  category  with the highest posterior probability. In
       case of a tie, the first	most probable category is chosen. If an	 error
       occurs, dbacl returns zero.

       When  using the -l command form,	dbacl learns a category	when given one
       or more FILE names, which should	contain	readable  ASCII	 text.	If  no
       FILE  is	 given,	dbacl learns from STDIN. If FILE is a directory, it is
       opened and all its files	are read, but not its subdirectories. The  re-
       sult  is	 saved	in  the	binary file named category, and	completely re-
       places any previous contents. As	 a  convenience,  if  the  environment
       variable	DBACL_PATH contains a directory, then that is prepended	to the
       file path, unless category starts with a	'/' or a '.'.

       The input text for learning is assumed to be unstructured plain text by
       default.	 This  is  not suitable	for learning email, because email con-
       tains various transport encodings and formatting	instructions which can
       reduce classification effectiveness. You	must use the -T	switch in that
       case so that dbacl knows	it should perform decoding  and	 filtering  of
       MIME  and HTML as appropriate.  Apropriate switch values	are "-T	email"
       for RFC2822 email input,	"-T html" for HTML input, "-T xml" for generic
       XML  style  input and "-T text" is the default plain text format. There
       are other values	of the -T switch that also allow fine  tuning  of  the
       decoding	capabilities.

       When  using  the	 -c  command form, dbacl attempts to classify the text
       found in	FILE, or STDIN if no FILE is  given.  Each  possible  category
       must  be	 given separately, and should be the file name of a previously
       learned text corpus. As a convenience, if the variable DBACL_PATH  con-
       tains  a	 directory,  it	 is  prepended to each file path which doesn't
       start with a '/'	or a '.'. The visible output of	the classification de-
       pends  on the combination of extra switches used. If no switch is used,
       then no output is shown on STDOUT. However, dbacl  always  produces  an
       exit code which can be tested.

       To see an output	for a classification, you must use at least one	of the
       -v,-U,-n,-N,-D,-d switches. Sometimes, they can be used in  combination
       to  produce a natural variation of their	individual outputs. Sometimes,
       dbacl also produces warnings on STDERR if applicable.

       The -v switch outputs the name of  the  best  category  among  all  the
       choices given.

       The  -U switch outputs the name of the best category followed by	a con-
       fidence percentage. Normally, this is the switch	that you want to  use.
       A  percentage  of  100% means that dbacl	is sure	of its choice, while a
       percentage of 0%	means that some	other category is equally likely. This
       is  not the model probability, but measures how unambiguous the classi-
       fication	is, and	can be used to tag unsure classifications (e.g.	if the
       confidence is 25% or less).

       The  -N	switch	prints	each category name followed by its (posterior)
       probability, expressed as a percentage. The percentages always  sum  to
       100%.  This is intuitive, but only valuable if the document being clas-
       sified contains a handful of tokens (ten	or less). In the  common  case
       with  many more tokens, the probabilities are always extremely close to
       100% and	0%.

       The -n switch prints each category name followed	by the negative	 loga-
       rithm  of  its  probability. This is equivalent to using	the -N switch,
       but much	more useful. The smallest number gives the  best  category.  A
       more  convenient	 form is to use	both -n	and -v which prints each cate-
       gory name followed by the cross entropy and the number of  tokens  ana-
       lyzed.  The  cross  entropy  measures (in bits) the average compression
       rate which is achievable, under the given category model, per token  of
       input  text.  If	 you  use all three of -n,-v,-X	then an	extra value is
       output for each category, representing a	kind of	p-value	for each cate-
       gory  score.  This  indicates  how typical the score is compared	to the
       training	documents, but only works if the -X  switch  was  used	during
       learning, and only for some types of models (e.g. email).  These	p-val-
       ues are uniformly distributed and independent (if  the  categories  are
       independent), so	can be combined	using Fisher's chi squared test	to ob-
       tain composite p-values for groupings of	categories.

       The -v and -X switches together print each category name	followed by  a
       detailed	 decomposition	of  the	category score,	factored into (	diver-
       gence rate + shannon entropy rate )* token count	@ p-value. Again, this
       only works in some types	of models.

       The  -v and -U switches print each category name	followed by a decompo-
       sition of the category score into ( divergence rate +  shannon  entropy
       rate # score variance )*	token count.

       The -D switch prints out	the input text as modified internally by dbacl
       prior to	tokenization. For example, if a	MIME encoded email document is
       classified, then	this prints the	decoded	text that will be actually to-
       kenized and classified. This switch is mainly useful for	debugging.

       The -d switch dumps tokens and scores while they	are being read.	It  is
       useful  for  debugging,	or if you want to create graphical representa-
       tions of	the classification. A detailed explanation of  the  output  is
       beyond  the scope of this manual	page, but is straightforward if	you've
       read  Possible	variations include -d together with -n or -N.

       Classification can be done with one or several categories in principle.
       When  two or more categories are	used, the Bayesian posterior probabil-
       ity is used, given the input text, with a uniform prior distribution on
       categories.  For	 other	choices	 of  prior,  see the companion utility
       bayesol(1).  When a single category is used, classification can be done
       by  comparing the score with a treshold.	In practice however, much bet-
       ter results are obtained	with several categories.

       Learning	and classifying	cannot be mixed	on the	same  command  invoca-
       tion,  however there are	no locking issues and separate dbacl processes
       can operate simultaneously with obvious results,	 because  file	opera-
       tions are designed to be	atomic.

       Finally,	 note that dbacl does not manage your document corpora or your
       computed	categories.  In	particular, dbacl cannot  add  or  subtract  a
       document	 from  a category file directly.  If you want to learn a cate-
       gory incrementally, the standard	way is to keep adding to your document
       corpus,	and  learn  the	 whole corpus each time. By keeping control of
       your archives, you can never lose the information in  your  categories,
       and  you	can easily experiment with different switches or tokenizations
       or sets of training documents if	you like.

       If the standard incremental learning method is too slow,	the -o	switch
       can  help. This creates a data file named online	which contains all the
       document	statistics that	have been learned. When	you use	the -l and  -o
       switches	 together,  dbacl  merges  the online data file	(if it exists)
       with the	new document(s)	to be learned, and recreates an	 updated  ver-
       sion  of	online.	 This is equivalent to adding the new documents	to the
       corpus and relearning the whole corpus, but faster. However,  documents
       cannot  be  removed  if	you  change your mind. This is a limitation of
       dbacl which cannot be changed for mathematical reasons.	You  can  work
       around  this by making backups of the online data file. It is also pos-
       sible to	merge one or more extra	online data  files  simultaneously  by
       using the -O switch one or more times.

       By default, dbacl classifies the	input text as a	whole, ie it only out-
       puts a single result even if you	specify	several	input  files.  If  you
       want to classify	multiple input files you can either call dbacl repeat-
       edly (which is fast when	you use	the -m switch),	or use the -F  switch,
       which prints each input FILE followed by	the result for that FILE.  Al-
       ternatively, you	can classify each line of the input  individually,  by
       using  the  -f option, which prints only	those lines which match	one or
       more models identified by keep (use the category	name or	number to  re-
       fer  to	a  category). This last	switch is useful if you	want to	filter
       out some	lines, but note	that if	the lines are short,  then  the	 error
       rate can	be high.

       The  -e,-w,-g,-j	 switches  are	used for selecting an appropriate tok-
       enization scheme. A token is a word or word fragment or combination  of
       words  or  fragments. The shape of tokens is important because it forms
       the basis of the	language models	used by	dbacl.	The -e switch  selects
       a  predefined tokenization scheme, which	is speedy but limited.	The -w
       switch specifies	composite tokens derived from the -e switch. For exam-
       ple,  "-e  alnum	 -w  2"	 means that tokens should be alphanumeric word
       fragments combined into overlapping pairs (bigrams). When the -j	switch
       is  used, all tokens are	converted to lowercase,	which reduces the num-
       ber of possible tokens and therefore memory consumption.

       If the -g switch	is used, you can completely specify  what  the	tokens
       should look like	using a	regular	expression. Several -g switches	can be
       used to construct complex tokenization schemes, and parentheses	within
       each  expression	 can be	used to	select fragments and combine them into
       n-grams.	The cost of such flexibility  is  reduced  classification  and
       learning	speed. When experimenting with tokenization schemes, try using
       the -d or -D switches while learning or classifying, as they will print
       the  tokens explicitly so you can see what text fragments are picked up
       or missed out. For regular exression syntax, see	regex(7).

       The -h and -H switches regulate how  much  memory  dbacl	 may  use  for
       learning.  Text	classification can use a lot of	memory,	and by default
       dbacl limits itself even	at the expense of learning accuracy.  In  many
       cases  if  a  limit  is	reached,  a warning message will be printed on
       STDERR with some	advice.

       When relearning the same	category several times,	a significant  speedup
       can  be	obtained by using the -1 switch, as this allows	the previously
       learned probabilities to	be read	from the category and reused.

       Note that classification	accuracy depends foremost on  the  amount  and
       quality of the training samples,	and then only on amount	of tweaking.

       When using the -l command form, dbacl returns zero on success. When us-
       ing the -c form,	dbacl returns a	 positive  integer  (1,2,3...)	corre-
       sponding	 to  the  category  with the highest posterior probability. In
       case of a tie, the first	most probable category is chosen. If an	 error
       occurs, dbacl returns zero.

       -0     When  learning,  prevents	 weight	 preloading.  Normally,	 dbacl
	      checks if	the category file already exists, and if so, tries  to
	      use  the existing	weights	as a starting point. This can dramati-
	      cally speed up learning.	If the -0 (zero) switch	is  set,  then
	      dbacl  behaves  as  if  no category file already exists. This is
	      mainly useful for	testing.  This switch is now  enabled  by  de-
	      fault, to	protect	against	weight drift which can reduce accuracy
	      over many	learning iterations. Use -1 to force preloading.

       -1     Force weight preloading if the category file already exists. See
	      discussion of the	-0 switch.

       -a     Append  scores.  Every  input  line is written to	STDOUT and the
	      dbacl scores are appended. This  is  useful  for	postprocessing
	      with  bayesol(1).	  For ease of processing, every	original input
	      line is indented by a single space (to distinguish them from the
	      appended	scores),  and the line with the	scores (if -n is used)
	      is prefixed with the string "scores ". If	a second copy of dbacl
	      needs  to	 read this output later, it should be invoked with the
	      -A switch.

       -d     Dump the model parameters	to STDOUT. In conjunction with the  -l
	      option,  this  produces  a human-readable	summary	of the maximum
	      entropy model. In	conjunction with the -c	option,	 displays  the
	      contribution  of	each  token to the final score.	Suppresses all
	      other normal output.

       -e     Select character class for default (not  regex-based)  tokeniza-
	      tion.  By	default, tokens	are alphabetic strings only. This cor-
	      responds to the case when	deftok is "alpha". Possible values for
	      deftok  are  "alpha", "alnum", "graph", "char", "cef" and	"adp".
	      The last two are custom tokenizers intended for email  messages.
	      See  also	 isalpha(3).   The  "char"  tokenizer  picks up	single
	      printable	characters rather than bigger tokens, and is  intended
	      for testing only.

       -f     Filter  each  line  of  input separately,	passing	to STDOUT only
	      lines which match	the category identified	as keep.  This	option
	      should  be used repeatedly for each category which must be kept.
	      keep can be either the category file name, or a positive integer
	      representing  the	required category in the same order it appears
	      on the command line.

	      Output lines are flushed as soon as they are written. If the in-
	      put  file	is a pipe or character device, then an attempt is made
	      to use line buffering mode, otherwise the	more  efficient	 block
	      buffering	is used.

       -g     Learn only features described by the extended regular expression
	      regex.  This overrides the default feature selection method (see
	      -w  option) and learns, for each line of input, only tokens con-
	      structed from the	 concatenation	of  strings  which  match  the
	      tagged subexpressions within the supplied	regex.	All substrings
	      which match regex	within a suffix	of each	input line are treated
	      as features, even	if they	overlap	on the input line.

	      As  an  optional convenience, regex can include the suffix ||xyz
	      which indicates which  parenthesized  subexpressions  should  be
	      tagged. In this case, xyz	should consist exclusively of digits 1
	      to 9, numbering exactly those  subexpressions  which  should  be
	      tagged.  Alternatively,  if  no  parentheses exist within	regex,
	      then it is assumed that the whole	expression must	be captured.

       -h     Set the size of the hash table to	2^size	elements.  When	 using
	      the  -l  option, this refers to the total	number of features al-
	      lowed in the maximum entropy model being learned.	When using the
	      -c option	toghether with the -M switch and multinomial type cat-
	      egories, this refers to the maximum  number  of  features	 taken
	      into account during classification.  Without the -M switch, this
	      option has no effect.

       -i     Fully internationalized mode. Forces the use of wide  characters
	      internally,  which  is  necessary	in some	locales. This incurs a
	      noticeable performance penalty.

       -j     Make features case sensitive. Normally, all  features  are  con-
	      verted  to  lower	 case during processing, which reduces storage
	      requirements  and	 improves  statistical	estimates  for	 small
	      datasets.	 With this option, the original	capitalization is used
	      for each feature.	This can improve classification	accuracy.

       -m     Aggressively maps	categories into	memory and locks them into RAM
	      to  prevent  swapping, if	possible. This is useful when speed is
	      paramount	and memory is plentiful, for example when testing  the
	      classifier on large datasets.

	      Locking  may  require  relaxing user limits with ulimit(1).  Ask
	      your system administrator. Beware	when using the -m  switch  to-
	      gether  with the -o switch, as only one dbacl process must learn
	      or classify at a time to prevent file corruption.	If no learning
	      takes  place,  then the -m switch	for classifying	is always safe
	      to use. See also the discussion for the -o switch.

       -n     Print scores for each category.  Each score is  the  product  of
	      two  numbers,  the cross entropy and the complexity of the input
	      text under each model. Multiplied	together, they	represent  the
	      log probability that the input resembles the model. To see these
	      numbers separately, use also the -v option. In conjunction  with
	      the  -f  option,	stops  filtering  but  prints  each input line
	      prepended	with a list of scores for that line.

       -q     Select quality of	learning, where	quality	can be 1,2,3,4.	Higher
	      values  take  longer to learn, and should	be slightly more accu-
	      rate. The	default	quality	is 1 if	the category file doesn't  ex-
	      ist or weights cannot be preloaded, and 2	otherwise.

       -o     When  learning, reads/writes partial token counts	so they	can be
	      reused. Normally,	category files are learned  from  exactly  the
	      input data given,	and don't contain extraneous information. When
	      this option is in	effect,	some extra information is saved	in the
	      file  online,  after all input was read. This information	can be
	      reread the next time that	learning occurs, to continue where the
	      previous	dataset	 left off. If online doesn't exist, it is cre-
	      ated. If online exists, it is read before	learning, and  updated
	      afterwards.  The file is approximately 3 times bigger (at	least)
	      than the learned category.

	      In dbacl,	file updates are atomic, but if	using the  -o  switch,
	      two  or  more processes should not learn simultaneously, as only
	      one process will write a lasting category	and memory  dump.  The
	      -m  switch can also speed	up online learning, but	beware of pos-
	      sible corruption.	 Only one process should read or write a file.
	      This  option is intended primarily for controlled	test runs. See
	      also the -O (big-oh) switch.

       -r     Learn the	digramic reference model only. Skips the  learning  of
	      extra features in	the text corpus.

       -v     Verbose  mode.  When learning, print out details of the computa-
	      tion, when classifying, print out	the name of the	most  probable
	      category.	  In conjunction with the -n option, prints the	scores
	      as an explicit product of	the cross entropy and the complexity.

       -w     Select default features to be n-grams up to max_order.  This  is
	      incompatible  with the -g	option,	which always takes precedence.
	      If no -w or -g options are given,	dbacl assumes -w 1. Note  that
	      n-grams  for n greater than 1 do not straddle line breaks	by de-
	      fault.  The -S switch enables line straddling.

       -x     Set decimation probability to 1 -	2^(-decim).  To	reduce	memory
	      requirements  when  learning,  some inputs are randomly skipped,
	      and only a few are added to the model.  Exact behaviour  depends
	      on  the  applicable  -T option (default is -T "text").  When the
	      type is not "email" (eg "text"), then individual input  features
	      are added	with probability 2^(-decim). When the type is "email",
	      then full	input messages are added with probability  2^(-decim).
	      Within each such message,	all features are used.

       -z     When  learning, only take	into account features whose occurrence
	      count is strictly	greater	than ftreshold.	By default,  ftreshold
	      is zero, so all features in the training corpus are used.	A neg-
	      ative value of ftreshold causes dbacl to subtract	from the maxi-
	      mum  observed  feature count, and	to use that if it is positive.
	      For example, -z 1	means dbacl only learns	features  which	 occur
	      at  least	twice in the corpus, and -z -5 means dbacl only	learns
	      the feature(s) whose occurrence count is within 4	of the	global

       -A     Expect  indented	input  and scores. With	this switch, dbacl ex-
	      pects input lines	to be indented by  a  single  space  character
	      (which  is then skipped).	 Lines starting	with any other charac-
	      ter are ignored. This is the counterpart to the -a switch	above.
	      When used	together with the -a switch, dbacl outputs the skipped
	      lines as they are, and reinserts the space at the	front of  each
	      processed	input line.

       -D     Print  debug output. Do not use normally,	but can	be very	useful
	      for displaying the list features picked up while learning.

       -F     For each FILE of input, print the	 FILE  name  followed  by  the
	      classification  result  (normally	dbacl only prints a single re-
	      sult even	if multiple files are listed as	input).

       -H     Allow hash table to grow up to a	maximum	 of  2^gsize  elements
	      during learning. Initial size is given by	-h option.

       -L     Select the digramic reference measure for	character transitions.
	      The measure can be one of	"uniform",  "dirichlet"	 or  "maxent".
	      Default is "uniform".

       -M     Force  multinomial calculations. When learning, forces the model
	      features to be treated multinomially. When classifying, corrects
	      entropy scores to	reflect	multinomial probabilities (only	appli-
	      cable to multinomial type	models,	if present).  Scores will  al-
	      ways be lower, because the ordering of features is lost.

       -N     Print  posterior	probabilities for each category.  This assumes
	      the supplied categories form an exhaustive  list	of  possibili-
	      ties.   In  conjunction  with the	-f option, stops filtering but
	      prints each input	line prepended with a summary of the posterior
	      distribution for that line.

       -O     This  switch  causes  the	 online	 data file named ronline to be
	      merged during learning. The ronline file must be	created	 using
	      the  -o (little-oh) switch.  Several -O data files can be	merged
	      simultaneously. This is intended to be a read  only  version  of
	      -o, to allow piecing together of several sets of preparsed data.
	      See the description of the -o switch.

       -R     Include an extra category	for purely random text.	 The  category
	      is called	"random".  Only	makes sense when using the -c option.

       -P     Correct the category scores to include estimated prior probabil-
	      ities. The prior probability estimate for	each category is  pro-
	      portional	 to  the  number of documents or, if that doesn't make
	      sense, the number	of unique features. This can help  with	 "bal-
	      ancing"  when  one  category is learned from much	more data than
	      another. If all categories are learned  from  approximately  the
	      same  amount  of data (or	maybe within a factor of 2), then this
	      option should have little	qualitative effect.

       -S     Enable line straddling. This is useful together with the -w  op-
	      tion to allow n-grams for	n > 1 to ignore	line breaks, so	a com-
	      plex token can continue past the end of the line.	 This  is  not
	      recommended for email.

       -T     Specify  nonstandard text	format.	By default, dbacl assumes that
	      the input	text is	a purely ASCII text file. This corresponds  to
	      the case when type is "text".

	      There  are several types and subtypes which can be used to clean
	      the input	text of	extraneous tokens before  actual  learning  or
	      classifying  takes place.	Each (sub)type you wish	to use must be
	      indicated	with a separate	-T option on the command line, and au-
	      tomatically implies the corresponding type.

	      The  "text"  type	 is for	unstructured plain text. No cleanup is
	      performed. This is the default if	no types are given on the com-
	      mand line.

	      The "email" type is for mbox format input	files or single	RFC822
	      emails.  Headers are recognized and most are skipped. To include
	      extra  RFC822  standard  headers (except for trace headers), use
	      the "email:headers" subtype.  To include trace headers, use  the
	      "email:theaders"	subtype.  To include all headers in the	email,
	      use the "email:xheaders" subtype.	To skip	 all  headers,	except
	      the  subject,  use "email:noheaders". To scan binary attachments
	      for strings, use the "email:atts"	subtype.

	      When the "email" type is in effect, HTML markup is automatically
	      removed  from text attachments except text/plain attachments. To
	      also  remove  HTML  markup  from	plain  text  attachments,  use
	      "email:noplain".	To prevent HTML	markup removal in all text at-
	      tachments, use "email:plain".

	      The "html" type is for removing HTML markup (between <html>  and
	      </html>  tags)  and  surrounding	text. Note that	if the "email"
	      type is enabled, then "html" is automatically enabled  for  com-
	      patible message attachments only.

	      The  "xml"  type	is  like "html", but doesn't honour <html> and
	      </html>, and doesn't interpret tags  (so	this  should  be  more
	      properly	called	"angle	markup"	removal, and has nothing to do
	      with actual XML semantics).

	      When "html" is enabled, most markup  attributes  are  lost  (for
	      values  of  'most'  close	 to  'all').  The "html:links" subtype
	      forces link urls to be parsed and	learned, which would otherwise
	      be ignored. The "html:alt" subtype forces	parsing	of alternative
	      text  in	ALT   attributes   and	 various   other   tags.   The
	      "html:scripts"  subtype forces parsing of	scripts, "html:styles"
	      forces parsing of	styles,	"html:forms" forces  parsing  of  form
	      values, while "html:comments" forces parsing of HTML comments.

       -U     Print  (U)nambiguity.   When  used  in  conjunction  with	the -v
	      switch, prints scores followed by	their empirical	standard devi-
	      ations.  When  used alone, prints	the best category, followed by
	      an estimated probability that this category choice is  unambigu-
	      ous. More	precisely, the probability measures lack of overlap of
	      CLT confidence intervals for each	category score	(If  there  is
	      overlap, then there is ambiguity).

	      This estimated probability can be	used as	an "unsure" flag, e.g.
	      if the estimated probability is  lower  than  50%.  Formally,  a
	      score of 0% means	another	category is equally likely to apply to
	      the input, and a score of	100% means no other category is	likely
	      to  apply	to the input. Note that	this type of confidence	is un-
	      related to the -X	switch.	Also, the probability estimate is usu-
	      ally  low	 if  the document is short, or if the message contains
	      many tokens that have never been seen before  (only  applies  to
	      uniform digramic measure).

       -V     Print the	program	version	number and exit.

       -W     Like -w, but prevents features from straddling newlines. See the
	      description of -w.

       -X     Print the	confidence in the score	calculated for each  category,
	      when  used together with the -n or -N switch. Prepares the model
	      for confidence scores, when used with the	-l switch.  The	confi-
	      dence  is	 an  estimate of the typicality	of the score, assuming
	      the null hypothesis that the given  category  is	correct.  When
	      used  with  the -v switch	alone, factorizes the score as the em-
	      pirical divergence plus the shannon entropy, multiplied by  com-
	      plexity,	in  that  order. The -X	switch is not supported	in all
	      possible models, and displays a percentage of "0.0" if it	 can't
	      be calculated. Note that for unknown documents, it is quite com-
	      mon to have confidences close to zero.

       -Y     Print the	cumulative media counts.  Some	tokenizers  include  a
	      medium variable with each	token: for example, in email classifi-
	      cation the word "the" can	appear in the subject or the body of a
	      message,	but  the  subject is counted as	a separate medium from
	      the body.	This allows the	token frequencies to be	kept separate,
	      even  though the word is the same. Currently, up to 16 different
	      media are	supported (0-15), with	the  following	interpretation
	      for email:

	       0   unused.
	       1   default medium.
	       2   mail	body or	attachment in HTML format.
	       3   mail	body or	attachment in plain text format.
	       4   mail	header unknown.
	       5   User-Agent, Comments, Keywords, Note
	       6   X-MS*, Categor*, Priority, Importance, Thread-*
	       7   X-*
	       8   List-*
	       9   MIME-Version, Content-*
	       10  Subject
	       11  To
	       12  Sender, Sent, BCC, CC, From
	       13  Resent-*, Original-*
	       14  Message-ID, References, In-Reply-To
	       15  Received, Return-Path, Return-Receipt-To, Reply-To

	      The -Y switch prints the number of tokens	observed in each sepa-
	      rate medium, in order from 0 to 15.

       To create two category files in the current directory  from  two	 ASCII
       text  files  named  Mark_Twain.txt  and William_Shakespeare.txt respec-
       tively, type:

       % dbacl -l twain	Mark_Twain.txt
       % dbacl -l shake	William_Shakespeare.txt

       Now you can classify input text,	for example:

       % echo "howdy" |	dbacl -v -c twain -c shake
       % echo "to be or	not to be" | dbacl -v -c twain -c shake

       Note that the -v	option at least	is necessary, otherwise	dbacl does not
       print  anything.	The return value is 1 in the first case, 2 in the sec-

       % echo "to be or	not to be" | dbacl -v -N -c twain -c shake
       twain 22.63% shake 77.37%
       % echo "to be or	not to be" | dbacl -v -n -c twain -c shake
       twain  7.04 * 6.0 shake	6.74 * 6.0

       These invocations are equivalent. The numbers 6.74 and  7.04  represent
       how  close the average token is to each category, and 6.0 is the	number
       of tokens observed. If you want to print	a simple confidence value  to-
       gether with the best category, replace -v with -U.

       % echo "to be or	not to be" | dbacl -U -c twain -c shake
       shake # 34%

       Note  that the true probability of category shake versus	category twain
       is 77.37%, but the calculation is somewhat ambiguous, and  34%  is  the
       confidence out of 100% that the calculation is qualitatively correct.

       Suppose	a  file	 document.txt contains English text lines interspersed
       with noise lines. To filter out the noise lines from the	English	lines,
       assuming	you have an existing category shake say, type:

       % dbacl -c shake	-f shake -R document.txt > document.txt_eng
       % dbacl -c shake	-f random -R document.txt > document.txt_rnd

       Note  that  the	quality	of the results will vary depending on how well
       the categories shake and	random represent each input line.  It is some-
       times  useful  to see the posterior probabilities for each line without

       % dbacl -c shake	-f shake -RN document.txt > document.txt_probs

       You can now postprocess the posterior probabilities for	each  line  of
       text  with  another script, to replicate	an arbitrary Bayesian decision
       rule of your choice.

       In the special case of exactly two categories, the optimal Bayesian de-
       cision procedure	can be implemented for documents as follows: let p1 be
       the prior probability that the input text is classified	as  category1.
       Consequently,  the prior	probability of classifying as category2	is 1 -
       p1.  Let	u12 be the cost	of misclassifying a category1  input  text  as
       belonging  to  category2	and vice versa for u21.	 We assume there is no
       cost for	classifying correctly.	Then the following command  implements
       the optimal Bayesian decision:

       % dbacl -n -c category1 -c category2 | awk '{ if($2 * p1	* u12 >	$4 *
	      (1 - p1) * u21) {	print $1; } else { print $3; } }'

       dbacl can also be used in conjunction with procmail(1) to  implement  a
       simple  Bayesian	email classification system. Assume that incoming mail
       should be automatically delivered to one	of three mail folders  located
       in  $MAILDIR and	named work, personal, and spam.	 Initially, these must
       be created and filled with appropriate  sample  emails.	 A  crontab(1)
       file can	be used	to learn the three categories once a day, e.g.

       5  0 * *	* dbacl	-T email -l $CATS/work $MAILDIR/work
       10 0 * *	* dbacl	-T email -l $CATS/personal $MAILDIR/personal
       15 0 * *	* dbacl	-T email -l $CATS/spam $MAILDIR/spam

       To  automatically  deliver  each	 incoming  email  into the appropriate
       folder, the following procmailrc(5) recipe fragment could be used:


       # run the spam classifier
       :0 c
       YAY=| dbacl -vT email -c	$CATS/work -c $CATS/personal -c	$CATS/spam

       # send to the appropriate mailbox
       * ? test	-n "$YAY"


       Sometimes, dbacl	will send the email to	the  wrong  mailbox.  In  that
       case, the misclassified message should be removed from its wrong	desti-
       nation and placed in the	correct	mailbox.  The error will be  corrected
       the  next  time	your messages are learned.  If it is left in the wrong
       category, dbacl will learn the wrong corpus statistics.

       The default text	features (tokens) read by dbacl	are purely  alphabetic
       strings,	 which minimizes memory	requirements but can be	unrealistic in
       some cases. To construct	models based on	alphanumeric tokens,  use  the
       -e  switch.  The	 example below also uses the optional -D switch, which
       prints a	list of	actual tokens found in the document:

       % dbacl -e alnum	-D -l twain Mark_Twain.txt | less

       It is also possible to override the default  feature  selection	method
       used  to	 learn the category model by means of regular expressions. For
       example,	the following duplicates the default feature selection	method
       in the C	locale,	while being much slower:

       % dbacl -l twain	-g '^([[:alpha:]]+)' -g	'[^[:alpha:]]([[:alpha:]]+)'

       The category twain which	is obtained depends only on single  alphabetic
       words  in  the text file	Mark_Twain.txt (and computed digram statistics
       for prediction).	 For a second example, the following command builds  a
       smoothed	 Markovian  (word bigram) model	which depends on pairs of con-
       secutive	words within each line	(but  pairs  cannot  straddle  a  line

       % dbacl -l twain2 -g '(^|[^[:alpha:]])([[:alpha:]]+)||2'	-g '(^|[^[:al-

       More  general, line based, n-gram models	of all orders (up to 7)	can be
       built in	a similar way.	 To  construct	paragraph  based  models,  you
       should  reformat	 the input corpora with	awk(1) or sed(1) to obtain one
       paragraph per line. Line	size is	limited	by available memory, but  note
       that regex performance will degrade quickly for long lines.

       The  underlying assumption of statistical learning is that a relatively
       small number of training	documents can represent	a much larger  set  of
       input  documents.  Thus	in  the	long run, learning can grind to	a halt
       without serious impact on classification	accuracy. While	 not  true  in
       reality,	 this assumption is surprisingly accurate for problems such as
       email filtering.	 In practice, this means that a	well chosen corpus  on
       the  order  of ten thousand documents is	sufficient for highly accurate
       results for years.  Continual learning after such a critical  mass  re-
       sults  in  diminishing returns.	Of course, when	real world input docu-
       ment patterns change dramatically, the predictive power of  the	models
       can be lost. At the other end, a	few hundred documents already give ac-
       ceptable	results	in most	cases.

       dbacl is	heavily	optimized for the case of frequent classifications but
       infrequent  batch  learning.  This  is  the  long run optimum described
       above. Under ideal conditions, dbacl can	classify a hundred emails  per
       second  on low end hardware (500Mhz Pentium III). Learning speed	is not
       very much slower, but takes effectively much longer for large  document
       collections for various reasons.	 When using the	-m switch, data	struc-
       tures are aggressively mapped into memory if possible,  reducing	 over-
       heads for both I/O and memory allocations.

       dbacl  throws  away its input as	soon as	possible, and has no limits on
       the input document size.	Both classification and	learning speed are di-
       rectly  proportional to the number of tokens in the input, but learning
       also needs a nonlinear optimization step	which takes time  proportional
       to  the	number of unique tokens	discovered.  At	time of	writing, dbacl
       is one of the fastest open source mail filters given its	optimal	 usage
       scenario, but uses more memory for learning than	other filters.

       When  saving category files, dbacl first	writes out a temporary file in
       the same	location, and renames it afterwards. If	a problem or crash oc-
       curs  during  learning,	the  old  category  file is therefore left un-
       touched.	This ensures that categories can never be corrupted, no	matter
       how  many  processes try	to simultaneously learn	or classify, and means
       that valid categories are available for classification at any time.

       When using the -m switch, file contents are memory  mapped  for	speedy
       reading	and  writing.  This,  together with the	-o switch, is intended
       mainly for testing purposes, when tens of thousands of messages must be
       learned and scored in a laboratory to measure dbacl's accuracy. Because
       no file locking is attempted for	performance reasons,  corruptions  are
       possible,  unless  you  make  sure that only one	dbacl process reads or
       writes any file at any given time. This is the only case	(-m and	-o to-
       gether) when corruption is possible.

       When  classifying a document, dbacl loads all indicated categories into
       RAM, so the total memory	needed is approximately	the sum	of  the	 cate-
       gory  file  sizes  plus	a fixed	small overhead.	 The input document is
       consumed	while being read, so its size doesn't matter,  but  very  long
       lines  can take up space.  When using the -m switch, the	categories are
       read using mmap(2) as available.

       When learning, dbacl keeps a large structure in memory  which  contains
       many objects which won't	be saved into the output category. The size of
       this structure is proportional to the number of unique tokens read, but
       not the size of the input documents, since they are discarded while be-
       ing read. As a rough guide, this	structure is 4x-5x the size of the fi-
       nal category file that is produced.

       To  prevent unchecked memory growth, dbacl allocates by default a fixed
       smallish	amount of memory for tokens. When this space is	used up,  fur-
       ther  tokens  are discarded which has the effect	of skewing the learned
       category	making it less usable as more tokens are dropped. A warning is
       printed on STDERR in such a case.

       The  -h switch lets you fix the initial size of the token space in pow-
       ers of 2, ie "-h	17" means 2^17 = 131072	possible tokens. If  you  type
       "dbacl  -V", you	can see	the number of bytes needed for each token when
       either learning or classifying. Multiply	this  number  by  the  maximum
       number  of  possible tokens to estimate the memory needed for learning.
       The -H switch lets dbacl	grow its  tables  automatically	 if  and  when
       needed,	up  to	a  maximum specified. So if you	type "-H 21", then the
       initial size will be doubled repeatedly if necessary,  up  to  approxi-
       mately two million unique tokens.

       When learning with the -X switch, a handful of input documents are also
       kept in RAM throughout.

	      When this	variable is set, its value is prepended	to every cate-
	      gory filename which doesn't start	with a '/' or a	'.'.

       INT    If  this	signal is caught, dbacl	simply exits without doing any
	      cleanup or other operations. This	signal can often  be  sent  by
	      pressing Ctrl-C on the keyboard. See stty(1).

       HUP, QUIT, TERM
	      If one of	these signals is caught, dbacl stops reading input and
	      continues	its operation as if no more input was available.  This
	      is a way of quitting gracefully, but note	that in	learning mode,
	      a	category file will be written based on the  incomplete	input.
	      The  QUIT	signal can often be sent by pressing Ctrl- on the key-
	      board. See stty(1).

       USR1   If this signal is	caught,	dbacl reloads the  current  categories
	      at  the earliest feasible	opportunity. This is not normally use-
	      ful at all, but might be in special cases, such  as  if  the  -f
	      switch is	invoked	together with input from a long	running	pipe.

       dbacl generated category	files are in binary format, and	may or may not
       be portable to systems using a different	byte order architecture	 (this
       depends	on  how	 dbacl was compiled). The -V switch prints out whether
       categories are portable,	or else	you can	just experiment.

       dbacl does not recognize	functionally equivalent	 regular  expressions,
       and in this case	duplicate features will	be counted several times.

       With  every  learned  category, the command line	options	that were used
       are saved.  When	classifying, make sure that  every  relevant  category
       was  learned  with the same set of options (regexes are allowed to dif-
       fer), otherwise behaviour is undefined. There is	no need	to repeat  all
       the switches when classifying.

       If you get many digitization warnings, then you are trying to learn too
       much data at once, or your model	is too complex.	 dbacl is compiled  to
       save  memory by digitizing final	weights, but you can disable digitiza-
       tion by editing dbacl.h and recompiling.

       dbacl offers several built-in tokenizers	(see -e	switch)	with  more  to
       come in future versions,	as the author invents them.  While the default
       tokenizer may evolve, no	tokenizer should ever be removed, so that  you
       can  always  simulate previous dbacl behaviour subject to bug fixes and
       architectural changes.

       The confidence estimates	obtained through the -X	switch are  underesti-
       mates, ie are more conservative than they should	be.

       "Ya know, some day scientists are gonna invent something	that will out-
       smart a rabbit."	(Robot Rabbit, 1953)

       The source code for the latest version of this program is available  at
       the following locations:

       Laird A.	Breyer <>

       awk(1),	bayesol(1),  crontab(1),  hmine(1),  hypex(1),	less(1), mail-
       cross(1),  mailfoot(1),	mailinspect(1),	  mailtoe(1),	procmailex(5),
       regex(7), stty(1), sed(1)

Version	1.14.1	      Bayesian Text Classification Tools	      DBACL(1)


Want to link to this manual page? Use this URL:

home | help