FreeBSD Manual Pages

home | help
DBACL(1)							      DBACL(1)

NAME
       dbacl - a digramic Bayesian classifier for text recognition.

SYNOPSIS

       dbacl [-01dvnirmwMNDXW] [-T type	] -l category [-h size]	[-H gsize] [-x
	      decim]  [-q  quality] [-w	max_order] [-e deftok] [-o online] [-L
	      measure] [-z ftresh] [-O ronline]...  [-g	regex]...  [FILE]...

       dbacl [-vnimNRXYP] [-h size] [-T	type]  -c  category  [-c  category]...
	      [-f keep]...  [FILE]...

       dbacl -V

OVERVIEW
       dbacl  is  a  Bayesian  text  and  email	 classifier. When using	the -l
       switch, it learns a body	of text	and  produce  a	 file  named  category
       which summarizes	the text. When using the -c switch, it compares	an in-
       put text	stream with any	number of category files, and outputs the name
       of  the closest match, or optionally various numerical scores explained
       below.

       Whereas this manual page	is intended as a reference, there are  several
       tutorials  and  documents  you can read to get specialized information.
       Specific	documentation about the	design of dbacl	 and  the  statistical
       models  that it uses can	be found in dbacl.ps.  For a basic overview of
       text classification using dbacl,	see tutorial.html. A companion	tutor-
       ial  geared  towards email filtering is email.html. If you have trouble
       getting dbacl to	classify reliably, read	is_it_working.html.  The USAGE
       section of this manual page also	has some examples.

       /usr/local/share/dbacl/doc/dbacl.ps

       /usr/local/share/dbacl/doc/tutorial.html

       /usr/local/share/dbacl/doc/email.html

       /usr/local/share/dbacl/doc/is_it_working.html

       dbacl uses a maximum entropy (minimum divergence) language  model  con-
       structed	 with  respect to a digramic reference measure (unknown	tokens
       are predicted from digrams, i.e.	pairs of letters).  Practically,  this
       means  that  a category is constructed from tokens in the training set,
       while previously	unseen tokens  can  be	predicted  automatically  from
       their  letters.	A token	here is	either a word (fragment) or a combina-
       tion of words (fragments),  selected  according	to  various  switches.
       Learning	roughly	works by tweaking token	probabilities until the	train-
       ing data	is least surprising.

EXIT_STATUS
       The  normal shell exit conventions aren't followed (sorry!). When using
       the -l command form, dbacl returns zero on success, nonzero if an error
       occurs. When using the -c form, dbacl returns a positive	integer	corre-
       sponding	to the category	with the  highest  posterior  probability.  In
       case  of	a tie, the first most probable category	is chosen. If an error
       occurs, dbacl returns zero.

DESCRIPTION
       When using the -l command form, dbacl learns a category when given  one
       or  more	 FILE  names,  which should contain readable ASCII text. If no
       FILE is given, dbacl learns from	STDIN. If FILE is a directory,	it  is
       opened  and all its files are read, but not its subdirectories. The re-
       sult is saved in	the binary file	named  category,  and  completely  re-
       places  any  previous  contents.	 As  a convenience, if the environment
       variable	DBACL_PATH contains a directory, then that is prepended	to the
       file path, unless category starts with a	'/' or a '.'.

       The input text for learning is assumed to be unstructured plain text by
       default.	This is	not suitable for learning email,  because  email  con-
       tains various transport encodings and formatting	instructions which can
       reduce classification effectiveness. You	must use the -T	switch in that
       case  so	 that  dbacl knows it should perform decoding and filtering of
       MIME and	HTML as	appropriate.  Apropriate switch	values are "-T	email"
       for RFC2822 email input,	"-T html" for HTML input, "-T xml" for generic
       XML  style  input and "-T text" is the default plain text format. There
       are other values	of the -T switch that also allow fine  tuning  of  the
       decoding	capabilities.

       When  using  the	 -c  command form, dbacl attempts to classify the text
       found in	FILE, or STDIN if no FILE is  given.  Each  possible  category
       must  be	 given separately, and should be the file name of a previously
       learned text corpus. As a convenience, if the variable DBACL_PATH  con-
       tains  a	 directory,  it	 is  prepended to each file path which doesn't
       start with a '/'	or a '.'. The visible output of	the classification de-
       pends on	the combination	of extra switches used.	If no switch is	 used,
       then  no	 output	 is shown on STDOUT. However, dbacl always produces an
       exit code which can be tested.

       To see an output	for a classification, you must use at least one	of the
       -v,-U,-n,-N,-D,-d switches. Sometimes, they can be used in  combination
       to  produce a natural variation of their	individual outputs. Sometimes,
       dbacl also produces warnings on STDERR if applicable.

       The -v switch outputs the name of  the  best  category  among  all  the
       choices given.

       The  -U switch outputs the name of the best category followed by	a con-
       fidence percentage. Normally, this is the switch	that you want to  use.
       A  percentage  of  100% means that dbacl	is sure	of its choice, while a
       percentage of 0%	means that some	other category is equally likely. This
       is not the model	probability, but measures how unambiguous the  classi-
       fication	is, and	can be used to tag unsure classifications (e.g.	if the
       confidence is 25% or less).

       The  -N	switch	prints	each category name followed by its (posterior)
       probability, expressed as a percentage. The percentages always  sum  to
       100%.  This is intuitive, but only valuable if the document being clas-
       sified contains a handful of tokens (ten	or less). In the  common  case
       with  many more tokens, the probabilities are always extremely close to
       100% and	0%.

       The -n switch prints each category name followed	by the negative	 loga-
       rithm  of  its  probability. This is equivalent to using	the -N switch,
       but much	more useful. The smallest number gives the  best  category.  A
       more  convenient	 form is to use	both -n	and -v which prints each cate-
       gory name followed by the cross entropy and the number of  tokens  ana-
       lyzed.  The  cross  entropy  measures (in bits) the average compression
       rate which is achievable, under the given category model, per token  of
       input  text.  If	 you  use all three of -n,-v,-X	then an	extra value is
       output for each category, representing a	kind of	p-value	for each cate-
       gory score. This	indicates how typical the score	 is  compared  to  the
       training	 documents,  but  only	works if the -X	switch was used	during
       learning, and only for some types of models (e.g. email).  These	p-val-
       ues are uniformly distributed and independent (if  the  categories  are
       independent), so	can be combined	using Fisher's chi squared test	to ob-
       tain composite p-values for groupings of	categories.

       The  -v and -X switches together	print each category name followed by a
       detailed	decomposition of the category score, factored  into  (	diver-
       gence rate + shannon entropy rate )* token count	@ p-value. Again, this
       only works in some types	of models.

       The  -v and -U switches print each category name	followed by a decompo-
       sition of the category score into ( divergence rate +  shannon  entropy
       rate # score variance )*	token count.

       The -D switch prints out	the input text as modified internally by dbacl
       prior to	tokenization. For example, if a	MIME encoded email document is
       classified, then	this prints the	decoded	text that will be actually to-
       kenized and classified. This switch is mainly useful for	debugging.

       The  -d switch dumps tokens and scores while they are being read. It is
       useful for debugging, or	if you want to	create	graphical  representa-
       tions  of  the  classification. A detailed explanation of the output is
       beyond the scope	of this	manual page, but is straightforward if	you've
       read dbacl.ps.  Possible	variations include -d together with -n or -N.

       Classification can be done with one or several categories in principle.
       When  two or more categories are	used, the Bayesian posterior probabil-
       ity is used, given the input text, with a uniform prior distribution on
       categories. For other choices  of  prior,  see  the  companion  utility
       bayesol(1).  When a single category is used, classification can be done
       by  comparing the score with a treshold.	In practice however, much bet-
       ter results are obtained	with several categories.

       Learning	and classifying	cannot be mixed	on the	same  command  invoca-
       tion,  however there are	no locking issues and separate dbacl processes
       can operate simultaneously with obvious results,	 because  file	opera-
       tions are designed to be	atomic.

       Finally,	 note that dbacl does not manage your document corpora or your
       computed	categories.  In	particular, dbacl cannot  add  or  subtract  a
       document	 from  a category file directly.  If you want to learn a cate-
       gory incrementally, the standard	way is to keep adding to your document
       corpus, and learn the whole corpus each time.  By  keeping  control  of
       your  archives,	you can	never lose the information in your categories,
       and you can easily experiment with different switches or	 tokenizations
       or sets of training documents if	you like.

       If  the standard	incremental learning method is too slow, the -o	switch
       can help. This creates a	data file named	online which contains all  the
       document	 statistics that have been learned. When you use the -l	and -o
       switches	together, dbacl	merges the online data	file  (if  it  exists)
       with  the  new document(s) to be	learned, and recreates an updated ver-
       sion of online.	This is	equivalent to adding the new documents to  the
       corpus  and relearning the whole	corpus,	but faster. However, documents
       cannot be removed if you	change your mind.  This	 is  a	limitation  of
       dbacl  which  cannot  be	changed	for mathematical reasons. You can work
       around this by making backups of	the online data	file. It is also  pos-
       sible  to  merge	 one or	more extra online data files simultaneously by
       using the -O switch one or more times.

SECONDARY_SWITCHES
       By default, dbacl classifies the	input text as a	whole, ie it only out-
       puts a single result even if you	specify	several	input  files.  If  you
       want to classify	multiple input files you can either call dbacl repeat-
       edly  (which is fast when you use the -m	switch), or use	the -F switch,
       which prints each input FILE followed by	the result for that FILE.  Al-
       ternatively, you	can classify each line of the input  individually,  by
       using  the  -f option, which prints only	those lines which match	one or
       more models identified by keep (use the category	name or	number to  re-
       fer  to	a  category). This last	switch is useful if you	want to	filter
       out some	lines, but note	that if	the lines are short,  then  the	 error
       rate can	be high.

       The  -e,-w,-g,-j	 switches  are	used for selecting an appropriate tok-
       enization scheme. A token is a word or word fragment or combination  of
       words  or  fragments. The shape of tokens is important because it forms
       the basis of the	language models	used by	dbacl.	The -e switch  selects
       a  predefined tokenization scheme, which	is speedy but limited.	The -w
       switch specifies	composite tokens derived from the -e switch. For exam-
       ple, "-e	alnum -w 2" means that	tokens	should	be  alphanumeric  word
       fragments combined into overlapping pairs (bigrams). When the -j	switch
       is  used, all tokens are	converted to lowercase,	which reduces the num-
       ber of possible tokens and therefore memory consumption.

       If the -g switch	is used, you can completely specify  what  the	tokens
       should look like	using a	regular	expression. Several -g switches	can be
       used  to	construct complex tokenization schemes,	and parentheses	within
       each expression can be used to select fragments and combine  them  into
       n-grams.	 The  cost  of	such flexibility is reduced classification and
       learning	speed. When experimenting with tokenization schemes, try using
       the -d or -D switches while learning or classifying, as they will print
       the tokens explicitly so	you can	see what text fragments	are picked  up
       or missed out. For regular exression syntax, see	regex(7).

       The  -h	and  -H	 switches  regulate  how much memory dbacl may use for
       learning. Text classification can use a lot of memory, and  by  default
       dbacl  limits  itself even at the expense of learning accuracy. In many
       cases if	a limit	is reached, a  warning	message	 will  be  printed  on
       STDERR with some	advice.

       When  relearning	the same category several times, a significant speedup
       can be obtained by using	the -1 switch, as this allows  the  previously
       learned probabilities to	be read	from the category and reused.

       Note  that  classification  accuracy depends foremost on	the amount and
       quality of the training samples,	and then only on amount	of tweaking.

EXIT STATUS
       When using the -l command form, dbacl returns zero on success. When us-
       ing the -c form,	dbacl returns a	 positive  integer  (1,2,3...)	corre-
       sponding	 to  the  category  with the highest posterior probability. In
       case of a tie, the first	most probable category is chosen. If an	 error
       occurs, dbacl returns zero.

OPTIONS
       -0     When  learning,  prevents	 weight	 preloading.  Normally,	 dbacl
	      checks if	the category file already exists, and if so, tries  to
	      use  the existing	weights	as a starting point. This can dramati-
	      cally speed up learning.	If the -0 (zero) switch	is  set,  then
	      dbacl  behaves  as  if  no category file already exists. This is
	      mainly useful for	testing.  This switch is now  enabled  by  de-
	      fault, to	protect	against	weight drift which can reduce accuracy
	      over many	learning iterations. Use -1 to force preloading.

       -1     Force weight preloading if the category file already exists. See
	      discussion of the	-0 switch.

       -a     Append  scores.  Every  input  line is written to	STDOUT and the
	      dbacl scores are appended. This  is  useful  for	postprocessing
	      with  bayesol(1).	  For ease of processing, every	original input
	      line is indented by a single space (to distinguish them from the
	      appended scores),	and the	line with the scores (if -n  is	 used)
	      is prefixed with the string "scores ". If	a second copy of dbacl
	      needs  to	 read this output later, it should be invoked with the
	      -A switch.

       -d     Dump the model parameters	to STDOUT. In conjunction with the  -l
	      option,  this  produces  a human-readable	summary	of the maximum
	      entropy model. In	conjunction with the -c	option,	 displays  the
	      contribution  of	each  token to the final score.	Suppresses all
	      other normal output.

       -e     Select character class for default (not  regex-based)  tokeniza-
	      tion.  By	default, tokens	are alphabetic strings only. This cor-
	      responds to the case when	deftok is "alpha". Possible values for
	      deftok are "alpha", "alnum", "graph", "char", "cef"  and	"adp".
	      The  last	two are	custom tokenizers intended for email messages.
	      See also isalpha(3).   The  "char"  tokenizer  picks  up	single
	      printable	 characters rather than	bigger tokens, and is intended
	      for testing only.

       -f     Filter each line of input	separately,  passing  to  STDOUT  only
	      lines  which match the category identified as keep.  This	option
	      should be	used repeatedly	for each category which	must be	 kept.
	      keep can be either the category file name, or a positive integer
	      representing  the	required category in the same order it appears
	      on the command line.

	      Output lines are flushed as soon as they are written. If the in-
	      put file is a pipe or character device, then an attempt is  made
	      to  use  line buffering mode, otherwise the more efficient block
	      buffering	is used.

       -g     Learn only features described by the extended regular expression
	      regex.  This overrides the default feature selection method (see
	      -w option) and learns, for each line of input, only tokens  con-
	      structed	from  the  concatenation  of  strings  which match the
	      tagged subexpressions within the supplied	regex.	All substrings
	      which match regex	within a suffix	of each	input line are treated
	      as features, even	if they	overlap	on the input line.

	      As an optional convenience, regex	can include the	 suffix	 ||xyz
	      which  indicates	which  parenthesized  subexpressions should be
	      tagged. In this case, xyz	should consist exclusively of digits 1
	      to 9, numbering exactly those  subexpressions  which  should  be
	      tagged.  Alternatively,  if  no  parentheses exist within	regex,
	      then it is assumed that the whole	expression must	be captured.

       -h     Set the size of the hash table to	2^size	elements.  When	 using
	      the  -l  option, this refers to the total	number of features al-
	      lowed in the maximum entropy model being learned.	When using the
	      -c option	toghether with the -M switch and multinomial type cat-
	      egories, this refers to the maximum  number  of  features	 taken
	      into account during classification.  Without the -M switch, this
	      option has no effect.

       -i     Fully  internationalized mode. Forces the	use of wide characters
	      internally, which	is necessary in	some locales.  This  incurs  a
	      noticeable performance penalty.

       -j     Make  features  case  sensitive. Normally, all features are con-
	      verted to	lower case during processing,  which  reduces  storage
	      requirements   and  improves  statistical	 estimates  for	 small
	      datasets.	With this option, the original capitalization is  used
	      for each feature.	This can improve classification	accuracy.

       -m     Aggressively maps	categories into	memory and locks them into RAM
	      to  prevent  swapping, if	possible. This is useful when speed is
	      paramount	and memory is plentiful, for example when testing  the
	      classifier on large datasets.

	      Locking  may  require  relaxing user limits with ulimit(1).  Ask
	      your system administrator. Beware	when using the -m  switch  to-
	      gether  with the -o switch, as only one dbacl process must learn
	      or classify at a time to prevent file corruption.	If no learning
	      takes place, then	the -m switch for classifying is  always  safe
	      to use. See also the discussion for the -o switch.

       -n     Print  scores  for  each category.  Each score is	the product of
	      two numbers, the cross entropy and the complexity	of  the	 input
	      text  under  each	model. Multiplied together, they represent the
	      log probability that the input resembles the model. To see these
	      numbers separately, use also the -v option. In conjunction  with
	      the  -f  option,	stops  filtering  but  prints  each input line
	      prepended	with a list of scores for that line.

       -q     Select quality of	learning, where	quality	can be 1,2,3,4.	Higher
	      values take longer to learn, and should be slightly  more	 accu-
	      rate.  The default quality is 1 if the category file doesn't ex-
	      ist or weights cannot be preloaded, and 2	otherwise.

       -o     When learning, reads/writes partial token	counts so they can  be
	      reused.  Normally,  category  files are learned from exactly the
	      input data given,	and don't contain extraneous information. When
	      this option is in	effect,	some extra information is saved	in the
	      file online, after all input was read. This information  can  be
	      reread the next time that	learning occurs, to continue where the
	      previous	dataset	 left off. If online doesn't exist, it is cre-
	      ated. If online exists, it is read before	learning, and  updated
	      afterwards.  The file is approximately 3 times bigger (at	least)
	      than the learned category.

	      In dbacl,	file updates are atomic, but if	using the  -o  switch,
	      two  or  more processes should not learn simultaneously, as only
	      one process will write a lasting category	and memory  dump.  The
	      -m  switch can also speed	up online learning, but	beware of pos-
	      sible corruption.	 Only one process should read or write a file.
	      This option is intended primarily	for controlled test runs.  See
	      also the -O (big-oh) switch.

       -r     Learn  the  digramic reference model only. Skips the learning of
	      extra features in	the text corpus.

       -v     Verbose mode. When learning, print out details of	 the  computa-
	      tion,  when classifying, print out the name of the most probable
	      category.	 In conjunction	with the -n option, prints the	scores
	      as an explicit product of	the cross entropy and the complexity.

       -w     Select  default features to be n-grams up	to max_order.  This is
	      incompatible with	the -g option, which always takes  precedence.
	      If  no -w	or -g options are given, dbacl assumes -w 1. Note that
	      n-grams for n greater than 1 do not straddle line	breaks by  de-
	      fault.  The -S switch enables line straddling.

       -x     Set  decimation probability to 1 - 2^(-decim).  To reduce	memory
	      requirements when	learning, some inputs  are  randomly  skipped,
	      and  only	a few are added	to the model.  Exact behaviour depends
	      on the applicable	-T option (default is -T  "text").   When  the
	      type  is not "email" (eg "text"),	then individual	input features
	      are added	with probability 2^(-decim). When the type is "email",
	      then full	input messages are added with probability  2^(-decim).
	      Within each such message,	all features are used.

       -z     When  learning, only take	into account features whose occurrence
	      count is strictly	greater	than ftreshold.	By default,  ftreshold
	      is zero, so all features in the training corpus are used.	A neg-
	      ative value of ftreshold causes dbacl to subtract	from the maxi-
	      mum  observed  feature count, and	to use that if it is positive.
	      For example, -z 1	means dbacl only learns	features  which	 occur
	      at  least	twice in the corpus, and -z -5 means dbacl only	learns
	      the feature(s) whose occurrence count is within 4	of the	global
	      maximum.

       -A     Expect  indented	input  and scores. With	this switch, dbacl ex-
	      pects input lines	to be indented by  a  single  space  character
	      (which  is then skipped).	 Lines starting	with any other charac-
	      ter are ignored. This is the counterpart to the -a switch	above.
	      When used	together with the -a switch, dbacl outputs the skipped
	      lines as they are, and reinserts the space at the	front of  each
	      processed	input line.

       -D     Print  debug output. Do not use normally,	but can	be very	useful
	      for displaying the list features picked up while learning.

       -F     For each FILE of input, print the	 FILE  name  followed  by  the
	      classification  result  (normally	dbacl only prints a single re-
	      sult even	if multiple files are listed as	input).

       -H     Allow hash table to grow up to a	maximum	 of  2^gsize  elements
	      during learning. Initial size is given by	-h option.

       -L     Select the digramic reference measure for	character transitions.
	      The  measure  can	 be one	of "uniform", "dirichlet" or "maxent".
	      Default is "uniform".

       -M     Force multinomial	calculations. When learning, forces the	 model
	      features to be treated multinomially. When classifying, corrects
	      entropy  scores  to  reflect multinomial probabilities (only ap-
	      plicable to multinomial type models, if present).	  Scores  will
	      always be	lower, because the ordering of features	is lost.

       -N     Print  posterior	probabilities for each category.  This assumes
	      the supplied categories form an exhaustive  list	of  possibili-
	      ties.   In  conjunction  with the	-f option, stops filtering but
	      prints each input	line prepended with a summary of the posterior
	      distribution for that line.

       -O     This switch causes the online data  file	named  ronline	to  be
	      merged  during  learning.	The ronline file must be created using
	      the -o (little-oh) switch.  Several -O data files	can be	merged
	      simultaneously.  This  is	 intended to be	a read only version of
	      -o, to allow piecing together of several sets of preparsed data.
	      See the description of the -o switch.

       -R     Include an extra category	for purely random text.	 The  category
	      is called	"random".  Only	makes sense when using the -c option.

       -P     Correct the category scores to include estimated prior probabil-
	      ities.  The prior	probability estimate for each category is pro-
	      portional	to the number of documents or, if  that	 doesn't  make
	      sense,  the  number of unique features. This can help with "bal-
	      ancing" when one category	is learned from	much  more  data  than
	      another.	If  all	 categories are	learned	from approximately the
	      same amount of data (or maybe within a factor of 2),  then  this
	      option should have little	qualitative effect.

       -S     Enable  line straddling. This is useful together with the	-w op-
	      tion to allow n-grams for	n > 1 to ignore	line breaks, so	a com-
	      plex token can continue past the end of the line.	 This  is  not
	      recommended for email.

       -T     Specify  nonstandard text	format.	By default, dbacl assumes that
	      the input	text is	a purely ASCII text file. This corresponds  to
	      the case when type is "text".

	      There  are several types and subtypes which can be used to clean
	      the input	text of	extraneous tokens before  actual  learning  or
	      classifying  takes place.	Each (sub)type you wish	to use must be
	      indicated	with a separate	-T option on the command line, and au-
	      tomatically implies the corresponding type.

	      The "text" type is for unstructured plain	text.  No  cleanup  is
	      performed. This is the default if	no types are given on the com-
	      mand line.

	      The "email" type is for mbox format input	files or single	RFC822
	      emails.  Headers are recognized and most are skipped. To include
	      extra  RFC822  standard  headers (except for trace headers), use
	      the "email:headers" subtype.  To include trace headers, use  the
	      "email:theaders"	subtype.  To include all headers in the	email,
	      use the "email:xheaders" subtype.	To skip	 all  headers,	except
	      the  subject,  use "email:noheaders". To scan binary attachments
	      for strings, use the "email:atts"	subtype.

	      When the "email" type is in effect, HTML markup is automatically
	      removed from text	attachments except text/plain attachments.  To
	      also  remove  HTML  markup  from	plain  text  attachments,  use
	      "email:noplain". To prevent HTML markup removal in all text  at-
	      tachments, use "email:plain".

	      The  "html" type is for removing HTML markup (between <html> and
	      </html> tags) and	surrounding text. Note	that  if  the  "email"
	      type  is	enabled, then "html" is	automatically enabled for com-
	      patible message attachments only.

	      The "xml"	type is	like "html", but  doesn't  honour  <html>  and
	      </html>,	and  doesn't  interpret	 tags  (so this	should be more
	      properly called "angle markup" removal, and has  nothing	to  do
	      with actual XML semantics).

	      When  "html"  is	enabled,  most markup attributes are lost (for
	      values of	'most' close  to  'all').   The	 "html:links"  subtype
	      forces link urls to be parsed and	learned, which would otherwise
	      be ignored. The "html:alt" subtype forces	parsing	of alternative
	      text   in	  ALT	attributes   and   various   other  tags.  The
	      "html:scripts" subtype forces parsing of scripts,	 "html:styles"
	      forces  parsing  of  styles, "html:forms"	forces parsing of form
	      values, while "html:comments" forces parsing of HTML comments.

       -U     Print (U)nambiguity.  When  used	in  conjunction	 with  the  -v
	      switch, prints scores followed by	their empirical	standard devi-
	      ations.  When  used alone, prints	the best category, followed by
	      an estimated probability that this category choice is  unambigu-
	      ous. More	precisely, the probability measures lack of overlap of
	      CLT  confidence  intervals  for each category score (If there is
	      overlap, then there is ambiguity).

	      This estimated probability can be	used as	an "unsure" flag, e.g.
	      if the estimated probability is  lower  than  50%.  Formally,  a
	      score of 0% means	another	category is equally likely to apply to
	      the input, and a score of	100% means no other category is	likely
	      to  apply	to the input. Note that	this type of confidence	is un-
	      related to the -X	switch.	Also, the probability estimate is usu-
	      ally low if the document is short, or if	the  message  contains
	      many  tokens  that  have never been seen before (only applies to
	      uniform digramic measure).

       -V     Print the	program	version	number and exit.

       -W     Like -w, but prevents features from straddling newlines. See the
	      description of -w.

       -X     Print the	confidence in the score	calculated for each  category,
	      when  used together with the -n or -N switch. Prepares the model
	      for confidence scores, when used with the	-l switch.  The	confi-
	      dence is an estimate of the typicality of	 the  score,  assuming
	      the  null	 hypothesis  that  the given category is correct. When
	      used with	the -v switch alone, factorizes	the score as  the  em-
	      pirical  divergence plus the shannon entropy, multiplied by com-
	      plexity, in that order. The -X switch is not  supported  in  all
	      possible	models,	and displays a percentage of "0.0" if it can't
	      be calculated. Note that for unknown documents, it is quite com-
	      mon to have confidences close to zero.

       -Y     Print the	cumulative media counts.  Some	tokenizers  include  a
	      medium variable with each	token: for example, in email classifi-
	      cation the word "the" can	appear in the subject or the body of a
	      message,	but  the  subject is counted as	a separate medium from
	      the body.	This allows the	token frequencies to be	kept separate,
	      even though the word is the same.	Currently, up to 16  different
	      media  are  supported  (0-15), with the following	interpretation
	      for email:

	       0    unused.
	       1    default medium.
	       2    mail body or attachment in HTML format.
	       3    mail body or attachment in plain text format.
	       4    mail header	unknown.
	       5    User-Agent,	Comments, Keywords, Note
	       6    X-MS*, Categor*, Priority, Importance, Thread-*
	       7    X-*
	       8    List-*
	       9    MIME-Version, Content-*
	       10   Subject
	       11   To
	       12   Sender, Sent, BCC, CC, From
	       13   Resent-*, Original-*
	       14   Message-ID,	References, In-Reply-To
	       15   Received, Return-Path, Return-Receipt-To, Reply-To

	      The -Y switch prints the number of tokens	observed in each sepa-
	      rate medium, in order from 0 to 15.

USAGE
       To create two category files in the current directory  from  two	 ASCII
       text  files  named  Mark_Twain.txt  and William_Shakespeare.txt respec-
       tively, type:

       % dbacl -l twain	Mark_Twain.txt
       % dbacl -l shake	William_Shakespeare.txt

       Now you can classify input text,	for example:

       % echo "howdy" |	dbacl -v -c twain -c shake
       twain
       % echo "to be or	not to be" | dbacl -v -c twain -c shake
       shake

       Note that the -v	option at least	is necessary, otherwise	dbacl does not
       print anything. The return value	is 1 in	the first case,	2 in the  sec-
       ond.

       % echo "to be or	not to be" | dbacl -v -N -c twain -c shake
       twain 22.63% shake 77.37%
       % echo "to be or	not to be" | dbacl -v -n -c twain -c shake
       twain  7.04 * 6.0 shake	6.74 * 6.0

       These  invocations  are equivalent. The numbers 6.74 and	7.04 represent
       how close the average token is to each category,	and 6.0	is the	number
       of  tokens observed. If you want	to print a simple confidence value to-
       gether with the best category, replace -v with -U.

       % echo "to be or	not to be" | dbacl -U -c twain -c shake
       shake # 34%

       Note that the true probability of category shake	versus category	 twain
       is  77.37%,  but	 the calculation is somewhat ambiguous,	and 34%	is the
       confidence out of 100% that the calculation is qualitatively correct.

       Suppose a file document.txt contains English  text  lines  interspersed
       with noise lines. To filter out the noise lines from the	English	lines,
       assuming	you have an existing category shake say, type:

       % dbacl -c shake	-f shake -R document.txt > document.txt_eng
       % dbacl -c shake	-f random -R document.txt > document.txt_rnd

       Note  that  the	quality	of the results will vary depending on how well
       the categories shake and	random represent each input line.  It is some-
       times useful to see the posterior probabilities for each	 line  without
       filtering:

       % dbacl -c shake	-f shake -RN document.txt > document.txt_probs

       You  can	 now  postprocess the posterior	probabilities for each line of
       text with another script, to replicate an arbitrary  Bayesian  decision
       rule of your choice.

       In the special case of exactly two categories, the optimal Bayesian de-
       cision procedure	can be implemented for documents as follows: let p1 be
       the  prior  probability that the	input text is classified as category1.
       Consequently, the prior probability of classifying as category2 is 1  -
       p1.   Let  u12  be the cost of misclassifying a category1 input text as
       belonging to category2 and vice versa for u21.  We assume there	is  no
       cost  for classifying correctly.	 Then the following command implements
       the optimal Bayesian decision:

       % dbacl -n -c category1 -c category2 | awk '{ if($2 * p1	* u12 >	$4 *
	      (1 - p1) * u21) {	print $1; } else { print $3; } }'

       dbacl can also be used in conjunction with procmail(1) to  implement  a
       simple  Bayesian	email classification system. Assume that incoming mail
       should be automatically delivered to one	of three mail folders  located
       in  $MAILDIR and	named work, personal, and spam.	 Initially, these must
       be created and filled with appropriate  sample  emails.	 A  crontab(1)
       file can	be used	to learn the three categories once a day, e.g.

       CATS=$HOME/.dbacl
       5  0 * *	* dbacl	-T email -l $CATS/work $MAILDIR/work
       10 0 * *	* dbacl	-T email -l $CATS/personal $MAILDIR/personal
       15 0 * *	* dbacl	-T email -l $CATS/spam $MAILDIR/spam

       To  automatically  deliver  each	 incoming  email  into the appropriate
       folder, the following procmailrc(5) recipe fragment could be used:

       CATS=$HOME/.dbacl

       # run the spam classifier
       :0 c
       YAY=| dbacl -vT email -c	$CATS/work -c $CATS/personal -c	$CATS/spam

       # send to the appropriate mailbox
       :0:
       * ? test	-n "$YAY"
       $MAILDIR/$YAY

       :0:
       $DEFAULT

       Sometimes, dbacl	will send the email to	the  wrong  mailbox.  In  that
       case, the misclassified message should be removed from its wrong	desti-
       nation  and placed in the correct mailbox.  The error will be corrected
       the next	time your messages are learned.	 If it is left	in  the	 wrong
       category, dbacl will learn the wrong corpus statistics.

       The  default text features (tokens) read	by dbacl are purely alphabetic
       strings,	which minimizes	memory requirements but	can be unrealistic  in
       some  cases.  To	construct models based on alphanumeric tokens, use the
       -e switch. The example below also uses the optional  -D	switch,	 which
       prints a	list of	actual tokens found in the document:

       % dbacl -e alnum	-D -l twain Mark_Twain.txt | less

       It  is  also  possible to override the default feature selection	method
       used to learn the category model	by means of regular  expressions.  For
       example,	 the following duplicates the default feature selection	method
       in the C	locale,	while being much slower:

       % dbacl -l twain	-g '^([[:alpha:]]+)' -g	'[^[:alpha:]]([[:alpha:]]+)'
	      Mark_Twain.txt

       The category twain which	is obtained depends only on single  alphabetic
       words  in  the text file	Mark_Twain.txt (and computed digram statistics
       for prediction).	 For a second example, the following command builds  a
       smoothed	 Markovian  (word bigram) model	which depends on pairs of con-
       secutive	words within each line	(but  pairs  cannot  straddle  a  line
       break):

       % dbacl -l twain2 -g '(^|[^[:alpha:]])([[:alpha:]]+)||2'	-g '(^|[^[:al-
	      pha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23'
	      Mark_Twain.txt

       More  general, line based, n-gram models	of all orders (up to 7)	can be
       built in	a similar way.	 To  construct	paragraph  based  models,  you
       should  reformat	 the input corpora with	awk(1) or sed(1) to obtain one
       paragraph per line. Line	size is	limited	by available memory, but  note
       that regex performance will degrade quickly for long lines.

PERFORMANCE
       The  underlying assumption of statistical learning is that a relatively
       small number of training	documents can represent	a much larger  set  of
       input  documents.  Thus	in  the	long run, learning can grind to	a halt
       without serious impact on classification	accuracy. While	 not  true  in
       reality,	 this assumption is surprisingly accurate for problems such as
       email filtering.	 In practice, this means that a	well chosen corpus  on
       the  order  of ten thousand documents is	sufficient for highly accurate
       results for years.  Continual learning after such a critical  mass  re-
       sults  in  diminishing returns.	Of course, when	real world input docu-
       ment patterns change dramatically, the predictive power of  the	models
       can be lost. At the other end, a	few hundred documents already give ac-
       ceptable	results	in most	cases.

       dbacl is	heavily	optimized for the case of frequent classifications but
       infrequent  batch  learning.  This  is  the  long run optimum described
       above. Under ideal conditions, dbacl can	classify a hundred emails  per
       second  on low end hardware (500Mhz Pentium III). Learning speed	is not
       very much slower, but takes effectively much longer for large  document
       collections for various reasons.	 When using the	-m switch, data	struc-
       tures  are  aggressively	mapped into memory if possible,	reducing over-
       heads for both I/O and memory allocations.

       dbacl throws away its input as soon as possible,	and has	no  limits  on
       the input document size.	Both classification and	learning speed are di-
       rectly  proportional to the number of tokens in the input, but learning
       also needs a nonlinear optimization step	which takes time  proportional
       to  the	number of unique tokens	discovered.  At	time of	writing, dbacl
       is one of the fastest open source mail filters given its	optimal	 usage
       scenario, but uses more memory for learning than	other filters.

MULTIPLE PROCESSES AND DATA CORRUPTION
       When  saving category files, dbacl first	writes out a temporary file in
       the same	location, and renames it afterwards. If	a problem or crash oc-
       curs during learning, the old  category	file  is  therefore  left  un-
       touched.	This ensures that categories can never be corrupted, no	matter
       how  many  processes try	to simultaneously learn	or classify, and means
       that valid categories are available for classification at any time.

       When using the -m switch, file contents are memory  mapped  for	speedy
       reading	and  writing.  This,  together with the	-o switch, is intended
       mainly for testing purposes, when tens of thousands of messages must be
       learned and scored in a laboratory to measure dbacl's accuracy. Because
       no file locking is attempted for	performance reasons,  corruptions  are
       possible,  unless  you  make  sure that only one	dbacl process reads or
       writes any file at any given time. This is the only case	(-m and	-o to-
       gether) when corruption is possible.

MEMORY USE
       When classifying	a document, dbacl loads	all indicated categories  into
       RAM,  so	 the total memory needed is approximately the sum of the cate-
       gory file sizes plus a fixed small overhead.   The  input  document  is
       consumed	 while	being  read, so	its size doesn't matter, but very long
       lines can take up space.	 When using the	-m switch, the categories  are
       read using mmap(2) as available.

       When  learning,	dbacl keeps a large structure in memory	which contains
       many objects which won't	be saved into the output category. The size of
       this structure is proportional to the number of unique tokens read, but
       not the size of the input documents, since they are discarded while be-
       ing read. As a rough guide, this	structure is 4x-5x the size of the fi-
       nal category file that is produced.

       To prevent unchecked memory growth, dbacl allocates by default a	 fixed
       smallish	 amount	of memory for tokens. When this	space is used up, fur-
       ther tokens are discarded which has the effect of skewing  the  learned
       category	making it less usable as more tokens are dropped. A warning is
       printed on STDERR in such a case.

       The  -h switch lets you fix the initial size of the token space in pow-
       ers of 2, ie "-h	17" means 2^17 = 131072	possible tokens. If  you  type
       "dbacl  -V", you	can see	the number of bytes needed for each token when
       either learning or classifying. Multiply	this  number  by  the  maximum
       number  of  possible tokens to estimate the memory needed for learning.
       The -H switch lets dbacl	grow its  tables  automatically	 if  and  when
       needed,	up  to	a  maximum specified. So if you	type "-H 21", then the
       initial size will be doubled repeatedly if necessary,  up  to  approxi-
       mately two million unique tokens.

       When learning with the -X switch, a handful of input documents are also
       kept in RAM throughout.

ENVIRONMENT
       DBACL_PATH
	      When this	variable is set, its value is prepended	to every cate-
	      gory filename which doesn't start	with a '/' or a	'.'.

SIGNALS
       INT    If  this	signal is caught, dbacl	simply exits without doing any
	      cleanup or other operations. This	signal can often  be  sent  by
	      pressing Ctrl-C on the keyboard. See stty(1).

       HUP, QUIT, TERM
	      If one of	these signals is caught, dbacl stops reading input and
	      continues	 its operation as if no	more input was available. This
	      is a way of quitting gracefully, but note	that in	learning mode,
	      a	category file will be written based on the  incomplete	input.
	      The  QUIT	signal can often be sent by pressing Ctrl- on the key-
	      board. See stty(1).

       USR1   If this signal is	caught,	dbacl reloads the  current  categories
	      at  the earliest feasible	opportunity. This is not normally use-
	      ful at all, but might be in special cases, such  as  if  the  -f
	      switch is	invoked	together with input from a long	running	pipe.

NOTES
       dbacl generated category	files are in binary format, and	may or may not
       be  portable to systems using a different byte order architecture (this
       depends on how dbacl was	compiled). The -V switch  prints  out  whether
       categories are portable,	or else	you can	just experiment.

       dbacl  does  not	recognize functionally equivalent regular expressions,
       and in this case	duplicate features will	be counted several times.

       With every learned category, the	command	line options  that  were  used
       are  saved.   When  classifying,	make sure that every relevant category
       was learned with	the same set of	options	(regexes are allowed  to  dif-
       fer),  otherwise	behaviour is undefined.	There is no need to repeat all
       the switches when classifying.

       If you get many digitization warnings, then you are trying to learn too
       much data at once, or your model	is too complex.	 dbacl is compiled  to
       save  memory by digitizing final	weights, but you can disable digitiza-
       tion by editing dbacl.h and recompiling.

       dbacl offers several built-in tokenizers	(see -e	switch)	with  more  to
       come in future versions,	as the author invents them.  While the default
       tokenizer  may evolve, no tokenizer should ever be removed, so that you
       can always simulate previous dbacl behaviour subject to bug  fixes  and
       architectural changes.

       The  confidence estimates obtained through the -X switch	are underesti-
       mates, ie are more conservative than they should	be.

BUGS
       "Ya know, some day scientists are gonna invent something	that will out-
       smart a rabbit."	(Robot Rabbit, 1953)

SOURCE
       The source code for the latest version of this program is available  at
       the following locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR
       Laird A.	Breyer <laird@lbreyer.com>

SEE ALSO
       awk(1),	bayesol(1),  crontab(1),  hmine(1),  hypex(1),	less(1), mail-
       cross(1),  mailfoot(1),	mailinspect(1),	  mailtoe(1),	procmailex(5),
       regex(7), stty(1), sed(1)

Version	1.14.1	      Bayesian Text Classification Tools	      DBACL(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=dbacl&sektion=1&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages