Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
QSF(1)				 User Manuals				QSF(1)

NAME
       qsf - quick spam	filter

SYNOPSIS
       Filtering:	qsf [-snrAtav] [-d DB] [-g DB]
			    [-L	LVL] [-S SUBJ] [-H MARK] [-Q NUM]
			    [-X	NUM]
       Training:	qsf -T SPAM NONSPAM [MAXROUNDS]	[-d DB]
       Retraining:	qsf -[m|M] [-d DB] [-w WEIGHT] [-ayN]
       Database:	qsf -[p|D|R|O] [-d DB]
       Database	merge:	qsf -E OTHERDB [-d DB]
       Allowlist query:	qsf -e EMAIL [-m|-M|-t]	[-d DB]	[-g DB]
       Denylist	query:	qsf -y -e EMAIL	[-m -m|-M -M|-t] [-d DB] [-g DB]
       Help:		qsf -[h|V]

DESCRIPTION
       qsf  reads  a single email on standard input, and by default outputs it
       on standard output.  If the email is determined to be  spam,  an	 addi-
       tional header ("X-Spam: YES") will be added, and	optionally the subject
       line can	have "[SPAM]" prepended	to it.

       qsf  is	intended to be used in a procmail(1) recipe, in	a ruleset such
       as this:

	       :0 wf
	       | qsf -ra

	       :0 H:
	       * X-Spam: YES
	       $HOME/mail/spam

       For more	examples, including sample procmail(1) recipes,	see the	 EXAM-
       PLES section below.

TRAINING
       Before qsf can be used properly,	it needs to be trained.	 A good	way to
       train qsf is to collect a copy of all your email	into two folders - one
       for  spam,  and one for non-spam.  Once you have	done this, you can use
       the training function, like this:

	       qsf -aT spam-folder non-spam-folder

       This will generate a database that can be used by qsf to	guess  whether
       email  received	in  the	future is spam or not.	Note that this initial
       training	run may	take a long time, but you should only need  to	do  it
       once.

       To  mark	 a single message as spam, pipe	it to qsf with the --mark-spam
       or -m ("mark as spam") option.  This will update	the  database  accord-
       ingly and discard the email.

       To  mark	 a single message as non-spam, pipe it to qsf with the --mark-
       nonspam or -M ("mark as non-spam") option.  Again,  this	 will  discard
       the email.

       If a message has	been mis-tagged, simply	send it	to qsf as the opposite
       type,  i.e.  if it has been mistakenly tagged as	spam, pipe it into qsf
       --mark-nonspam --weight=2 to add	it to the non-spam side	of  the	 data-
       base with double	the usual weighting.

OPTIONS
       The qsf options are listed below.

       -d, --database [TYPE:]FILE
	      Use  FILE	 as the	spam/non-spam database.	 The default is	to use
	      /var/lib/qsfdb and, if that is not available  or	is  read-only,
	      $HOME/.qsfdb.  This option can also be useful if there is	a sys-
	      tem-wide	database  but  you  do not want	to use it - specifying
	      your own here will override the default.

	      If  you  prefix  the  filename  with  a  TYPE,   of   the	  form
	      btree:$HOME/.qsfdb, then this will specify what kind of database
	      FILE is, such as list, btree, gdbm, sqlite and so	on.  Check the
	      output  of  qsf -V to see	which database backends	are available.
	      The default is to	auto-detect the	type, or, if the file does not
	      already exist, use list.	Note that TYPE is not case-sensitive.

       -g, --global [TYPE:]FILE
	      Use  FILE	 as  the   default   global   database,	  instead   of
	      /var/lib/qsfdb.	If  you	 also specify a	database with -d, then
	      this "global" database will be used in read-only	mode  in  con-
	      junction with the	read-write database specified with -d.	The -g
	      option  can  be  used a second time to specify a third database,
	      which will also be used in read-only mode.  Again, the  filename
	      can optionally be	prefixed with a	TYPE which specifies the data-
	      base type.

       -P, --plain-map FILE
	      Maintain	a  mapping  of all database tokens to their non-hashed
	      counterparts in FILE, one	token per line.	 This can be useful if
	      you want to be able to list the contents of your database	 at  a
	      later  date,  for	 instance  to get a list of email addresses in
	      your allow-list.	Note that using	this option may	slow qsf down,
	      and only entries written to the database while  this  option  is
	      active will be stored in FILE.

       -s, --subject
	      Rewrite the Subject line of any email that turns out to be spam,
	      adding "[SPAM]" to the start of the line.

       -S, --subject-marker SUBJECT
	      Instead  of  adding "[SPAM]", add	SUBJECT	to the Subject line of
	      any email	that turns out to be spam.  Implies -s.

       -H, --header-marker MARK
	      Instead of setting the X-Spam header to "YES", set it to MARK if
	      email turns out to be spam.  This	can be useful  if  your	 email
	      client can only search all headers for a string, rather than one
	      particular  header (so searching for "YES" might match more than
	      just the output of qsf).

       -n, --no-header
	      Do not add an X-Spam header to messages.

       -r, --add-rating
	      Insert an	additional header X-Spam-Rating	which is a  rating  of
	      the  "spamminess"	 of  a message from 0 to 100; 90 and above are
	      counted as spam, anything	under 90 is not	considered  spam.   If
	      combined with -t,	then the rating	(0-100)	will be	output,	on its
	      own, on standard output.

       -A, --asterisk
	      Insert  an additional header X-Spam-Level	which will contain be-
	      tween 0 and 20 asterisks (*), depending on the spam rating.

       -t, --test
	      Instead of passing the message out on  standard  output,	output
	      nothing, and exit	0 if the message is not	spam, or exit 1	if the
	      message is spam.	If combined with -r, then the spam rating will
	      be output	on standard output.

       -a, --allowlist
	      Enable the allow-list.  This causes the email addresses given in
	      the  message's  "From:" and "Return-Path:" headers to be checked
	      against a	list; if either	one matches, then the message  is  al-
	      ways  treated as non-spam, regardless of what the	token database
	      says. When specified with	a retraining  flag,  -a	 -m  (mark  as
	      spam)  will  remove  that	address	from the allow-list as well as
	      marking the message as spam, and -a -M (mark as  non-spam)  will
	      add  that	 address to the	allow-list as well as marking the mes-
	      sage as non-spam.	 The idea is that you add all of your  friends
	      to  the  allow-list,  and	 then  none of their messages ever get
	      marked as	spam.

       -y, --denylist
	      Enable the deny-list.  This causes the email addresses given  in
	      the  message's  "From:" and "Return-Path:" headers to be checked
	      against a	second list; if	either one matches, then theh  message
	      is  always  treated  as spam.  Training works in the same	way as
	      with -a, except that you must specify -m or -M twice  to	modify
	      the  deny-list  instead  of the allow-list, and with the reverse
	      syntax: -y -m -m (mark as	spam) will add	that  address  to  the
	      deny-list,  whereas -y -M	-M (mark as non-spam) will remove that
	      address from the deny-list.  This	 double	 specification	is  so
	      that  the	 usual retraining process never	touches	the deny-list;
	      the deny-list should be carefully	maintained rather  than	 auto-
	      matically	generated.

	      Normally you would not need to use the deny-list.

       -L, --level, --threshold	LEVEL
	      Change  the  spam	 scoring threshold level which must be reached
	      before an	email is classified as spam.  The default is 90.

       -Q, --min-tokens	NUM
	      Only give	a score	if more	than NUM tokens	are found in the  mes-
	      sage  -  otherwise the message is	assumed	to be non-spam,	and it
	      is not modified in any way.  The	default	 is  0.	  This	option
	      might  be	 useful	if you find that very short messages are being
	      frequently miscategorised.

       -e, --email, --email-only EMAIL
	      Query or update the  allow-list  entry  for  the	email  address
	      EMAIL.   With no other options, this will	simply output "YES" if
	      EMAIL is in the allow-list, or "NO" if it	is not.	 With  -t,  it
	      will  not	output anything, but will exit 0 (success) if EMAIL is
	      in the allow-list, or 1 (failure)	if it  is  not.	 With  the  -m
	      (mark-spam) option, any previous allow-list entry	for EMAIL will
	      be  removed.  Finally,  with the -M (mark-nonspam) option, EMAIL
	      will be added to the allow-list if it is not already on it.

	      If EMAIL is just the word	MSG on its own,	then an	email will  be
	      read  from  standard input, and the email	addresses given	in the
	      "From:" and "Return-Path:" headers will be used.

	      Using -e automatically switches on -a.

	      If you also specify -y, then the deny-list will be operated  on.
	      Remember that -m and -M are reversed with	the deny-list.

	      If you specify an	email address of the form @domain (nothing be-
	      fore the @), then	the whole domain will be allow or deny listed.

       -v, --verbose
	      Add  extra  X-QSF-Info headers to	any filtered email, containing
	      error messages and so on if applicable.  Specify	-v  more  than
	      once to increase verbosity.

       -T, --train SPAM	NONSPAM	[MAXROUNDS]
	      Train  the database using	the two	mbox folders SPAM and NONSPAM,
	      by testing each message in each folder and updating the database
	      each time	a message is miscategorised.   This  is	 done  several
	      times, and may take a while to run.  Specify the -a (allow-list)
	      flag  to	add  every sender in the NONSPAM folder	to your	allow-
	      list as a	side-effect of the training process.  If MAXROUNDS  is
	      specified,  training will	end after this number of rounds	if the
	      results are still	not good enough. The default is	a  maximum  of
	      200 rounds.

       -m, --mark-spam
	      Instead  of passing the message out on standard output, mark its
	      contents as spam and update the database	accordingly.   If  the
	      allow-list  (-a)	is enabled, the	message's "From:" and "Return-
	      Path:" addresses are removed from	the allow-list.	 If the	 deny-
	      list (-y)	is enabled and you specify -m twice, the message's ad-
	      dresses are added	to the deny-list instead.

       -M, --mark-nonspam
	      Instead  of passing the message out on standard output, mark its
	      contents as non-spam and update the  database  accordingly.   If
	      the  allow-list  (-a) is enabled,	the message's "From:" and "Re-
	      turn-Path:" addresses are	added to the allow-list	 (see  the  -a
	      option above).  If the deny-list (-y) is enabled and you specify
	      -M twice,	the message's addresses	are removed from the deny-list
	      instead.

       -w, --weight WEIGHT
	      When  marking  as	 spam  or non-spam, update the database	with a
	      weighting	of WEIGHT per token instead of the default of 1.  Use-
	      ful when correcting mistakes, eg a message that has been mistak-
	      enly detected as spam should  be	marked	as  non-spam  using  a
	      weighting	 of  2,	i.e. double the	usual weighting, to counteract
	      the error.

       -D, --dump [FILE]
	      Dump the contents	of the database	as a platform-independent text
	      file, suitable for archival, transfer to another machine,	and so
	      on.  The data is output on stdout	or into	the given FILE.

       -R, --restore [FILE]
	      Rebuild the database from	scratch	from the text file  on	stdin.
	      If  a  FILE  is  given,  data is read from there instead of from
	      stdin.

       -O, --tokens
	      Instead of filtering, output a list of the tokens	found  in  the
	      message read from	standard input,	along with the number of times
	      each  token  was	found.	This is	only useful if you want	to use
	      qsf as a general tokeniser for use with another filtering	 pack-
	      age.

       -E, --merge OTHERDB
	      Merge  the OTHERDB database into the current database.  This can
	      be useful	if you want to take one	user's mailbox	and  merge  it
	      into  the	 system-wide one, for instance (this would be done by,
	      as root, doing qsf -d /var/lib/qsfdb  -E	/home/user/.qsfdb  and
	      then removing /home/user/.qsfdb).

       -B, --benchmark SPAM NONSPAM [MAXROUNDS]
	      Benchmark	 the  training process using the two mbox folders SPAM
	      and NONSPAM.  A temporary	database is created and	trained	 using
	      the  first  75% of the messages in each folder, and then the en-
	      tire contents of each folder is tested to	 see  how  many	 false
	      positives	 and false negatives occur. Some timing	information is
	      also displayed.

	      This can be used to decide which backend is best on your system.
	      Use -d to	select a backend, eg qsf -B spam  nonspam  -d  GDBM  -
	      this  will  create  a temporary database which is	removed	after-
	      wards.

	      The exception to this is the MySQL backend, where	a  full	 data-
	      base  specification must be given	(-d MySQL:database=db;host=lo-
	      calhost;...)  and	the database table given will not be wiped be-
	      forehand or dropped afterwards.

	      As with -T, if MAXROUNDS is specified, training  will  never  be
	      done for more than this number of	rounds;	the default is 200.

       -h, --help
	      Print a usage message on standard	output and exit	successfully.

       -V, --version
	      Print  version  information, including a list of available data-
	      base backends, on	standard output	and exit successfully.

DEPRECATED OPTIONS
       The following options are only for use with the old binary  tree	 data-
       base  backend  or  old  databases that haven't been upgraded to the new
       format that came	in with	version	1.1.0.

       -N, --no-autoprune
	      When marking as spam or nonspam, never automatically  prune  the
	      database.	 Usually the database is pruned	after every 500	marks;
	      if  you  would  rather --prune manually, use -N to disable auto-
	      matic pruning.

       -p, --prune
	      Remove redundant entries from the	database and  clean  it	 up  a
	      little.	This  is  automatically	 done  after  several calls to
	      --mark-spam or --mark-nonspam, and during	training with  --train
	      if  the  training	 takes	a large	number of rounds, so it	should
	      rarely be	necessary to use --prune manually unless you are using
	      -N / --no-autoprune.

       -X, --prune-max NUM
	      When the database	is being pruned, no more than NUM entries will
	      be considered for	removal.  This is to prevent  CPU  and	memory
	      resources	 being taken over.  The	default	is 100,000 but in some
	      circumstances (if	you find that pruning takes too	long) this op-
	      tion may be used to reduce it to a more manageable number.

FILES
       /var/lib/qsfdb
	      The default (system-wide)	spam database.	If you wish to install
	      qsf system-wide, this should be  read-only  to  everyone;	 there
	      should  be  one  user  with write	access who can update the spam
	      database with qsf	--mark-spam and	qsf --mark-non-spam when  nec-
	      essary.

       /var/lib/qsfdb2
	      A	 second,  read-only,  system-wide database. This can be	useful
	      when installing qsf system-wide and using	third-party spam data-
	      bases; the first global database can be updated with system-spe-
	      cific changes, and this second database can be periodically  up-
	      dated when the third-party spam database is updated.

       $HOME/.qsfdb
	      The  default  spam  database  for	 per-user data.	 Users without
	      write access to the system-wide database will  have  their  data
	      written  here, and the two databases will	be read	together.  The
	      per-user database	will be	given a	 weighting  equivalent	to  10
	      times the	weighting of the global	database.

NOTES
       Currently,  you	cannot use qsf to check	for spam while the database is
       being updated.  This means that while an	update	is  in	progress,  all
       email is	passed through as non-spam.

       There  is  an  upper  size  limit  of 512Kb on incoming email; anything
       larger than this	is just	passed through as non-spam, to avoid tying  up
       machine resources.

       The  plaintext  token  mapping  maintained  by  --plain-map  will never
       shrink, only grow.  It is intended for use by housekeeping and user in-
       terface scripts that, for instance, the user can	use to list all	 email
       addresses on their allow-list.  These scripts should take care of weed-
       ing  out	entries	for tokens that	are no longer in the database.	If you
       have no such scripts, there is probably no point	in  using  --plain-map
       anyway.

       Avoid  using  the deny-list (-y)	in any automated retraining, as	it can
       be cause	the filter to reject mail unnecessarily.  In general the deny-
       list is probably	best left unused unless	explicitly  required  by  your
       particular setup.

       If  both	 the  allow-list and the deny-list are enabled,	then email ad-
       dresses will first be checked against the deny-list,  then  the	allow-
       list, then the domain of	the email address will be checked for matching
       "@domain" entries in the	deny-list and then in the allow-list.

EXAMPLES
       To filter all of	your mail through qsf, with the	allow-list enabled and
       the  "spam  rating"  header  being  added, add this to your .procmailrc
       file:

	       :0 wf
	       | qsf -ra

       If you want qsf to add "[SPAM]" to the subject line of any messages  it
       thinks are spam,	do this	instead:

	       :0 wf
	       | qsf -sra

       To  automatically mark any email	sent to	spambox@yourdomain.com as spam
       (this is	the "naive" version):

	       :0 H
	       * ^To:.*spambox@yourdomain.com
	       | qsf -am

       To do the same, but cleverly, so	that  only  email  to  spambox@yourdo-
       main.com	 which	qsf  does  NOT already classify	as spam	gets marked as
       spam in the database (this  stops  the  database	 getting  too  heavily
       weighted):

	       # If sent to spambox@yourdomain.com:
	       :0
	       * ^To:.*spambox@yourdomain.com
	       {
		  :0 wf
		  | qsf	-a

		  # The	above two lines	can be skipped if you've
		  # already piped the message through qsf.

		  # If the qsf database	says it's not spam,
		  # mark it as spam!
		  :0 H
		  * ^X-Spam: NO
		  | qsf	-am
	       }

       Remove the -a option in the above examples if you don't want to use the
       allow-list.

       A  more	complicated filtering example -	this will only run qsf on mes-
       sages which don't have a	subject	line saying "your  <something>	is  on
       fire"  and  which  don't	have a sender address ending in	"@foobar.com",
       meaning that messages with that subject line  OR	 that  sender  address
       will NEVER be marked as spam, no	matter what:

	       :0 wf
	       * ! ^Subject: Your .* is	on fire
	       * ! ^From: .*@foobar.com
	       | qsf -ra

       For  more  on  procmail(1)  recipes,  see  the  procmailrc(5) and proc-
       mailex(5) manual	pages.

       A couple	of macros to add to your .muttrc file, if you use mutt(1) as a
       mail user agent:

	       # Press F5 to mark a message as spam and	delete it
	       macro index <f5>	"<pipe-message>qsf -am\n<delete-message>"
	       macro pager <f5>	"<pipe-message>qsf -am\n<delete-message>"

	       # Press F9 to mark a message as non-spam
	       macro index <f9>	"<pipe-message>qsf -aM\n"
	       macro pager <f9>	"<pipe-message>qsf -aM\n"

       Again, remove the -a option in the above	examples if you	don't want  to
       use the allow-list.

       Note,  however, that the	above macros won't work	when operating on mul-
       tiple tagged messages. For that,	you'd need something like this:

	       macro  index  <f5>   ":set   pipe_split\n<tag-prefix><pipe-mes-
	      sage>qsf -am\n<tag-prefix><delete-message>\n:unset pipe_split\n"

       If you use qmail(7), then to get	procmail working with it you will need
       to  put	a  line	 containing just DEFAULT=./Maildir/ at the top of your
       ~/.procmailrc file, so that procmail delivers to	 your  Maildir	folder
       instead	of  trying  to	deliver	to /var/spool/mail/$USER, and you will
       need to put this	in your	~/.qmail file:

	       | preline procmail

       This will cause all your	mail to	be delivered via procmail  instead  of
       being delivered directly	into your mail directory.

       See the qmail(7)	documentation for more about mail delivery with	qmail.

       If you use postfix(1), you can set up a system-wide mail	filter by cre-
       ating a user account for	the purpose of filtering mail, populating that
       account's  .qsfdb,  and	then  creating	a shell	script,	to run as that
       user, which runs	qsf on stdin and passes	stdout to sendmail(8).

       Doing this requires some	knowledge of postfix  configuration  and  care
       needs  to  be  taken to avoid mail loops.  One qsf user's full HOWTO is
       included	in the doc/ directory with this	package.

THE ALLOW-LIST
       A feature called	the "allow-list" can be	switched on by specifying  the
       --allowlist  or	-a option.  This causes	messages' "From:" and "Return-
       Path:" addresses	to be checked against a	list of	people you  have  said
       to  allow  all  messages	 from,	and if a message's "From:" or "Return-
       Path:" address is in the	list, it is never marked as spam.  This	 means
       you can add all your friends to an "allow-list" and qsf will then never
       mis-file	 their	messages - a quick way to do this is to	use -a with -T
       (train);	everyone in your non-spam folder who has  sent	you  an	 email
       will be added to	the allow-list automatically during training.

       You  can	 manually  add and remove addresses to and from	the allow-list
       using the -e (email) option. For	instance, to add  foo@bar.com  to  the
       allow-list, do this:

	       qsf -e foo@bar.com -M

       To remove bad@nasty.com from the	allow-list, do this:

	       qsf -e bad@nasty.com -m

       And  to	see whether someone@somewhere.com is in	the allow-list or not,
       just do this:

	       qsf -e someone@somewhere.com

       In general, you probably	always want to enable the allow-list,  so  al-
       ways  specify  the  -a  option when using qsf.  This will automatically
       maintain	the allow-list based on	what you classify as spam or non-spam.

       The only	times you might	want to	turn it	off are	when  people  on  your
       allow-list  are prone to	getting	viruses	or if a	virus is causing email
       to be sent to you that is pretending to be from someone on your	allow-
       list.

BACKUP AND RESTORE
       Because	the database format is platform-specific, it is	a good idea to
       periodically dump the database to a text	file using qsf -D so that,  if
       necessary,  it  can be transferred to another machine and restored with
       qsf -R later on.

       Also note that since the	actual contents	of email  messages  are	 never
       stored  in  the	database (see TECHNICAL	DETAILS), you can safely share
       your qsf	database with friends -	simply dump your database to  a	 file,
       like this:

	       qsf -D >	your-database-dump.txt

       Once  you  have sent your-database-dump.txt to another person, they can
       do this:

	       qsf -R <	your-database-dump.txt

       They will then have an identical	database to yours.

TECHNICAL DETAILS
       When a message is passed	to qsf,	any attachments	are decoded, all  HTML
       elements	 are removed, and the message text is then broken up into "to-
       kens", where a "token" is a single word or URL.	Each token  is	hashed
       using the MD5 algorithm (see below for why), and	that hash is then used
       to look up each token in	the qsf	database.

       For  full  details  of  which parts of an email (headers, body, attach-
       ments, etc) are used to calculate the spam rating, see the TOKENISATION
       section below.

       Within the database, each token has two numbers associated with it: the
       number of times that token has been seen	in spam,  and  the  number  of
       times  it has been seen in non-spam.  These two numbers,	along with the
       total number of spam and	non-spam messages seen,	are then used to  give
       a  "spamminess"	value  for  that  particular token.  This "spamminess"
       value ranges from "definitely not spammy" at  one  end  of  the	scale,
       through "neutral" in the	middle,	up to "definitely spammy" at the other
       end.

       Once  a "spamminess" value has been calculated for all of the tokens in
       the message, a summary calculation is made to give an overall "is  this
       spam?"  probability rating for the message.  If the overall probability
       is 0.9 or above,	the message is flagged as spam.

       In  addition  to	 the probability test is the "allow-list".  If enabled
       (with the -a option), the whole probability check  is  skipped  if  the
       sender  of  the message is listed in the	allow-list, and	the message is
       not marked as spam.

       When training the database, a message is	split up into  tokens  as  de-
       scribed	above, and then	the numbers in the database for	each token are
       simply added to:	if you tell qsf	that a message is spam,	it adds	one to
       the "number of times seen in spam" counter for each token, and  if  you
       tell it a message is not	spam, it adds one to the "number of times seen
       in non-spam" counter for	each token.  If	you specify a weight, with -w,
       then the	number you specify is added instead of one.

       To  stop	 the database growing uncontrollably, the database keeps track
       of when a token was last	used.  Underused tokens	are automatically  re-
       moved  from the database.  (The old method was to "prune" every 500 up-
       dates).

       Finally,	the reason MD5 hashes were used	is privacy.  If	the actual to-
       kens from the messages, and the actual email addresses  in  the	allow-
       list,  were  stored,  you could not share a single qsf database between
       multiple	users because bits of everyone's  messages  would  be  in  the
       database	- things like emailed passwords, keywords relating to personal
       gossip, and so on.  So a	hash is	stored instead.	 A hash	is a "one-way"
       function;  it  is  easy to turn a token into a hash but very hard (some
       might say impossible) to	turn a hash back into the token	 that  created
       it.  This means that you	end up with a database with no personal	infor-
       mation in it.

TOKENISATION
       When  a	message	is broken up into tokens, various parts	of the message
       are treated in different	ways.

       First, all header fields	are discarded, except for the important	 ones:
       From, Return-Path, Sender, To, Reply-To,	and Subject.

       Next,  any MIME-encoded attachments are decoded.	 Any attachments whose
       MIME type starts	with "text/" (i.e. HTML	and text) are tokenised, after
       having any HTML tags stripped.  Any  non-textual	 attachments  are  re-
       placed  with  their  MD5	hash (such that	two identical attachments will
       have the	same hash), and	that hash is then used as a token.

       In addition to single-word tokens from textual message parts, qsf  adds
       doubled-up  tokens  so that word	pairs get added	to the database.  This
       makes the database a bit	bigger (although the automatic	pruning	 tends
       to take care of that) but makes matching	more exact.

SPECIAL	FILTERS
       As  well	as using the textual content of	email to detect	spam, qsf also
       uses special filters which  create  "pseudo-tokens"  based  on  various
       rules.	This  means that specific patterns, not	just individual	words,
       can be used to determine	whether	a message is spam or not.

       For example, if a message contains lots of words	with  multiple	conso-
       nants,  like  "ashjkbnxcsdjh",  then each time a	word like that is seen
       the special token ".GIBBERISH-CONSONANTS." is added to the list of  to-
       kens  found  in	the  message.  If it turns out that most messages with
       words that trigger this filter rule are spam, then other	messages  with
       gibberish consonant strings will	be more	likely to be flagged as	spam.

       Currently the special filters are:

       GTUBE  Flags    any    message	 containing   the   string   XJS*C4JD-
	      BQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X  as
	      spam - useful for	testing	that your qsf installation is working.

       ATTACH-SCR

       ATTACH-PIF

       ATTACH-EXE

       ATTACH-VBS

       ATTACH-VBA

       ATTACH-LNK

       ATTACH-COM

       ATTACH-BAT
	      Adds a token for every attachment	whose filename ends in ".scr",
	      ".pif",  ".exe",	".vbs",	".vba",	".lnk",	".com",	and ".bat" re-
	      spectively (these	are often viruses).

       ATTACH-GIF

       ATTACH-JPG

       ATTACH-PNG
	      Adds a token for every attachment	whose filename ends in ".gif",
	      ".jpg" or	".jpeg", and ".png" respectively.

       ATTACH-DOC

       ATTACH-XLS

       ATTACH-PDF
	      Adds a token for every attachment	whose filename ends in ".doc",
	      ".xls", or ".pdf"	respectively (these tend to  indicate  a  non-
	      spam email).

       SINGLE-IMAGE
	      Adds a token if the message contains exactly one attached	image.

       MULTIPLE-IMAGES
	      Adds  a token if the message contains more than one attached im-
	      age.

       GIBBERISH-CONSONANTS
	      Adds a token for every word found	that has  multiple  consonants
	      in  a  row,  as described	above.	Spam often contains strings of
	      gibberish.

       GIBBERISH-VOWELS
	      Adds a token for every word found	that has multiple vowels in  a
	      row, eg "aeaiaiaeeio".

       GIBBERISH-FROMCONS
	      Like GIBBERISH-CONSONANTS, but only for the "From:" and "Return-
	      Path:" addresses on their	own.

       GIBBERISH-FROMVOWL
	      Like  GIBBERISH-VOWELS,  but  only  for the "From:" and "Return-
	      Path:" addresses on their	own.

       GIBBERISH-BADSTART
	      Adds a token for every word that starts  with  a	bad  character
	      such as %.

       GIBBERISH-HYPHENS
	      Adds  a token for	every word with	more than three	hyphens	or un-
	      derscores	in it.

       GIBBERISH-LONGWORDS
	      Adds a token for every word with over 30 characters in  it  (but
	      less than	60).

       HTML-COMMENTS-IN-WORDS
	      Adds  a  token  for  every HTML comment found in the middle of a
	      word.   Spam  often  contains  HTML  inside  words,  like	 this:
	      w<!--dsgfhsdgjgh-->ord

       HTML-EXTERNAL-IMG
	      Adds  a  token  for every	HTML <img> (image) tag found that con-
	      tains ://	(i.e.  it refers to an external	image).

       HTML-FONT
	      Adds a token for every HTML <font> tag found.

       HTML-IP-IN-URLS
	      Adds a token for every URL found containing an IP	address.

       HTML-INT-IN-URL
	      Adds a token for every URL found containing an  integer  in  its
	      hostname.

       HTML-URLENCODED-URL
	      Adds  a  token  for  every  URL found containing a % sign	in its
	      hostname.

       Normally, filters will just cause a token to be added, and these	tokens
       are processed by	the normal weighting  algorithm.   However  the	 GTUBE
       filter  will  immediately  flag any matching message as spam, bypassing
       the token matching.

DATABASE BACKENDS
       The inbuilt "list" database backend will	not  necessarily  provide  the
       best performance, but is	provided because using it requires no external
       libraries.

       If,  when  qsf was compiled, the	correct	libraries were available, then
       it will be possible to use qsf with alternative database	backends.   To
       find  out which backends	you have available, run	qsf -V (capital	V) and
       read the	second line of output.	To see how well	 a  backend  performs,
       collect	some  spam and non-spam	and use	qsf -d BACKEND -B SPAM NONSPAM
       (see the	entry for -B above).

       Some people find	that they get the best performance  out	 of  the  gdbm
       backend;	this is	a library that is widely available on many systems.

       To  efficiently	share a	qsf database across multiple machines, you may
       find the	MySQL backend useful.  However,	using it is a little more com-
       plicated.

       To use the MySQL	backend	you will need  to  create  a  table  with  the
       fields  key1,  key2,  token,  value1,  value2  and  value3.  The	token,
       value1, value2, and value3 fields must be VARCHAR(64), BIGINT  or  INT,
       and  BIGINT  or	INT respectively, and indexing on the token field is a
       good idea. The key1 and key2 fields can be anything, but	they  must  be
       present.

       For example:

		USE mydatabase;
		CREATE TABLE qsfdb (
		  key1	    BIGINT UNSIGNED NOT	NULL,
		  key2	    BIGINT UNSIGNED NOT	NULL,
		  token	    VARCHAR(64)	DEFAULT	'' NOT NULL,
		  value1    INT	UNSIGNED NOT NULL,
		  value2    INT	UNSIGNED NOT NULL,
		  value3    INT	UNSIGNED NOT NULL,
		  PRIMARY KEY (key1,key2,token),
		  KEY (key1),
		  KEY (key2),
		  KEY (token)
		);

       The  key1  and  key2 fields allow you to	have multiple qsf databases in
       one table, by specifying	different key1 and key2	values on invocation.

       Instead of specifying a database	file with the --database / -d  option,
       you  must  specify either a specification string	as described below, or
       the name	of a file containing such a string on its first	line.

       The specification string	is as follows:

		database=DATABASE;host=HOST;port=PORT;
		user=USER;pass=PASS;table=TABLE;
		key1=KEY1;key2=KEY2

       This string must	be all on one line, with no spaces.

       DATABASE
	      is the name of the MySQL database.

       HOST   is the hostname of the database server (eg "localhost").

       PORT   is the TCP port to connect on (eg	3306).

       USER   is the username to connect with.

       PASS   is the password to connect with.

       TABLE  is the database table to use.  If	a table	with  this  name  does
	      not exist	when qsf is called in update or	training mode, then it
	      will be created if permissions allow this	to be done.

       KEY1   is the value to use for the key1 field.

       KEY2   is the value to use for the key2 field.

       Since  command  lines  can  be seen in the process list,	it is probably
       best to specify a filename (eg qsf -d  mysql:qsfdb.spec)	 and  put  the
       specification string inside that	file.

TROUBLESHOOTING
       If  you	have  problems	with qsf, please check the list	below; if this
       does not	help, go to the	qsf home  page	and  investigate  the  mailing
       lists, or email the author.

       Nothing is being	marked as spam.
	      First,  use the -r option	to switch on the X-Spam-Rating header,
	      and check	that this header appears in email passed through  qsf.
	      If  it  does not,	then it	is likely that qsf is not being	run at
	      all - check your configuration of	procmail(1) or its equivalent.

	      If you are seeing	X-Spam-Rating headers,	and  different	emails
	      have  different scores, then you may simply need to retrain your
	      database a little	more.  Take more spam email and	pass it	to qsf
	      -m.

	      If you are seeing	X-Spam-Rating headers but they	all  give  the
	      same spam	rating,	then the most likely reason is that qsf	is not
	      reading any database.  Make sure that whatever is	processing the
	      email  has  read permissions on /var/lib/qsfdb and/or ~/.qsfdb -
	      and make sure that, if you are using ~/.qsfdb, what  your	 data-
	      base  creator  thought  was  ~  ($HOME) is the same as it	is for
	      whatever is processing the email.

       Retraining sometimes takes a very long time.
	      With the obtree backend or  2-column  MySQL  or  SQLite  tables,
	      every 500th retrain (-m or -M), the database is pruned.  On some
	      systems  this may	take some time,	and during this	time the data-
	      base is locked (except when using	the MySQL or SQLite backends).
	      If you constantly	do a lot of retraining and want	to avoid this,
	      then use the -N option to	suppress auto-pruning, and then	have a
	      cron(8) job or something run a manual prune (qsf -p)  every  now
	      and again.

       Running qsf from	procmail fails with an error.
	      If  you  can run qsf from	the command line, but in your procmail
	      log file you get errors about "qsf: cannot execute binary	file",
	      then contact your	system administrator for help. It may be  that
	      incoming	email  is handled by a different server	to the one you
	      normally shell into, and either they are of a  different	archi-
	      tecture or operating system, or the mail server is not permitted
	      to execute user-owned binaries.

AUTHOR
       Written by Andrew Wood, with patches submitted by various other people.
       Please see the package README for a complete list of contributors.

BUGS
       Report  bugs  in	 QSF  using  the contact form linked from the QSF home
       page: <http://www.ivarch.com/programs/qsf/>

SEE ALSO
       procmail(1), procmailrc(5), procmailex(5)

       Someone has written a guide to using qsf	with KMail that	can  be	 found
       at:
       http://www.softwaredesign.co.uk/Information.SpamFilters.html

LICENSE
       This is free software, distributed under	the ARTISTIC 2.0 license.

1.2.15				  March	2021				QSF(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=qsf&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help