FreeBSD Manual Pages

home | help
ANNOGEN(1)		    General Commands Manual		    ANNOGEN(1)

Annotator  Generator  is an examples-driven generator of fast text annotators.
"Annotate" in this context means to add	pronunciation or other information  to
each  word,  and/or  to	 split text into words in a language that does not use
spaces.

       O   You supply a	corpus of pre-annotated	texts for Annotator  Generator
	   to work out the rules and exceptions

       O   Annotator   Generator   creates   table-driven  code	 in  C,	 Java,
	   Javascript, Dart or Python with 2 and 3 compatibility

       O   The resulting program should	be able	to annotate any	text that con-
	   tains words or phrases similar to those found in the	examples

       O   It can output the annotations alone or it can combine them with the
	   original text using HTML Ruby markup	or simple braces

       O   If anything is unclear (didn't happen in the	examples,  or  there's
	   not	enough	context	to figure out which example should be applied)
	   then	the program will leave it unannotated so you can pass it to  a
	   backup annotation program if	you have one.

       O   If  you  have  no  backup annotator then try	setting	the -y option,
	   which makes Annotator Generator try harder to find context-indepen-
	   dent	rules with context-dependent exceptions, so as to annotate  as
	   much	text as	possible.

       O   Generated  annotators  can act as filters for Web Adjuster; options
	   are also provided for generating Android apps, browser  extensions,
	   and	clipboard  annotators  for  Windows and	Windows	Mobile,	or you
	   could format	the annotations	on a Unix terminal

       -h, --help
       show this help message and exit

       --infile=
       Filename	of a text file (or a compressed	.gz, .bz2 or .xz file or  URL)
       to read the input examples from.	If this	is not specified, standard in-
       put is used.

       --incode=
       Character encoding of the input file (default utf-8)

       --mstart=
       The  string  that  starts a piece of text with annotation markup	in the
       input examples; default <ruby><rb>

       --mmid=
       The string that occurs in the middle of a piece of markup in the	 input
       examples,  with	the word on its	left and the added markup on its right
       (or the other way around	if mreverse is set); default </rb><rt>

       --mend=
       The string that ends a piece of annotation markup in  the  input	 exam-
       ples; default </rt></ruby>

       -r, --mreverse
       Specifies  that	the  annotation	markup is reversed, so the text	before
       mmid is the annotation and the text after it is the base	text

       --no-mreverse
       Cancels any earlier --mreverse option in	Makefile variables etc

       --end-pri=
       Treat words that	occur in the examples before this delimeter as	having
       "high  priority"	 for  Yarowsky-like seed collocations (if these	are in
       use). Normally the Yarowsky-like	logic tries to	identify  a  "default"
       annotation  based  on what is most common in the	examples, with the ex-
       ceptions	indicated by collocations. If however a	word  is  found	 in  a
       high-priority  section  at  the	start, then the	first annotation found
       there will be taken as the ideal	"default" even if it's in  a  minority
       in the examples;	everything else	will be	taken as an exception.

       -s, --spaces
       Set this	if you are working with	a language that	uses whitespace	in its
       non-markedup  version (not fully	tested). The default is	to assume that
       there will not be any whitespace	in the language, which is correct  for
       Chinese and Japanese.

       --no-spaces
       Cancels any earlier --spaces option in Makefile variables etc

       -c, --capitalisation
       Don't  try  to normalise	capitalisation in the input. Normally, to sim-
       plify the rules,	the analyser will try to remove	start-of-sentence cap-
       itals in	annotations, so	that the only  remaining  words	 with  capital
       letters are the ones that are always capitalised	such as	names. (That's
       not  perfect:  some words might always be capitalised just because they
       never occur mid-sentence	in the examples.) If this option is used,  the
       analyser	 will instead try to "learn" how to predict the	capitalisation
       of all words (including start of	sentence words)	from their contexts.

       --no-capitalisation
       Cancels any earlier --capitalisation option in Makefile variables etc

       -w, --annot-whitespace
       Don't try to normalise the use of whitespace and	hyphenation in the ex-
       ample annotations. Normally the analyser	will try to do this, to	reduce
       the risk	of missing possible rules due to  minor	 typographical	varia-
       tions.

       --no-annot-whitespace
       Cancels any earlier --annot-whitespace option in	Makefile variables etc

       --keep-whitespace=
       Comma-separated	list  of  words	 (without annotation markup) for which
       whitespace and hyphenation should always	be kept	even without the --an-
       not-whitespace option. Use when you know	the variation  is  legitimate.
       This  option expects words to be	encoded	using the system locale	(UTF-8
       if it cannot be detected).

       --suffix=
       Comma-separated list of annotations that	 can  be  considered  optional
       suffixes	for normalisation

       --suffix-minlen=
       Minimum length of word (in Unicode characters) to apply suffix normali-
       sation

       --post-normalise=
       Filename	 or  URL  of  an  optional Python module defining a dictionary
       called 'table' mapping integers to integers for arbitrary  single-char-
       acter normalisation on the Unicode BMP. This can	reduce the size	of the
       annotator. It is	applied	in post-processing (does not affect rules gen-
       eration	itself). For example this can be used to merge the recognition
       of Full,	Simplified and Variant forms of	the same Chinese character  in
       cases where this	can be done without ambiguity, if it is	acceptable for
       the generated annotator to recognise mixed-script words should they oc-
       cur.  If	 any word in the examples has a	different annotation when nor-
       malised than not, the normalised	version	takes precedence.

       --glossfile=
       Filename	of an optional text file (or compressed	.gz, .bz2 or .xz  file
       or URL) to read auxiliary "gloss" information. Each line	of this	should
       be  of  the  form: word (tab) annotation	(tab) gloss. Extra tabs	in the
       gloss will be converted to newlines (useful if you want to quote	multi-
       ple dictionaries). When the compiled annotator generates	 ruby  markup,
       it  will	 add  the  gloss string	as a popup title whenever that word is
       used with that annotation (before any reannotator option	 is  applied).
       The  annotation field may be left blank to indicate that	the gloss will
       appear for all other annotations	of that	word. The entries in glossfile
       do not affect the annotation process itself, so it's not	 necessary  to
       completely debug	glossfile's word segmentation etc.

       -C, --gloss-closure=
       If  any	Chinese,  Japanese  or	Korean word is missing from glossfile,
       search its closure of variant characters	also, using the	 Unihan	 vari-
       ants file (or URL) specified by this option

       --no-gloss-closure
       Cancels any earlier --gloss-closure option in Makefile variables	etc

       -M, --glossmiss-omit
       Omit  rules  containing	any  word not mentioned	in glossfile. Might be
       useful if you want to train on a	text that uses proprietary  terms  and
       don't want to accidentally 'leak' those terms (assuming they're not ac-
       cidentally  included  in	 glossfile  also). Words may also be listed in
       glossfile with an empty gloss field to indicate that no gloss is	avail-
       able but	rules using this word needn't be omitted.

       --no-glossmiss-omit
       Cancels any earlier --glossmiss-omit option in Makefile variables etc

       --words-omit=
       File (or	compressed .gz,	.bz2 or	.xz file or URL) containing words (one
       per line, without markup) to omit from the annotator. Use this to  make
       an annotator smaller if for example if you're working from a rules file
       that  contains long lists of place names	you don't need this particular
       annotator to recognise but you still want to keep  them	as  rules  for
       other  annotators,  but be careful because any word on such a list gets
       omitted even if it also has other meanings (some	place names  are  also
       normal words).

       --manualrules=
       Filename	 of an optional	text file (or compressed .gz, .bz2 or .xz file
       or URL) to read extra, manually-written rules. Each line	of this	should
       be a marked-up phrase (in the input format) which is to be uncondition-
       ally added as a rule. Use this sparingly, because these rules  are  not
       taken  into account when	generating the others and they will be applied
       regardless of context (although a manual	rule might fail	to activate if
       the annotator is	part-way through processing  a	different  rule);  try
       checking	messages from --diagnose-manual.

       --c-filename=
       Where  to  write	the C, C#, Python, Javascript, Go or Dart program. De-
       faults to standard output, or annotator.c in the	system	temporary  di-
       rectory	if standard output seems to be the terminal (the program might
       be large, especially if Yarowsky-like indicators	are not	used, so  it's
       best  not  to  use a server home	directory where	you might have limited
       quota).

       --c-compiler=
       The C compiler to run if	generating C and standard output is  not  con-
       nected  to a pipe. The default is to use	the "cc" command which usually
       redirects to your "normal" compiler. You	can add	 options  (remembering
       to  enclose  this whole parameter in quotes if it contains spaces), but
       if the C	program	is large then adding optimisation options may make the
       compile take a long time. If standard output is connected  to  a	 pipe,
       then  this  option is ignored because the C code	will simply be written
       to the pipe. You	can also set this option to an empty  string  to  skip
       compilation. Default: cc	-o annotator

       --outcode=
       Character  encoding to use in the generated parser (default utf-8, must
       be ASCII-compatible i.e.	not utf-16)

       --rulesFile=
       Filename	of a JSON file to hold the accumulated rules. Adding .gz, .bz2
       or .xz for compression is  acceptable.  If  this	 is  set  then	either
       --write-rules or	--read-rules must be specified.

       --write-rules
       Write  rulesFile	 instead of generating a parser. You will then need to
       rerun with --read-rules later.

       --no-write-rules
       Cancels any earlier --write-rules option	in Makefile variables etc

       --read-rules
       Read rulesFile from a previous run, and apply the output	options	to it.
       You should still	specify	the input formatting options (which should not
       change),	and any	glossfile or manualrules options (which	 may  change),
       but no input is required.

       --no-read-rules
       Cancels any earlier --read-rules	option in Makefile variables etc

       -E, --newlines-reset
       Have  the  annotator  reset its state on	every newline byte. By default
       newlines	do not affect state such as whether a space is required	before
       the next	word, so that if the annotator is  used	 with  Web  Adjuster's
       htmlText	 option	(which defaults	to using newline separators) the spac-
       ing should be handled sensibly when there is HTML  markup  in  mid-sen-
       tence.

       --no-newlines-reset
       Cancels any earlier --newlines-reset option in Makefile variables etc

       -z, --compress
       Compress	annotation strings in the C code. This compression is designed
       for  fast  on-the-fly  decoding,	 so  it	saves only a limited amount of
       space (typically	10-20%)	but might help if RAM is short.

       --no-compress
       Cancels any earlier --compress option in	Makefile variables etc

       -Z, --zlib
       Compress	the embedded data table	using zlib (or pyzopfli	if available),
       and include code	to call	zlib to	decompress it on load. Useful  if  the
       runtime	machine	 has  the zlib library and you need to save disk space
       but not RAM (the	decompressed table is stored separately	in RAM,	unlike
       --compress which, although giving less compression, at least works  'in
       place').	 Once  --zlib  is in use, specifying --compress	too will typi-
       cally give an additional	disk space saving of less than 1% (and a  run-
       time  RAM  saving that's	greater	but more than offset by	zlib's extrac-
       tion RAM). If generating	a Javascript annotator with zlib,  the	decom-
       pression	 code  is  inlined  so there's no runtime zlib dependency, but
       startup can be ~50% slower so this option is not	recommended in	situa-
       tions  where  the  annotator is frequently reloaded from	source (unless
       you're running on Node.js in which case loading is faster  due  to  the
       use of Node's "Buffer" class).

       --no-zlib
       Cancels any earlier --zlib option in Makefile variables etc

       -l, --library
       Instead	of generating C	code that reads	and writes standard input/out-
       put, generate a C library suitable for loading into Python via  ctypes.
       This  can  be used for example to preload a filter into Web Adjuster to
       cut process-startup delays.

       --no-library
       Cancels any earlier --library option in Makefile	variables etc

       -W, --windows-clipboard
       Include C code to read the clipboard on Windows or Windows  Mobile  and
       to  write an annotated HTML file	and launch a browser, instead of using
       the default cross-platform command-line C wrapper. See the start	of the
       generated C file	for instructions on how	to compile for Windows or Win-
       dows Mobile.

       --no-windows-clipboard
       Cancels any earlier --windows-clipboard option  in  Makefile  variables
       etc

       --java=
       Instead of generating C code, generate Java, and	place the *.java files
       in  the directory specified by this option. The last part of the	direc-
       tory should be made up of the package name; a double slash (//)	should
       separate	  the	rest   of   the	 path  from  the  package  name,  e.g.
       --java=/path/to/wherever//org/example/annotator and the main class will
       be called Annotator.

       --android=
       URL for an Android app to browse	(--java	must be	set). If this is  set,
       code  is	 generated for an Android app which starts a browser with that
       URL as the start	page, and annotates the	text on	every page  it	loads.
       Use file:///android_asset/index.html for	local HTML files in the	assets
       directory;  a clipboard viewer is placed	in clipboard.html, and the app
       will also be able to handle shared text.	If certain  environment	 vari-
       ables  are set, this option can also compile and	sign the app using An-
       droid SDK command-line tools (otherwise it puts a message on stderr ex-
       plaining	what needs to be set)

       --android-template=
       File (or	URL) to	use as a template for Android start HTML. This	option
       implies	--android=file:///android_asset/index.html  and	generates that
       index.html from the file	specified (or from a built-in default  if  the
       special	filename  'blank'  is  used).  The  template  file may include
       URL_BOX_GOES_HERE to show a  URL	 entry	box  and  related  items  (of-
       fline-clipboard link etc) in the	page, in which case you	can optionally
       define  a  Javascript function 'annotUrlTrans' to pre-convert some URLs
       from shortcuts etc; also	enables	better zoom controls on	Android	4+,  a
       mode  selector if you use --annotation-names, a selection scope control
       on recent-enough	WebKit,	and a visible version stamp (which, if the de-
       vice is in 'developer mode', you	may  double-tap	 on  to	 show  missing
       glosses).  VERSION_GOES_HERE may	also be	included if you	want to	put it
       somewhere other than at the bottom of  the  page.  If  you  do  include
       URL_BOX_GOES_HERE you'll	have an	annotating Web browser app that	allows
       the  user to navigate to	arbitrary URLs:	as of 2020, this is acceptable
       on Google Play and Huawei AppGallery (non-China only  from  2022),  but
       not  Amazon  AppStore  as  they	don't want 'competition' to their Silk
       browser.

       --gloss-simplify=
       A regular expression matching parts of glosses to remove	when  generat-
       ing  a '3-line' format in apps, but not for hover titles	or popups. De-
       fault removes parenthesised expressions if not solitary,	anything after
       the first slash or semicolon, and the leading word 'to'.	Can be set  to
       empty string to omit simplification.

       -L, --pleco-hanping
       In  the Android app, make popup definitions link	to Pleco or Hanping if
       installed

       --no-pleco-hanping
       Cancels any earlier --pleco-hanping option in Makefile variables	etc

       --bookmarks=
       Android bookmarks: comma-separated list of package names	that share our
       bookmarks. If this is not specified, the	browser	will not  be  given  a
       bookmarks function. If it is set	to the same value as the package spec-
       ified  in --java, bookmarks are kept in just this Android app. If it is
       set to a	comma-separated	list of	packages that have also	been generated
       by annogen (presumably with different annotation	types),	 and  if  each
       one   has  the  same  android:sharedUserId  attribute  in  AndroidMani-
       fest.xml's 'manifest' tag (you'll need to add this  manually),  and  if
       the same	certificate is used to sign all	of them, then bookmarks	can be
       shared across the set of	browser	apps. But beware the following two is-
       sues:  (1)  adding an android:sharedUserId attribute to an app that has
       already been released without one causes	some devices to	refuse the up-
       date with a 'cannot install' message (details via adb logcat;  affected
       users would need	to uninstall and reinstall instead of update, and some
       of them may not notice the instruction to do so); (2) this has not been
       tested with Google's new	"App Bundle" arrangement, and may be broken if
       the  Bundle  results  in	 APKs being signed by a	different key. In June
       2019 Play Console started issuing warnings if you release  an  APK  in-
       stead  of a Bundle, even	though the "size savings" they mention are un-
       der 1% for annogen-generated apps.

       -e, --epub
       When generating an Android browser, make	it also	respond	to requests to
       open EPUB files.	This results in	an app that requests the 'read	exter-
       nal storage' permission on Android versions below 6, so if you have al-
       ready  released a version without EPUB support then devices running An-
       droid 5.x or below will not auto-update past this change	until the user
       notices the update notification and approves the	extra permission.

       --no-epub
       Cancels any earlier --epub option in Makefile variables etc

       --android-print
       When generating an Android browser, include code	to provide a Print op-
       tion (usually print to PDF) and a  simple  highlight-selection  option.
       The Print option	will require Android 4.4, but the app should still run
       without it on earlier versions of Android.

       --no-android-print
       Cancels any earlier --android-print option in Makefile variables	etc

       --known-characters=
       When generating an Android browser, include an option to	leave the most
       frequent	 characters  unannotated as 'known'. This option should	be set
       to the filename or URL of a UTF-8 file of characters separated by  new-
       lines,  assumed	to be most frequent first, with	characters on the same
       line being variants of each other (see --freq-count for one way to gen-
       erate it). Words	consisting entirely of characters found	in the first N
       lines of	this file (where N is settable by the user)  will  be  unanno-
       tated until tapped on.

       --freq-count=
       Name  of	 a file	to write that is suitable for the known-characters op-
       tion, taken from	the input examples (which should be representative  of
       typical	use). Any post-normalise table provided	will be	used to	deter-
       mine which characters are equivalent.

       --android-audio=
       When generating an Android browser, include an option  to  convert  the
       selection  to  audio  using  this  URL  as a prefix, e.g. https://exam-
       ple.org/speak.cgi?text= (use for	languages not likely to	 be  supported
       by  the	device	itself). Optionally follow the URL with	a space	(quote
       carefully) and a	maximum	number of words	to read	in each	user  request.
       Setting	a  limit is recommended, or somebody somewhere will likely try
       'Select All' on a whole book or something and create load problems. You
       should set a limit server-side too of course.

       --extra-js=
       Extra Javascript	to inject into sites to	 fix  things  in  the  Android
       browser	app.  The snippet will be run before each scan for new text to
       annotate. You may also specify a	file to	read:  --extra-js=@file.js  or
       --extra-js=@file1.js,file2.js (or URLs; do not use // comments in these
       files,  only  /*	... */ because newlines	will be	replaced), and you can
       create variants of the files by adding  search-replace  strings:	 --ex-
       tra-js=@file1.js:search:replace,file2.js

       --tts-js
       Make  Android 5+	multilingual Text-To-Speech functions available	to ex-
       tra-js scripts (see TTSInfo code	for details)

       --no-tts-js
       Cancels any earlier --tts-js option in Makefile variables etc

       --existing-ruby-js-fixes=
       Extra Javascript	to run in the Android browser app or browser extension
       whenever	existing RUBY elements are encountered;	 the  DOM  node	 above
       these  elements	will be	in the variable	n, which your code can manipu-
       late or replace to fix known problems with sites' existing  ruby	 (such
       as  common  two-syllable	words being split when they shouldn't be). Use
       with caution. You may also specify a file  or  URL  to  read:  --exist-
       ing-ruby-js-fixes=@file.js

       --existing-ruby-lang-regex=
       Set  the	 Android app or	browser	extension to remove existing ruby ele-
       ments unless the	document language matches this regular expression.  If
       --sharp-multi  is  in use, you can separate multiple regexes with comma
       and any unset will always delete	existing ruby. If this option  is  not
       set at all then existing	ruby is	always kept.

       --existing-ruby-shortcut-yarowsky
       Set the Android browser app to 'shortcut' Yarowsky-like collocation de-
       cisions when adding glosses to existing ruby over 2 or more characters,
       so that words normally requiring	context	to be found are	more likely to
       be  found without context (this may be needed because adding glosses to
       existing	ruby is	done without regard to context)

       --extra-css=
       Extra CSS to inject into	sites to fix things  in	 the  Android  browser
       app. You	may also specify a file	or URL to read --extra-css=@file.css

       --app-name=
       User-visible name of the	Android	app

       --compile-only
       Assume  the code	has already been generated by a	previous run, and just
       run the compiler

       --no-compile-only
       Cancels any earlier --compile-only option in Makefile variables etc

       -j, --javascript
       Instead of generating C code, generate JavaScript. This might be	useful
       if you want to run an annotator on a device that	has a  JS  interpreter
       but  doesn't let	you run	your own binaries. The JS will be table-driven
       to make it load faster. See comments at the start for usage.

       --no-javascript
       Cancels any earlier --javascript	option in Makefile variables etc

       -6, --js-6bit
       When generating a Javascript annotator, use a 6-bit format for many ad-
       dresses to reduce escape	codes in the data string by making more	of  it
       ASCII

       --no-js-6bit
       Cancels any earlier --js-6bit option in Makefile	variables etc

       -8, --js-octal
       When  generating	a Javascript annotator,	use octal instead of hexadeci-
       mal codes in the	data string when doing so would	save space. This  does
       not comply with ECMAScript 5 and	may give errors	in its strict mode.

       --no-js-octal
       Cancels any earlier --js-octal option in	Makefile variables etc

       -9, --ignore-ie8
       When generating a Javascript annotator, do not make it backward-compat-
       ible  with Microsoft Internet Explorer 8	and below. This	may save a few
       bytes.

       --no-ignore-ie8
       Cancels any earlier --ignore-ie8	option in Makefile variables etc

       -u, --js-utf8
       When generating a Javascript annotator, assume the script can use UTF-8
       encoding	directly and not via escape sequences. In some	browsers  this
       might work only on UTF-8	websites, and/or if your annotation can	be ex-
       pressed without the use of Unicode combining characters.

       --no-js-utf8
       Cancels any earlier --js-utf8 option in Makefile	variables etc

       --browser-extension=
       Name  of	 a Chrome or Firefox browser extension to generate. The	exten-
       sion will be placed in a	directory of the same name  (without  spaces),
       which  may  optionally  already exist and contain icons like 32.png and
       48.png to be used.

       --browser-extension-description=
       Description field to use	when generating	browser	extensions

       --manifest-v3
       Use Manifest v3 instead of Manifest v2 when generating  browser	exten-
       sions  (tested  on Chrome only, and requires Chrome 88 or higher). This
       is now required for all Chrome Web Store	uploads.

       --gecko-id=
       a Gecko (Firefox) ID to embed in	the browser extension

       --dart
       Instead of generating C code, generate Dart. This might	be  useful  if
       you want	to run an annotator in a Flutter application.

       --no-dart
       Cancels any earlier --dart option in Makefile variables etc

       --dart-datafile=
       When  generating	Dart code, put annotator data into a separate file and
       open it using this pathname. Not	compatible with	Dart's "Web  app"  op-
       tion,  but  might  save	space  in a Flutter app	(especially along with
       --zlib)

       -Y, --python
       Instead of generating C code, generate a	Python module. Similar to  the
       Javascript  option,  this  is for when you can't	run your own binaries,
       and it is table-driven for fast loading.

       --no-python
       Cancels any earlier --python option in Makefile variables etc

       --reannotator=
       Shell command through which to pipe each	word of	the original  text  to
       obtain  new  annotation	for that word. This might be useful as a quick
       way of generating a new annotator (e.g. for a different topolect) while
       keeping the information about word separation and/or glosses  from  the
       previous	 annotator,  but  it is	limited	to commands that don't need to
       look beyond the boundaries of each word.	If the command is prefixed  by
       a  # character, it will be given	the word's existing annotation instead
       of its original text, and if prefixed by	## it will be given text#anno-
       tation. The command should treat	each line of its input	independently,
       and  both  its input and	its output should be in	the encoding specified
       by --outcode.

       -A, --reannotate-caps
       When using --reannotator, make sure to capitalise any word  it  returns
       that began with a capital on input

       --no-reannotate-caps
       Cancels any earlier --reannotate-caps option in Makefile	variables etc

       --sharp-multi
       Assume  annotation  (or	reannotator output) contains multiple alterna-
       tives separated by # (e.g. pinyin#Yale) and include code	to select  one
       by  number at runtime (starting from 0).	This is	to save	on total space
       when shipping multiple annotators that share the	same word grouping and
       gloss data, differing only in the transcription of each word.

       --no-sharp-multi
       Cancels any earlier --sharp-multi option	in Makefile variables etc

       --annotation-names=
       Comma-separated list of annotation types	supplied to sharp-multi	 (e.g.
       Pinyin,Yale),  if you want the Android app etc to be able to name them.
       You can also set	just one annotation names here if you  are  not	 using
       sharp-multi.

       --annotation-map=
       Comma-separated	list  of  annotation-number overrides for sharp-multi,
       e.g. 7=3	to take	the 3rd	item if	a 7th is selected

       --annotation-postprocess=
       Extra code for post-processing specific annotNo	selections  after  re-
       trieving	from a sharp-multi list	(@file or @url allowed)

       -o, --allow-overlaps
       Normally,  the analyser avoids generating rules that could overlap with
       each other in a way that	would leave the	program	not knowing which  one
       to  apply. If a short rule would	cause overlaps,	the analyser will pre-
       fer to generate a longer	rule that uses more context, and if  even  the
       entire  phrase cannot be	made into a rule without causing overlaps then
       the analyser will give up on trying to cover that phrase.  This	option
       allows  the  analyser  to generate rules	that could overlap, as long as
       none of the  overlaps  would  cause  actual  problems  in  the  example
       phrases.	 Thus more of the examples can be covered, at the expense of a
       higher risk of ambiguity	problems when  applying	 the  rules  to	 other
       texts. See also the -y option.

       --no-allow-overlaps
       Cancels any earlier --allow-overlaps option in Makefile variables etc

       -y, --ybytes=
       Look  for  candidate  Yarowsky  seed-collocations within	this number of
       bytes of	the end	of a word. If this is set then overlaps	and rule  con-
       flicts  will  be	 allowed when seed collocations	can be used to distin-
       guish between them, and the analysis is likely to be faster. Markup ex-
       amples that are completely  separate  (e.g.  sentences  from  different
       sources)	 must  have at least this number of (non-whitespace) bytes be-
       tween them.

       --ybytes-max=
       Extend the Yarowsky seed-collocation search to check over larger	ranges
       up to this maximum. If this is set then several ranges will be  checked
       in  an  attempt	to  determine the best one for each word, but see also
       ymax-threshold and ymax-limitwords.

       --ymax-threshold=
       Limits the length of word that  receives	 the  narrower-range  Yarowsky
       search  when  ybytes-max	 is  in	 use.  For words longer	than this, the
       search will go directly to ybytes-max. This is for languages where  the
       likelihood  of  a  word's  annotation being influenced by its immediate
       neighbours more than its	distant	 collocations  increases  for  shorter
       words, and less is to be	gained by comparing different ranges when pro-
       cessing	longer	words. Setting this to 0 means no limit, i.e. the full
       range will be explored on all Yarowsky checks.

       --ymax-limitwords=
       Comma-separated list of words (without annotation markup) for which the
       ybytes expansion	loop should run	at most	two iterations.	 This  may  be
       useful to reduce	compile	times for very common ambiguous	words that de-
       pend  only on their immediate neighbours. Annogen may suggest words for
       this option if it finds they take inordinate time to process.

       --ybytes-step=
       The increment value for the loop	between	ybytes and ybytes-max

       -k, --warn-yarowsky
       Warn when absolutely no distinguishing Yarowsky seed  collocations  can
       be found	for a word in the examples

       --no-warn-yarowsky
       Cancels any earlier --warn-yarowsky option in Makefile variables	etc

       -K, --yarowsky-all
       Accept Yarowsky seed collocations even from input characters that never
       occur  in  annotated  words  (this  might include punctuation and exam-
       ple-separation markup)

       --no-yarowsky-all
       Cancels any earlier --yarowsky-all option in Makefile variables etc

       --yarowsky-multiword
       Check potential multiword rules for Yarowsky  seed  collocations	 also.
       Without this option (default), only single-word rules are checked.

       --no-yarowsky-multiword
       Cancels	any  earlier --yarowsky-multiword option in Makefile variables
       etc

       --yarowsky-thorough
       Recheck Yarowsky	seed collocations when checking	if any multiword  rule
       would  be  needed  to reproduce the examples. This could	risk 'overfit-
       ting' the example set.

       --no-yarowsky-thorough
       Cancels any earlier --yarowsky-thorough option  in  Makefile  variables
       etc

       --yarowsky-half-thorough
       Like  --yarowsky-thorough but check only	what collocations occur	within
       the proposed new	rule (not around it), less likely to overfit

       --no-yarowsky-half-thorough
       Cancels any earlier --yarowsky-half-thorough option in  Makefile	 vari-
       ables etc

       --yarowsky-debug=
       Report  the  details of seed-collocation	false positives	if there are a
       large number of matches and at most this	number of false	positives (de-
       fault 1). Occasionally these might be due to typos in the corpus, so it
       might be	worth a	check.

       --allow-exceptions=
       Filename	(or URL) of any	known exeptions	 for  --yarowsky-debug	checks
       (default	allow-exceptions.txt)

       --normalise-debug=
       When  --capitalisation  is not in effect. report	words that are usually
       capitalised but that have at most this number of	lower-case  exceptions
       (default	1) for investigation of	possible typos in the corpus

       --allow-caps-exceptions=
       Filename	 (or  URL) of any known	exeptions for --normalise-debug	checks
       (default	allow-caps-exceptions.txt)

       --debug-dir=
       Directory in which to write reports of possible typos etc (defaults  to
       current directory)

       --normalise-cache=
       Optional	 file to use to	cache the result of normalisation. Adding .gz,
       .bz2 or .xz for compression is acceptable.

       -1, --single-words
       Do not generate any rule	longer than 1 word, although it	can still have
       Yarowsky	seed collocations if -y	is set.	This speeds up the search, but
       at the expense of thoroughness. You might want to use this  in  conjuc-
       tion with -y to make a parser quickly.

       --no-single-words
       Cancels any earlier --single-words option in Makefile variables etc

       --max-words=
       Limits  the number of words in a	rule. 0	means no limit.	--single-words
       is equivalent to	--max-words=1. If you need to limit the	 search	 time,
       and  are	 using -y, it should suffice to	use --single-words for a quick
       annotator or --max-words=5 for  a  more	thorough  one  (or  try	 3  if
       --yarowsky-half-thorough	is in use).

       --multiword-end-avoid=
       Comma-separated	list  of words (without	annotation markup) that	should
       be avoided at the end of	a multiword rule (e.g. sandhi likely to	depend
       on the following	word)

       -d, --diagnose=
       Output some diagnostics for the specified word. Use this	option to help
       answer "why doesn't it have a rule for...?" issues. This	option expects
       the word	without	markup and uses	the system locale (UTF-8 if it	cannot
       be detected).

       --diagnose-limit=
       Maximum number of phrases to print diagnostics for (0 means unlimited).
       Default:	10

       -m, --diagnose-manual
       Check and diagnose potential failures of	--manualrules

       --no-diagnose-manual
       Cancels any earlier --diagnose-manual option in Makefile	variables etc

       -q, --diagnose-quick
       Ignore  all phrases that	do not contain the word	specified by the --di-
       agnose option, for getting a faster (but	possibly less accurate)	 diag-
       nostic.	The  generated	annotator is not likely	to be useful when this
       option is present.

       --no-diagnose-quick
       Cancels any earlier --diagnose-quick option in Makefile variables etc

       --priority-list=
       Instead of generating an	annotator, use the input examples to  generate
       a  list of (non-annotated) words	with priority numbers, a higher	number
       meaning the word	should have greater preferential treatment in ambigui-
       ties, and write it to this file (or compressed .gz, .bz2	or .xz	file).
       If  the	file provided already exists, it will be updated, thus you can
       amend an	existing usage-frequency list or similar (although  the	 final
       numbers	are  priorities	 and might no longer match usage-frequency ex-
       actly). The purpose of this option is to	help if	you have  an  existing
       word-priority-based text	segmenter and wish to update its data from the
       examples;  this	approach might not be as good as the Yarowsky-like one
       (especially when	the same word has multiple readings to	choose	from),
       but  when  there	are integration	issues with existing code you might at
       least be	able to	improve	its word-priority data.

       -t, --time-estimate
       Estimate	time to	completion. The	code to	do this	is unreliable  and  is
       prone to	underestimate. If you turn it on, its estimate is displayed at
       the end of the status line as days, hours or minutes.

       --no-time-estimate
       Cancels any earlier --time-estimate option in Makefile variables	etc

       -0, --single-core
       Use only	one CPU	core even when others are available on Unix

       --no-single-core
       Cancels any earlier --single-core option	in Makefile variables etc

       --cores-command=
       Command	to  run	when changing the number of CPU	cores in use (with new
       number as a parameter); this can	 run  a	 script	 to  pause/resume  any
       lower-priority load

       -p, --status-prefix=
       Label  to add at	the start of the status	line, for use if you batch-run
       annogen in multiple configurations and want to know which one  is  cur-
       rently running

Legal considerations
       Annotator  code will contain individual words and some phrases from the
       original	corpus (and these can be read even by people who do  not  have
       the  unannotated	 version); with	regards	to copyright law, I expect the
       annotator code will count as an "index" to the  collection,  the	 copy-
       right  of  which	 exists	separately to that of the original collection,
       but laws	do vary	by country and I am not	a solicitor so please act  ju-
       diciously.

       Legally	obtaining  that	original annotated corpus is up	to you.	If you
       are in the UK the government says non-commercial	text mining is allowed
       (terms of use prohibiting  non-commercial  mining  are  unenforceable),
       provided	you:

       1.  respect network stability (i.e. wait	a long time between each down-
	   load),

       2.  connect  directly  to  the  publisher  (this	 law bypasses the pub-
	   lisher's terms of use, not those of third-party search engines like
	   Google),

       3.  use the result only for mining, not for republishing	 the  original
	   text	(so you	can't publish your unprocessed crawl dumps either),

       4.  and	still respect any prohibitions against sharing whatever	mining
	   tools you made for the site (as this	law is only about text mining,
	   not about the sharing of tools).

       Laws outside the	UK are different (and I'm not a	lawyer)	so check care-
       fully.	 Gao	et    al    2020's     paper	 on	"The	 Pile"
       https://arxiv.org/abs/2101.00027	claims published crawl dumps with lim-
       ited  processing	 might	be permissible under American copyright	law as
       transformative fair use,	but I'm	not sure how legally watertight	 their
       argument	 is:  it might be safer	to keep	unlicensed parts of the	corpus
       private and publish only	the resulting index.

       If the website's	terms don't actually prohibit writing  an  unpublished
       scraper	for  non-commercial  mining purposes, perhaps you won't	need a
       legal exception for the crawling	part--but  you	should	still  respect
       their  bandwidth	 and  do  it  slowly, both for moral reasons (it's the
       right thing to do) and pragmatic	ones (you won't	want  their  sysadmins
       and service providers taking action against you).

Citation
       If you need to cite a peer-reviewed paper:

       Silas  S.  Brown. Web Annotation	with Modified-Yarowsky and Other Algo-
       rithms. Overload	112 (December 2012) pp.4-7.

Silas S. Brown			  August 2025			    ANNOGEN(1)
Legal considerations | Citation
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=annogen&sektion=1&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages