FreeBSD Manual Pages
ANNOGEN(1) General Commands Manual ANNOGEN(1)
Annotator Generator is an examples-driven generator of fast text annotators.
"Annotate" in this context means to add pronunciation or other information to
each word, and/or to split text into words in a language that does not use
spaces.
O You supply a corpus of pre-annotated texts for Annotator Generator
to work out the rules and exceptions
O Annotator Generator creates table-driven code in C, Java,
Javascript, Dart or Python with 2 and 3 compatibility
O The resulting program should be able to annotate any text that con-
tains words or phrases similar to those found in the examples
O It can output the annotations alone or it can combine them with the
original text using HTML Ruby markup or simple braces
O If anything is unclear (didn't happen in the examples, or there's
not enough context to figure out which example should be applied)
then the program will leave it unannotated so you can pass it to a
backup annotation program if you have one.
O If you have no backup annotator then try setting the -y option,
which makes Annotator Generator try harder to find context-indepen-
dent rules with context-dependent exceptions, so as to annotate as
much text as possible.
O Generated annotators can act as filters for Web Adjuster; options
are also provided for generating Android apps, browser extensions,
and clipboard annotators for Windows and Windows Mobile, or you
could format the annotations on a Unix terminal
-h, --help
show this help message and exit
--infile=
Filename of a text file (or a compressed .gz, .bz2 or .xz file or URL)
to read the input examples from. If this is not specified, standard in-
put is used.
--incode=
Character encoding of the input file (default utf-8)
--mstart=
The string that starts a piece of text with annotation markup in the
input examples; default <ruby><rb>
--mmid=
The string that occurs in the middle of a piece of markup in the input
examples, with the word on its left and the added markup on its right
(or the other way around if mreverse is set); default </rb><rt>
--mend=
The string that ends a piece of annotation markup in the input exam-
ples; default </rt></ruby>
-r, --mreverse
Specifies that the annotation markup is reversed, so the text before
mmid is the annotation and the text after it is the base text
--no-mreverse
Cancels any earlier --mreverse option in Makefile variables etc
--end-pri=
Treat words that occur in the examples before this delimeter as having
"high priority" for Yarowsky-like seed collocations (if these are in
use). Normally the Yarowsky-like logic tries to identify a "default"
annotation based on what is most common in the examples, with the ex-
ceptions indicated by collocations. If however a word is found in a
high-priority section at the start, then the first annotation found
there will be taken as the ideal "default" even if it's in a minority
in the examples; everything else will be taken as an exception.
-s, --spaces
Set this if you are working with a language that uses whitespace in its
non-markedup version (not fully tested). The default is to assume that
there will not be any whitespace in the language, which is correct for
Chinese and Japanese.
--no-spaces
Cancels any earlier --spaces option in Makefile variables etc
-c, --capitalisation
Don't try to normalise capitalisation in the input. Normally, to sim-
plify the rules, the analyser will try to remove start-of-sentence cap-
itals in annotations, so that the only remaining words with capital
letters are the ones that are always capitalised such as names. (That's
not perfect: some words might always be capitalised just because they
never occur mid-sentence in the examples.) If this option is used, the
analyser will instead try to "learn" how to predict the capitalisation
of all words (including start of sentence words) from their contexts.
--no-capitalisation
Cancels any earlier --capitalisation option in Makefile variables etc
-w, --annot-whitespace
Don't try to normalise the use of whitespace and hyphenation in the ex-
ample annotations. Normally the analyser will try to do this, to reduce
the risk of missing possible rules due to minor typographical varia-
tions.
--no-annot-whitespace
Cancels any earlier --annot-whitespace option in Makefile variables etc
--keep-whitespace=
Comma-separated list of words (without annotation markup) for which
whitespace and hyphenation should always be kept even without the --an-
not-whitespace option. Use when you know the variation is legitimate.
This option expects words to be encoded using the system locale (UTF-8
if it cannot be detected).
--suffix=
Comma-separated list of annotations that can be considered optional
suffixes for normalisation
--suffix-minlen=
Minimum length of word (in Unicode characters) to apply suffix normali-
sation
--post-normalise=
Filename or URL of an optional Python module defining a dictionary
called 'table' mapping integers to integers for arbitrary single-char-
acter normalisation on the Unicode BMP. This can reduce the size of the
annotator. It is applied in post-processing (does not affect rules gen-
eration itself). For example this can be used to merge the recognition
of Full, Simplified and Variant forms of the same Chinese character in
cases where this can be done without ambiguity, if it is acceptable for
the generated annotator to recognise mixed-script words should they oc-
cur. If any word in the examples has a different annotation when nor-
malised than not, the normalised version takes precedence.
--glossfile=
Filename of an optional text file (or compressed .gz, .bz2 or .xz file
or URL) to read auxiliary "gloss" information. Each line of this should
be of the form: word (tab) annotation (tab) gloss. Extra tabs in the
gloss will be converted to newlines (useful if you want to quote multi-
ple dictionaries). When the compiled annotator generates ruby markup,
it will add the gloss string as a popup title whenever that word is
used with that annotation (before any reannotator option is applied).
The annotation field may be left blank to indicate that the gloss will
appear for all other annotations of that word. The entries in glossfile
do not affect the annotation process itself, so it's not necessary to
completely debug glossfile's word segmentation etc.
-C, --gloss-closure=
If any Chinese, Japanese or Korean word is missing from glossfile,
search its closure of variant characters also, using the Unihan vari-
ants file (or URL) specified by this option
--no-gloss-closure
Cancels any earlier --gloss-closure option in Makefile variables etc
-M, --glossmiss-omit
Omit rules containing any word not mentioned in glossfile. Might be
useful if you want to train on a text that uses proprietary terms and
don't want to accidentally 'leak' those terms (assuming they're not ac-
cidentally included in glossfile also). Words may also be listed in
glossfile with an empty gloss field to indicate that no gloss is avail-
able but rules using this word needn't be omitted.
--no-glossmiss-omit
Cancels any earlier --glossmiss-omit option in Makefile variables etc
--words-omit=
File (or compressed .gz, .bz2 or .xz file or URL) containing words (one
per line, without markup) to omit from the annotator. Use this to make
an annotator smaller if for example if you're working from a rules file
that contains long lists of place names you don't need this particular
annotator to recognise but you still want to keep them as rules for
other annotators, but be careful because any word on such a list gets
omitted even if it also has other meanings (some place names are also
normal words).
--manualrules=
Filename of an optional text file (or compressed .gz, .bz2 or .xz file
or URL) to read extra, manually-written rules. Each line of this should
be a marked-up phrase (in the input format) which is to be uncondition-
ally added as a rule. Use this sparingly, because these rules are not
taken into account when generating the others and they will be applied
regardless of context (although a manual rule might fail to activate if
the annotator is part-way through processing a different rule); try
checking messages from --diagnose-manual.
--c-filename=
Where to write the C, C#, Python, Javascript, Go or Dart program. De-
faults to standard output, or annotator.c in the system temporary di-
rectory if standard output seems to be the terminal (the program might
be large, especially if Yarowsky-like indicators are not used, so it's
best not to use a server home directory where you might have limited
quota).
--c-compiler=
The C compiler to run if generating C and standard output is not con-
nected to a pipe. The default is to use the "cc" command which usually
redirects to your "normal" compiler. You can add options (remembering
to enclose this whole parameter in quotes if it contains spaces), but
if the C program is large then adding optimisation options may make the
compile take a long time. If standard output is connected to a pipe,
then this option is ignored because the C code will simply be written
to the pipe. You can also set this option to an empty string to skip
compilation. Default: cc -o annotator
--outcode=
Character encoding to use in the generated parser (default utf-8, must
be ASCII-compatible i.e. not utf-16)
--rulesFile=
Filename of a JSON file to hold the accumulated rules. Adding .gz, .bz2
or .xz for compression is acceptable. If this is set then either
--write-rules or --read-rules must be specified.
--write-rules
Write rulesFile instead of generating a parser. You will then need to
rerun with --read-rules later.
--no-write-rules
Cancels any earlier --write-rules option in Makefile variables etc
--read-rules
Read rulesFile from a previous run, and apply the output options to it.
You should still specify the input formatting options (which should not
change), and any glossfile or manualrules options (which may change),
but no input is required.
--no-read-rules
Cancels any earlier --read-rules option in Makefile variables etc
-E, --newlines-reset
Have the annotator reset its state on every newline byte. By default
newlines do not affect state such as whether a space is required before
the next word, so that if the annotator is used with Web Adjuster's
htmlText option (which defaults to using newline separators) the spac-
ing should be handled sensibly when there is HTML markup in mid-sen-
tence.
--no-newlines-reset
Cancels any earlier --newlines-reset option in Makefile variables etc
-z, --compress
Compress annotation strings in the C code. This compression is designed
for fast on-the-fly decoding, so it saves only a limited amount of
space (typically 10-20%) but might help if RAM is short.
--no-compress
Cancels any earlier --compress option in Makefile variables etc
-Z, --zlib
Compress the embedded data table using zlib (or pyzopfli if available),
and include code to call zlib to decompress it on load. Useful if the
runtime machine has the zlib library and you need to save disk space
but not RAM (the decompressed table is stored separately in RAM, unlike
--compress which, although giving less compression, at least works 'in
place'). Once --zlib is in use, specifying --compress too will typi-
cally give an additional disk space saving of less than 1% (and a run-
time RAM saving that's greater but more than offset by zlib's extrac-
tion RAM). If generating a Javascript annotator with zlib, the decom-
pression code is inlined so there's no runtime zlib dependency, but
startup can be ~50% slower so this option is not recommended in situa-
tions where the annotator is frequently reloaded from source (unless
you're running on Node.js in which case loading is faster due to the
use of Node's "Buffer" class).
--no-zlib
Cancels any earlier --zlib option in Makefile variables etc
-l, --library
Instead of generating C code that reads and writes standard input/out-
put, generate a C library suitable for loading into Python via ctypes.
This can be used for example to preload a filter into Web Adjuster to
cut process-startup delays.
--no-library
Cancels any earlier --library option in Makefile variables etc
-W, --windows-clipboard
Include C code to read the clipboard on Windows or Windows Mobile and
to write an annotated HTML file and launch a browser, instead of using
the default cross-platform command-line C wrapper. See the start of the
generated C file for instructions on how to compile for Windows or Win-
dows Mobile.
--no-windows-clipboard
Cancels any earlier --windows-clipboard option in Makefile variables
etc
--java=
Instead of generating C code, generate Java, and place the *.java files
in the directory specified by this option. The last part of the direc-
tory should be made up of the package name; a double slash (//) should
separate the rest of the path from the package name, e.g.
--java=/path/to/wherever//org/example/annotator and the main class will
be called Annotator.
--android=
URL for an Android app to browse (--java must be set). If this is set,
code is generated for an Android app which starts a browser with that
URL as the start page, and annotates the text on every page it loads.
Use file:///android_asset/index.html for local HTML files in the assets
directory; a clipboard viewer is placed in clipboard.html, and the app
will also be able to handle shared text. If certain environment vari-
ables are set, this option can also compile and sign the app using An-
droid SDK command-line tools (otherwise it puts a message on stderr ex-
plaining what needs to be set)
--android-template=
File (or URL) to use as a template for Android start HTML. This option
implies --android=file:///android_asset/index.html and generates that
index.html from the file specified (or from a built-in default if the
special filename 'blank' is used). The template file may include
URL_BOX_GOES_HERE to show a URL entry box and related items (of-
fline-clipboard link etc) in the page, in which case you can optionally
define a Javascript function 'annotUrlTrans' to pre-convert some URLs
from shortcuts etc; also enables better zoom controls on Android 4+, a
mode selector if you use --annotation-names, a selection scope control
on recent-enough WebKit, and a visible version stamp (which, if the de-
vice is in 'developer mode', you may double-tap on to show missing
glosses). VERSION_GOES_HERE may also be included if you want to put it
somewhere other than at the bottom of the page. If you do include
URL_BOX_GOES_HERE you'll have an annotating Web browser app that allows
the user to navigate to arbitrary URLs: as of 2020, this is acceptable
on Google Play and Huawei AppGallery (non-China only from 2022), but
not Amazon AppStore as they don't want 'competition' to their Silk
browser.
--gloss-simplify=
A regular expression matching parts of glosses to remove when generat-
ing a '3-line' format in apps, but not for hover titles or popups. De-
fault removes parenthesised expressions if not solitary, anything after
the first slash or semicolon, and the leading word 'to'. Can be set to
empty string to omit simplification.
-L, --pleco-hanping
In the Android app, make popup definitions link to Pleco or Hanping if
installed
--no-pleco-hanping
Cancels any earlier --pleco-hanping option in Makefile variables etc
--bookmarks=
Android bookmarks: comma-separated list of package names that share our
bookmarks. If this is not specified, the browser will not be given a
bookmarks function. If it is set to the same value as the package spec-
ified in --java, bookmarks are kept in just this Android app. If it is
set to a comma-separated list of packages that have also been generated
by annogen (presumably with different annotation types), and if each
one has the same android:sharedUserId attribute in AndroidMani-
fest.xml's 'manifest' tag (you'll need to add this manually), and if
the same certificate is used to sign all of them, then bookmarks can be
shared across the set of browser apps. But beware the following two is-
sues: (1) adding an android:sharedUserId attribute to an app that has
already been released without one causes some devices to refuse the up-
date with a 'cannot install' message (details via adb logcat; affected
users would need to uninstall and reinstall instead of update, and some
of them may not notice the instruction to do so); (2) this has not been
tested with Google's new "App Bundle" arrangement, and may be broken if
the Bundle results in APKs being signed by a different key. In June
2019 Play Console started issuing warnings if you release an APK in-
stead of a Bundle, even though the "size savings" they mention are un-
der 1% for annogen-generated apps.
-e, --epub
When generating an Android browser, make it also respond to requests to
open EPUB files. This results in an app that requests the 'read exter-
nal storage' permission on Android versions below 6, so if you have al-
ready released a version without EPUB support then devices running An-
droid 5.x or below will not auto-update past this change until the user
notices the update notification and approves the extra permission.
--no-epub
Cancels any earlier --epub option in Makefile variables etc
--android-print
When generating an Android browser, include code to provide a Print op-
tion (usually print to PDF) and a simple highlight-selection option.
The Print option will require Android 4.4, but the app should still run
without it on earlier versions of Android.
--no-android-print
Cancels any earlier --android-print option in Makefile variables etc
--known-characters=
When generating an Android browser, include an option to leave the most
frequent characters unannotated as 'known'. This option should be set
to the filename or URL of a UTF-8 file of characters separated by new-
lines, assumed to be most frequent first, with characters on the same
line being variants of each other (see --freq-count for one way to gen-
erate it). Words consisting entirely of characters found in the first N
lines of this file (where N is settable by the user) will be unanno-
tated until tapped on.
--freq-count=
Name of a file to write that is suitable for the known-characters op-
tion, taken from the input examples (which should be representative of
typical use). Any post-normalise table provided will be used to deter-
mine which characters are equivalent.
--android-audio=
When generating an Android browser, include an option to convert the
selection to audio using this URL as a prefix, e.g. https://exam-
ple.org/speak.cgi?text= (use for languages not likely to be supported
by the device itself). Optionally follow the URL with a space (quote
carefully) and a maximum number of words to read in each user request.
Setting a limit is recommended, or somebody somewhere will likely try
'Select All' on a whole book or something and create load problems. You
should set a limit server-side too of course.
--extra-js=
Extra Javascript to inject into sites to fix things in the Android
browser app. The snippet will be run before each scan for new text to
annotate. You may also specify a file to read: --extra-js=@file.js or
--extra-js=@file1.js,file2.js (or URLs; do not use // comments in these
files, only /* ... */ because newlines will be replaced), and you can
create variants of the files by adding search-replace strings: --ex-
tra-js=@file1.js:search:replace,file2.js
--tts-js
Make Android 5+ multilingual Text-To-Speech functions available to ex-
tra-js scripts (see TTSInfo code for details)
--no-tts-js
Cancels any earlier --tts-js option in Makefile variables etc
--existing-ruby-js-fixes=
Extra Javascript to run in the Android browser app or browser extension
whenever existing RUBY elements are encountered; the DOM node above
these elements will be in the variable n, which your code can manipu-
late or replace to fix known problems with sites' existing ruby (such
as common two-syllable words being split when they shouldn't be). Use
with caution. You may also specify a file or URL to read: --exist-
ing-ruby-js-fixes=@file.js
--existing-ruby-lang-regex=
Set the Android app or browser extension to remove existing ruby ele-
ments unless the document language matches this regular expression. If
--sharp-multi is in use, you can separate multiple regexes with comma
and any unset will always delete existing ruby. If this option is not
set at all then existing ruby is always kept.
--existing-ruby-shortcut-yarowsky
Set the Android browser app to 'shortcut' Yarowsky-like collocation de-
cisions when adding glosses to existing ruby over 2 or more characters,
so that words normally requiring context to be found are more likely to
be found without context (this may be needed because adding glosses to
existing ruby is done without regard to context)
--extra-css=
Extra CSS to inject into sites to fix things in the Android browser
app. You may also specify a file or URL to read --extra-css=@file.css
--app-name=
User-visible name of the Android app
--compile-only
Assume the code has already been generated by a previous run, and just
run the compiler
--no-compile-only
Cancels any earlier --compile-only option in Makefile variables etc
-j, --javascript
Instead of generating C code, generate JavaScript. This might be useful
if you want to run an annotator on a device that has a JS interpreter
but doesn't let you run your own binaries. The JS will be table-driven
to make it load faster. See comments at the start for usage.
--no-javascript
Cancels any earlier --javascript option in Makefile variables etc
-6, --js-6bit
When generating a Javascript annotator, use a 6-bit format for many ad-
dresses to reduce escape codes in the data string by making more of it
ASCII
--no-js-6bit
Cancels any earlier --js-6bit option in Makefile variables etc
-8, --js-octal
When generating a Javascript annotator, use octal instead of hexadeci-
mal codes in the data string when doing so would save space. This does
not comply with ECMAScript 5 and may give errors in its strict mode.
--no-js-octal
Cancels any earlier --js-octal option in Makefile variables etc
-9, --ignore-ie8
When generating a Javascript annotator, do not make it backward-compat-
ible with Microsoft Internet Explorer 8 and below. This may save a few
bytes.
--no-ignore-ie8
Cancels any earlier --ignore-ie8 option in Makefile variables etc
-u, --js-utf8
When generating a Javascript annotator, assume the script can use UTF-8
encoding directly and not via escape sequences. In some browsers this
might work only on UTF-8 websites, and/or if your annotation can be ex-
pressed without the use of Unicode combining characters.
--no-js-utf8
Cancels any earlier --js-utf8 option in Makefile variables etc
--browser-extension=
Name of a Chrome or Firefox browser extension to generate. The exten-
sion will be placed in a directory of the same name (without spaces),
which may optionally already exist and contain icons like 32.png and
48.png to be used.
--browser-extension-description=
Description field to use when generating browser extensions
--manifest-v3
Use Manifest v3 instead of Manifest v2 when generating browser exten-
sions (tested on Chrome only, and requires Chrome 88 or higher). This
is now required for all Chrome Web Store uploads.
--gecko-id=
a Gecko (Firefox) ID to embed in the browser extension
--dart
Instead of generating C code, generate Dart. This might be useful if
you want to run an annotator in a Flutter application.
--no-dart
Cancels any earlier --dart option in Makefile variables etc
--dart-datafile=
When generating Dart code, put annotator data into a separate file and
open it using this pathname. Not compatible with Dart's "Web app" op-
tion, but might save space in a Flutter app (especially along with
--zlib)
-Y, --python
Instead of generating C code, generate a Python module. Similar to the
Javascript option, this is for when you can't run your own binaries,
and it is table-driven for fast loading.
--no-python
Cancels any earlier --python option in Makefile variables etc
--reannotator=
Shell command through which to pipe each word of the original text to
obtain new annotation for that word. This might be useful as a quick
way of generating a new annotator (e.g. for a different topolect) while
keeping the information about word separation and/or glosses from the
previous annotator, but it is limited to commands that don't need to
look beyond the boundaries of each word. If the command is prefixed by
a # character, it will be given the word's existing annotation instead
of its original text, and if prefixed by ## it will be given text#anno-
tation. The command should treat each line of its input independently,
and both its input and its output should be in the encoding specified
by --outcode.
-A, --reannotate-caps
When using --reannotator, make sure to capitalise any word it returns
that began with a capital on input
--no-reannotate-caps
Cancels any earlier --reannotate-caps option in Makefile variables etc
--sharp-multi
Assume annotation (or reannotator output) contains multiple alterna-
tives separated by # (e.g. pinyin#Yale) and include code to select one
by number at runtime (starting from 0). This is to save on total space
when shipping multiple annotators that share the same word grouping and
gloss data, differing only in the transcription of each word.
--no-sharp-multi
Cancels any earlier --sharp-multi option in Makefile variables etc
--annotation-names=
Comma-separated list of annotation types supplied to sharp-multi (e.g.
Pinyin,Yale), if you want the Android app etc to be able to name them.
You can also set just one annotation names here if you are not using
sharp-multi.
--annotation-map=
Comma-separated list of annotation-number overrides for sharp-multi,
e.g. 7=3 to take the 3rd item if a 7th is selected
--annotation-postprocess=
Extra code for post-processing specific annotNo selections after re-
trieving from a sharp-multi list (@file or @url allowed)
-o, --allow-overlaps
Normally, the analyser avoids generating rules that could overlap with
each other in a way that would leave the program not knowing which one
to apply. If a short rule would cause overlaps, the analyser will pre-
fer to generate a longer rule that uses more context, and if even the
entire phrase cannot be made into a rule without causing overlaps then
the analyser will give up on trying to cover that phrase. This option
allows the analyser to generate rules that could overlap, as long as
none of the overlaps would cause actual problems in the example
phrases. Thus more of the examples can be covered, at the expense of a
higher risk of ambiguity problems when applying the rules to other
texts. See also the -y option.
--no-allow-overlaps
Cancels any earlier --allow-overlaps option in Makefile variables etc
-y, --ybytes=
Look for candidate Yarowsky seed-collocations within this number of
bytes of the end of a word. If this is set then overlaps and rule con-
flicts will be allowed when seed collocations can be used to distin-
guish between them, and the analysis is likely to be faster. Markup ex-
amples that are completely separate (e.g. sentences from different
sources) must have at least this number of (non-whitespace) bytes be-
tween them.
--ybytes-max=
Extend the Yarowsky seed-collocation search to check over larger ranges
up to this maximum. If this is set then several ranges will be checked
in an attempt to determine the best one for each word, but see also
ymax-threshold and ymax-limitwords.
--ymax-threshold=
Limits the length of word that receives the narrower-range Yarowsky
search when ybytes-max is in use. For words longer than this, the
search will go directly to ybytes-max. This is for languages where the
likelihood of a word's annotation being influenced by its immediate
neighbours more than its distant collocations increases for shorter
words, and less is to be gained by comparing different ranges when pro-
cessing longer words. Setting this to 0 means no limit, i.e. the full
range will be explored on all Yarowsky checks.
--ymax-limitwords=
Comma-separated list of words (without annotation markup) for which the
ybytes expansion loop should run at most two iterations. This may be
useful to reduce compile times for very common ambiguous words that de-
pend only on their immediate neighbours. Annogen may suggest words for
this option if it finds they take inordinate time to process.
--ybytes-step=
The increment value for the loop between ybytes and ybytes-max
-k, --warn-yarowsky
Warn when absolutely no distinguishing Yarowsky seed collocations can
be found for a word in the examples
--no-warn-yarowsky
Cancels any earlier --warn-yarowsky option in Makefile variables etc
-K, --yarowsky-all
Accept Yarowsky seed collocations even from input characters that never
occur in annotated words (this might include punctuation and exam-
ple-separation markup)
--no-yarowsky-all
Cancels any earlier --yarowsky-all option in Makefile variables etc
--yarowsky-multiword
Check potential multiword rules for Yarowsky seed collocations also.
Without this option (default), only single-word rules are checked.
--no-yarowsky-multiword
Cancels any earlier --yarowsky-multiword option in Makefile variables
etc
--yarowsky-thorough
Recheck Yarowsky seed collocations when checking if any multiword rule
would be needed to reproduce the examples. This could risk 'overfit-
ting' the example set.
--no-yarowsky-thorough
Cancels any earlier --yarowsky-thorough option in Makefile variables
etc
--yarowsky-half-thorough
Like --yarowsky-thorough but check only what collocations occur within
the proposed new rule (not around it), less likely to overfit
--no-yarowsky-half-thorough
Cancels any earlier --yarowsky-half-thorough option in Makefile vari-
ables etc
--yarowsky-debug=
Report the details of seed-collocation false positives if there are a
large number of matches and at most this number of false positives (de-
fault 1). Occasionally these might be due to typos in the corpus, so it
might be worth a check.
--allow-exceptions=
Filename (or URL) of any known exeptions for --yarowsky-debug checks
(default allow-exceptions.txt)
--normalise-debug=
When --capitalisation is not in effect. report words that are usually
capitalised but that have at most this number of lower-case exceptions
(default 1) for investigation of possible typos in the corpus
--allow-caps-exceptions=
Filename (or URL) of any known exeptions for --normalise-debug checks
(default allow-caps-exceptions.txt)
--debug-dir=
Directory in which to write reports of possible typos etc (defaults to
current directory)
--normalise-cache=
Optional file to use to cache the result of normalisation. Adding .gz,
.bz2 or .xz for compression is acceptable.
-1, --single-words
Do not generate any rule longer than 1 word, although it can still have
Yarowsky seed collocations if -y is set. This speeds up the search, but
at the expense of thoroughness. You might want to use this in conjuc-
tion with -y to make a parser quickly.
--no-single-words
Cancels any earlier --single-words option in Makefile variables etc
--max-words=
Limits the number of words in a rule. 0 means no limit. --single-words
is equivalent to --max-words=1. If you need to limit the search time,
and are using -y, it should suffice to use --single-words for a quick
annotator or --max-words=5 for a more thorough one (or try 3 if
--yarowsky-half-thorough is in use).
--multiword-end-avoid=
Comma-separated list of words (without annotation markup) that should
be avoided at the end of a multiword rule (e.g. sandhi likely to depend
on the following word)
-d, --diagnose=
Output some diagnostics for the specified word. Use this option to help
answer "why doesn't it have a rule for...?" issues. This option expects
the word without markup and uses the system locale (UTF-8 if it cannot
be detected).
--diagnose-limit=
Maximum number of phrases to print diagnostics for (0 means unlimited).
Default: 10
-m, --diagnose-manual
Check and diagnose potential failures of --manualrules
--no-diagnose-manual
Cancels any earlier --diagnose-manual option in Makefile variables etc
-q, --diagnose-quick
Ignore all phrases that do not contain the word specified by the --di-
agnose option, for getting a faster (but possibly less accurate) diag-
nostic. The generated annotator is not likely to be useful when this
option is present.
--no-diagnose-quick
Cancels any earlier --diagnose-quick option in Makefile variables etc
--priority-list=
Instead of generating an annotator, use the input examples to generate
a list of (non-annotated) words with priority numbers, a higher number
meaning the word should have greater preferential treatment in ambigui-
ties, and write it to this file (or compressed .gz, .bz2 or .xz file).
If the file provided already exists, it will be updated, thus you can
amend an existing usage-frequency list or similar (although the final
numbers are priorities and might no longer match usage-frequency ex-
actly). The purpose of this option is to help if you have an existing
word-priority-based text segmenter and wish to update its data from the
examples; this approach might not be as good as the Yarowsky-like one
(especially when the same word has multiple readings to choose from),
but when there are integration issues with existing code you might at
least be able to improve its word-priority data.
-t, --time-estimate
Estimate time to completion. The code to do this is unreliable and is
prone to underestimate. If you turn it on, its estimate is displayed at
the end of the status line as days, hours or minutes.
--no-time-estimate
Cancels any earlier --time-estimate option in Makefile variables etc
-0, --single-core
Use only one CPU core even when others are available on Unix
--no-single-core
Cancels any earlier --single-core option in Makefile variables etc
--cores-command=
Command to run when changing the number of CPU cores in use (with new
number as a parameter); this can run a script to pause/resume any
lower-priority load
-p, --status-prefix=
Label to add at the start of the status line, for use if you batch-run
annogen in multiple configurations and want to know which one is cur-
rently running
Legal considerations
Annotator code will contain individual words and some phrases from the
original corpus (and these can be read even by people who do not have
the unannotated version); with regards to copyright law, I expect the
annotator code will count as an "index" to the collection, the copy-
right of which exists separately to that of the original collection,
but laws do vary by country and I am not a solicitor so please act ju-
diciously.
Legally obtaining that original annotated corpus is up to you. If you
are in the UK the government says non-commercial text mining is allowed
(terms of use prohibiting non-commercial mining are unenforceable),
provided you:
1. respect network stability (i.e. wait a long time between each down-
load),
2. connect directly to the publisher (this law bypasses the pub-
lisher's terms of use, not those of third-party search engines like
Google),
3. use the result only for mining, not for republishing the original
text (so you can't publish your unprocessed crawl dumps either),
4. and still respect any prohibitions against sharing whatever mining
tools you made for the site (as this law is only about text mining,
not about the sharing of tools).
Laws outside the UK are different (and I'm not a lawyer) so check care-
fully. Gao et al 2020's paper on "The Pile"
https://arxiv.org/abs/2101.00027 claims published crawl dumps with lim-
ited processing might be permissible under American copyright law as
transformative fair use, but I'm not sure how legally watertight their
argument is: it might be safer to keep unlicensed parts of the corpus
private and publish only the resulting index.
If the website's terms don't actually prohibit writing an unpublished
scraper for non-commercial mining purposes, perhaps you won't need a
legal exception for the crawling part--but you should still respect
their bandwidth and do it slowly, both for moral reasons (it's the
right thing to do) and pragmatic ones (you won't want their sysadmins
and service providers taking action against you).
Citation
If you need to cite a peer-reviewed paper:
Silas S. Brown. Web Annotation with Modified-Yarowsky and Other Algo-
rithms. Overload 112 (December 2012) pp.4-7.
Silas S. Brown August 2025 ANNOGEN(1)
Legal considerations | Citation
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=annogen&sektion=1&manpath=FreeBSD+Ports+15.0>
