Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
ucto(1)			    General Commands Manual		       ucto(1)

NAME
       ucto - Unicode Tokenizer

SYNOPSIS
       ucto [[options]]	[input-file] [[output-file]]

DESCRIPTION
       ucto  tokenizes text files: it separates	words from punctuation,	splits
       sentences (and optionally paragraphs), and finds	paired	quotes.	  Ucto
       is preconfigured	with tokenisation rules	for several languages.

       Those rules are provided	by uctodata

OPTIONS
       -c configfile
	      read settings from a 'configfile'

       -B
	      run in batch mode. Process all inputfiles	to an output directory
	      specified	with -O.

       -d value
	      set debug	mode to	'value'

       -e value
	      set input	encoding. (default UTF8)

       -I value
	      set the input directory to 'value'. (batch mode only)

       -O value
	      set the ouput directory to 'value'. (Required for	batch mode)

       -N value
	      set UTF8 output normalization. (default NFC)

       --filter=[YES|NO]
	      disable  filtering  of  special  characters, (default YES) These
	      special characters can be	specified in the [FILTER] block	of the
	      configuration file.

       -L language
	      Automatically selects a configuration  file  by  language	 code.
	      The  language  code  is generally	a three-letter iso-639-3 code.
	      For example, 'fra' will select the file tokconfig-fra  from  the
	      installation directory

       --detectlanguages=<lang1,lang2,..langn>
	      try  to detect all the specified languages. The default language
	      will be 'lang1'.	(only useful for FoLiA output).

	      All values must be iso-639-3 codes.

	      You can also use the special language code `und`.	 This  ensures
	      there  is	 NO  default language, and any language	that is	NOT in
	      the list will remain unanalyzed.

	      Warning: To be able to handle utterances of mixed	language, Ucto
	      uses a simple sentence splitter based on the markers '.' '?' and
	      '!'.  This may occasionally lead to surprising results.

       -l
	      Convert output text to all lowercase

       -u
	      Convert all input	text to	all uppercase

       -n
	      Emit one sentence	per line on output

       -m
	      Assume one sentence per line on input

       --normalize=class1,class2,..,classn
	      map all occurrences of  tokens  with  class1,...class  to	 their
	      generic  names.  e.g  --normalize=DATE will map all dates	to the
	      word {{DATE}}. Very  useful  to  normalize  tokens  like	URL's,
	      DATE's, E-mail addresses and so on.

       -T value	or --textredundancy=value
	      set text redundancy level	for text nodes in FoLiA	output:
	       'full'	 - add text to all levels: <p> <s> <w> etc.
	       'minimal'  -  don't introduce text on higher levels, but	retain
	      what is already
	       there.
	       'none'	 - only	introduce text on <w>,	AND  remove  all  text
	      from higher levels

       --allow-word-correction
	      Allow  ucto to tokenize inside FoLiA Word	elements, creating Fo-
	      LiA Corrections

       --ignore-tag-hints
	      Skip all tag=token hints from the	FoLiA input. These  hints  can
	      be used to signal	text markup like subscript and superscript

       --add-tokens="file"
	      Add  additional tokens to	the [TOKENS] block of the default lan-
	      guage.  The file should contain one TOKEN	per line.

       --passthru
	      Don't tokenize, but perform input	decoding and simple token role
	      detection

       --filterpunct
	      remove most of the punctuation from the output. (not from	 abre-
	      viations and embedded punctuation	like John's)

       -P
	      Disable Paragraph	Detection

       -Q
	      Enable  Quote  Detection.	 (this is experimental and may lead to
	      unexpected results)

       -s <string>
	      Set End-of-sentence marker. (Default <utt>)

       -V or --	version
	      Show version information

       -v
	      set Verbose mode

       -F
	      The input	file(s)	are assumed to be FoLiA	XML. Text in the  cor-
	      rect  'inputclass'  will be tokenized.  For files	with an	'.xml'
	      extension, -F is the default.

	      In batch mode, this forces to only select	files with the	'.xml'
	      extension	from the input directory.

       --inputclass="cls"
	      When  tokenizing	a FoLiA	XML document, search for text nodes of
	      class 'cls'.  The	default	is "current".

       --outputclass="cls"
	      When tokenizing a	FoLiA XML document, output the tokenized  text
	      in  text nodes with 'cls'. The default is	"current".  It is rec-
	      ommended to have different classes for input and output.

       --textclass="cls"(obsolete)
	      use 'cls'	for input and output of	text from FoLiA. Equivalent to
	      both --inputclass='cls' and --outputclass='cls')

	      This option is obsolete and NOT recommended. Please use the sep-
	      arate --inputclass= and --outputclass options.

       --copyclass
	      when ucto	is used	on FoLiA with fully tokenized text  in	input-
	      class='inputclass',  no  text in textclass 'outputclass' is pro-
	      duced. (A	warning	will be	given).	 To circumvent this.  Add  the
	      --copyclass  option.  Which assures that text will be emitted in
	      that class

       -X
	      All output will be FoLiA XML. Document id's are autogenerated.

	      Works in batch mode too.

       --id <DocId>
	      Use the specified	Document ID for	the FoLiA XML. (not allowed in
	      batch mode) When not provided, a document	is is generated	 based
	      on the nema of the input file.

BUGS
       likely

AUTHORS
       Maarten van Gompel

       Ko van der Sloot

       e-mail: lamasoftware@science.ru.nl

				  2024 apr 11			       ucto(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=ucto&sektion=1&manpath=FreeBSD+Ports+14.3.quarterly>

home | help