FreeBSD Manual Pages

home | help
uni2ascii(1)		    General Commands Manual		  uni2ascii(1)

NAME
       uni2ascii  -  convert  UTF-8 Unicode to various 7-bit ASCII representa-
       tions

SYNOPSIS
       uni2ascii [options] (<input file	name>)

DESCRIPTION
       uni2ascii converts UTF-8	Unicode	to  various  7-bit  ASCII  representa-
       tions.  If  no  format  is specified, standard hexadecimal format (e.g.
       0x00e9) is used.	 It reads from the standard input and  writes  to  the
       standard	output.

       Command line options are:

       -A     List  the	 single	character approximations carried out by	the -y
	      flag.

       -a <format>
	      Convert to the specified format. Formats	may  be	 specified  by
	      means  of	 the  following	 arbitrary  single character codes, by
	      means of names such as "SGML_decimal", and by  examples  of  the
	      desired format.

	      A	 Generate  hexadecimal numbers with prefix U in	angle-brackets
	      (<U00E9>).

	      B	Generate \x-escaped hex	(e.g. \x00E9)

	      C	Generate  \x  escaped  hexadecimal  numbers  in	 braces	 (e.g.
	      \x{00E9}).

	      D	 Generate  decimal  HTML  numeric  character  references (e.g.
	      &#0233;)

	      E	Generate hexadecimal with prefix U (U00E9).

	      F	Generate hexadecimal with prefix u (u00E9).

	      G	Convert	hexadecimal in	single	quotes	with  prefix  X	 (e.g.
	      X'00E9').

	      H	 Generate  hexadecimal HTML numeric character references (e.g.
	      &#x00E9;)

	      I	Generate hexadecimal UTF-8 with	each byte's hex	preceded by an
	      =-sign (e.g. =C3=A9) . This is the Quoted	Printable  format  de-
	      fined by RFC 2045.

	      J	 Generate hexadecimal UTF-8 with each byte's hex preceded by a
	      %-sign (e.g.  %C3%A9). This is the URI escape format defined  by
	      RFC 2396.

	      K	 Generate  octal  UTF-8	 with each byte	escaped	by a backslash
	      (e.g.  \303\251)

	      L	Generate \U-escaped hex	outside	the BMP, \u-escaped hex	within
	      the BMP (U+0000-U+FFFF).

	      M	Generate hexadecimal SGML numeric character  references	 (e.g.
	      \#xE9;)

	      N	 Generate  decimal  SGML  numeric  character  references (e.g.
	      \#233;)

	      O	Generate octal escapes for the three low bytes	in  big-endian
	      order(e.g. \000\000\351))

	      P	Generate hexadecimal numbers with prefix U+ (e.g. U+00E9)

	      Q	 Generate  character  entities (e.g. &eacute;) where possible,
	      otherwise	hexadecimal numeric character references.

	      R	Generate raw hexadecimal numbers (e.g. 00E9)

	      S	Generate hexadecimal escapes for the three low bytes  in  big-
	      endian order (e.g. \x00\x00\xE9)

	      T	Generate decimal escapes for the three low bytes in big-endian
	      order (e.g. \d000\d000\d233)

	      U	Generate \u-escaped hexadecimal	numbers	(e.g. \u00E9).

	      V	Generate \u-escaped decimal numbers (e.g. \u00233).

	      X	Generate standard hexadecimal numbers (e.g. 0x00E9).

	      0	 Generate  hexadecimal	UTF-8  with  each  byte's hex enclosed
	      within angle brackets (e.g. <C3><A9>).

	      1	Generate Common	Lisp format hexadecimal	numbers	(e.g. #x00E9).

	      2	Generate Perl format  decimal  numbers	with  prefix  v	 (e.g.
	      v233).

	      3	Generate hexadecimal numbers with prefix $ (e.g. $00E9).

	      4	Generate Postscript format hexadecimal numbers with prefix 16#
	      (e.g. 16#00E9).

	      5	 Generate  Common  Lisp	format hexadecimal numbers with	prefix
	      #16r (e.g. #16r00E9).

	      6	Generate ADA format hexadecimal	numbers	with  prefix  16#  and
	      suffix # (e.g. 16#00E9#).

	      7	 Generate Apache log format hexadecimal	UTF-8 with each	byte's
	      hex preceded by a	backslash-x (e.g.  \xC3\xA9).

	      8	Generate Microsoft OOXML format	hexadecimal numbers with  pre-
	      fix _x and suffix	_ (e.g.	_x00E9_).

	      9	Generate %\u-escaped hexadecimal numbers (e.g. %\u00E9).

       -B     Transform	to ASCII if possible. This option is equivalent	to the
	      combination cdefx.

       -c     Convert circled and parenthesized	characters to their unenclosed
	      counterparts.

       -d     Strip  diacritics.  This converts	single codepoints representing
	      characters with diacritics to the	corresponding ASCII  character
	      and deletes separately encoded diacritics.

       -e     Convert  characters  to  their approximate ASCII equivalents, as
	      follows:
	      U+0085  next line					   0x0A	  new-
	      line
	      U+00A0  no break space				  0x20	space
	      U+00AB  left-pointing double angle quotation mark	  0x22	double
	      quote
	      U+00AD  soft hyphen				  0x2D	minus
	      U+00AF  macron					  0x2D	minus
	      U+00B7  middle dot				  0x2E	period
	      U+00BB  right-pointing double angle quotation mark  0x22	double
	      quote
	      U+1361  ethiopic word space			  0x20	space
	      U+1680  ogham space				  0x20	space
	      U+2000  en quad					  0x20	space
	      U+2001  em quad					  0x20	space
	      U+2002  en space					  0x20	space
	      U+2003  em space					  0x20	space
	      U+2004  three-per-em space			  0x20	space
	      U+2005  four-per-em space				  0x20	space
	      U+2006  six-per-em space				  0x20	space
	      U+2007  figure space				  0x20	space
	      U+2008  punctuation space				  0x20	space
	      U+2009  thin space				  0x20	space
	      U+200A  hair space				  0x20	space
	      U+200B  zero-width space				  0x20	space
	      U+2010  hyphen					  0x2D	minus
	      U+2011  non-breaking hyphen			  0x2D	minus
	      U+2012  figure dash				  0x2D	minus
	      U+2013  en dash					  0x2D	minus
	      U+2014  em dash					  0x2D	minus
	      U+2018   left  single quotation mark		    0x60  left
	      single quote
	      U+2019  right single quotation mark		  0x27	 right
	      or neutral single	quote
	      U+201A   single  low-9 quotation mark		    0x60  left
	      single quote
	      U+201B  single high-reversed-9 quotation mark	   0x60	  left
	      single quote
	      U+201C  left double quotation mark		  0x22	double
	      quote
	      U+201D  right double quotation mark		  0x22	double
	      quote
	      U+201E  double low-9 quotation mark		  0x22	double
	      quote
	      U+201F  double high-reversed-9 quotation mark	  0x22	double
	      quote
	      U+2022   bullet					   0x6F	 small
	      letter o
	      U+2028  line separator				   0x0A	  new-
	      line
	      U+2032   prime					   0x27	 right
	      or neutral single	quote
	      U+2033  double prime				  0x22	double
	      quote
	      U+2039  single left-pointing angle quotation mark	   0x60	  left
	      single quote
	      U+203A   single right-pointing angle quotation mark  0x27	 right
	      or neutral single	quote
	      U+204E  low asterisk				  0x2A	aster-
	      isk
	      U+2212  minus sign				  0x2D	minus
	      U+2216  set minus					  0x5C	 back-
	      slash
	      U+2217  asterisk operator				  0x2A	aster-
	      isk
	      U+2223  divides					  0x7C	verti-
	      cal line
	      U+2500  box drawing light	horizontal		  0x2D	minus
	      U+2501  box drawing heavy	horizontal		  0x2D	minus
	      U+2502  box drawing light	vertical		  0x7C	verti-
	      cal line
	      U+2503  box drawing heavy	vertical		  0x7C	verti-
	      cal line
	      U+2731  heavy asterisk				  0x2A	aster-
	      isk
	      U+275D  heavy double turned comma	quotation mark	  0x22	double
	      quote
	      U+275E  heavy double comma quotation mark		  0x22	double
	      quote
	      U+3000  ideographic space				  0x20	space
	      U+FE60  small ampersand				  0x26	amper-
	      sand
	      U+FE61  small asterisk				  0x2A	aster-
	      isk
	      U+FE62   small  plus sign				    0x2B  plus
	      sign

       -E     List the expansions performed by the -x flag.

       -f     Convert stylistic	variants to plain  ASCII.   Stylistic  equiva-
	      lents  include:  superscript and subscript forms,	small capitals
	      (e.g. U+1D04), script forms (e.g.	U+212C),  black	 letter	 forms
	      (e.g.  U+212D),  fullwidth  forms	(e.g. U+FF01), halfwidth forms
	      (e.g. U+FF7B), and the mathematical alphanumeric	symbols	 (e.g.
	      U+1D400).

       -h     Help. Print the usage message and	exit.

       -l     Use lowercase a-f	when generating	hexadecimal numbers.

       -n     Convert newlines too. By default,	they are left alone.

       -P     Pass  through Unicode rather than	converting to ASCII escapes if
	      the character is not converted to	an ASCII character by a	trans-
	      formation	such as	diacritic stripping. Note that if this	option
	      is used the output may not be pure ASCII.

       -p     Pure. Convert characters within the ASCII	range except for space
	      and newline as well as those above.

       -q     Quiet. Do	not chat unnecessarily while working.

       -s     Convert space characters too. By default,	they are left alone.

       -S <Unicode:ASCII>
	      Define a custom substitution. The	argument should	consist	of the
	      Unicode  codepoint  to be	replaced followed by the ASCII code of
	      the character to be used as replacement, separated by  a	colon.
	      If  no ASCII code	follows	the colon, the specified Unicode char-
	      acter will be deleted.  The code values may be  in  hexadecimal,
	      octal,  or  decimal  following the usual conventions (to be pre-
	      cise,those of strtoul(3)).  This option may be repeated as  many
	      times as desired to define multiple substitutions.

       -v     Print program version information	and exit.

       -w     Add a space after	each converted item.

       -x     Expand  certain  characters  to  multicharacter  sequences.  The
	      characters affected are the same as those	affected by the	-y op-
	      tion.
	      U+00A2 CENT SIGN			      -> cent
	      U+00A3 POUND SIGN			      -> pound
	      U+00A5 YEN SIGN			      -> yen
	      U+00A9 COPYRIGHT SYMBOL		      -> (c)
	      U+00AE REGISTERED	SYMBOL		      -> (R)
	      U+00BC ONE QUARTER		      -> 1/4
	      U+00BD ONE HALF			      -> 1/2
	      U+00BE THREE QUARTERS		      -> 3/4
	      U+00C6 CAPITAL LETTER ASH		      -> AE
	      U+00DF SMALL LETTER SHARP	S	      -> ss
	      U+00E6 SMALL LETTER ASH		      -> ae
	      U+0132 LIGATURE IJ		      -> IJ
	      U+0133 LIGATURE ij		      -> ij
	      U+0152 LIGATURE OE		      -> OE
	      U+0153 LIGATURE oe		      -> oe
	      U+01F1 CAPITAL LETTER DZ		      -> DZ
	      U+01F2 MIXED LETTER Dz		      -> Dz
	      U+01F3 SMALL LETTER DZ		      -> dz
	      U+02A6 SMALL LETTER TS DIGRAPH	      -> ts
	      U+2026 HORIZONTAL	ELLIPSIS	      -> ...
	      U+20AC EURO SIGN			      -> euro
	      U+2122 TRADEMARK SIGN		      -> (tm) br  U+22EF  MID-
	      LINE HORIZONTAL ELLIPSIS	    -> ...
	      U+2190 LEFTWARDS ARROW		      -> <-
	      U+2192 RIGHTWARDS	ARROW		      -> ->
	      U+21D0 LEFTWARDS DOUBLE ARROW	      -> <=
	      U+21D2 RIGHTWARDS	DOUBLE ARROW	      -> =>
	      U+FB00 LATIN SMALL LIGATURE FF	      -> ff
	      U+FB01 LATIN SMALL LIGATURE FI	      -> fi
	      U+FB02 LATIN SMALL LIGATURE FL	      -> fl
	      U+FB03 LATIN SMALL LIGATURE FFI	      -> ffi
	      U+FB04 LATIN SMALL LIGATURE FFL	      -> ffl
	      U+FB06 LATIN SMALL LIGATURE ST	      -> st

       -y     Convert  certain characters having multi-character expansions to
	      single-character ascii approximations instead (e.g. to  maintain
	      character-positioning).  The characters affected are the same as
	      those affected by	the -x option.
	      U+00A2 CENT SIGN			      -> c
	      U+00A3 POUND SIGN			      -> #
	      U+00A5 YEN SIGN			      -> Y
	      U+00A9 COPYRIGHT SYMBOL		      -> C
	      U+00AE REGISTERED	SYMBOL		      -> R
	      U+00BC ONE QUARTER		      -> -
	      U+00BD ONE HALF			      -> -
	      U+00BE THREE QUARTERS		      -> -
	      U+00C6 CAPITAL LETTER ASH		      -> A
	      U+00DF SMALL LETTER SHARP	S	      -> s
	      U+00E6 SMALL LETTER ASH		      -> a
	      U+0132 LIGATURE IJ		      -> I
	      U+0133 LIGATURE ij		      -> i
	      U+0152 LIGATURE OE		      -> O
	      U+0153 LIGATURE oe		      -> o
	      U+01F1 CAPITAL LETTER DZ		      -> D
	      U+01F2 MIXED LETTER Dz		      -> D
	      U+01F3 SMALL LETTER DZ		      -> d
	      U+02A6 SMALL LETTER TS DIGRAPH	      -> t
	      U+2026 HORIZONTAL	ELLIPSIS	      -> .
	      U+20AC EURO SIGN			      -> E
	      U+22EF MIDLINE HORIZONTAL	ELLIPSIS      -> .
	      U+2190 LEFTWARDS ARROW		      -> <
	      U+2192 RIGHTWARDS	ARROW		      -> >
	      U+21D0 LEFTWARDS DOUBLE ARROW	      -> <
	      U+21D2 RIGHTWARDS	DOUBLE ARROW	      -> >

       -Z <format>
	      Generate output using the	supplied format. The format  specified
	      will  be used as the format string in a call to printf(3)	with a
	      single argument consisting of an unsigned	long integer. For  ex-
	      ample, to	obtain the same	output as with the -U flag, the	format
	      would be:	\u%04X.

       If  conversion  of  spaces  is disabled (as it is by default), if space
       characters outside the ASCII range are encountered (U+3000  ideographic
       space,  U+1351  Ethiopic	word space, and	U+1680 ogham space mark), they
       are replaced with the ASCII space character (0x20) so as	 to  keep  the
       output pure 7-bit ASCII.

       Note  that  XML	and XHTML numeric character entities are like those of
       HTML with two restrictions. First, in  X(HT)ML  the  terminating	 semi-
       colon  may  not	be omitted.  Second, in	X(HT)ML	the "x"	must be	lower-
       case, while in HTML it may be either upper- or  lower-case.  We	always
       generate	 the  terminating  semi-colon and use a	lower-case "x",	so the
       option dubbed "HTML" produces valid XML and XHTML as well.

EXIT STATUS
       The following values are	returned on exit:

       0 SUCCESS
	      The input	was successfully converted.

       2 I/O ERROR
	      A	system error ocurred during input or output.

       3 INFO The user requested information such as the version number	or us-
	      age synopsis and this has	been provided.

       5 BAD OPTION
	      An incorrect option flag was given on the	command	line.

       8 BAD RECORD
	      Ill-formed UTF-8 was detected in the input.

SEE ALSO
       ascii2uni(1), Text::Unidecode

AUTHOR
       Bill Poser <billposer@alum.mit.edu>

LICENSE
       GNU General Public License

				 August, 2013			  uni2ascii(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=uni2ascii&sektion=1&manpath=FreeBSD+Ports+15.1.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages