FreeBSD Manual Pages
uni2ascii(1) General Commands Manual uni2ascii(1) NAME uni2ascii - convert UTF-8 Unicode to various 7-bit ASCII representa- tions SYNOPSIS uni2ascii [options] (<input file name>) DESCRIPTION uni2ascii converts UTF-8 Unicode to various 7-bit ASCII representa- tions. If no format is specified, standard hexadecimal format (e.g. 0x00e9) is used. It reads from the standard input and writes to the standard output. Command line options are: -A List the single character approximations carried out by the -y flag. -a <format> Convert to the specified format. Formats may be specified by means of the following arbitrary single character codes, by means of names such as "SGML_decimal", and by examples of the desired format. A Generate hexadecimal numbers with prefix U in angle-brackets (<U00E9>). B Generate \x-escaped hex (e.g. \x00E9) C Generate \x escaped hexadecimal numbers in braces (e.g. \x{00E9}). D Generate decimal HTML numeric character references (e.g. é) E Generate hexadecimal with prefix U (U00E9). F Generate hexadecimal with prefix u (u00E9). G Convert hexadecimal in single quotes with prefix X (e.g. X'00E9'). H Generate hexadecimal HTML numeric character references (e.g. é) I Generate hexadecimal UTF-8 with each byte's hex preceded by an =-sign (e.g. =C3=A9) . This is the Quoted Printable format de- fined by RFC 2045. J Generate hexadecimal UTF-8 with each byte's hex preceded by a %-sign (e.g. %C3%A9). This is the URI escape format defined by RFC 2396. K Generate octal UTF-8 with each byte escaped by a backslash (e.g. \303\251) L Generate \U-escaped hex outside the BMP, \u-escaped hex within the BMP (U+0000-U+FFFF). M Generate hexadecimal SGML numeric character references (e.g. \#xE9;) N Generate decimal SGML numeric character references (e.g. \#233;) O Generate octal escapes for the three low bytes in big-endian order(e.g. \000\000\351)) P Generate hexadecimal numbers with prefix U+ (e.g. U+00E9) Q Generate character entities (e.g. é) where possible, otherwise hexadecimal numeric character references. R Generate raw hexadecimal numbers (e.g. 00E9) S Generate hexadecimal escapes for the three low bytes in big- endian order (e.g. \x00\x00\xE9) T Generate decimal escapes for the three low bytes in big-endian order (e.g. \d000\d000\d233) U Generate \u-escaped hexadecimal numbers (e.g. \u00E9). V Generate \u-escaped decimal numbers (e.g. \u00233). X Generate standard hexadecimal numbers (e.g. 0x00E9). 0 Generate hexadecimal UTF-8 with each byte's hex enclosed within angle brackets (e.g. <C3><A9>). 1 Generate Common Lisp format hexadecimal numbers (e.g. #x00E9). 2 Generate Perl format decimal numbers with prefix v (e.g. v233). 3 Generate hexadecimal numbers with prefix $ (e.g. $00E9). 4 Generate Postscript format hexadecimal numbers with prefix 16# (e.g. 16#00E9). 5 Generate Common Lisp format hexadecimal numbers with prefix #16r (e.g. #16r00E9). 6 Generate ADA format hexadecimal numbers with prefix 16# and suffix # (e.g. 16#00E9#). 7 Generate Apache log format hexadecimal UTF-8 with each byte's hex preceded by a backslash-x (e.g. \xC3\xA9). 8 Generate Microsoft OOXML format hexadecimal numbers with pre- fix _x and suffix _ (e.g. _x00E9_). 9 Generate %\u-escaped hexadecimal numbers (e.g. %\u00E9). -B Transform to ASCII if possible. This option is equivalent to the combination cdefx. -c Convert circled and parenthesized characters to their unenclosed counterparts. -d Strip diacritics. This converts single codepoints representing characters with diacritics to the corresponding ASCII character and deletes separately encoded diacritics. -e Convert characters to their approximate ASCII equivalents, as follows: U+0085 next line 0x0A new- line U+00A0 no break space 0x20 space U+00AB left-pointing double angle quotation mark 0x22 double quote U+00AD soft hyphen 0x2D minus U+00AF macron 0x2D minus U+00B7 middle dot 0x2E period U+00BB right-pointing double angle quotation mark 0x22 double quote U+1361 ethiopic word space 0x20 space U+1680 ogham space 0x20 space U+2000 en quad 0x20 space U+2001 em quad 0x20 space U+2002 en space 0x20 space U+2003 em space 0x20 space U+2004 three-per-em space 0x20 space U+2005 four-per-em space 0x20 space U+2006 six-per-em space 0x20 space U+2007 figure space 0x20 space U+2008 punctuation space 0x20 space U+2009 thin space 0x20 space U+200A hair space 0x20 space U+200B zero-width space 0x20 space U+2010 hyphen 0x2D minus U+2011 non-breaking hyphen 0x2D minus U+2012 figure dash 0x2D minus U+2013 en dash 0x2D minus U+2014 em dash 0x2D minus U+2018 left single quotation mark 0x60 left single quote U+2019 right single quotation mark 0x27 right or neutral single quote U+201A single low-9 quotation mark 0x60 left single quote U+201B single high-reversed-9 quotation mark 0x60 left single quote U+201C left double quotation mark 0x22 double quote U+201D right double quotation mark 0x22 double quote U+201E double low-9 quotation mark 0x22 double quote U+201F double high-reversed-9 quotation mark 0x22 double quote U+2022 bullet 0x6F small letter o U+2028 line separator 0x0A new- line U+2032 prime 0x27 right or neutral single quote U+2033 double prime 0x22 double quote U+2039 single left-pointing angle quotation mark 0x60 left single quote U+203A single right-pointing angle quotation mark 0x27 right or neutral single quote U+204E low asterisk 0x2A aster- isk U+2212 minus sign 0x2D minus U+2216 set minus 0x5C back- slash U+2217 asterisk operator 0x2A aster- isk U+2223 divides 0x7C verti- cal line U+2500 box drawing light horizontal 0x2D minus U+2501 box drawing heavy horizontal 0x2D minus U+2502 box drawing light vertical 0x7C verti- cal line U+2503 box drawing heavy vertical 0x7C verti- cal line U+2731 heavy asterisk 0x2A aster- isk U+275D heavy double turned comma quotation mark 0x22 double quote U+275E heavy double comma quotation mark 0x22 double quote U+3000 ideographic space 0x20 space U+FE60 small ampersand 0x26 amper- sand U+FE61 small asterisk 0x2A aster- isk U+FE62 small plus sign 0x2B plus sign -E List the expansions performed by the -x flag. -f Convert stylistic variants to plain ASCII. Stylistic equiva- lents include: superscript and subscript forms, small capitals (e.g. U+1D04), script forms (e.g. U+212C), black letter forms (e.g. U+212D), fullwidth forms (e.g. U+FF01), halfwidth forms (e.g. U+FF7B), and the mathematical alphanumeric symbols (e.g. U+1D400). -h Help. Print the usage message and exit. -l Use lowercase a-f when generating hexadecimal numbers. -n Convert newlines too. By default, they are left alone. -P Pass through Unicode rather than converting to ASCII escapes if the character is not converted to an ASCII character by a trans- formation such as diacritic stripping. Note that if this option is used the output may not be pure ASCII. -p Pure. Convert characters within the ASCII range except for space and newline as well as those above. -q Quiet. Do not chat unnecessarily while working. -s Convert space characters too. By default, they are left alone. -S <Unicode:ASCII> Define a custom substitution. The argument should consist of the Unicode codepoint to be replaced followed by the ASCII code of the character to be used as replacement, separated by a colon. If no ASCII code follows the colon, the specified Unicode char- acter will be deleted. The code values may be in hexadecimal, octal, or decimal following the usual conventions (to be pre- cise,those of strtoul(3)). This option may be repeated as many times as desired to define multiple substitutions. -v Print program version information and exit. -w Add a space after each converted item. -x Expand certain characters to multicharacter sequences. The characters affected are the same as those affected by the -y op- tion. U+00A2 CENT SIGN -> cent U+00A3 POUND SIGN -> pound U+00A5 YEN SIGN -> yen U+00A9 COPYRIGHT SYMBOL -> (c) U+00AE REGISTERED SYMBOL -> (R) U+00BC ONE QUARTER -> 1/4 U+00BD ONE HALF -> 1/2 U+00BE THREE QUARTERS -> 3/4 U+00C6 CAPITAL LETTER ASH -> AE U+00DF SMALL LETTER SHARP S -> ss U+00E6 SMALL LETTER ASH -> ae U+0132 LIGATURE IJ -> IJ U+0133 LIGATURE ij -> ij U+0152 LIGATURE OE -> OE U+0153 LIGATURE oe -> oe U+01F1 CAPITAL LETTER DZ -> DZ U+01F2 MIXED LETTER Dz -> Dz U+01F3 SMALL LETTER DZ -> dz U+02A6 SMALL LETTER TS DIGRAPH -> ts U+2026 HORIZONTAL ELLIPSIS -> ... U+20AC EURO SIGN -> euro U+2122 TRADEMARK SIGN -> (tm) br U+22EF MID- LINE HORIZONTAL ELLIPSIS -> ... U+2190 LEFTWARDS ARROW -> <- U+2192 RIGHTWARDS ARROW -> -> U+21D0 LEFTWARDS DOUBLE ARROW -> <= U+21D2 RIGHTWARDS DOUBLE ARROW -> => U+FB00 LATIN SMALL LIGATURE FF -> ff U+FB01 LATIN SMALL LIGATURE FI -> fi U+FB02 LATIN SMALL LIGATURE FL -> fl U+FB03 LATIN SMALL LIGATURE FFI -> ffi U+FB04 LATIN SMALL LIGATURE FFL -> ffl U+FB06 LATIN SMALL LIGATURE ST -> st -y Convert certain characters having multi-character expansions to single-character ascii approximations instead (e.g. to maintain character-positioning). The characters affected are the same as those affected by the -x option. U+00A2 CENT SIGN -> c U+00A3 POUND SIGN -> # U+00A5 YEN SIGN -> Y U+00A9 COPYRIGHT SYMBOL -> C U+00AE REGISTERED SYMBOL -> R U+00BC ONE QUARTER -> - U+00BD ONE HALF -> - U+00BE THREE QUARTERS -> - U+00C6 CAPITAL LETTER ASH -> A U+00DF SMALL LETTER SHARP S -> s U+00E6 SMALL LETTER ASH -> a U+0132 LIGATURE IJ -> I U+0133 LIGATURE ij -> i U+0152 LIGATURE OE -> O U+0153 LIGATURE oe -> o U+01F1 CAPITAL LETTER DZ -> D U+01F2 MIXED LETTER Dz -> D U+01F3 SMALL LETTER DZ -> d U+02A6 SMALL LETTER TS DIGRAPH -> t U+2026 HORIZONTAL ELLIPSIS -> . U+20AC EURO SIGN -> E U+22EF MIDLINE HORIZONTAL ELLIPSIS -> . U+2190 LEFTWARDS ARROW -> < U+2192 RIGHTWARDS ARROW -> > U+21D0 LEFTWARDS DOUBLE ARROW -> < U+21D2 RIGHTWARDS DOUBLE ARROW -> > -Z <format> Generate output using the supplied format. The format specified will be used as the format string in a call to printf(3) with a single argument consisting of an unsigned long integer. For ex- ample, to obtain the same output as with the -U flag, the format would be: \u%04X. If conversion of spaces is disabled (as it is by default), if space characters outside the ASCII range are encountered (U+3000 ideographic space, U+1351 Ethiopic word space, and U+1680 ogham space mark), they are replaced with the ASCII space character (0x20) so as to keep the output pure 7-bit ASCII. Note that XML and XHTML numeric character entities are like those of HTML with two restrictions. First, in X(HT)ML the terminating semi- colon may not be omitted. Second, in X(HT)ML the "x" must be lower- case, while in HTML it may be either upper- or lower-case. We always generate the terminating semi-colon and use a lower-case "x", so the option dubbed "HTML" produces valid XML and XHTML as well. EXIT STATUS The following values are returned on exit: 0 SUCCESS The input was successfully converted. 2 I/O ERROR A system error ocurred during input or output. 3 INFO The user requested information such as the version number or us- age synopsis and this has been provided. 5 BAD OPTION An incorrect option flag was given on the command line. 8 BAD RECORD Ill-formed UTF-8 was detected in the input. SEE ALSO ascii2uni(1), Text::Unidecode AUTHOR Bill Poser <billposer@alum.mit.edu> LICENSE GNU General Public License August, 2013 uni2ascii(1)
NAME | SYNOPSIS | DESCRIPTION | EXIT STATUS | SEE ALSO | AUTHOR | LICENSE
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=uni2ascii&sektion=1&manpath=FreeBSD+Ports+15.0.quarterly>
