Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
UNICONV(1)			LINUX COMMANDS			    UNICONV(1)

NAME
       uniconv - convert text to native	formats	through	Unicode

SYNOPSIS
       uniconv	-out  output-file [ -decode input-encoding ] [ -encode output-
       encoding	] [ input-file ] [ -todos ] [ -fromdos ] [ -tomac ] [ -frommac
       ]

DESCRIPTION
       uniconv program decodes scripts with a certain  encoding	 encodes  them
       with  some  other  encoding.  The scipt is a 16,8 or 7 bit-byte stream.
       The converted text  will	be sent	to the standard	output,	even  in  case
       of  16-bit  encoding methods,unless the output file is specified	by the
       -out option.

       The -decode and -encode options are optional, the default converter  is
       utf-8.	The program reads the Unicode map helper files (*.my) from the
       default directory /usr/local/share/data.	 Simple	1-to-1 encoding	 meth-
       ods  can	be added on the	fly by adding a	a my-file, or setting your yu-
       dit.datapath  property	in   ~/.yudit/yudit.properties	 or   /usr/lo-
       cal/share/yudit/config/yudit.properties.	     By	   default    /usr/lo-
       cal/share/yudit/data and	~/.yudit/data are searched.

       My-files	can be created by a program called The files can be  converted
       between dos/unix/mac line-ending	variants with -fromdos,	-frommac, -to-
       dos,  -tomac  options.  the  default  (not  scpecified  one)  is	 Unix.
       makeumap.

ENCODING
       If you received this program through the	Yudit distribution, then as of
       today you can convert between the encoding methods below.

       utf-8  Yudit recommends this format for international  information  ex-
	      change.  ASCII text  will	 get through  intact, while other Uni-
	      code  characters	will  get their	8th bit	set and	the length  of
	      the  code	 will depend on	how far	away they are in  the  Unicode
	      space.   This  is	the only transformation	format that can	encode
	      both 16-bit (ucs-2) and 31-bit (ucs-4) Unicode.

       utf-8-s
	      Hackers utf-8 format - it	does not give an error message when  a
	      surrogate	pair is	decoded	and it can encode a surrogate pair 'as
	      is'.   This  is  not a recommended encoding format although this
	      format is	used to	encode/decode clipboard	data, in order to pre-
	      serve input.

       utf-16 Although 16 is bigger than 8 this	is still a compromise required
	      by OSes like Windows that	can not	handle ucs-4 -	this  encoding
	      produces 16-bit Unicode streams.	In addition to BMP it can con-
	      vert  16 planes using the	Unicode	Surrogate Area.	 This encoding
	      can not convert anything above U+10FFFF (Plane 16).   The	 input
	      byte  order is recognized	by the first two characters BEM	(byte-
	      order-mark) U+FEFF. This format is used in Windows NT for	 docu-
	      ments like notepad .txt files.

       utf-16-be
	      Big endian utf-16	converter.

       utf-16-le
	      Littlen endian utf-16 converter.

       utf-7  This is the recommended format for international information ex-
	      change,  when  7-bit can only be used. It	can only handle	16-bit
	      (utf-16) Unicode,	for ucs-4  (above  U+10FFFF)  you  should  use
	      utf-8 encoding.

       iso-8859-1
	      This  is	the  ISO 8859-1	character  encoding format. It is also
	      known as "Latin-1" encoding.

       iso-8859-2
	      This  is	the ISO	8859-2 character encoding format. It  is  also
	      known as "Central	European" encoding.

       iso-8859-5
	      This  is	the  ISO  8859-5 character encoding format. It is also
	      known as "Cyrillic" encoding.

       iso-8859-7
	      This is the ISO 8859-7 character encoding	 format.  It  is  also
	      known as "Greek" encoding.

       iso-8859-9
	      This  is	the  ISO  8859-9 character encoding format. It is also
	      known as "Turkish" encoding.

       koi8-r This is the KOI8-R character encoding format. It is mainly  used
	      in Russia.

       cp-1251
	      This  is	the  CP1251  cyrillic character	encoding format. It is
	      mainly used in Microsoft Windows and some	web sites.

       iso-2022-jp
	      This is a	Japanese character encoding format. It is a 7-bit  en-
	      coding format.

       iso-2022-jp-3
	      This  is a Japanese character encoding format. It	is a 7-bit en-
	      coding format. It	is base	upon  JIS X 0213 standard.

       euc-jp This is a	Japanese character encoding format. It is an 8-bit en-
	      coding format.  Mainly used in UNIX systems.

       euc-jp-3
	      The official name	is EUC-JISX0213	- I just could not read	 this.
	      This  is a Japanese character encoding format. It	is a 8-bit en-
	      coding format. It	is base	upon  JIS X 0213 standard.

       shift-jis
	      This is a	Japanese character encoding format.  It	 is  an	 8-bit
	      encoding format. Mainly used in MSDOS/Windows.

       shift-jis-3
	      The  official  name  is  Shift_JISX0213  - I just	could not read
	      this.  This is a Japanese	character encoding format.  It	is  an
	      8-bit encoding format. Mainly used in MSDOS/Windows.

       iso-2022-jp
	      This  is	a  Japanese  7-bit  character  encoding	 format.   The
	      iso-2022-jp email	messages can be	decoded/encoded	 are  in  this
	      format.

       iso-2022-x11
	      This  is a Japanese character encoding format.  It is also known
	      as  "COMPOUND_TEXT" encoding for the X  Window System. This is a
	      7-bit encoding format.  It can be	derived	from the  ISO  2022-JP
	      format with some differences.

       ksc-5601-x11
	      This is a	 Korean	 character  encoding format used by the	X win-
	      dow  system(COMPOUND_TEXT	 encoding) to encode Korean(KS X 1001)
	      and US-ASCII. This  is  a	 7bit  encoding	 format	 compliant  to
	      ISO-2022	specification for encoding of multiple character sets.
	      Please, note that	this is	DIFFERENT from ISO-2022-KR (defined in
	      IETF RFC 1557).

       euc-kr This  is	an 8bit	 multibyte encoding for	 Korean.   It  encodes
	      US-ASCII(7bit)  in  single  byte	range  and  characters in KS X
	      1001(formerly KS C 5601) in double byte range with MSB on(8bit).
	      It's used	in Unix	and Internet. Korean  version of MS-DOS, MacOS
	      and MS-Windows use compatible (most cases, identical) variant of
	      this encoding.

       johab  This  is	a  Korean  encoding  specified	in  KS	 X  1001(KS  C
	      5601-1992),    Annex   3	 as  a supplementary encoding.	Widely
	      used in Korean MS-DOS until mid-1990's.	It  can	  encode   all
	      Hangul  syllables(11,172)	 of  modern  Korean as well as all the
	      special symbols and Hanja	(Chinese ideograms used	in Korea)  de-
	      fined in KS X 1001.

       uhc    A	 variant   of  EUC-KR  used  in	 Korean	 MS-Windows 95/98(pro-
	      prietary encoding	of Microsoft,CP949). Its character  repertoire
	      includes	all  modern   syllables	  of Hangul,Korean   script as
	      well as all the special symbols  and  Hanja  (Chinese  ideograms
	      used in Korea) defined in	KS X 1001.

       gb-18030
	      This is a	Chinese	character encoding format based	upon GB	18030.
	      It encodes the whole U+0000..U+10FFFF range, while being compat-
	      ible with	gb-2312.

       gb-2312-x11
	      This  is a Chinese character encoding format based upon GB 2312.
	      It is a 7-bit encoding format.

       gb-2312
	      This is a	Chinese	character encoding format based	upon GB	 2312.
	      It is an 8-bit encoding format.

       big-5  This  is a Chinese character encoding format based upon BIG5 en-
	      coding.  It is an	8-bit encoding format.

       hz     This is a	Chinese	character encoding format based	 upon  "Hanzi"
	      encoding.	 It is a 7-bit encoding	format.

       viscii This is a	Vietnamese character encoding format.

       ucs-2-be
	      This  converts  16-bit Unicode (ucs-2) streams. The format takes
	      care of big-endian variant.  Yudit does not recommend this  for-
	      mat.

       ucs-2-le
	      This  converts  16-bit Unicode (ucs-2) streams. The format takes
	      care of little-endian variant.  Yudit does  not  recommend  this
	      format.

       ucs-2  This  converts  16-bit  Unicode (ucs-2) streams.	The input byte
	      order is recognized by the first two characters BEM (byte-order-
	      mark) U+FEFF.  Yudit does	not recommend this format.

       java   This converts \uxxxx character escapes. When encoding, all char-
	      acters above U+0080 will be escaped with a string	like '\u0080'.
	      When decoding the	same format is decoded but, in addition, utf-8
	      format is	also recognized, so it can also	 be  used  to  recover
	      data   accidentally   saved   with   the	 wrong	encoding.  The
	      U+10000..U+10FFFF	area  is  converted  to	 surrogates  and  vice
	      versa.

       java-s This converts \uxxxx character escapes. When encoding, all char-
	      acters above U+0080 will be escaped with a string	like '\u0080'.
	      When decoding the	same format is decoded but, in addition, utf-8
	      format  is  also	recognized,  so	it can also be used to recover
	      data accidentally	saved with the wrong encoding. Surrogates  are
	      not  treated specially during conversion - this is why it	is not
	      a	recommended conversion.

FILES
       ~/.yudit/yudit.properties or /usr/local/share/yudit/config/yudit.prop-
       erties
	      can have yudit.datapath property.	This is	where  the  map	 files
	      are kept.	 By default /usr/local/share/yudit/data	is searched.

SEE ALSO
       makeumap

AUTHOR
       This  program  was written by gaspar@yudit.org (Gaspar Sinai), Last up-
       dated: 5	February, 2023,	Tokyo.

LINUX COMMANDS			  Nov 5	1997			    UNICONV(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=uniconv&sektion=1&manpath=FreeBSD+Ports+14.3.quarterly>

home | help