Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
unidesc(1)		    General Commands Manual		    unidesc(1)

NAME
       unidesc - Describe the contents of a Unicode text file

SYNOPSIS
       unidesc ([option	flags])	(<file name>)

       If  no input file name is supplied, unidesc reads from the standard in-
       put.

DESCRIPTION
       unidesc describes the content of	a Unicode text file by	reporting  the
       character  ranges  to which different portions of the text belong.  The
       ranges reported include both  official  Unicode	ranges	and  the  con-
       structed	 language  ranges within the Private Use Areas registered with
       the   Conscript	 Unicode    Registry	(http://www.evertype.com/stan-
       dards/csur/).  For each range of	characters, unidesc prints the charac-
       ter or byte offset of the beginning of the range, the character or byte
       offset  of  the	end  of	 the range, and	the name of the	range. Offsets
       start from 0.

       Since the ASCII digits, punctuation, and	whitespace characters are fre-
       quently used by other writing systems, by default these characters  are
       treated	as  neutral, that is, as not belonging exclusively to any par-
       ticular character range.	 These characters are treated as belonging  to
       the range of whatever characters	precede	them.

       If  the	input  begins with neutral characters, they are	treated	as be-
       longing to the range of whatever	characters follow them.	 If  the  file
       consists	 entirely  of  neutral	characters, the	range is identified as
       Neutral followed	by Basic Latin in square brackets.

       A magic number identifying the Unicode encoding is not part of the Uni-
       code standard, so pure Unicode files do not  contain  a	magic  number.
       However,	 informal  conventions	have  arisen for this purpose.	If the
       command line flag -m is given, unidesc will  attempt  to	 identify  the
       Unicode	subtype	 by examining the first	few bytes of the input.	If the
       input is	identified as one of the two acceptable	types, UTF-8 or	native
       order UTF-32, it	will then proceed to describe the contents of the  in-
       put.  Otherwise,	it will	report what it has learned and exit. Note that
       if the file does	contain	a magic	number,	you  must  use	the  -m	 flag.
       Without	this flag unidesc assumes that the input consists of pure Uni-
       code with the character data beginning immediately.  It will  therefore
       be thrown off by	the magic number.

       By  default, input is expected to be UTF-8. Native order	UTF-32 is also
       acceptable.  UTF-32 may be specified via	the command line flag  -u  or,
       if the command line flag	-m is given, via the magic number.

COMMAND	LINE FLAGS
       -b     Give file	offsets	in bytes rather	than characters.

       -d     Treat  the  ASCII	 digits	 as belonging exclusively to the Basic
	      Latin range.

       -h     Print usage information.

       -L     List the Unicode ranges alphabetically.

       -l     List the Unicode ranges by codepoint.

       -m     Check the	file's magic number to determine the Unicode subtype.

       -p     Treat ASCII punctuation as belonging exclusively	to  the	 Basic
	      Latin range.

       -r     Instead of listing ranges	as they	are encountered, just list the
	      ranges detected after all	input has been read.

       -u     Input is native order UTF-32.

       -v     Print version information.

       -w     Treat  ASCII  whitespace	as  belonging exclusively to the Basic
	      Latin range.

SEE ALSO
       uniname

REFERENCES
       Unicode Standard, version 5.0

AUTHOR
       Bill Poser
       billposer@alum.mit.edu

LICENSE
       GNU General Public License

				  June,	2007			    unidesc(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=unidesc&sektion=1&manpath=FreeBSD+Ports+15.0.quarterly>

home | help