FreeBSD Manual Pages

home | help
UNICODE_CANONICAL(3)	    Courier Unicode Library	  UNICODE_CANONICAL(3)

NAME
       unicode_canonical, unicode_ccc, unicode_decomposition_init,
       unicode_decomposition_deinit, unicode_decompose,
       unicode_decompose_reallocate_size, unicode_compose,
       unicode_composition_init, unicode_composition_deinit,
       unicode_composition_apply - unicode canonical normalization and
       denormalization

SYNOPSIS
       #include	<courier-unicode.h>

       unicode_canonical_t unicode_canonical(char32_t c);

       uint8_t unicode_ccc(char32_t c);

       void unicode_decomposition_init(unicode_decomposition_t *info,
				       char32_t	*string, size_t	*string_size,
				       void *arg);

       int unicode_decompose(unicode_decomposition_t *info);

       void unicode_decomposition_deinit(unicode_decomposition_t *info);

       size_t unicode_decompose_reallocate_size(unicode_decomposition_t	*info,
						const size_t *sizes,
						size_t n);

       int unicode_compose(char32_t *string, size_t string_size, int flags,
			   size_t *new_size);

       int unicode_composition_init(const char32_t *string,
				    size_t string_size,	int flags,
				    unicode_composition_t *compositions);

       void unicode_composition_deinit(unicode_composition_t *compositions);

       size_t unicode_composition_apply(char32_t *string, size_t string_size,
					unicode_composition_t *compositions);

DESCRIPTION
       These functions compose or decompose a Unicode string into a canonical
       or a compatible normalized form.

       unicode_canonical() looks up the	character's canonical and
       compatibility mapping[1].  unicode_canonical() returns a	structure with
       the following fields:

       canonical_chars
	   A pointer to	the canonical or equivalent representation of the
	   character.

       n_canonical_chars
	   Number of characters	in the canonical_chars.

       format
	   A value of UNICODE_CANONICAL_FMT_NONE indicates a canonical
	   mapping, other values indicate a compatibility equivalent mapping.

       A NULL canonical_chars (with a 0	n_canonical_chars) indicates that the
       character has no	canonical or compatibility equivalence.

       unicode_ccc() returns the character's canonical combining class value.

       unicode_decomposition_init(), unicode_decompose() and
       unicode_decomposition_deinit() implement	a complete interface for
       decomposing a Unicode string:

	   unicode_decomposition_t info;

	   unicode_decomposition_init(&info, before, (size_t)-1, NULL);
	   info.decompose_flags=UNICODE_DECOMPOSE_FLAG_QC;
	   unicode_decompose(&info);
	   unicode_decomposition_deinit(&info);

       unicode_decomposition_init() initializes	a new unicode_decomposition_t
       structure, that gets passed in as its first parameter. The second
       parameter is a pointer to a Unicode string, with	the number of
       characters in the string	in the third parameter.	A string size of -1
       indicates a \0-terminated string	and calculates its string_size (which
       does not	include	the trailing \0. The last parameter is a void *, an
       opaque pointer that gets	stored in the initialized
       unicode_decomposition_t object:
       typedef struct unicode_decomposition {
	   char32_t   *string;
	   size_t     string_size;
	   int	      decompose_flags;
	   int	      (*reallocate)(
			   struct unicode_decomposition	  *info,
			   const size_t			  *offsets,
			   const size_t			  *sizes,
			   size_t			  n
		      );
	   void	      *arg;
       } unicode_decomposition_t;

       unicode_decompose() proceeds and	decomposes the string and replaces it
       with its	decomposed string version.

       unicode_decomposition_t's string, string_size and arg are copies	of
       unicode_decomposition_init's parameters.	 unicode_decomposition_init
       initializes all other fields to their default values.

       The decompose_flags bitmask gets	initialized to 0, and is a bit mask:

       UNICODE_DECOMPOSE_FLAG_QC
	   Check each character's appropriate "quick check" property and skip
	   decomposing Unicode characters that would get re-composed by
	   unicode_composition_apply().

       UNICODE_DECOMPOSE_FLAG_COMPAT
	   Perform a compatibility decomposition instead of a canonical
	   decomposition.

       reallocate is a pointer to a function that gets called to reallocate a
       larger string.  unicode_decompose() determines which characters in the
       string need decomposing and calls the reallocate	function pointer zero
       or more times. Each call	to reallocate passes information about where
       new characters will get inserted	into the string.

       reallocate only needs to	grow the size of the buffer where string
       points so that it's big enough to hold a	larger,	decomposed string;
       then update string accordingly.	reallocate should not update
       string_size or make any changes to the existing string, that's
       unicode_decompose()'s job (after	reallocate returns).

       The reallocate callback function	receives the following parameters.

       •   A pointer to	the unicode_decomposition_t and, notably, its arg.

       •   A pointer to	the array of offset indexes in the string where	new
	   characters will get inserted	in order to hold the decomposed
	   string.

       •   A pointer to	the array that holds the number	of characters that get
	   inserted each corresponding offset.

       •   The size of the two arrays.

       reallocate must update the string if necessary to hold at least the
       number of characters that's the sum total of the	initial	string_size
       and the sum total of al sizes.

       unicode_decomposition_init() initializes	the reallocate pointer to a
       default implementation that uses	realloc(3) and updates string with its
       return value. The application can use its own reallocate	to handle this
       task on its own,	and use	unicode_decompose_reallocate_size to compute
       the minimum string size:

	   size_t unicode_decompose_reallocate_size(unicode_decomposition_t *info,
						    const size_t *sizes,
						    size_t n)
	   {
	       size_t i;
	       size_t new_size=info->string_size;

	       for (i=0; i<n; ++i)
		   new_size += sizes[i];

	       return new_size;
	   }

       The reallocate function returns 0 on success and	a non-0	error code to
       report a	failure; and unicode_decompose() does the same.	The only error
       condition from unicode_decompose() is a non-0 error code	from the
       reallocate function. Otherwise: a successful decomposition results in
       unicode_decompose() returning 0 and unicode_decomposition_init()'s
       string pointing to the decomposed string	and string_size	giving the
       number of characters in the decomposed string.

	   Note

	   string_size does not	include	the trailing \0	character. The input
	   string also has its string_size specified without counting its \0
	   character. The default implementation of reallocate allocates an
	   extra char32_t ands sets it to a \0.	Therefore:

	   •   If the Unicode string before decomposition has a	trailing \0
	       and no decomposition occurs, and	no calls to reallocate takes
	       place: the string in the	unicode_decomposition_t	is unchanged
	       and it's	still \0-terminated.

	   •   The default reallocate allocates	an extra char32_t ands sets it
	       to a \0;	and it takes care of that for the decomposed string.

	   •   An application that provides its	own replacement	reallocate is
	       responsible for doing the same, if it wants the decomposed
	       string to be \0 terminated.

	   Note

	   Multiple calls to the reallocate callback are possible. Each	call
	   to reallocate reflect the prior calls' decompositions. Example: the
	   original string has five characters and the first call to
	   reallocate had two offsets, at position 1 and 3, with a value of 1
	   for their both sizes. This effects transforming an original Unicode
	   string "AAAAA" into "AXAAXAA" (with "A" representing	unspecified
	   characters in the original string, and "X" showing the two
	   characters added in the first call to reallocate.

	   A second call to varname with am offset at position 4, and a	size
	   of 1, results in the	updated	string of "AXAAYXAA" (with "Y")
	   marking an unspecified character inserted by	the second call.

	   Note

	   Unicode string decomposition	involves replacing a given Unicode
	   character with one or more other characters.	The sizes given	to
	   reallocate reflect the net addition to the Unicode string. For
	   example: decomposing	one Unicode character into three decomposed
	   characters results in a call	to reallocate reporting	an insert of
	   two more characters.

	   Note

	   offsets actually report the indices of each Unicode character
	   that's getting decomposed. A	1:1 decomposition of a Unicode
	   Character gets reported as an additional sizes entry	of 0.

       unicode_decomposition_deinit() releases all resources and destroys the
       unicode_decomposition_t;	it is no longer	valid.

	   Note

	   unicode_decomposition_deinit() does not free(3) the string. The
	   original string gets	passed in to unicode_decomposition_init() and
	   the decomposed string is left in the	string.

       The default implementation of the reallocate function assumes the
       string is a malloc(3)-ed	string,	and reallocs it.

	   Note

	   At this time	unicode_decomposition_deinit() does nothing. All code
	   should explicitly call it in	order to remain	forward-compatible (at
	   the source level).

       unicode_compose() performs a canonical composition of a decomposed
       string. Its parameters are:

       •   A pointer to	the decomposed Unicode string.

       •   The number of characters in the Unicode string. The Unicode string
	   does	not need to be \0-terminated; if it is this number does	not
	   include it.

       •   A flags bitmask, which can have the following values:

	   UNICODE_COMPOSE_FLAG_REMOVEUNUSED
	       Remove all combining marks after	doing all canonical
	       compositions. Normally any unused combining marks are left in
	       place, in the combined text. This option	removes	them.

	   UNICODE_COMPOSE_FLAG_ONESHOT
	       Perform canonical composition once per character, and do	not
	       attempt to combine any resulting	combined characters again.

       •   A non-NULL pointer to a size_t.

	   A successful	composition sets this size_t to	the number of
	   characters in the combined string, and returns 0. The combined
	   string gets placed back into	the string parameter, this string gets
	   combined in place and this gives the	size of	the combined string.

	   unicode_compose() returns a non-zero	value to indicate an error.

       unicode_composition_init(), unicode_composition_apply() and
       unicode_composition_deinit() implement a	detailed interface for
       canonical composition of	a decomposed Unicode string:

	   unicode_compositions_t compositions;

	   if (unicode_composition_init(str, strsize, flags, &compositions) == 0)
	   {
	       size_t new_size=unicode_composition_apply(str, strsize, &compositions);

	       unicode_composition_deinit(&compositions);
	   }

       The first two parameters	to both	unicode_composition_init() and
       unicode_composition_apply() are the same: the Unicode string and	the
       number of characters (not including any trailing	\0 character) in the
       Unicode string.

       unicode_composition_init()'s additional parameters are: any optional
       flags (see unicode_compose() for	a list of available flags), and	the
       address of a unicode_composition_t object. A non-0 return from
       unicode_composition_init() indicates an error.
       unicode_composition_init() indicates success by returning 0 and
       initializing the	unicode_composition_t's	object which contains a
       pointer to an array of pointers to of unicode_compose_info objects, and
       the number of pointers.	unicode_composition_init() does	not change the
       string; the only	thing it does is initialize the	unicode_composition_t
       object.

       unicode_composition_apply() applies the compositions to the string, in
       place, and returns the new size of the string (also not including the
       \0 byte,	however	it does	append one if the composed string is smaller,
       so the composed string is \0-terminated if the decomposed string	was).

       It is necessary to call unicode_composition_deinit() to free all	memory
       that was	allocated for the unicode_composition_t	object:
       struct unicode_compose_info {
	   size_t			 index;
	   size_t			 n_composed;
	   char32_t			 *composition;
	   size_t			 n_composition;
       };

       typedef struct {
	   struct unicode_compose_info	 **compositions;
	   size_t			 n_compositions;
       } unicode_composition_t;

       index gives the character index in the string where each	composition
       occurs.	n_composed gives the number of characters in the original
       string that get composed. The composed characters are the composition;
       and n_composition gives the number of composed characters.

       Effectively: at the index position in the original string, #n_composed
       characters get removed and there	are #n_composition characters that
       replace them (always n_composed or less).

	   Note

	   The UNICODE_COMPOSE_FLAG_REMOVEUNUSED flag has the effect of
	   including the combining marks that did not get combined in the
	   n_composed count. It's possible that, in this case, n_composition
	   is 0. This indicates	complete removal of the	combining marks,
	   without anything getting combined in	their place.

       unicode_composition_init() sets unicode_composition_t's compositions
       pointer to an array of pointers to unicode_compose_infos	that are
       sorted according	to their index.	 n_compositions	gives the number of
       pointers	in the array, and is 0 if there	are no compositions, the array
       is empty. The empty array gets interpreted accordingly when it gets
       passed to unicode_composition_apply() and unicode_composition_deinit():
       nothing happens.	 unicode_composition_apply() simply returns the	size
       of the unchanged	string,	and unicode_composition_deinit() does a
       pro-forma cleanup.

SEE ALSO
       TR-15[2], courier-unicode(7), unicode::canonical(3).

AUTHOR
       Sam Varshavchik
	   Author

NOTES
	1. canonical and compatibility mapping
	   https://www.unicode.org/reports/tr15/tr15-54.html

	2. TR-15
	   https://www.unicode.org/reports/tr15/tr15-54.html

Courier	Unicode	Library		  05/18/2024		  UNICODE_CANONICAL(3)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=unicode_decomposition_init&sektion=3&manpath=FreeBSD+Ports+14.3.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages