Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
UNICODE_BIDI(3)		    Courier Unicode Library	       UNICODE_BIDI(3)

NAME
       unicode_bidi, unicode_bidi_calc_levels, unicode_bidi_calc_types,
       unicode_bidi_calc, unicode_bidi_reorder,	unicode_bidi_cleanup,
       unicode_bidi_cleaned_size, unicode_bidi_logical_order,
       unicode_bidi_combinings,	unicode_bidi_needs_embed, unicode_bidi_embed,
       unicode_bidi_embed_paragraph_level, unicode_bidi_direction,
       unicode_bidi_type, unicode_bidi_setbnl, unicode_bidi_mirror,
       unicode_bidi_bracket_type - unicode bi-directional algorithm

SYNOPSIS
       #include	<courier-unicode.h>

       unicode_bidi_level_t lr=UNICODE_BIDI_LR;

       void unicode_bidi_calc_types(const char32_t *p, size_t n,
				    unicode_bidi_type_t	*types);

       struct unicode_bidi_direction
							      unicode_bidi_calc_levels(const char32_t *p,
							      const unicode_bidi_type_t	*types,
							      size_t n,
							      unicode_bidi_level_t *levels,
							      const unicode_bidi_level_t *initial_embedding_level);

       struct unicode_bidi_direction unicode_bidi_calc(const char32_t *p,
						       size_t n,
						       unicode_bidi_level_t *levels,
						       const unicode_bidi_level_t *initial_embedding_level);

       void unicode_bidi_reorder(char32_t *string,
				 unicode_bidi_level_t *levels, size_t n,
				 void (*reorder_callback)(size_t, size_t, void *),
				 void *arg);

       size_t unicode_bidi_cleanup(char32_t *string,
				   unicode_bidi_level_t	*levels, size_t	n,
				   int options,
				   void	(*removed_callback)(size_t, size_t, void *),
				   void	*arg);

       size_t unicode_bidi_cleaned_size(const char32_t *string,	size_t n,
					int options);

       void unicode_bidi_logical_order(char32_t	*string,
				       unicode_bidi_level_t *levels, size_t n,
				       unicode_bidi_level_t paragraph_embedding,
				       void (*reorder_callback)(size_t index, size_t n,	void *arg),
				       void *arg);

       void unicode_bidi_combinings(const char32_t *string,
				    const unicode_bidi_level_t *levels,
				    size_t n,
				    void (*combinings)(unicode_bidi_level_t level, size_t level_start, size_t n_chars, size_t comb_start, size_t n_comb_chars, void *arg),
				    void *arg);

       int unicode_bidi_needs_embed(const char32_t *string,
				    const unicode_bidi_level_t *levels,
				    size_t n,
				    const unicode_bidi_level_t *paragraph_embedding);

       size_t unicode_bidi_embed(const char32_t	*string,
				 const unicode_bidi_level_t *levels, size_t n,
				 unicode_bidi_level_t paragraph_embedding,
				 void (*emit)(const char32_t *string, size_t n,	int is_part_of_string, void *arg),
				 void *arg);

       char32_t	unicode_bidi_embed_paragraph_level(const char32_t *string,
						   size_t n,
						   unicode_bidi_level_t	paragraph_embedding);

       char32_t	bidi_mirror(char32_t c);

       char32_t	bidi_bracket_type(char32_t c, unicode_bracket_type_t *ret);

       struct unicode_bidi_direction unicode_bidi_get_direction(char32_t *c,
								size_t n);

       enum_bidi_type_t	unicode_bidi_type(char32_t c);

       void unicode_bidi_setbnl(char32_t *p, const unicode_bidi_type_t *types,
				size_t n);

DESCRIPTION
       These functions are related to the Unicode Bi-Directional algorithm[1].
       They implement the algorithm up to and including	step L2, and provide
       additional functionality	of returning miscellaneous
       bi-directional-related metadata of Unicode characters. There's also a
       basic algorithm that "reverses" the bi-directional algorithm and
       produces	a Unicode string with bi-directional markers that results in
       the same	bi-directional string after reapplying the algorithm.

   Calculating bi-directional rendering	order
       The following process computes the rendering order of characters
       according to the	Unicode	Bi-Directional algorithm:

	1. Allocate an array of	unicode_bidi_type_t that's the same size as
	   the Unicode string.

	2. Allocate an array of	unicode_bidi_level_t that's the	same size as
	   the Unicode string.

	3. Use unicode_bidi_calc_types() to compute the	Unicode	string's
	   characters' bi-directional types, and populate the
	   unicode_bidi_type_t buffer.

	4. Use unicode_bidi_calc_levels() to compute the Unicode string's
	   characters' bi-directional embedding	level (executes	the
	   Bi-Directional algorithm up to and including	step L1). This
	   populates the unicode_bidi_level_t buffer.

	5. Alternatively: allocate only	the unicode_bidi_level_t array and use
	   unicode_bidi_calc(),	which malloc()s	the unicode_bidi_type_t
	   buffer, calls unicode_bidi_calc_levels(), and then free()s the
	   buffer.

	6. Use unicode_bidi_reorder() to reverse any characters	in the string,
	   according to	the algorithm (step L2), with an optional callback
	   that	reports	which ranges of	characters get reversed.

	7. Use unicode_bidi_cleanup() to remove	the characters from the	string
	   which are used by the bi-directional	algorithm, and are not needed
	   for rendering the text.  unicode_bidi_cleaned_size()	is available
	   to determine, in advance, how many characters will remain.

       The parameters to unicode_bidi_calc_types() are:

          A pointer to	the Unicode string.

          Number of characters	in the Unicode string.

          A pointer to	an array of unicode_bidi_type_t	values.	The caller is
	   responsible for allocating and deallocating this array, which has
	   the same size as the	Unicode	string.

       The parameters to unicode_bidi_calc_levels() are:

          A pointer to	the Unicode string.

          A pointer to	the buffer that	was passed to
	   unicode_bidi_calc_types().

          Number of characters	in the Unicode string and the
	   unicode_bidi_type_t buffer.

          A pointer to	an array of unicode_bidi_level_t values. The caller is
	   responsible for allocating and deallocating this array, which has
	   the same size as the	Unicode	string.

          An optional pointer to a UNICODE_BIDI_LR or UNICODE_BIDI_RL value.
	   This	sets the default paragraph direction level. A null pointer
	   computes the	default	paragraph direction level based	on the string,
	   as specified	by the "P" rules of the	bi-directional algorithm.

       The parameters to unicode_bidi_calc() are the same except for the
       unicode_bidi_type_t pointer.  unicode_bidi_calc() allocates this	buffer
       by itself and calls unicode_bidi_calc_types, and	destroys the buffer
       before returning.

       unicode_bidi_calc() and unicode_bidi_calc_levels() fill in the
       unicode_bidi_level_t array with the values corresponding	to the
       embedding level of the corresponding character, according the Unicode
       Bidirection Algorithm (even values for left-to-right ordering, and odd
       values for right-to-left	ordering). A value of UNICODE_BIDI_SKIP
       designates directional markers (from step X9).

       unicode_bidi_calc() and unicode_bidi_calc_levels() return the resolved
       paragraph direction level, which	always matches the passed in level, if
       specified, else it reports the derived one. These functions return a
       unicode_bidi_direction structure:
       struct unicode_bidi_direction {
	   unicode_bidi_level_t	  direction;
	   int			  is_explicit;
       };

       direction gives the paragraph embedding level, UNICODE_BIDI_LR or
       UNICODE_BIDI_RL.	 is_explicit indicates whether:	the optional pointer
       to a UNICODE_BIDI_LR or UNICODE_BIDI_RL value was specified (and
       returned	in direction), or whether the direction	comes from an
       character with an explicit direction indication.

       unicode_bidi_reorder() takes the	actual unicode string together with
       the embedding values from unicode_bidi_calc or
       unicode_bidi_calc_levels(), then	reverses the bi-directional string, as
       specified by step L2 of the bi-directional algorithm. The parameters to
       unicode_bidi_reorder() are:

          A pointer to	the Unicode string.

          A pointer to	an array of unicode_bidi_level_t values.

          Number of characters	in the Unicode string and the
	   unicode_bidi_level_t	array.

          An optional reorder_callback	function pointer.

       A non-NULL reorder_callback gets	invoked	to report each reversed
       character range.	The callback's first parameter is the index of the
       first reversed character, the second parameter is the number of
       reversed	characters, starting at	the given index	of the Unicode string.
       The third parameter is the arg passthrough parameter.

       unicode_bidi_reorder modifies its string	and levels.  reorder_callback
       gets invoked after reversing each consecutive range of values in	the
       string and levels buffers. For example: "reorder_callback(5, 7, arg)"
       reports that character indexes #5 through #11 got reversed.

       A NULL string pointer leaves the	levels buffer unchanged, but still
       invokes the reorder_callback as if the character	string,	and their
       embedding values, were reversed.

       The resulting string and	embedding levels are in	"rendering order", but
       still contain bi-directional embedding, override, boundary-neutral,
       isolate,	and marker characters.	unicode_bidi_cleanup removes these
       characters and directional markers.

       The parameters to unicode_bidi_cleanup()	are:

          The pointer to the unicode string.

          A non-null pointer to the directional embedding level buffer, of
	   the same size as the	string,	also removes the corresponding values
	   from	the buffer, and	the remaining values in	the embedding level
	   buffer get reset to levels UNICODE_BIDI_LR and UNICODE_BIDI_RL,
	   only.

          The size of the unicode string and the directional embedding	buffer
	   (if not NULL).

          A a bitmask that selects the	following options (or 0	if no
	   options):

	   UNICODE_BIDI_CLEANUP_EXTRA
	       In addition to removing all embedding, override,	and
	       boundry-neutral characters as specified by step X9 of the
	       bi-directional algorithm	(the default behavior without this
	       flag), also remove all isolation	markers	and implicit markers.

	   UNICODE_BIDI_CLEANUP_BNL
	       Replace all characters classified as paragraph separators with
	       a newline character.

	   UNICODE_BIDI_CLEANUP_CANONICAL
	       A combined set of UNICODE_BIDI_CLEANUP_EXTRA and
	       UNICODE_BIDI_CLEANUP_BNL,

          A pointer to	a function that	gets repeatedly	invoked	with the index
	   of the character that gets removed from the Unicode string.

          An opaque pointer that gets forwarded to the	callback.

       The function pointer (if	not NULL) gets invoked to report the index of
       each removed character. The reported index is the index from the
       original	string,	and the	callback gets invoked in strict	order, from
       the first to the	last removed character (if any).

       The character string and	the embedding level values resulting from
       unicode_bidi_cleanup() with the UNICODE_BIDI_CLEANUP_CANONICAL are in
       "canonical rendering order".  unicode_bidi_logical_order(),
       unicode_bidi_needs_embed() and unicode_bidi_embed() require the
       canonical rendering order for their string and embedding	level values.

       The parameters to unicode_bidi_cleaned_size() are a pointer to the
       unicode string, its size, and the bitmask option	to
       unicode_bidi_cleanup().

   Embedding bi-directional markers in Unicode text strings
       unicode_bidi_logical_order() rearranges the string from rendering to
       its logical order.  unicode_bidi_embed()	adds various bi-directional
       markers to a Unicode string in canonical	rendering order. The resulting
       string is not guaranteed	to be identical	to the original	Unicode
       bi-directional string. The algorithm is fairly basic, but the resulting
       bi-directional string produces the same canonical rendering order after
       applying	unicode_bidi_calc() or unicode_bidi_calc_levels(),
       unicode_reorder() and unicode_bidi_cleanup() (with the canonical
       option),	with the same paragraph_embedding level.
       unicode_bidi_needs_embed() attempts to heuristically determine whether
       unicode_bidi_embed() is required.

       unicode_bidi_logical_order() gets called	first, followed	by
       unicode_bidi_embed() (or	unicode_bidi_needs_embed() in order to
       determine whether bi-directional	markers	are required). Finally,
       unicode_bidi_embed_paragraph_level() optionally determines whether the
       resulting string's default paragraph embedding level matches the	one
       used for	the actual embedding direction,	and if not returns a
       directional marker to be	prepended to the Unicode character string, as
       a hint.

       unicode_bidi_logical_order() factors in the characters' embedding
       values, and the provided	paragraph embedding value (UNICODE_BIDI_LR or
       UNICODE_BIDI_RL), and rearranges	the characters and the embedding
       levels in left-to-right order, while simultaneously invoking the
       supplied	reorder_callback indicating each range of characters whose
       relative	order gets reversed. The reorder_callback() receives, as
       parameters:

          The starting	index of the first reversed character, in the string.

          Number of reversed characters.

          Forwarded arg pointer value.

       This specifies a	consecutive range of characters	(and directional
       embedding values) that get reversed (first character in the range
       becomes the last	character, and the last	character becomes the first
       character).

       After unicode_bidi_logical_order(), unicode_bidi_embed()	progressively
       invokes the passed-in callback with the contents	of a bi-directional
       unicode string. The parameters to unicode_bidi_embed() are:

          The Unicode string.

          The directional embedding buffer, in	canonical rendering order.

          The size of the string and the embedding level buffer.

          The paragraph embedding level, either UNICODE_BIDI_LR or
	   UNICODE_BIDI_RL.

          The pointer to the callback function.

          An opaque pointer argument that gets	forwarded to the callback
	   function.

       The callback receives pointers to various parts of the original string
       that gets passed	to unicode_bidi_embed(), intermixed with
       bi-directional markers, overrides, and isolates.	The callback's
       parameters are:

          The pointer to a Unicode string.

	       Note
	       It is not a given that the callback receives pointers to
	       progressively increasing	pointers of the	original string	that
	       gets passed to unicode_bidi_embed(). Some calls will be for
	       individual bi-directional markers, and unicode_bidi_embed()
	       also performs some additional internal reordering, on the fly,
	       after unicode_bidi_logical_order()'s big	hammer.

          Number of characters	in the Unicode string.

          Indication whether the Unicode string pointer is pointing to	a part
	   of the original Unicode string that's getting embedded. Otherwise
	   this	must be	some marker character that's not present in the
	   original Unicode string.

          Forwarded arg pointer value.

       The assembled unicode string should produce the same canonical
       rendering order,	for the	same paragraph embedding level.
       unicode_bidi_embed_paragraph_level() checks if the specified Unicode
       string computes the given default paragraph embedding level and returns
       0 if it matches.	Otherwise it returns a directional marker that should
       be prepended to the Unicode string to allow unicode_bidi_calc's (or
       unicode_bidi_calc_levels()) optional paragraph embedding	level
       pointer's value to be NULL, but derive the same default embedding
       level. The parameters to	unicode_bidi_embed_paragraph_level() are:

          The Unicode string.

          The size of the string.

          The paragraph embedding level, either UNICODE_BIDI_LR or
	   UNICODE_BIDI_RL.

       unicode_bidi_needs_embed() attempts to heuristically determine whether
       the Unicode string, in logical order, requires bi-directional markers.
       The parameters to unicode_bidi_embed_paragraph_level() are:

          The Unicode string.

          The directional embedding buffer, in	logical	order.

          The size of the string and the embedding level buffer.

          A pointer to	an explicit paragraph embedding	level, either
	   UNICODE_BIDI_LR or UNICODE_BIDI_RL; or a NULL pointer (see
	   unicode_bidi_calc_types()'s explanation for this parameter).

       unicode_bidi_needs_embed() returns 0 if the Unicode string does not
       need explicit directional markers, or 1 if it does. This	is done	by
       using unicode_bidi_calc(), unicode_bidi_reorder(),
       unicode_bidi_logical_order and then checking if the end result is
       different from what was passed in.

   Combining character ranges
       unicode_bidi_combinings() reports consecutive sequences of one or more
       combining marks in bidirectional	text (which can	be either in rendering
       or logical order) that have the same embedding level. It	takes the
       following parameters:

          The Unicode string.

          The directional embedding buffer, in	logical	or rendering order. A
	   NULL	value for this pointer is equivalent to	a directional
	   embedding buffer with a level of 0 for every	character in the
	   Unicode string.

          Number of characters	in the Unicode string.

          The pointer to the callback function.

          An opaque pointer argument that gets	forwarded to the callback
	   function.

       The callback function gets invoked for every consecutive	sequence of
       one or more characters that have	a canonical combining class other than
       0, and with the same embedding level. The parameters to the callback
       function	are:

          The embedding level of the combining	characters.

          The starting	index of a consecutive sequence	of all characters with
	   the same embedding level.

          The number of characters with the same embedding level.

          The starting	index of a consecutive sequence	of all characters with
	   the same embedding level and	a canonical combining class other than
	   0. This will	always be equal	to or greater than the value of	the
	   second parameter.

          The number of consecutive characters	with the characters with the
	   same	embedding level	and a canonical	combining class	other than 0.
	   The last character included in this sequence	will always be less
	   than	or equal to the	last character in the sequence defined by the
	   second and the third	parameters.

          The opaque pointer argument that was	passed to
	   unicode_bidi_combinings.

       A consecutive sequence of Unicode characters with non-0 combining
       classes but different embedding levels gets reported individually, for
       each consecutive	sequence with the same embedding level.

       This function helps with	reordering the combining characters in
       right-to-left-rendered text. Right-to-left text reversed	by
       unicode_bidi_reorder() results in combining characters preceding	their
       starter character. They get reversed no differently than	any other
       character. The same thing also occurs after
       unicode_bidi_logical_order() reverses everything	back. Use
       unicode_bidi_combinings to identify consecutive sequences of combining
       characters followed by their original starter.

       The callback may	reorder	the characters identified by its third and the
       fourth parameters in the	manner described below.
       unicode_bidi_reorder's parameter	is pointers to a constant Unicode
       string; but it can modify the string (via an out-of-band	mutable
       pointer)	subject	to the following conditions:

          The characters identified by	the third and the fourth parameter may
	   be modified.

          If the last character in this sequence is not the last character
	   included in the range specified by the first	and the	second
	   character, then one more character after the	last character may
	   also	be modified.

	   This	is, presumably,	the original starter that preceded the
	   combining characters	before the entire sequence was reversed.

       Here's an example of a callback that reverses combining characters and
       their immediately-following starter character:

	   void	reorder_right_to_left_combining(unicode_bidi_level_t level,
						size_t level_start,
						size_t n_chars,
						size_t comb_start,
						size_t n_comb_chars,
						void *arg)
	   {
	       /* Let's	say that this is the Unicode string */
	       char32_t	*buf=(char32_t *)arg;

	       if ((level & 1) == 0)
		   return; /* Left-to-right text not reversed */

	       char32_t	*b=buf+comb_start;
	       char32_t	*e=b+n_comb_chars;

	       /*
	       ** Include the starter characters in the	reversed range.
	       ** The semantics	of the combining characters with different
	       ** embedding levels -- so they get reported here	separately -- is
	       ** not specified. This will reverse just	the combining marks, and
	       ** they're on their own.
	       */

	       if (comb_start +	n_comb_chars < level_start + n_chars)
		   ++e;

	       while (b	< e)
	       {
		   char32_t t;

		   --e;
		   t=*b;
		   *b=*e;
		   *e=t;
		   ++b;
	       }
	   }

   Miscellaneous utility functions
       unicode_bidi_get_direction takes	a pointer to a unicode string, the
       number of characters in the unicode string, and determines default
       paragraph level level.  unicode_bidi_get_direction returns a struct
       with the	following fields:

       direction
	   This	value is either	UNICODE_BIDI_LR	or UNICODE_BIDI_RL (left to
	   right or right to left).

       is_explicit
	   This	value is a flag. A non-0 value indicates that the embedding
	   level was derived from an explicit character	type (L, R or AL) from
	   the stirng. A 0 value indicates the default paragraph direction, no
	   explicit character was found	in the string.

       unicode_bidi_type looks up each character's bi-directional character
       type.

       unicode_bidi_setbnl takes a pointer to a	unicode	string,	a pointer to
       an array	of enum_bidi_type_t values and the number of characters	in the
       string and the array.  unicode_bidi_setbnl replaces all paragraph
       separators in the unicode string	with a newline character (same as the
       UNICODE_BIDI_CLEANUP_BNL	option to unicode_bidi_cleanup.

       unicode_bidi_mirror returns the glyph that's a mirror image of the
       parameter (i.e. an open parenthesis for a close parenthesis, and	vice
       versa); or the same value if there is no	mirror image (this is the
       Bidi_Mirrored=Yes property).

       unicode_bidi_bracket_type looks up each bracket character and returns
       its opposite, or	the same value if the character	is not a bracket that
       has an opposing bracket character (this is the Bidi_Paired_Bracket_type
       property). A non-NULL ret gets initialized to either UNICODE_BIDI_o,
       UNICODE_BIDI_c or UNICODE_BIDI_n.

SEE ALSO
       TR-9[1],	unicode::bidi(3), courier-unicode(7),

AUTHOR
       Sam Varshavchik
	   Author

NOTES
	1. Unicode Bi-Directional algorithm
	   https://www.unicode.org/reports/tr9/tr9-48.html

Courier	Unicode	Library		  05/18/2024		       UNICODE_BIDI(3)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=unicode_bidi_combinings&sektion=3&manpath=FreeBSD+Ports+14.3.quarterly>

home | help