2016-06-04

Hello again. I'm back to talk about recreating some Pip Boy like terminal widgets in ncurses.

Last time, I started talking about the remedial problems of how to *render* a string for a character cell console, e.g., how to convert a logical string to a list of visual strings that can be displayed on a terminal.  So far, I said that you need to

Put the string in Unicode NFC normalization, because NFC normalized strings are more likely to be prepared glyphs on a terminal and not overstruck glyphs.  Unicode normalization is discussed in http://www.unicode.org/reports/tr15/

Convert tabs to spaces

Replace almost all of the control characters or Private Use Area characters with replacement glyphs.  This hygiene is necessary since some unhandled control character could garble the terminal display.  Some control characters that don't get removed are the 5 line separators (CR, LF, NEL, PS, and LS) and the 8 bidirectional  formatting characters (LRE, RLE, LRO, RLO, PDF, LRI, RLI, FSI, PDI, LRM, RLM, and ALM) because we'll use them in a second.

So that just leaves

Prepare to handle any right-to-left text in the string.  This is described in the Unicode Bidirectional Algorithm in http://www.unicode.org/reports/tr9/

Do any Arabic shaping of the glyphs

Wrap the string to a given number of character cells.  The Unicode line-breaking algorithm is described here: http://www.unicode.org/reports/tr14/tr14-35.html

Pad the list of wrapped strings to left, center, or right aligned

Bidirectional text

First off, I'm not qualified to talk about this, but here's the highlights.

Most European languages are written left to right.  The big right-to-left alphabets are Arabic and Hebrew.  Here in the Los Angeles, I'd guess the most common right-to-left languages are probably Persian, Hebrew, Arabic, and Yiddish.

But Scheme strings are normally encoded in logical order: e.g. the beginning of a string is the part of the string that would be read first by a human.  If a string contained a line of French, the first character would be the left-most character to be displayed on a screen.  If it contained Arabic, the first character would be the right-most character to be displayed on a screen.

Terminal emulators generally take one of two strategies when given strings to display. They either

Display the text from left-to-right regardless of the contents of the text

Or, try to be context sensitive when they display the text, switching from left-to-right and right-to-left depending on the apparent language of the text

Weirdly, in a ncurses application, neither strategy is particularly helpful.  If a terminal does the former, it becomes the programmers' responsibility to convert the string from logical order to visual order.  If it does the latter, and you ask ncurses to write a string at a given (y,x) position, it is hard to know if that x is columns counted from the left or columns counted from the right.

Consider the following program, and its output on the Cygwin terminal.  It starts ncurses, prints a line of Latin text starting at column 30, row 1, and prints a word in the Hebrew alphabet starting at column 30, row 2.  Both lines are supposed to begin a column 30, but, the terminal tries to be helpful and makes the second line print in the 30th column from the right, which is unlikely to be the intention of the programmer.



So, what to do?

To complicate matters, Unicode has some explicit control characters that can be embedded in strings when one wants to explicitly state the directionality of all or part of the text.  They can be used to explicitly indicate the direction of a run of text, or override the current general direction of the text. The 8 bidirectional  formatting characters (LRE, RLE, LRO, RLO, PDF, LRI, RLI, FSI, PDI, LRM, RLM, and ALM) need to be interpreted.  See Unicode TR#9 for details.

One strategy is to

Use a library like GNU FriBiDi to convert the string from logical to visual order

Set the terminal program to *not* try to help with bidirectionalization.

So, the FriBiDi function fribidi_log2vis is the important function for this.  There needs to be a Guile function that wraps up the FriBiDi functionality.  There isn't one yet.

Arabic Shaping

Arabic shaping is another one of those topics about which I know almost nothing, but, here's the highlights.

The same Arabic letter can have a different glyph depending on where it appears in a word.  If it appears in the middle of a word, the glyph should join smoothly to its neighboring letters, like English cursive letters.  If it appears at the beginning or end of a word it has a different form.  And it might also look different when it is a single letter not in a word.



Some new terminal emulators are smart enough to do this shaping for you. If it detects an Arabic letter in the middle of a work, it uses the correct glyph.  But if you've asked your terminal to *not* help with bidirectionalization so that the behavior of the ncurses screen locations is still predictable, it is still going to do Arabic letter shaping for you?  I don't know.  It is a mess.

There are other complications.  There are Unicode controls that exist to encourage characters to be joined (ZWJ, for example) or to discourage it (ZWNJ).

In any case, if you need to do shaping manually instead of leaving it to your terminal emulator, again GNU FriBiDi is your goto library for this.  And someone needs to package that for Guile, too.

Line Length and Line Breaking

OK.  You have your string.  It is NFC, untabified, has no nasty control characters in it, and you've decided upon some strategy for bidirectional text.  Next up, we need to figure out how much screen real estate each string takes up, and whether the lines need to be wrapped.

For console programs, each character takes up zero, one, or two cells. Latin letters usually take one cell. Chinese, Japanese, and Korean letters usually take two, as in the following pic.



So how do you tell how many cells a glyph is going to take?  Basically the C library function wcwidth is the basis for this.  It will tell you if a codepoint has a glyph that takes up one cell or two.  Unicode has their own explanation over here: http://unicode.org/reports/tr11/

Guile needs a function to compute the screen width of a line of text.  I like u32_strwidth from GNU Libunistring, and that's a function that needs to be made available to Guile, too.

But this also has some problems.  Some characters have a width that is ambiguous, and should be 2 cells on a screen that mostly consists of CJK text, or should be 1 on a screen that mostly consists of Latin text.  Some example ambiguous width characters are some punctuation or math symbols.  Not all terminals agree on what to do about ambiguous width characters. But I'm not going to solve this problem.

But once one knows how many screen cells a line of text is going to take, it would seem fairly easy to then construct a line-breaking algorithm.  Line breaks need to be put in at all the explicit line breaks of the five line breaking characters CR, LF, NEL, PS, and LS. Remember that CR+LF counts as just a single line break.  And then lines that exceed a desired number of columns need to be broken at plausible locations at the end of words.

There are some other complications.  There are some Unicode characters that are there to prevent line breaking or encourage line breaking: various hyphens, soft hyphens, double hyphens, word joiners.

Hopefully this convinces you that line breaking is not really something you should treat casually.    See Unicode TR #14 for its generic line breaking algorithm.  It is actually quite complex.  In any case, I'm not going to engineer a console line-wrapping algorithm.  I like the one in GNU Libunistring called u32_width_linebreaks() described here. And, again, this isn't available to Guile yet, so I need to package that, too.

But if there were such a function, it would take a string and break it into a list of strings, where each element of the list had one screen line of text.

Alignment and Padding

Alright, now we're at the end of this process.  The last step is padding the string to get the desired alignment.

The default alignment for Latin console text should be to have an aligned left margin and a ragged right margin.  For RTL text is should be to have an aligned right margin and a ragged left margin.  So you need a function that tries to determine the general directionality of a paragraph.

If you want a paragraph of text to be right-aligned or center-aligned, you need to pad the strings on the left with spaces.  To know how many spaces to pad a line, you need to know how many cells a line occupies on the screen.  Again it all goes back to wcwidth and  u32_strwidth as described above.

Next

So I'll come back when I have all the above functionality in some library somewhere, and we'll finally be ready to put some green text in a green box.

Show more