A multilingual environment for LaTeX with Aleph
© 2004-2005 Javier Bezos
All Rights Reserved
See also the Forum, with updated information and the possibility to post a comment. (I intend to create a weblog, but for the time being this is a sort of poor man's blog.)
2005-07-10. Selecting components and properties
I'm rewriting significant portions of the main package to implement the inheritance scheme explained below, thus meaning there is no code to release until this task is more or less finished (in the sense that this feature requires changes in many interrelated parts of the code and therefore it does not work yet). It's taking me a lot more time than I expected.
The main reason is I'm not sure if components and properties should be splitted into two categories, as currently are, or rather should be a single selector. For example, quotes has its own group, so you can say
text.quotes = include (or exclude, or inherit, etc.)
but it so happens quotes can have different configurations:
text.quotes.style = angle (or high, etc.)
So, should not both settings be gathered so that we say:
text.quotes = angle (or high, OR include, or exclude, or inherit, etc.)?
(the particular style to include them, or include to use a default value; of course there should be a way to set the default). Note CSS follows more or less this combined syntax, but note as well CSS has a closed set of properties (and values) while Mem allows you to define your own properties and their values.
In other words, very likely selecting a component should be considered another property instead of a separate feature (as far as an user is concerned, of course--the code could be quite different).
2005-02-07. EuroTeX 2005
Currently I'm preparing my talk at EuroTeX (March 7, 2005), so very little has been done.
2005-02-04. LGR encoding
Adapting LGR is not easy, because several important signs are assigned to ascii points whose catcode is reserved in LaTeX--eg, GREEK PSILI AND OXIA at "5E (^) thus having the catcode of superscript. There is another symbol mapped to the backslash. That would not be important except for a long-standing bug in how OCPs treat catcodes which the Aleph Task Force is trying to fix, because it's a critical one (very often \string is no-op and \char does not ensure a catcode of 12). Actually, these signs are intended to be accesed only by means of a TeX ligature like >', and this is what I've done in the mtp files.
Since there are very few LGR fonts, and very likely their number will not increase, I'm thinking about removing the support for that encoding and instead to write a virtual file.
Several control characters has been added, in particular spacing controls like EM QUAD, FIGURE SPACE, etc.
2004-12-23. Further tests
I continue exploring the current code. The black boxes in greek.pdf are non breaking spaces. Currently, mtp's handle printable characters only, but control characters, which are so important in Unicode, should be taken into account, too. A new greek.pdf shows a few polytonic characters written in UTF-8 as well as a short polytonic text.
2004-12-18. Encodings and components
The basic code for font encodings is more or less working, but it's still far from finished. However, corrently you can make tests for Cyrillic and Greek, even polytonic. There is a short example in greek.tex. Oddly enough, omlgc DOES NOT has the extended Greek Unicode page, but instead these characters are placed in the private area with its own encoding. However, LGR more or less work, but you should note the input convention is that of Omega--like that of babel but with = replacing ~ (after all, redefining ~ does not seem a good idea). Now, this basic structure must be fine tuned and improved.
One of the next tasks is to see how language features are selected. Currently, these features are grouped in a few so-called components, so that you can select, say, the "text" component. However, this schema is very limited and a new one must be devised to allow selecting "subcomponent"s or even an specific macro. Ideally, one should be able to:
- select a feature,
- ignore a feature, "inheriting" it from the surrounding language,
- ignore a feature, "inheriting" it from the main language,
- ignore a feature, killing it and restoring the LaTeX default, if any.
The main problem is a proper syntax to make these selections in a clean and simple way. Currently, to select a component you have to write, say, text, and to kill it, you have to write notext. This systax is inherited from my package polyglot, which I used as a starting point to have something to work with, but it's clear it isn't flexible enough.
2004-11-28. Font encodings
It seems that finally the LGC scripts will require three ocps, although they are not implemented in full (currently there are only two, with states). The first one decomposes and it's script dependent (NOT font encoding dependent), the second one will compose the glyphs available in the font, and the third one eventually maps from Unicode to the composed glyphs or composes them with the \accent primitive.
I'm rewrinting font encodings from scratch, so there are many new bugs. A few ones I've discovered in a quick look (as of 2004/11/28):
- TEXsometimes appears in Latin and sometimes in Greek
- Lowercase iota vanishes as well as several composite characters
- A few symbols are replaced by black boxes
- em-dashes and en-dashes vanish
2004-10-24. Diacritical marks
The Unicode 4.0 book states (p. 184) when discussing spacing modifier letters:
A number of the spacing forms are covered in the Basic Latin and Latin-1 Supplement blocks. The six common European diacritics that do not have encodings there are added as spacing characters.
In other words, except for these six diacritics (U+02D8-U+02DD), the spacing forms of combining characters are those in the range U+0000-U+00FF. Unfortunately, it happens this is not true, since the spacing caron accent (U+02C7) is not encoded in these blocks. Further, one of these six diacritics encoded separately--namely, the tilde U+02DC--does exist in these blocks (U+007E).
What to do, then? One will be forced to find some kind of hint, and one can do it readily--all characters in the block Spacing Modifier Letters are prefixed with MODIFIER LETTER, except the six spacing clones and CARON (U+02C7). From this, we can infere that the right spacing form for the circumflex accent is not the MODIFIER LETTER variant, but the one in the Basic Latin Block, exactly like the ACUTE ACCENT. No doubt the "small" tilde has been encoded separately because the ASCII tilde has already a special meaning in several OS's.
Still, I think there is a better solution, or rather a better encoding which does not poses this problem. Since the glyph for diacritics are mainly intended for use with the \accentprimitive, one can conclude they are, after all, combining characters. The fact we need further processing with TeX does not prevent considering these glyphs conceptually as non-spacing characters, because this is the way TeX works. Since composing diacritical marks are encoded anew in Unicode, we don't need to be concerned with legacy encodings.