Sound

Transcription

00:00:03,91100:00:37,520

So my presentation is about how much the struggle of some people to use typography in their languages especially with digital type. Because there is quite a complex set of elements that make this universe of digital type.

00:00:37,52000:01:59,759

So one of the basic things people do when they want to use their languages, they end up with these type of problems down here where some characters are shown, sore aren’t, sometimes they don’t match within the font. Because one font has one of the character they need and then they don’t. Like here for example this font has the capital letter but not the lower-case letter. So users don’t really know how to deal with that, they just try different fonts and when they’re more courageous, they go online and find how to complain about those. And so very often they end up just complain to developers —I mean font designers or engineers. And those people try to solve those problems as well as they can. But it’s pretty hard to find out how to solve them sometimes. Adding missing characters is pretty easy but…

00:01:59,75900:03:55,467

Ten sometimes you also have language requirements that are very complex. Like here for examples, in Polish, you have the ogonek, which is like a little tail, that shows that a vowel is nasalized. Most fonts actually have that character and for some languages people are used to have that little tail centred and it’s quite rare to see a font that has that. So then when font designers face that issue they pretty much have to make a choice rather they want to go with one tradition or another and if they want to go one way they pretty much are scattered to those people. Also you have like problems of how you want to space things differently because this is a stacking of different accents — better called diacritics or diacritical marks. because stacking this heigh up often ends up on the line above, so often you have to find solution to make it less heavy on a line, and then in some languages they end up instead of stacking them they end putting them side by side. which is yet another point where you have to make a choice.

00:03:55,46700:04:55,127

But basically, all these things are based on how type is represented on computers. You used to have simple encodings where you had like ASCII the basic Western Latin alphabet and each character was represented by bytes. You had those bytes who would represent the character and the character could be displayed with different fonts, with different styles, they could meet the requirements of different people, and then they made a bunch of different encodings because they were a lot of requirements and it’s kind of hard to fit all that in ASCII.

00:04:55,12700:06:12,541

So often they just had they would start with ASCII and then add specific requirements but then they ended up having a lot of different standards because of all the different needs. and so one (single byte) of representation would have different meanings and each of these meanings could be displayed differently in fonts. But very often when you have old webpages, they are using old encodings very often. If your browser is not using the right encoding you would have just jibbish displayed because of this chaos of encodings. And so in late 80s, they started thinking about those problems and then in the 90s they started working on it with Unicode, a bunch of companies got together and worked on a one single unifying standard that would pretty much be compatible with all the pre-used standards or the new coming ones.

00:06:12,54100:07:33,909

In Unicode , it’s pretty well defined, you have a universal (crude) point to represent, to identify a character, and then that character can be displayed with different glyphs depending on the font or the style selected. And so with that framework, when you have to have the proper character displayed, you pretty much have to the (call) point in a font editor and change the shape of the character and it can be displayed properly. And then sometimes there’s just no (crude point) for the character you need because it hasn’t been added, it wasn’t in any existing standard or nobody has even needed it before or people who needed it just used the old printers and metal type.

00:07:33,90900:07:59,175

So for that, you sort of have to start to deal with Unicode, the organization itself. They have a few ways to communicate like the mailing-list, the public (?) and recently they also opened a forum where you can ask questions about the characters you need because you might just not find them.

00:07:59,17500:10:56,931

I guess you’re familiar with the character map. Very often you have an application you can access all the characters either all the characters that exist in Unicode or the ones available in the font you’re using. So it’s kind of hard to find what you need actually, most of the time it’s… because it’s organized with a very restrictive set of rules. So most of the time the characters are just ordered in the way they’re ordered within Unicode, meaning their code point order: like the capital A is 41, and then the B is 42, etc. Then the further you go the further you (go) in the Unicode blocks and tables. And there, there is a lot of different writing systems. But also because Unicode is sort of expanding organically, work is done on one script, and then another and then they’re coming back to the previous script and adding things, so things are not really in a logical order or practical order. Like here for example, the Basic Latin is all the way up there, and here you have Latin Extended A, (Conditional) Extended Latin, Latin Extended B, C and D. But those are actually quite far apart within Unicode, and each of them can have a different set-up, for example, here you have a capital letter that is just alone, and here you have a capital letter and a lower-case letter. So when you know what character you want to use sometimes you might just find the upper-case and then you have to keep looking for the lower-case, depending on the application you’re using you might be able to find the information right there: looking at here, the lower-case is there so () to find it, it’s in the international phonetics alphabet block.

00:10:56,93100:13:08,048

Basically when you have a character that you just can’t find, you ask the mailing-list or forums and people can tell you if there would be a point to include it in Unicode or not. And then if you’re very motivated, you can try to meet the criterias for inclusion, but for the proper inclusion, there has to be a formal proposal, they have a template with questions you can answer, you also have to provide proof that the characters you want to add are actually used or have they would be used. The criterias are quite complicated because… you have to make sure that this is not a glyphic variant, meaning basically the same thing but represented differently. Or just a different shape but the same idea, it is still used the same way sort of. and then you also have to show that it is actually not a character that already exists because sometimes you just don’t know it’s a variant of another one or sometimes they just want to make it easier and say it’s a variant of another one even though you don’t think it is. Is it sure that it’s not just a ligature because sometimes ligatures are used as a single character but they’re not and sometimes they are. You have to provide an actual font with the character so that they can use that in their documentation.

00:13:08,04800:16:56,327

FS: How long does it take usually?

DJ: It depends I guess because sometimes they just accept it right away if you provide enough proof and if you explain it properly but very often they ask for revisions to the proposals and then sometimes it’s just rejected because it doesn’t meet the criterias. Actually those criterias have changed a bit in the past because initially in Unicode you had a… They started with Basic Latin and then they started adding the special characters that were used: here for example it’s the international phonetic alphabet but also all the accented ones… Because initially they figured they’re used in other encodings and so they just want to be compatible with everything that already exists, so they’re going to add them. And then they also figured they already all those accented characters from other encodings so they’re also going to add all the ones they know are used even though they were not encoded yet. And so that’s why you have like different names because they had different policies at the beginning instead of having the same policy as now. So they added here a bunch of Latin letters with marks that were used for example in transcription. So if you’re transcribing Sanskrit for example, you would use some of the characters here. Then at some point they realized that this list of accented characters would just get huge, and there must be a smarter way to do this. So they figured you could actually just use the parts of those characters because they can be broken apart like it’s a base letter and it’s the marks you add. So they actually have a compatibility between () so here you have a single character that can be decomposed canonically between the small letter B and a colon dot above. So here you have the character for the dot above; it’s in the block of the diacritical marks. So here you pretty much have all the diacritical marks they thought were useful at some point. At that point, when they realized they would end up having thousands of accented characters they figured with this way, we can have just any possibility, so from now on, we’re just going to say if you want to have an accented character that hasn’t been encoded already just use the parts that then represent it. And some people in 96 for Yoruba, a spoken (??) in Nigeria, made a proposal to add the characters with diacritics they needed and Unicode just rejected the proposal because they had this way of composing them.

00:16:56,32700:18:33,250

FS: So the elements they needed were not already in the toolbox?

DJ: Yes, the elements were, but not the… The encoding parts are there meaning it can be represented with Unicode but then software didn’t handle properly so it made more sense to the Yoruba speakers to have it encoded it in Unicode.

FS: So you could type, but you need to type two characters of course?

DJ: Yes, that’s a big problem. The way you type things… Because very often most keyboards are based on the old encodings when you have accented characters as single characters so when you want to do a sequence of several characters, you actually have to type more or you’d have to have a special keyboard layout that allows you to have one key map to several characters. So that’s technically feasible but it’s just a slow process to have all the possibilities. You might have one that’s very common so developers end up adding it to the keyboard layouts or whatever applications they’re using. But then other people have different needs and so take the same efforts again to have those sequences available.

00:18:33,25000:22:40,479

Within Unicode, there is a lot of documentation, but it’s quite hard to find when you’re just starting, and it’s quite technical. Most of it is actually in a book they publish at every new version. So that book has a few chapters that describe how Unicode works and how characters should work together, what properties they have. And all the differences between scripts are relevant. They also have special cases trying to cater to those needs that weren’t met or the proposals that were rejected. So for example, they have a few examples in the Unicode book: in some transcription systems they have this sequence of characters or ligature: this is a T and S with a ligature tie and then a dot above. So the ligature tie means that T and S are pronounced together and the dot above is err… has a different meaning (laugh). But it has a meaning! But because of the way characters work in Unicode, applications actually reorder it whatever you type in, it’s reordered so that the ligature tie ends up being moved after the dot. So you always have this representation because you have the T, there should be the dot, and then there should be the ligature tie and then the S. So the T goes first, the dot goes above the T, the ligature tie goes above everything and then the S just goes next to the T. And so the way they explain how to do this is supposed to do the T, the ligature tie, and then a special diacritical mark that prevents any kind of reordering, and then you can add the dot and then you can do the S. So this kind of use is great because you have a solution, it’s just super hard because you have to type five characters instead of… well… four (laugh). But still, most of the libraries that are rendering fonts don’t handle it properly and then even most fonts don’t plan for it, so even if they did anyway the libraries wouldn’t handle it properly. Then there’s other things that Unicode does: because of that separation between accents and characters and then the composition you can actually normalize the things are ordered. And so you have… This sequence of characters can be reordered into the pre-composed one with a circumflex or whatever you have combining marks in the normalized order. And so all these things have to be handled somehow in the libraries, in the application or the fonts. Then you

00:22:40,47900:24:29,398

The documentation of Unicode itself is not prescriptive meaning that the shape the glyphs are not set and stone. So you can still have the room to have the style you want, the style your target users want. So for example we have different () and different (). So Unicode just has a one shape and it’s the font designer’s choice to have different ones. Unicode is also not about glyphs, it’s really about how information is represented, how it’s displayed. So like here you have two characters that are displayed as a ligature: it is actually encoded as one character because of previous encodings. But if ever it would be a new case, Unicode wouldn’t stake the ligature as a single character most probably. So that’s the thing with compatibility, here you have two different ways of representing the () circumflex, and so the font renderer library should be able to position things identically, either on its own or with the information provided in the font. And also, it should be the same positioning, as relative positioning, with characters that can only be composed.

00:24:29,39800:27:38,586

And so all this information is really in a corner there. It’s quite rare to find fonts that actually use this information to provide to the needs of the people who need those specific features. One of the way to implement all those features is with TrueType OpenType and there are also some alternatives like Graphite which is a subset of a TrueType OpenType font. But then you need your applications to be able to handle Graphite. So the real unique standard is TrueType Opentype. It’s pretty well documented, it’s very technical because it allows to do many things for many different writing systems. But the thing is that it’s slow to update so if there’s a mistake in the actual specifications of OpenType, it takes a while before they correct it and before that correction shows up in your application. It’s quite flexible and one of the big issue it that it has its own language code system meaning that some identified languages just can’t be identified in OpenType so you can’t… One of the feature in OpenType is… it’s very bold to say but… if I’m using Polish, I want this shape, and if I’m using Navajo, I want this shape. That’s very cool because then you can make one font that’s used by Polish speakers and Navajo speakers and they don’t have to worry about changing fonts as long as they label what they’re doing according to what languages they’re using. But then for languages that are not, that don’t have a… Because in OpenType you just don’t have that option. Because this other standard has a codes for all the languages, that’s pretty standard, but yet OpenType decided to have its own that’s different that not, that doesn’t have a () relation. And so it’s really frustrating because, you have Unicode, and you can see here, you can find all the characters in Unicode and it’s not organized in a practical way: you have to look all around the tables to find the characters that may be used by one language and then you have to look around for how to actually use them.

00:27:38,58600:29:01,425

It is a real lack of awareness within the font designer community. Because even when they might add all the characters you need, they might just not add the positioning, so like here for example you have a… when you combine with a circumflex, it just doesn’t position well. because that’s because most of the font designers still work with the old encoding mindset when you have one character for one accentuated letter. And it’s just hard. Sometimes they just think that following the Unicode blocks, the Unicode character charts, is good enough. But then you have problems where, like here at the beginning, the capital is in one block and the lower-case is in a different block. And then they just work on one block, they just don’t do the other because they don’t think it’s necessary, but yet, two cases of the same letter are there, so it would make sense to have both.

00:29:01,42500:29:38,888

It’s hard because when there’s very few connections between the Unicode world and the people who work on OpenType libraries and how OpenType is handled, and then the font designers and then the actual needs of the users.

00:29:38,88800:32:34,179

PM: There is something that … at the beginning of the presentation that is really surprising because you went for the code point of the characters, all your characters are subtitled by their code points, but why didn’t you use the names because it’s kind of the beauty of Unicode to name everything, every character.

DJ: Well, actually… the names are quite long, and those are shorter? (laugh) And also, one of the () thing about Unicode is that… For () they have this policy where you can’t change the names of the characters, so they actually have an errata where they realized that “oh shit, we shouldn’t have named this that, so here’s the actual name that makes sense, and the real name is wrong.”

FS: Can you give us the link to that document?

DJ: Yeah!

FS: Pierre refers to the fact that seeing it in the character mappings that each of the glyphs has also a description. And those are sometimes so abstract and poetic that this was a start of a work from OSP to try to re-imagine actually what shapes would belong to the descriptions. So “combining dot above” that’s the textual description of the code point. But of course there is thousands of them so they come up with the most fantastic gymnastics to try…

DJ: They just added a pile of poo in the last version…

(laughs)

DJ: So that’s the proposal… So they have like a whole paragraph explaining the thing and they have the characters… And then you have the description in the Unicode data, the way you could represent information about characters. And then you have samples usually…

00:32:34,179

NM: So when people come in a project like DéjàVu, they have to understand all that to start contributing. How does this training or teaching or learning process takes place?

DJ: well usually most people are interested in what they know. so they have this specific need and they realize they can add it to DéjàVu, so they learn how to play with FontForge and… After a while, what they’ve done is good so we can use it. Some people find that font and they end up adding stuff they’re not familiar with. So for example we had Ben doing Arabic, so it was mostly just drawing it and then asking for feedback on the mailing list and then we got some feedback, we changed some things and then we released it, we got more feedback (laughs) because more people complained… So it’s a lot of just drawing what you can from resources you can find. Very often it’s based on other typefaces. But sometimes you’re just copying mistakes from other typefaces… So eventually it’s just the feedback from the users that’s really helpful because then you know that. well people are using it, trying it, and then you know how to make it better.

NM: how much do they know, do they need to know from Unicode for instance to be able to get started?

DJ: just know where the character has to go, which code point. and then for the other features we try to help if we can, if we realize there’s something missing. FontForge is quite complex because a lot of the OpenType features FontForge handles very well… It’s just really complicated to add them. so they actually (don’t have to struggle for it) kind of like… some people ask some questions about how to actually use these OpenType features, and then we just do it for them, or we do it for them and then explain, or we just don’t know how to do it (laughs)… which is probably most of the cases.

PM: Are you using feature files or the interface of FontForge?

DJ: Ah, just the interface. I guess feature files would actually be easier but sometimes… actually some of us use scripts, so instead of a feature file I use scripts, because you allow features.. it allow ??? after a while so it’s better to do scripts. I guess you could just a script to generate a feature file, (still that works still).