Release of 3.1.1 for testing

Started by Gert, 09. May 2007, 20:28:47

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Gert

Version 3.1.1 is now online for testing: http://dictionarymid.sourceforge.net/development.html#testversion

If no serious problems show up, then this version will be shortly released as 'first official 3.1 version'.

Please post your testing feedback in this forum.

Thanks !
Gert

dreamingsky

Here is my testing with version 3.1.1.

I. JARCreator bug

I tested the 3.1.1 Bitmap Font Generator and the 3.1.0 DictionaryGeneration file.  I saw there is a new JARCreator file too.  That's really great.  It saves me a lot of time.

But, I ran into an error.  I can't get JARCreator to build the JAR with the font files.  I get this error from JARCreator:

Exception in thread "main" java.io.FileNotFoundException: C:\Temp\Dict\Thai_NIU\
dictionary\fonts (Access is denied)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(Unknown Source)
        at de.kugihan.jarCreator.JarCreator.writeJAR(JarCreator.java:178)
        at de.kugihan.jarCreator.JarCreator.main(JarCreator.java:86)

I got the same problem in the Thai, Khmer (Cambodia), and Hindi (India) dictionaries.

If I don't run the Bitmap Font Generator, then the JARCreator will finish with no problems.  I am using version 3.1.1 of DictionaryForMIDs.jar and DictionaryForMIDs.jad.  I am using version 3.1.0 of JARCreator.  Do I need version 3.1.1 of JARCreator?  I didn't see it on the website.

I really like you can choose multiple font sizes with the Bitmap Font Generator.  That's a big plus.  I thought the fix would only allow you to choose 1 size.  But, you can choose multiple sizes and put them in the JAR file.  That is very nice.  And the .png files are a big plus too.


II. Khmer font bug
The Bitmap Font Generator works good.  I ran into a problem with some Khmer fonts, though (KhmerOS and KhmerOT).  The "KhmerOS System" font worked OK, though.  The 2 fonts got cut off at the bottom.  Only the top 15% of KhmerOS showed up.  Nothing showed up for KhmerOT (OT = opentype).  It's just a white line.

I'm not sure if someone wants to bug fix it.  It's probably not worth it since the "KhmerOS System" font works fine.  The Khmer fonts can be downloaded from here:
http://www.khmeros.info/drupal/


III. complex scripts
Also, there is another bug (technically a feature).  But, there is no way to fix it.  Hindi, Thai, and Khmer are "complex scripts".  Fonts for these languages are a little tricky to make.

These languages (actually scripts for most languages in South Asia and South-East Asia) have separate consonant and vowel marks.  So to type "ke" in Hindi, you type a "k", then the "k consonant" shows up.  Then you type "e" and the "e" shows up.  But, the "e" is not a separate letter.  It is put on top of the "k".

So inside the font are directions for how to move the vowels over the consonants.  But, when you convert a font to a bitmap font, then you lose the instructions for the movement.

So when DfM displays the bitmap fonts, the consonants and vowels are divided up into 2 parts.  There is no way to avoid this.  Thai phones have TTF fonts or something that can handle the vowel marks correctly.

Also, Windows uses the Uniscribe DLL http://en.wikipedia.org/wiki/Uniscribe  to make some even more complicated font adjustments.  For example, in Hindi, a "ii" (long "i") is put to the right of the consonant.  But a "i" (short "i") is to the left of the consonant.  The Uniscribe DLL moves the "i" to the left.

Anyway, that is a lot of techno talk.  Basically I'm saying that the Hindi, Thai, and Khmer dictionaries will look funny with the bitmap fonts.  I doubt anyone would fault us for it, though.  If someone does complain, then we can only tell them that we can't fix it.  The problem is due to limits in the architecture.

Jeff

Gert

My comments:

Your point I:
I am not sure what is going wrong there. Hope Sebastian can tell quickly and he will know what to do about this. Sebastian was proposing to ignore all font files by JarCreator (and I was pressing him to postpone this to 3.2 :( ), given your problem, maybe we still should implement this in 3.1

Your point III:
Simple question to the knowledgeable person: nowadays where we have that allmighty unicode charset, I'd expect that a good unicode font does also cover Hindi, Thai and Khmer, with all the features that you mentioned, right ? Well, maybe not in all cases.

What I mean is, other languages also have the situation where special 'characters-parts' are amended to character, such as the accents in many Roman languages, the umlauts in German etc. While for many of these characters it takes you two key strokes to type, in the end there is one Unicode character that covers the result. Is this different for Thai etc ?

Gert


dreamingsky

Yes, many fonts can support a large number of languages.  "Arial Unicode MS" can probably support 30 languages or more (not Khmer though, so far I've only found 2 Unicode Khmer font makers on the internet).  Currently the Thai dictionaries and the Hindi dictionary are using Arial Unicode MS for the bitmap fonts.

The "complex scripts" work differently than the Roman language examples.  There are actually 2 Unicode blocks that get entered when typing.  For the roman example - if you type a German "u umlaut", then press the backspace button once, then the whole letter is deleted.

For the "complex scripts", if you press the backspace button, then only the vowel is deleted.  The consonant will still be there.  For example - in Hindi, if you type "k" then the "k" will show up.  Then you press the "e" and the "e" will show up over the previous "k".  Now if you press the backspace button then only the "e" is deleted.  The "k" and "e" are 2 different Unicode blocks.

Gert

Ok, I now understand the different handling between e.g. accented Roman characters and languages such as Hindi.

So, just for slowly understanding people like me, when we use "Arial Unicode MS" for the bitmap fond of the Thai and the Hindi dictionary, then these characters are not getting displayed correctly, for the reasons that you explained. Did I understand this correctly ?

So we would need to have something like uniscribe.dll for DfM to solve this problem, right ?

Gert

Tomcollins

Hi!

Yeah! There's a problem with the jarCreator. It expects just two different types of files in dictionary (font.bmf or .csv) but no folders like the new "/fonts/". So we have to make an update for the jarCreator. Sorry!
1) Copy it into an seperate folder like before or
2) Ignore it and leave it in the dictionary directory (which would mean we have to update the DFM too)?

What do you prefere Gert?

Sebastian


Gert

I prefer 2) - which is what I was asking you to postpone to 3.2 only a few days ago ...

Can you update DfM and JarCreator ?

I can then build version 3.1.2 of both - or do you want to do the build ?

Gert

dreamingsky

Quote from: Gert on 12. May 2007, 04:11:33
when we use "Arial Unicode MS" for the bitmap fond of the Thai and the Hindi dictionary, then these characters are not getting displayed correctly

Yes, using Arial Unicode MS (or any other font) won't work correctly for Hindi, Thai, Khmer...  In order to correctly display "complex scripts" we would need:
1. "outline" fonts (TTF, etc)
2. something like Uniscribe.dll

It would be possible to reproduce Uniscribe.dll.  Linux uses something like Uniscribe.dll to manage complex scripts.  I'm sure we could find the source for it somewhere.  But, I think it'd be way beyond the scope of DfM to do this.  We'd have to write the code for each language (Thai would need different code from Hindi, etc).

More importantly, we'd need to use regular fonts instead of bitmap fonts.  It would be impossible (as far as I know) to display "complex scripts" using bitmap fonts.  Regular outline fonts have code for each vowel to tell it to move the vowel back over the last letter.  Bitmap fonts don't have this feature.

So, we'll just have to live with what is possible.  It won't be a huge problem if you can get used to reading it with bitmap fonts.  An example would be:
Imagine the letter "ã".  The "~" is above the "a".  Imagine the "~" is the vowel and the "a" is the consonant.  With the bitmap fonts it will look like "a~" instead of "ã".

I think any effort towards these languages should be spent building an IME (input method editor) for Thai, Hindi, and Khmer.  An IME is the way to type in a language.  So far we can't type in Thai, Hindi, Khmer, or Japanese in DfM.  We can only search English -> language2.  We cannot search language2 -> English.  I made a posting about this in the "Feature planning for version 3.2" thread.

Jeff

Tomcollins

Hi!

1) Well. For a lot of languages there's some kind of transciption to basic latin. e.g. in chinese that would be pinyin. And if this transcription is added in the search index it's no problem to search for e.g. chinese with devices, which do not support chinese characters.
Although a IME would be very nice. (I once talked to one doing an chinese ime by himself but he said that a big problem is to get good complete char lists.)
2) I'm not so familiar with the languages you mentioned, but I know that there are also many character-combinations added as one character in unicode! We have the same problem with the chinese pinyin, since there are tones on the vowels. Like xue2sheng1 should be xuéshēng. We do the convertion with a languageUpdateClass so it is just one character in the dictionary and in the bitmapfontImage.
Could that work for some of your mentioned languages too? Of course, this is just for passiv (displaying) and doesn't work for active input..

Sebastian

Tomcollins

Quote from: Gert on 12. May 2007, 04:48:18
I prefer 2) - which is what I was asking you to postpone to 3.2 only a few days ago ...

Can you update DfM and JarCreator ?

I can then build version 3.1.2 of both - or do you want to do the build ?

Gert

I can update DFM, I cannot update the JarCreator, since I don't have the time to get familiar with it.
I updated the JarCreator, so it ignores the fonts directory, but it does not yet copy it to the jar file!
So one has to copy it manually into the jar file again, until someone finds the time to update the JarCreator!
DFM expects the fonts now located at /dictionary/fonts/, like it is generated.

Sebastian

Gert

I didn't write JarCreator neither ... but I will try to get the update done.

Gert

dreamingsky

Quote from: Tomcollins on 12. May 2007, 13:39:06
2) I'm not so familiar with the languages you mentioned, but I know that there are also many character-combinations added as one character in unicode! We have the same problem with the chinese pinyin, since there are tones on the vowels. Like xue2sheng1 should be xuéshēng. We do the convertion with a languageUpdateClass so it is just one character in the dictionary and in the bitmapfontImage.


Yes, pinyin with tone marks works fine in DfM with bitmap fonts.  "e2" is the same as "é".  "é" is only one Unicode block.  So it will display fine with a bitmap font.  "Complex scripts" are not like this.  The consonant and vowel are 2 separate Unicode blocks.  There are no merged "consonant + vowel" Unicode blocks.  Then you would need to make literally thousands to Unicode blocks to handle all the possible combinations of consonants and vowels.  It is simpler in fonts to just make the 50 basic parts and squish them together.

Quote from: Tomcollins on 12. May 2007, 13:39:06
1) Well. For a lot of languages there's some kind of transciption to basic latin. e.g. in chinese that would be pinyin. And if this transcription is added in the search index it's no problem to search for e.g. chinese with devices, which do not support chinese characters.

Yes, pinyin is a very useful transcription for indexing the Chinese dictionary.  Luckily pinyin is included in the CEDICT (Chinese-English) and HanDeDict (Chinese-German) dictionaries.  So we can just add an index to the pinyin.  Unfortunately the Thai, Hindi, Khmer, Japanese... dictionaries do not have this roman transliteration in the dictionary (Japanese uses the hiragana script for the transcription of Chinese characters).

We could possibly add this transliteration (sorry, "transliteration" means to type a language in the roman script).  Hindi has a standard transliteration called IAST http://en.wikipedia.org/wiki/IAST.  But, adding this transliteration to the dictionary would considerably increase the file size of the DfM dictionary.

Another option would be to input the Hindi in IAST transliteration.  The user would input IAST and IAST would show up in the search box in DfM.  Then the IAST could be converted "on the fly" to the real Hindi script (called Devanagari) in a languageUpdateClass.

There are a couple problems with this.  First, there is no standard transliteration system for Thai.  We'd have to pick one and then explain it in documentation.  Then the users would have to learn the transliteration method.

Another problem is that IAST for Hindi uses several roman letters that are not common: "ṅñś", for example.  Our phones have no way of typing these letters.  We could change the letters to ".n", or "~n" or something and then write documentation telling the users how to use it.  But, this is inconvenient for the users too.

The best solution would just be to write an IME.  Then you would type directly in Hindi or Thai.


So far I think we need IMEs for the following languages:
Thai
Hindi (India)
Khmer (Cambodia)
Japanese
Russian
Arabic

Does anyone know of any others?

So far we cannot search in these languages in DfM.  We can only search English->langauge2 or German->langauge2, etc.  We cannot search langauge2->English.

If you had a Thai phone, then it would be no problem.  You could use the default IME from the phone to type in Thai.  But what if you had an English phone?  You'd have no way of typing in Thai.

Only now could we start to worry about IMEs.  We had to have bitmap font support before we could do IMEs.

We could also build an IME for Chinese pinyin too (not Chinese characters).  Personally I think " xuéshēng"  is easier to read then "xue2sheng1".   Also, typing "xue2sheng1" must be difficult.  You probably type "e" then press a button to change to "number mode".  Then you press "2".  Then you press the button again to go to "letter mode".  Or using the old style, you press "33" to get an "e" then press "2222" to get the "2".

Wouldn't it be nice to have a separate key for tone marks, say "#" or "1", for example.  So then you press "33" to get an "e".  Then you press "#" and the screen will change from "e#" to "ē" (1st tone).  Press "#" again to get "é" (2nd tone).

If you'd like a Chinese pinyin IME for you PC, then you can download Keyman.  It is the default program for building your own IMEs on a PC.  First go here http://www.tavultesoft.com/70/download.php and download "Keyman Desktop Professional 7.0".  Then go here http://www.tavultesoft.com/keyman/downloads/keyboards/details.php?KeyboardID=346&FromKeyman=0 and download the pinyin IME.

Quote from: Tomcollins on 12. May 2007, 13:39:06
Although a IME would be very nice. (I once talked to one doing an chinese ime by himself but he said that a big problem is to get good complete char lists.)

You're right.  Building an IME for Chinese characters would be difficult.  Finding the Character lists wouldn't be too difficult.  You could just use the "Unihan" database from the Unicode website http://www.unicode.org/charts/unihan.html.  It has a database of all the characters in Unicode.  But, then you'd have to find one of the "frequency" lists that sorts the characters based on how common they are.  Big5: http://technology.chtsai.org/charfreq/, GB: http://lingua.mtsu.edu/chinese-computing/statistics/.

For this you'd need to have a database of at least 5,000 characters (50,000 if you want to be more complete).  That database would probably be at least 300kB for DfM.  That's too much.

A Chinese pinyin IME would be much simpler than an IME for Chinese characters.  We would just rebuild an IME for English and add the button for the tone marks.

Personally I think the best way to start with the IMEs is to build a "European" IME.  This is an IME that has all the letters for European languages: English, German, French, etc.  We'd start with the basic Latin (roman) alphabet.  Then we'd add "äöüß" for German, "éè", etc for French.  We could do it like really old phones like:
3   d
33   e
333   f
3333   3
33333   é
333333   è

I think it'd be better to do it differently and add a separate "diacritic" button ("#", for example).  Then you'd press "33" to get an "e".  Then you'd press "##" to get the "é", for example.

Of course we don't actually need this IME for DfM.  Normation classes can change all "ö" to "oe", for example.  Also, nobody will want to switch from using T9 input to the old "multi-type" method http://en.wikipedia.org/wiki/Predictive_text (just press "843" for "the", instead of "84433").  If you have a German phone, then using the default T9 German IME would be much simpler.

But, it'd be much easier to work on the Java code using the roman alphabet.  It'd be difficult to start with the Thai IME for example and try to explain what each of the letters mean.

Once the European IME is finished then it would very easy to adapt it to other languages.  For example, for Hindi we would just paste these letters into the European code:
3   ka   *
33   ki
333   ku
3333   ke
33333   ko
333333   kau

(* imagine these letters are actually written in Hindi.  If I actually wrote them in Hindi then they'd probably just show up as white boxes since you probably don't have a Hindi font on your computer)

First we should just build a simple "muilti-tap" http://en.wikipedia.org/wiki/Multi-tap IME (press "84433" for "the").  We already have the keystroke layouts for Thai and Japanese ready for this.

Then once that is finished we can start on working on a "true IME" (I'm not sure what to call it) based on the multi-tap IME.  We'll worry about this later.  If anyone wants to learn how to build a "true IME", then take a look at the Keyman program.  Go here http://www.tavultesoft.com/70/download.php and download "Keyman Developer 7.0".  Then look at the documentation.  The PDF file is easier to read than the program help file.  Download it here http://www.tavultesoft.com/keymandev/downloads/.  Then look in the manual for the "Quick French keyboard".  This will give some help in building the "European IME" for DfM.

I think we should base our IME code on Keyman.  It is the default scripting language for building your own IME (I'm actually using Keyman now to build an IME for another project).  There is an open source program for Linux called "Keyboard Mapping for Linux" http://kmfl.sourceforge.net/ that uses the same scripting language as Keyman.

Wow, that turned out long.  I'm done for now.
Jeff

Tomcollins

Well, there's a lot to tell about chinese, but the as for pinyin input: I personally mostly (over 90%) just use 'xuesheng' as input (without tones). Often you even don't know the tones and there are not so many combinations. For the HanDeDict-DFM I actually just added the pinyin without tones to the search index, since from my experience it's the most useful.
A really advantage would be an IME for chinese characters, but this seems to be a lot of work.
200kB: HanDeDict is 7MB an works well on (several) phones. So that would be ok!

Thanks also for all the other information on IME and the languages!
Seems, there's still enough to do!

Sebastian

Gert

I did update JarCreator so that now all files in the dictionary directory are included recursively. That solves the problem that Jeff reported above.

I also did build the new 3.1.2 versions. You can download and test them here:
http://www.kugihan.de/dict/download/test_versions/3.1.2/DictionaryForMIDs_3.1.2_empty.zip and
http://www.kugihan.de/dict/download/test_versions/3.1.2/DictionaryForMIDs_JarCreator_3.1.2.zip

I did not yet have to time to upload these files to sourceforge. Maybe you just could download the files from this location and do a test ?

Also, could you provide me with a 3.1.2 dictionary that I also could upload for the testers ?

Thanks,
Gert

Gert

I just uploaded the 3.1.2 files to sourceforge in the test section.

If someone can provide me a dictionary that was built with 3.1.2, the I will upload that one also and put it in the test section.

I would like to declare 3.1.2 as the new official version very soon.

Gert