DictionaryForMids Forum

Admins & Developer => General discussions => Topic started by: Gert on 09. May 2007, 20:28:47

Title: Release of 3.1.1 for testing
Post by: Gert on 09. May 2007, 20:28:47
Version 3.1.1 is now online for testing: http://dictionarymid.sourceforge.net/development.html#testversion (http://dictionarymid.sourceforge.net/development.html#testversion)

If no serious problems show up, then this version will be shortly released as 'first official 3.1 version'.

Please post your testing feedback in this forum.

Thanks !
Gert
Title: Re: Release of 3.1.1 for testing
Post by: dreamingsky on 11. May 2007, 07:53:43
Here is my testing with version 3.1.1.

I. JARCreator bug

I tested the 3.1.1 Bitmap Font Generator and the 3.1.0 DictionaryGeneration file.  I saw there is a new JARCreator file too.  That's really great.  It saves me a lot of time.

But, I ran into an error.  I can't get JARCreator to build the JAR with the font files.  I get this error from JARCreator:

Exception in thread "main" java.io.FileNotFoundException: C:\Temp\Dict\Thai_NIU\
dictionary\fonts (Access is denied)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(Unknown Source)
        at de.kugihan.jarCreator.JarCreator.writeJAR(JarCreator.java:178)
        at de.kugihan.jarCreator.JarCreator.main(JarCreator.java:86)

I got the same problem in the Thai, Khmer (Cambodia), and Hindi (India) dictionaries.

If I don't run the Bitmap Font Generator, then the JARCreator will finish with no problems.  I am using version 3.1.1 of DictionaryForMIDs.jar and DictionaryForMIDs.jad.  I am using version 3.1.0 of JARCreator.  Do I need version 3.1.1 of JARCreator?  I didn't see it on the website.

I really like you can choose multiple font sizes with the Bitmap Font Generator.  That's a big plus.  I thought the fix would only allow you to choose 1 size.  But, you can choose multiple sizes and put them in the JAR file.  That is very nice.  And the .png files are a big plus too.


II. Khmer font bug
The Bitmap Font Generator works good.  I ran into a problem with some Khmer fonts, though (KhmerOS and KhmerOT).  The "KhmerOS System" font worked OK, though.  The 2 fonts got cut off at the bottom.  Only the top 15% of KhmerOS showed up.  Nothing showed up for KhmerOT (OT = opentype).  It's just a white line.

I'm not sure if someone wants to bug fix it.  It's probably not worth it since the "KhmerOS System" font works fine.  The Khmer fonts can be downloaded from here:
http://www.khmeros.info/drupal/ (http://www.khmeros.info/drupal/)


III. complex scripts
Also, there is another bug (technically a feature).  But, there is no way to fix it.  Hindi, Thai, and Khmer are "complex scripts".  Fonts for these languages are a little tricky to make.

These languages (actually scripts for most languages in South Asia and South-East Asia) have separate consonant and vowel marks.  So to type "ke" in Hindi, you type a "k", then the "k consonant" shows up.  Then you type "e" and the "e" shows up.  But, the "e" is not a separate letter.  It is put on top of the "k".

So inside the font are directions for how to move the vowels over the consonants.  But, when you convert a font to a bitmap font, then you lose the instructions for the movement.

So when DfM displays the bitmap fonts, the consonants and vowels are divided up into 2 parts.  There is no way to avoid this.  Thai phones have TTF fonts or something that can handle the vowel marks correctly.

Also, Windows uses the Uniscribe DLL http://en.wikipedia.org/wiki/Uniscribe (http://en.wikipedia.org/wiki/Uniscribe)  to make some even more complicated font adjustments.  For example, in Hindi, a "ii" (long "i") is put to the right of the consonant.  But a "i" (short "i") is to the left of the consonant.  The Uniscribe DLL moves the "i" to the left.

Anyway, that is a lot of techno talk.  Basically I'm saying that the Hindi, Thai, and Khmer dictionaries will look funny with the bitmap fonts.  I doubt anyone would fault us for it, though.  If someone does complain, then we can only tell them that we can't fix it.  The problem is due to limits in the architecture.

Jeff
Title: Re: Release of 3.1.1 for testing
Post by: Gert on 11. May 2007, 18:31:13
My comments:

Your point I:
I am not sure what is going wrong there. Hope Sebastian can tell quickly and he will know what to do about this. Sebastian was proposing to ignore all font files by JarCreator (and I was pressing him to postpone this to 3.2 :( ), given your problem, maybe we still should implement this in 3.1

Your point III:
Simple question to the knowledgeable person: nowadays where we have that allmighty unicode charset, I'd expect that a good unicode font does also cover Hindi, Thai and Khmer, with all the features that you mentioned, right ? Well, maybe not in all cases.

What I mean is, other languages also have the situation where special 'characters-parts' are amended to character, such as the accents in many Roman languages, the umlauts in German etc. While for many of these characters it takes you two key strokes to type, in the end there is one Unicode character that covers the result. Is this different for Thai etc ?

Gert

Title: Re: Release of 3.1.1 for testing
Post by: dreamingsky on 12. May 2007, 03:26:24
Yes, many fonts can support a large number of languages.  "Arial Unicode MS" can probably support 30 languages or more (not Khmer though, so far I've only found 2 Unicode Khmer font makers on the internet).  Currently the Thai dictionaries and the Hindi dictionary are using Arial Unicode MS for the bitmap fonts.

The "complex scripts" work differently than the Roman language examples.  There are actually 2 Unicode blocks that get entered when typing.  For the roman example - if you type a German "u umlaut", then press the backspace button once, then the whole letter is deleted.

For the "complex scripts", if you press the backspace button, then only the vowel is deleted.  The consonant will still be there.  For example - in Hindi, if you type "k" then the "k" will show up.  Then you press the "e" and the "e" will show up over the previous "k".  Now if you press the backspace button then only the "e" is deleted.  The "k" and "e" are 2 different Unicode blocks.
Title: Re: Release of 3.1.1 for testing
Post by: Gert on 12. May 2007, 04:11:33
Ok, I now understand the different handling between e.g. accented Roman characters and languages such as Hindi.

So, just for slowly understanding people like me, when we use "Arial Unicode MS" for the bitmap fond of the Thai and the Hindi dictionary, then these characters are not getting displayed correctly, for the reasons that you explained. Did I understand this correctly ?

So we would need to have something like uniscribe.dll for DfM to solve this problem, right ?

Gert
Title: Re: Release of 3.1.1 for testing
Post by: Tomcollins on 12. May 2007, 04:35:59
Hi!

Yeah! There's a problem with the jarCreator. It expects just two different types of files in dictionary (font.bmf or .csv) but no folders like the new "/fonts/". So we have to make an update for the jarCreator. Sorry!
1) Copy it into an seperate folder like before or
2) Ignore it and leave it in the dictionary directory (which would mean we have to update the DFM too)?

What do you prefere Gert?

Sebastian

Title: Re: Release of 3.1.1 for testing
Post by: Gert on 12. May 2007, 04:48:18
I prefer 2) - which is what I was asking you to postpone to 3.2 only a few days ago ...

Can you update DfM and JarCreator ?

I can then build version 3.1.2 of both - or do you want to do the build ?

Gert
Title: Re: Release of 3.1.1 for testing
Post by: dreamingsky on 12. May 2007, 13:06:29
Quote from: Gert on 12. May 2007, 04:11:33
when we use "Arial Unicode MS" for the bitmap fond of the Thai and the Hindi dictionary, then these characters are not getting displayed correctly

Yes, using Arial Unicode MS (or any other font) won't work correctly for Hindi, Thai, Khmer...  In order to correctly display "complex scripts" we would need:
1. "outline" fonts (TTF, etc)
2. something like Uniscribe.dll

It would be possible to reproduce Uniscribe.dll.  Linux uses something like Uniscribe.dll to manage complex scripts.  I'm sure we could find the source for it somewhere.  But, I think it'd be way beyond the scope of DfM to do this.  We'd have to write the code for each language (Thai would need different code from Hindi, etc).

More importantly, we'd need to use regular fonts instead of bitmap fonts.  It would be impossible (as far as I know) to display "complex scripts" using bitmap fonts.  Regular outline fonts have code for each vowel to tell it to move the vowel back over the last letter.  Bitmap fonts don't have this feature.

So, we'll just have to live with what is possible.  It won't be a huge problem if you can get used to reading it with bitmap fonts.  An example would be:
Imagine the letter "ã".  The "~" is above the "a".  Imagine the "~" is the vowel and the "a" is the consonant.  With the bitmap fonts it will look like "a~" instead of "ã".

I think any effort towards these languages should be spent building an IME (input method editor) for Thai, Hindi, and Khmer.  An IME is the way to type in a language.  So far we can't type in Thai, Hindi, Khmer, or Japanese in DfM.  We can only search English -> language2.  We cannot search language2 -> English.  I made a posting about this in the "Feature planning for version 3.2" thread.

Jeff
Title: Re: Release of 3.1.1 for testing
Post by: Tomcollins on 12. May 2007, 13:39:06
Hi!

1) Well. For a lot of languages there's some kind of transciption to basic latin. e.g. in chinese that would be pinyin. And if this transcription is added in the search index it's no problem to search for e.g. chinese with devices, which do not support chinese characters.
Although a IME would be very nice. (I once talked to one doing an chinese ime by himself but he said that a big problem is to get good complete char lists.)
2) I'm not so familiar with the languages you mentioned, but I know that there are also many character-combinations added as one character in unicode! We have the same problem with the chinese pinyin, since there are tones on the vowels. Like xue2sheng1 should be xuéshēng. We do the convertion with a languageUpdateClass so it is just one character in the dictionary and in the bitmapfontImage.
Could that work for some of your mentioned languages too? Of course, this is just for passiv (displaying) and doesn't work for active input..

Sebastian
Title: Re: Release of 3.1.1 for testing
Post by: Tomcollins on 12. May 2007, 14:38:32
Quote from: Gert on 12. May 2007, 04:48:18
I prefer 2) - which is what I was asking you to postpone to 3.2 only a few days ago ...

Can you update DfM and JarCreator ?

I can then build version 3.1.2 of both - or do you want to do the build ?

Gert

I can update DFM, I cannot update the JarCreator, since I don't have the time to get familiar with it.
I updated the JarCreator, so it ignores the fonts directory, but it does not yet copy it to the jar file!
So one has to copy it manually into the jar file again, until someone finds the time to update the JarCreator!
DFM expects the fonts now located at /dictionary/fonts/, like it is generated.

Sebastian
Title: Re: Release of 3.1.1 for testing
Post by: Gert on 12. May 2007, 15:49:17
I didn't write JarCreator neither ... but I will try to get the update done.

Gert
Title: Re: Release of 3.1.1 for testing
Post by: dreamingsky on 13. May 2007, 04:53:03
Quote from: Tomcollins on 12. May 2007, 13:39:06
2) I'm not so familiar with the languages you mentioned, but I know that there are also many character-combinations added as one character in unicode! We have the same problem with the chinese pinyin, since there are tones on the vowels. Like xue2sheng1 should be xuéshēng. We do the convertion with a languageUpdateClass so it is just one character in the dictionary and in the bitmapfontImage.


Yes, pinyin with tone marks works fine in DfM with bitmap fonts.  "e2" is the same as "é".  "é" is only one Unicode block.  So it will display fine with a bitmap font.  "Complex scripts" are not like this.  The consonant and vowel are 2 separate Unicode blocks.  There are no merged "consonant + vowel" Unicode blocks.  Then you would need to make literally thousands to Unicode blocks to handle all the possible combinations of consonants and vowels.  It is simpler in fonts to just make the 50 basic parts and squish them together.

Quote from: Tomcollins on 12. May 2007, 13:39:06
1) Well. For a lot of languages there's some kind of transciption to basic latin. e.g. in chinese that would be pinyin. And if this transcription is added in the search index it's no problem to search for e.g. chinese with devices, which do not support chinese characters.

Yes, pinyin is a very useful transcription for indexing the Chinese dictionary.  Luckily pinyin is included in the CEDICT (Chinese-English) and HanDeDict (Chinese-German) dictionaries.  So we can just add an index to the pinyin.  Unfortunately the Thai, Hindi, Khmer, Japanese... dictionaries do not have this roman transliteration in the dictionary (Japanese uses the hiragana script for the transcription of Chinese characters).

We could possibly add this transliteration (sorry, "transliteration" means to type a language in the roman script).  Hindi has a standard transliteration called IAST http://en.wikipedia.org/wiki/IAST.  But, adding this transliteration to the dictionary would considerably increase the file size of the DfM dictionary.

Another option would be to input the Hindi in IAST transliteration.  The user would input IAST and IAST would show up in the search box in DfM.  Then the IAST could be converted "on the fly" to the real Hindi script (called Devanagari) in a languageUpdateClass.

There are a couple problems with this.  First, there is no standard transliteration system for Thai.  We'd have to pick one and then explain it in documentation.  Then the users would have to learn the transliteration method.

Another problem is that IAST for Hindi uses several roman letters that are not common: "ṅñś", for example.  Our phones have no way of typing these letters.  We could change the letters to ".n", or "~n" or something and then write documentation telling the users how to use it.  But, this is inconvenient for the users too.

The best solution would just be to write an IME.  Then you would type directly in Hindi or Thai.


So far I think we need IMEs for the following languages:
Thai
Hindi (India)
Khmer (Cambodia)
Japanese
Russian
Arabic

Does anyone know of any others?

So far we cannot search in these languages in DfM.  We can only search English->langauge2 or German->langauge2, etc.  We cannot search langauge2->English.

If you had a Thai phone, then it would be no problem.  You could use the default IME from the phone to type in Thai.  But what if you had an English phone?  You'd have no way of typing in Thai.

Only now could we start to worry about IMEs.  We had to have bitmap font support before we could do IMEs.

We could also build an IME for Chinese pinyin too (not Chinese characters).  Personally I think " xuéshēng"  is easier to read then "xue2sheng1".   Also, typing "xue2sheng1" must be difficult.  You probably type "e" then press a button to change to "number mode".  Then you press "2".  Then you press the button again to go to "letter mode".  Or using the old style, you press "33" to get an "e" then press "2222" to get the "2".

Wouldn't it be nice to have a separate key for tone marks, say "#" or "1", for example.  So then you press "33" to get an "e".  Then you press "#" and the screen will change from "e#" to "ē" (1st tone).  Press "#" again to get "é" (2nd tone).

If you'd like a Chinese pinyin IME for you PC, then you can download Keyman.  It is the default program for building your own IMEs on a PC.  First go here http://www.tavultesoft.com/70/download.php and download "Keyman Desktop Professional 7.0".  Then go here http://www.tavultesoft.com/keyman/downloads/keyboards/details.php?KeyboardID=346&FromKeyman=0 and download the pinyin IME.

Quote from: Tomcollins on 12. May 2007, 13:39:06
Although a IME would be very nice. (I once talked to one doing an chinese ime by himself but he said that a big problem is to get good complete char lists.)

You're right.  Building an IME for Chinese characters would be difficult.  Finding the Character lists wouldn't be too difficult.  You could just use the "Unihan" database from the Unicode website http://www.unicode.org/charts/unihan.html.  It has a database of all the characters in Unicode.  But, then you'd have to find one of the "frequency" lists that sorts the characters based on how common they are.  Big5: http://technology.chtsai.org/charfreq/, GB: http://lingua.mtsu.edu/chinese-computing/statistics/.

For this you'd need to have a database of at least 5,000 characters (50,000 if you want to be more complete).  That database would probably be at least 300kB for DfM.  That's too much.

A Chinese pinyin IME would be much simpler than an IME for Chinese characters.  We would just rebuild an IME for English and add the button for the tone marks.

Personally I think the best way to start with the IMEs is to build a "European" IME.  This is an IME that has all the letters for European languages: English, German, French, etc.  We'd start with the basic Latin (roman) alphabet.  Then we'd add "äöüß" for German, "éè", etc for French.  We could do it like really old phones like:
3   d
33   e
333   f
3333   3
33333   é
333333   è

I think it'd be better to do it differently and add a separate "diacritic" button ("#", for example).  Then you'd press "33" to get an "e".  Then you'd press "##" to get the "é", for example.

Of course we don't actually need this IME for DfM.  Normation classes can change all "ö" to "oe", for example.  Also, nobody will want to switch from using T9 input to the old "multi-type" method http://en.wikipedia.org/wiki/Predictive_text (just press "843" for "the", instead of "84433").  If you have a German phone, then using the default T9 German IME would be much simpler.

But, it'd be much easier to work on the Java code using the roman alphabet.  It'd be difficult to start with the Thai IME for example and try to explain what each of the letters mean.

Once the European IME is finished then it would very easy to adapt it to other languages.  For example, for Hindi we would just paste these letters into the European code:
3   ka   *
33   ki
333   ku
3333   ke
33333   ko
333333   kau

(* imagine these letters are actually written in Hindi.  If I actually wrote them in Hindi then they'd probably just show up as white boxes since you probably don't have a Hindi font on your computer)

First we should just build a simple "muilti-tap" http://en.wikipedia.org/wiki/Multi-tap IME (press "84433" for "the").  We already have the keystroke layouts for Thai and Japanese ready for this.

Then once that is finished we can start on working on a "true IME" (I'm not sure what to call it) based on the multi-tap IME.  We'll worry about this later.  If anyone wants to learn how to build a "true IME", then take a look at the Keyman program.  Go here http://www.tavultesoft.com/70/download.php and download "Keyman Developer 7.0".  Then look at the documentation.  The PDF file is easier to read than the program help file.  Download it here http://www.tavultesoft.com/keymandev/downloads/.  Then look in the manual for the "Quick French keyboard".  This will give some help in building the "European IME" for DfM.

I think we should base our IME code on Keyman.  It is the default scripting language for building your own IME (I'm actually using Keyman now to build an IME for another project).  There is an open source program for Linux called "Keyboard Mapping for Linux" http://kmfl.sourceforge.net/ that uses the same scripting language as Keyman.

Wow, that turned out long.  I'm done for now.
Jeff
Title: Re: Release of 3.1.1 for testing
Post by: Tomcollins on 13. May 2007, 12:13:59
Well, there's a lot to tell about chinese, but the as for pinyin input: I personally mostly (over 90%) just use 'xuesheng' as input (without tones). Often you even don't know the tones and there are not so many combinations. For the HanDeDict-DFM I actually just added the pinyin without tones to the search index, since from my experience it's the most useful.
A really advantage would be an IME for chinese characters, but this seems to be a lot of work.
200kB: HanDeDict is 7MB an works well on (several) phones. So that would be ok!

Thanks also for all the other information on IME and the languages!
Seems, there's still enough to do!

Sebastian
Title: 3.1.2: Release for testing
Post by: Gert on 13. May 2007, 18:30:50
I did update JarCreator so that now all files in the dictionary directory are included recursively. That solves the problem that Jeff reported above.

I also did build the new 3.1.2 versions. You can download and test them here:
http://www.kugihan.de/dict/download/test_versions/3.1.2/DictionaryForMIDs_3.1.2_empty.zip (http://www.kugihan.de/dict/download/test_versions/3.1.2/DictionaryForMIDs_3.1.2_empty.zip) and
http://www.kugihan.de/dict/download/test_versions/3.1.2/DictionaryForMIDs_JarCreator_3.1.2.zip (http://www.kugihan.de/dict/download/test_versions/3.1.2/DictionaryForMIDs_JarCreator_3.1.2.zip)

I did not yet have to time to upload these files to sourceforge. Maybe you just could download the files from this location and do a test ?

Also, could you provide me with a 3.1.2 dictionary that I also could upload for the testers ?

Thanks,
Gert
Title: Re: Release of 3.1.1 for testing
Post by: Gert on 17. May 2007, 17:55:59
I just uploaded the 3.1.2 files to sourceforge in the test section.

If someone can provide me a dictionary that was built with 3.1.2, the I will upload that one also and put it in the test section.

I would like to declare 3.1.2 as the new official version very soon.

Gert
Title: Re: Release of 3.1.1 for testing
Post by: dreamingsky on 18. May 2007, 10:31:00
I uploaded the "Thai NIU" in version 3.1.2beta1 to Sourceforge.  It is in the "dictionary ThaEng (NIU), 3.1.2" directory.  The file name is "DictionaryForMIDs_3.1.2beta_ThaEng_NIU_Thai.zip".

I found a few more problems with the bitmap fonts.

I.
I ran the bitmap font generator with font size 12 (only 1 size).  Then I started the program.  I went to "Settings" and turned on the bitmap font setting.  Then I went to the "font size" and selected "12" (it was already selected).  Then I got this error (while on the setting screen):

Thrown de.kugihan.dictionaryformids.general.g:
Incorrect bitmap font size setting: 14
Incorrect font size setting: 14

I think the problem is from an earlier setting I had.  Before I had the bitmap fonts set to size 14.  Then I turned off the bitmap fonts.  Then I made size 12 bitmap fonts and re-ran the program.  This is when I got the above error.

I also got that error in a 2nd way.  I ran the program with the bitmap fonts turned on with size 14.  Then I ran the bitmap font generator with only size 16.  When I started the program again (the bitmap fonts option was still turned on from the previous time).  I got the same error about size 14 (I didn't have size 14, only size 16).  I couldn't even get to the search page until I re-ran the font generator with size 14.

This 2nd problem shouldn't actually happen in the real world (because we can only use 1 dictionary at a time now.  I found the problem with the Wireless Toolkit.)  But, later when we get the dictionary loader working, then the problem will arise.

So, maybe we need code that runs when the program is started to check if the "font size" setting currently saved actually exits in the dictionary that is loaded.

II.
The bitmap fonts don't display correctly with the "Arial Unicode MS" font.  Every entry is shifted right half-way in the screen.

Also, most of the text seems to have disappeared.  The entire 1st search result for "table" disappeared when using the bitmap fonts.  It had a long example sentence in it.  With the bitmap fonts, the search result started at the next search result.

III.
I can't scroll down the screen with the bitmap fonts.  A search may have 10 hits.  A page will show 5.  But I can't scroll down to see the other 5.  It just scrolls down into small empty white boxes.

IV.
I tried to verify error II with another font.  So I ran the bitmap font generator with the "Cordia New" font (this is the default font for Thai.  It is installed by Windows).  This time the font was displayed on the left side of the screen correctly.  So only the bitmaps from "Arial Unicode MS" were making errors.  However, I can't use the "Cordia New" font because of the "complex scripts" limitation I mentioned before.  I must use "Arial Unicode MS".

Error III was also fixed by using the "Cordia New" font.  Scrolling worked fine.

V.
The "Cordia New" fonts looked OK.  But, I couldn't actually change the font size.  I'd start with size 12.  Then I'd go in the settings and select size 16.  However, the screen still showed size 12.

VI.
Then I ran the font generator for "Arial Unicode MS" for the Hindi dictionary.  I searched for "temple".  None of the Hindi or anything of the example sentences or grammar tags (in coloured text) showed up.  Only the English in black showed up.

Then I searched for "tree".  Everything looked OK, except the example sentence was on the same line as the search result.  It should be on the next line (there is an "\n" in the source dictionary file).

Then I searched for "house".  Then I got Error II.  The search results were shifted to the right.  And none of the Hindi or example sentences showed up.  Only the English search results were shown.


I uploaded the Hindi source files for the developers.  It is in the "dictionary EngHin (IIIT), 3.1.2" directory.  The file is titled "DfM_Hindi_source_312beta.zip".

You will need Hindi set up on your computer to use it.  For WinXP:
Control Panel -> Regional and Language Options -> Languages -> select "Install files for complex script and right-to-left languages"

Then open the zip file and extract it to "C:\".  The directory structure is already set as:
C:\Temp\Dict\Hindi\

Then run C:\Temp\Dict\Hindi\setup.bat
Then run the bitmap font generator
Then run C:\Temp\Dict\Hindi\jar.bat

The directories are hard-coded in the BAT files.  Feel free to change the environment how you'd like.

The fonts to use for the font generator are "Arial Unicode MS" (an optional install with Microsoft Office 2000 and higher) and "Mangal".  "Mangal" is the default font for Hindi on Windows, but "Arial Unicode MS" looks better.

Jeff
Title: Re: Release of 3.1.1 for testing
Post by: Tomcollins on 18. May 2007, 18:38:30
1) seems to be a problem with the settings store. Do you have an idea Gert?

2 & 3) & 4) is all the same; This one is unfortunately really a bug of the bitmapFontFeature; I already located it, but it'll take some more investigation how to fix it. Strange that it never happend with the chinese, since there are much more characters in it...
I always use ArialUnicodeMS...

5) I don't really know. Maybe a bug in DFM generally which I've with the settings too. Please try: first change to 10 and then change to 14.
On W700 I've this problem too with the interface language. I always have to switch back to english before i can choose another language. Do you have this problem too?

6) I think the same as 2,3 & 4.

Thanks for your detailed testing.

Sebastian
Title: Re: Release of 3.1.1 for testing
Post by: Gert on 18. May 2007, 20:00:43
Fantastic to see your work for testing / improvements !!! :)

1)
Jeff, I assume that you did do all your tests with Sun's WTK, right ?

That problem with storage of settings in Sun's WTK is well-known (I also included it the FAQ).
However, this is not only related to the font size settings. It also shows up in other situations. For example:
- you run a dictionary with 3 languages
- select language 3
- re-run with a 2 languages dictionary
-> you have an exception because of an illegally selected language (the application will not even start)

These problems do not show up in any real device - only in the WTK development environment.

To avoid this, when you re-build a dictionary with different bitmap fonts sizes, etc., just delete the WTK storage files (somewhere in the FAQ there is a description how to do this).

Ok, I see, because people keep running on these problems on WTK, I think we should try to make something like an 'erroneous storage settings detection', where we try to detect, for example, invalid bitmap font sizes. Hmmm, we need to think about all possible error situations.

And yes, oops, we need to consider this also for loadable dictionaries !! This is not yet done - thank you for this hint Jeff !!


2 & 3) & 4)
Sebastian, I am glad to read that you already looked at this !
Just one request: if you have a fix, can you also include that fix in the All_3_1_1_branch ? And then ;) re-build and upload everything ... ?

Thanks to you !!
Gert
Title: Re: Release of 3.1.1 for testing
Post by: Tomcollins on 19. May 2007, 12:49:14
Jeff:
I think the problem is with the "byte order mark"/header-character, which is not removed in your dictionary. (if you use e.g. hexplo, then you can see an EFBBBF). The Font Generator cannot handle this "character" properly yet.
I'll do an update of the Font Generator these days, which takes care of this "character".

Gert: Maybe we should remove the header by dictionary generation automaticly, if it exists, since many people forget to remove it.

Sebastian
Title: Re: Release of 3.1.1 for testing
Post by: Gert on 19. May 2007, 14:05:04
Hmmm, never heard about that character - where does it come from ? Is it a legal Unicode / UTF-character ?

Well, if the right solution is to remove this character, then we could do it in DictionaryGeneration, so BitmapFontGenerator will not have to bother about this. Just that I will not have time to work on source code during the next weeks.

Gert
Title: Re: Release of 3.1.1 for testing
Post by: dreamingsky on 19. May 2007, 16:03:03
The BOM is causing the problem?  Interesting.  The BOM (byte order mark) http://en.wikipedia.org/wiki/Byte_order_mark isn't an illegal character.  Basically it's a code to tell programs the file is encoded as UTF-8.  UTF-16 uses another character.

It is a normal character: "zero-width no-break space".  If you don't save a UTF-8 file with a BOM, then the next time the program opens the file it must guess what encoding the file has.  If you save with a BOM, then when a program opens the file then it knows the file is UTF-8.

I think it's a good idea to remove the BOM with the font generator not with DictionaryGeneration.  I think having the .csv files with the BOM would be a good idea.  Then you can open the files easier with a program.  I wouldn't recommend asking the users to manually remove it, since it is a good idea to save UTF8 files with a BOM.

I can manually remove the BOM and do some more testing for the time being.  I'll do some more testing tomorrow.
Jeff
Title: Re: Release of 3.1.1 for testing
Post by: Gert on 19. May 2007, 19:11:37
Jeff,

thank you for your link ! Yes, indeed it seems that this is a legal character. Hmmm, so why could this cause a problem in the font generation ?

Gert
Title: Re: Release of 3.1.1 for testing
Post by: Tomcollins on 20. May 2007, 00:28:34
I think DFM, as it is now, doesn't 'know' this bom, so the first entry may not show up properly.. (at least I think I had problems once)

But you are right jeff, i also think that we should work with boms! Also dictionaryInputFiles with boms should be possible, since there maybe users who don't know how to easily remove it.

jeff: I think there is more then one problem, so I think you can wait till I have located them and updated the bitmapfontGeneration, before you test it again. Strange that it works with other fonts!?!

Sebastian