Bug: (EDICT only) Incorrect transliteration for "ju"

oxxide · 02. December 2007, 05:42:07

Hi,

I noticed that one of the entries in DictionaryForMIDs.jar/char_lists/romaji_hiragana_UTF8.txt is incorrect:

sho=しょ
ja=じゃ
ju=ふ
jo=じょ
cha=ちゃ

The highlighted entry should read:
ju=じゅ

I've confirmed that this prevents a user from searching any word that contains the syllable "ju" when doing a romaji-based Japanese > English lookup. For example, searching for the romaji string "juku" yields results that match "ふく" (fuku).

Also, I noticed that this file only addresses the Hepburn standard for Japanese romanization. I would recommend this file be extended to support the Nihon-shiki and Kunrei-shiki romanization standards, which are used widely in Japan. Here is a reference on these standards: http://en.wikipedia.org/wiki/Romanization_of_Japanese

Cheers,
oxxide

oxxide · 04. December 2007, 18:35:22

I'm working on an updated version of this file.

Meanwhile, I thought I'd apply the same fixes to the katakana file as well. I found that in this file:
- there are about 10 errors that need to b e fixed
- the number of lines is doubled because long vowels are encoded to produce a vowel lengthening dash 「ー」. For example:

aa=アー
ii=イー
uu=ウー
ee=エー
oo=オー

This sounds like a good convenience at first, but I don't know if I agree with it. My main concern is that it makes it impossible to input アア、イイ、カア、etc. which may actually appear in some words (for example: クリアアウト clear out).

I would recommend including the same mappings as we do in the hiragana file (i.e. get rid of the long vowel mappings) and map the dash to the vowel lengthening dash:

-=ー

The only downside I can think of is that users need to be aware of this and use the dash when they search; searching for "paatona" will not return "partner", but "pa-tona" will work. I think this is fair game as long as it's documented somewhere.

Thoughts on this?

dreamingsky · 09. December 2007, 03:01:55

I wouldn't spend too much time fixing a romaji input for the Japanese EDICT dictionary. I think moving to a hiragana input is a much better idea.

Instead of making support for numerous romaji systems, it's better to just input the Japanese as Japanese (i.e. hiragana). Then you don't have to worry about Nihon-shiki and Kunrei-shiki.

Hiragana input would also solve the problem with the long vowels. On a standard Japanese electronic dictionary you enter long vowels as a combination of hiragana and the Japanese "ー" symbol.

For example, to search for "cart" （カート）, you type this in the electronic dictionary:
かーと [hiragana-katakana-hiragana]

If you wanted to search for "クリアアウト", you would enter "くりああうと" (you don't enter the long vowel sign).

Romaji is nice for beginning students of Japanese. But, Japanese students should abandon romaji pretty early and only use hiragana, katakana, and kanji.

oxxide · 15. January 2008, 07:25:02

Attached are the hiragana and katakana mapping files with my fixes. Here's what was changed:

In the hiragana file:

Fixed mistaken mappings such as ju
Added kunrei-shiki mappings (tya, tyu, tyo, etc)
Added small character mappings (la=ぁ lya=ゃ etc)
The small vowels are required to input some words such as ディスク(disc). The other small character mappings are just commonly known shortcuts.
Added the vowel extension line ー.

To obtain the katakana file, I copied over the hiragana file and converted all hiragana glyphs to their katakana counterparts.

It would be awesome if someone could look them over for completeness and correctness before they go into a release.

Also note that I never tested this on my device, I just followed the existing mappings when adding / correcting things. Please run a sanity check on a device/simulator before releasing a version with these files.

dreamingsky: I completely agree that romaji alone is an insufficient input method for a dictionary, and that we should move to a kana->kanji converting IME like all Japanese phones. However I still think it's worth maintaining an accurate roman alphabet -> kana mapping because:
- Romaji search is all that's working right now (at least on my phone) so it needs to be supported. Right now I can't look up any words containing "ju" (and they seem to be annoyingly common

)
- There's little confusion about the 3 romanization systems since they're mostly similar, and in cases where mappings differ, they can be used as equivalents (cha = tya etc.) so there's no conflict.
- It's really not that much work, plus it's done

dreamingsky · 17. January 2008, 01:33:59

Oxxide

I looked over your conversion tables. Everything looks good.

I found a couple double listings in the hiragana and katakana files. This is a minor error and wouldn't affect the program at all.

Hiragana doubles
ji   じ
jji   っじ
zu   ず
zzu   っず

Katakana doubles
jji   ッジ
zu   ズ
zzu   ッズ

I found a couple romaji listings that were missing in the conversion tables. They are rare Nihon-shiki transcriptions, but they should probably be added to the tables:

ちゃ   チャ   tya
ちゅ   チュ   tyu
ちょ   チョ   tyo
っづ   ッヅ   ddu
っちゃ   ッチャ   ttya
っちゅ   ッチュ   ttyu
っちょ   ッチョ   ttyo

Also, there is one more uncommon romaji listing maybe we should add: "oh". It is used for "おう" ("oo" is still used for "おお" I think).
http://en.wikipedia.org/wiki/Hepburn_romanization
Ministry of Foreign Affairs standard (外務省旅券規定) [3], in which the rendering of syllabic n as m before b, m, p is used and the spelling oh for word-final long o is allowed (e.g. Satoh for 佐藤). This is used to romanize Japanese names in passports and is thus also known as "Passport Hepburn".

I used to live near a city called "Akou". On all the signs the romaji was listed as "Akoh". I always thought the "oh" was from Nihon-shiki. But, I guess it is a variation of Hepburn. I have no idea why the passport agency uses "oh". I think "ou" is a better idea.

Were all the changes in the TXT files? Did you need to change any Java code? If there are any Java code changes then they'll have to wait until version 3.20 is released. If the only changes are in the TXT files then I might be able to sneak them into version 3.12 and release it.

Jeff

Gert · 18. January 2008, 20:05:49

The files are in the char_lists directory of the JAR file. You can use any zip tool (WinZip, 7zip, filezip, ...) to update the JAR file.

After updating the JAR file you either need to
1. remove the JAD file OR
2. adjust in the JAD file the value for the MIDlet-Jar-Size property

Option 1 is the easiest.

Gert

Gert · 27. January 2008, 18:28:22

Does the JAR update work as described above ?

Or should I create a specific build which includes the updated file ?

Regrads,
Gert

dreamingsky · 27. January 2008, 18:58:12

I was waiting to get a response from oxxide before posting a new build. I'll wait another week then I'll go ahead and post it.

I don't need a special build for it. I'll just use WinRAR to update the JAR file.

Thanks

oxxide · 02. February 2008, 19:49:09

Jeff,
Thanks for fixing the doubles and missed transliterations. I managed to leave out the "tya" series even though i used it for examples in all my posts...
I wouldn't worry about passport hepburn either, it doesn't seem to be used much for japanese input.
All my changes are in these two files, but i haven't even looked at the code that interprets them so i can't tell for sure if they will work. My assumptions were:
- these files are actually used,
- the mappings are scanned in the order they appear in the file,
- the first mapping whose key appears at the beginning of the unconverted string is used.

I'm looking forward to the updated version!

dreamingsky · 03. February 2008, 19:14:05

I posted the new version yesterday. I didn't do a thorough test on the new romaji inputs, however.