Feature request and update of existing feature

Started by waldermort, 04. May 2011, 07:18:57

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

waldermort

Great little tool. It has helped me out many times when I'm out and about and there is a word I need. One major problem I have though is deciding which word to use. For any given word (I'm mainly talking about English to Simplified Chinese here), there are multiple translations, all of which have differing meanings and usages.

The feature I would like to see: For a translated word let us know, among the results, what type of word it is, i.e. Verb, Noun, Adjective. I realize this would require re-writing the dictionaries, but no dictionary is complete without this feature.

As to the existing feature update. When entering pinyin to begin a search, all words must be written in lowercase and separated by a space. Most phones today explicitly set the first letter of a word to be upper case, requiring us to switch. Converting the search parameter to lower case before performing the search will fix this problem. Pinyin without spaces can be fixed by first parsing the string. Since pinyin is simple compared to English, a simple lookup table can speed up the parsing.

Regards

Gert

Good feedback - thanks for this !

QuoteThe feature I would like to see: For a translated word let us know, among the results, what type of word it is, i.e. Verb, Noun, Adjective. I realize this would require re-writing the dictionaries, but no dictionary is complete without this feature.

I guess that is related to the CEDICT dictionary ?
Jeff or Lars would have to check whether CEDICT provides that information.


QuoteAs to the existing feature update. When entering pinyin to begin a search, all words must be written in lowercase and separated by a space. Most phones today explicitly set the first letter of a word to be upper case, requiring us to switch. Converting the search parameter to lower case before performing the search will fix this problem.

That seems to me more like a bug; the Normation class has to take care that upper case letters are treated the same as lower case letters, exactly as you write.
Because I am the person who did write the Normation class for Chinese/Pinyin I guess it is me who has to check that ... just remind me in about 6 weeks then I should have time to look at that.

QuotePinyin without spaces can be fixed by first parsing the string. Since pinyin is simple compared to English, a simple lookup table can speed up the parsing.
Hmmm, what exactly would be the lookup table and the algorithm (I admit that I do not really know Chinese/Pinyin)

Best regards,
Gert




waldermort

QuoteJeff or Lars would have to check whether CEDICT provides that information.
Unfortunately, it doesn't.

QuoteHmmm, what exactly would be the lookup table and the algorithm (I admit that I do not really know Chinese/Pinyin)

Welcome.

I would write up a patch myself if I had the time (c/c++ background), but unfortunately my duties call me elsewhere.

Take a look at http://www.studypond.com/pinyin.aspx

Basically pinyin is composed of an Initial followed by a Final or, in some cases, only a final. I would have an array for each and iterate the input string while trying to match an Initial/Final pair (followed by an optional tone number) and add a space accordingly. If it can't be matched then abort and use the input as-is. I believe this could be incorporated into the existing code quite easily.

NOTE, Strings may have combinations such as "Tiananmen" which expanded would be "tian an men". A greedy search would be advised.

NOTE, the letter 'u' with two dots above, in the Final table, is often represented by the letter 'v' in plain ascii. This is actually a bug in the existing code also. An example, the Chinese character '女' (U+5973) in pinyin is 'nü' which can be typed into any existing IME as 'nv'. In the mid dictionary, the search string 'nv' returns no results, and the string 'nu' doesn't return '女' as expected (though it would be nice if it did as I am often typing incorrectly).

Gert

Thanks for your precise description !

Concerning the upper case/lower case problem, I had a very quick look, it seems that no Normation class is configured in DictionaryForMIDs.properties for CEDICT for the Chinese language (well, I would have to write a trivial Normation class that only does the default normation steps (which includes upper/lower case handling)).

I will look at the upper case/lower case problem closer as soon as I find time (hope in June). Will also try to look at the Pinyin u/v-problem.

Best regards,
Gert

waldermort

Welcome.

Looking forward to an updated version. If you require a beta tester, please just drop me an email.

Regards

Owen