New Normation Class - tone numbers

Cryme · 01. January 2008, 05:52:40

Hello,

Forgive me but I'm trying to create a new dictionary and I'm not that experienced at programming. I had a question about creating a new Normation class, or perhaps if I'm lucky finding a suitable Normation class already in existence. I get the concept that including Normation Classes helps during searches, so that if I type an "e" it would include other forms of the "e" character including accented ones. (Stop me if I'm incorrect) Anyways I have such a need in the dictionary I'm making, but its even more simple... My language2 (non-english) entries have a number (indicating tone) after the word. But I want to be able to search omitting the tone number and get all tone results. For example if I search for "gai" I want results to include "gai1", "gai2", gai3", etc. I have the dictionary working so far, but I must enter in the precise number or else it returns nothing.

Can anyone help me? Either by whipping one of these classes up and giving me the link, or by telling me how to do it? I'd appreciate any help I can get. My dictionary is amazing and with this new addition it would be really powerful.

Thanks!!
Ryan

Gert · 01. January 2008, 06:58:29

Yep, Normation classes are exactly for that kind of purpose !

Often people can just use an existing Normation class, because by now we have Normation classes that are suitable for many languages. If there is a need for a specific Normation class, then I can implement this quickly for you and send you the Normationxxx.class file.

In your case, if you want to get all 'gai' for 'gai1/2/3', we have a Normation class for Chinese Pinyin which does this. Actually, this Normation class for Pinyin does also handle search for the accents (both accents and tone numbers are supported). Is your language2 Piniyn ?

If not Chinese Pinyin, no problem, implementing a Normation class which finds for 'gai' also 'gai1/2/3' is basically one line of Java code. I can do this very quickly.

Best regards !
Gert

Cryme · 01. January 2008, 14:56:35

Hello,

My language2 is Cantonese pinyin (specifically yale romanization) but it probably works the exact same as the Mandarin ("Chinese") pinyin. So I can probably use that. What is the associated file... I tried NormationChi but it didn't find that file...

Ryan

Cryme · 01. January 2008, 15:20:17

I'll be more specific...
Right now in my dictionary I am using the following Normation classes:

language1NormationClassName: de.kugihan.dictionaryformids.translation.NormationEng
language2NormationClassName: de.kugihan.dictionaryformids.translation.Normation

(in my properties file before I use the DictionaryGeneration.jar)

I tried using:
language2NormationClassName: de.kugihan.dictionaryformids.translation.NormationChi
but during DictionaryGeneration it throws an error about "Normation Class not found"

I also initially tried putting another ".normation" in between the ".translation" and ".NormationXXX" but it also can't find those.

Hope that helps,
Ryan

Gert · 01. January 2008, 18:23:17

I just checked the CEDICT dictionary, where they are using Pinyin: I just realized that they use a DictionaryUpdate class for this and not a Normation class. Puh, it seems that I misinformed you when I said that the Piniyn handling is based on a Normation class ! Actually there is no Normation class for Pinyin, but a DictionaryUpdate class with the name de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdateCEDICTChi

The DictionaryUpdate class is used by DictionaryGeneration when it does generate the index files etc.

You find a description about the DictionaryUpdate classe at http://dictionarymid.sourceforge.net/newdictDictionaryUpdate.html , for CEDICT also the 'advanced' features are used.

I'll write more on this later, I have to give you some more information ...

Gert

Tomcollins · 01. January 2008, 19:31:08

Seems we don't have a normation Class for Pinyin yet (who has time?

). We always did this with "update"-classes:
There are two for chinese entries with pinyin, but they may not fit for your problem since they require a very specific input.
The existing ones are:
DictionaryUpdateHanDeDictChi
DictionaryUpdateCEDICTChi

The fastest solution for your problem might be to enter something like "gai?" or "gai*", using wildcards. In options you can try switch on "* at the end".

Sebastian

Code Select

infoText: Chinese-English dictionary from CEDICT: http://www.mandarintools.com/cedict.html 
dictionaryAbbreviation: CEDICT
numberOfAvailableLanguages: 2
language1DisplayText: Chinese
language2DisplayText: English
language1FilePostfix: chi
language2FilePostfix: eng
dictionaryGenerationSeparatorCharacter: '\t'
indexFileSeparationCharacter: '\t'
searchListFileSeparationCharacter: '\t'
dictionaryFileSeparationCharacter: '\t'
dictionaryGenerationLanguage2ExpressionSplitString=/
dictionaryGenerationInputCharEncoding: UTF-8
language2NormationClassName: de.kugihan.dictionaryformids.translation.normation.NormationEng
language1DictionaryUpdateClassName: de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdateHanDeDictChi
language2DictionaryUpdateClassName: de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdateHanDeDictGer
indexCharEncoding: UTF-8
searchListCharEncoding: UTF-8
dictionaryCharEncoding: UTF-8
language1NumberOfContentDeclarations=3
language1Content01DisplayText: Simplified
language1Content01DisplaySelectable: true
language1Content02DisplayText: Traditional
language1Content02DisplaySelectable: true
language1Content03DisplayText: Pinyin_with_tones
language1Content03DisplaySelectable: true

Gert · 01. January 2008, 21:05:36

Thank you for your help Sebastian !!

Yes, the point is that these DictionaryUpdate-classes are made for the specific format of the HanDeDict respectively the CEDICT dictionaries.

I do need to point out that these two dictionaries make very good use of the 'content' feature (see http://dictionarymid.sourceforge.net/newdictContent.html). This makes the dictionaries look nice, plus more advantages.

Just an idea from myself: Ryan, if you could make the format of the Cantonese column of your inputdictionaryfile look identical or similar to that of HanDeDict, it may be possible to use the DictionaryUpdateHanDeDictChi class (maybe after some adaptation). Well, only an idea, I am not sure about the effort for formatting according to HanDeDict. Sebastian would know better than me !

Greetings,
Gert

Cryme · 02. January 2008, 14:38:26

Thanks guys!

I didn't think about the wild card characters... worst case scenario I'll just get used to typing "gai?" to represent all forms of gai... but if I didn't have to type that question mark in it would be slightly easier.

I included the following update classes into my dictionary that CEDICT seems to use:

language1DictionaryUpdateClassName=de.kugihan.dictionaryformids.dictgen.DictionaryUpdateEngDef
language2DictionaryUpdateClassName=de.kugihan.dictionaryformids.dictgen.DictionaryUpdateCEDICTChi

I noticed one small change - Now I get search results with the search string anywhere in the translation and not just at the beginning.
But otherwise I do not see the change I was looking for (getting all tone results without entering tone number). I don't know why... perhaps my formatting is slightly different or something... My dictionary database has the form...

Quote
to nod one's head   岌頭 ngap6 tau2
high; majestic; fork in road   岐 kei4
Qishan (place in Shaanxi)   岐山 kei4 saan1
steep, precipitous; peak   岑 sam4

...where there are tabs separating the english from the characters. Perhaps I need square brackets around the pinyin like CEDICT does...

Actually, now that I think of it, maybe it is because CEDICT has its entries in the "accented" form and only uses the numbers as optional input. I would imagine it converts numbers to accents and then looks it up? Mine would be the other way around. I hate the accented form so I always use numbers.

Ryan

Tomcollins · 02. January 2008, 19:12:17

the cedict class is taking input with pinyin in brackets, the input looks like:

Code Select

安康安康 [01an1 kang1],good health

To solve all the problems you would probably need your own update class (or maybe normation class which removes the numbers). Maybe you can find someone to do that...

I added the source code of the cedict-update-class

Code Select

/*
DictionaryForMIDs - a free multi-language dictionary for mobile devices.
Copyright (C) 2005, 2006  Gert Nuber (dict@kugihan.de), Erik Peterson (http://www.mandarintools.com/)

GPL applies - see file COPYING for copyright statement.
*/

package de.kugihan.dictionaryformids.dfmbuilder.dictgen.dictionaryupdate;

import java.util.Vector;

import de.kugihan.dictionaryformids.general.DictionaryException;
import de.kugihan.dictionaryformids.general.Util;

public class DictionaryUpdateCEDICTChi extends DictionaryUpdate {

	/*
	 * The input looks like
	 * 安康 安康 [01an1 kang1],good health
	 * [01 is the start content delimiter 
	 */
	
	// replaces the pronounciation part which is pinyin with tone numbers in the source
	// with accented pinyin, by using Erik Peterson's conversion routines
	public String updateDictionaryExpression(String dictionaryExpression) throws DictionaryException {
		String updatedExpression;
		int startBracket = dictionaryExpression.indexOf('[');
		int endBracket = dictionaryExpression.toString().indexOf(']');
		if ((startBracket != -1) && (endBracket > startBracket)) {
			String pronounciationToneNumbers = dictionaryExpression.substring(startBracket + 3, endBracket);  // + 3 because of [01
			String pronounciationAccented = addTones(pronounciationToneNumbers);
			updatedExpression = dictionaryExpression.substring(0, startBracket) +
							    "[" +
							    pronounciationAccented +
							    "]";
			updatedExpression = DictionaryUpdateLib.setContentPronounciation(updatedExpression, 1);
		}
		else {
			updatedExpression = dictionaryExpression;
		}
		return updatedExpression;
	}
	
	// Creates the keyWordVector for
	// a) the pronounciation part which is in square brackets: 
	//    - one time with tone numbers 
	//    - one time without tone numbers
	//    - one time in the accented version using Erik's conversion routines  
	// b) for the Chinese expression
	public Vector createKeyWordVector(String expression, String expressionSplitString) {
		
		Vector keyWordVector = new Vector();
		
		int startBracket = expression.indexOf('[');
		int endBracket = expression.toString().indexOf(']');

		String chineseExpression;
		if ((startBracket != -1) && (endBracket > startBracket)) {
			String pronounciationExpression = expression.substring(startBracket + 3, endBracket);
			chineseExpression = expression.substring(0, startBracket);
			DictionaryUpdateLib.addKeyWordExpressions(pronounciationExpression, keyWordVector);
			String pronounciationWithoutNumbers = removeNumbers(pronounciationExpression);
			DictionaryUpdateLib.addKeyWordExpressions(pronounciationWithoutNumbers, keyWordVector);
			String pronounciationAccented = addTones(pronounciationExpression);
			DictionaryUpdateLib.addKeyWordExpressions(pronounciationAccented, keyWordVector);
		}
		else {
			chineseExpression = expression;
		}
		DictionaryUpdateLib.addKeyWordExpressions(chineseExpression, keyWordVector);

		return keyWordVector;
	}
	
	protected String removeNumbers(String expression) {
		StringBuffer output = new StringBuffer();
		for (int pos = 0; pos < expression.length(); ++pos) {
			char character = expression.charAt(pos);
			if (! Character.isDigit(character)) {
				output.append(character);
			}
		}
		return output.toString();
	}

	
	/*
	 * The code below comes from Erik Peterson (http://www.mandarintools.com/)
	 */
    public static String addTones(String withnumbers) {
    	StringBuffer scratch = new StringBuffer(withnumbers);
    	int index, oldindex;
    	String source, target;
    	String oldtail[];
    	String newtail[];
    	String vowelnums[];
    	String voweltones[];
    	oldtail = new String[] {"ng1", "ng2", "ng3", "ng4", "ng5",
    				"n1", "n2", "n3", "n4", "n5",
    				"r1", "r2", "r3", "r4", "r5",
    				"ao1", "ao2", "ao3", "ao4", "ao5",
    				"ai1", "ai2", "ai3", "ai4", "ao5",
    				"ei1", "ei2", "ei3", "ei4", "ei5",
    				"ou1", "ou2", "ou3", "ou4", "ou5"};

    	newtail = new String[] {"1ng", "2ng", "3ng", "4ng", "5ng",
    				"1n", "2n", "3n", "4n", "5n",
    				"1r", "2r", "3r", "4r", "5r",
    				"a1o", "a2o", "a3o", "a4o", "a5o",
    				"a1i", "a2i", "a3i", "a4i", "a5i",
    				"e1i", "e2i", "e3i", "e4i", "e5i",
    				"o1u", "o2u", "o3u", "o4u", "o5u"};

    	vowelnums = new String[] {"a1", "a2", "a3", "a4", "a5", "e1", "e2", 
    "e3", "e4", "e5",
    				  "i1", "i2", "i3", "i4", "i5", "o1", "o2", "o3", "o4", "o5",
    				  "u1", "u2", "u3", "u4", "u5",
    				  "u:1", "u:2", "u:3", "u:4", "u:5", "u:",
    				  "v1", "v2", "v3", "v4", "v5", "v"};
    	voweltones = new String[]  {"\u0101", "\u00e1", "\u01ce", "\u00e0", "a",
    				    "\u0113", "\u00e9", "\u011b", "\u00e8", "e",
    				    "\u012b", "\u00ed", "\u01d0", "\u00ec", "i",
    				    "\u014d", "\u00f3", "\u01d2", "\u00f2", "o",
    				    "\u016b", "\u00fa", "\u01d4", "\u00f9", "u",
    				    "\u01d6", "\u01d8", "\u01da", "\u01dc", "\u00fc", "\u00fc",
    				    "\u01d6", "\u01d8", "\u01da", "\u01dc", "\u00fc", "\u00fc"};

    	// Move to lower case
    	withnumbers = withnumbers.toLowerCase();

    	// Switch tone number from end of syllable to next to appropriate vowel
    	source = withnumbers;
    	target = withnumbers;  // Have to set it here to satisfy compiler
    	for (int i=0; i < oldtail.length; i++) {
    	    oldindex = index = 0;
    	    target = "";
    	    index = source.indexOf(oldtail[i], oldindex);
    	    while (index >= 0) {
    		target = target + source.substring(oldindex, index);
    		target = target + newtail[i];
    		oldindex = index + oldtail[i].length();
    		index = source.indexOf(oldtail[i], oldindex);
    	    }
    	    target = target + source.substring(oldindex, source.length());
    	    source = target;
    	}

    	// Replace vowel+tone number with vowel with a tone diacritic
    	boolean foundvowel = false;
    	for (int i=0; i < vowelnums.length; i++) {
    	    oldindex = index = 0;
    	    target = "";
    	    index = source.indexOf(vowelnums[i], oldindex);
    	    while (index >= 0) {
    		target = target + source.substring(oldindex, index);
    		target = target + voweltones[i];
    		oldindex = index + vowelnums[i].length();
    		index = source.indexOf(vowelnums[i], oldindex);
    		foundvowel = true;
    	    }
    	    target = target + source.substring(oldindex, source.length());
    	    source = target;
    	}

    	if (!foundvowel) {
    	    target = withnumbers;
    	}
    	return target;
	}
}

Sebastian

Cryme · 02. January 2008, 22:21:36

Okay well thanks for all the info. It was very helpful.

If anyone wants to help me make a Normation Class or Update Class, I'll tell you what I need. It's very simple. My input database is in the format I specified in my previous message (but I can change it if need be). The rules I'd like to implement are:

1) If user enters a string without a number, search returns all words with numbers at the end ranging from 1-6.
ie- search "gai" returns all "gai1", gai2", "gai3", "gai4", "gai5", and "gai6"
2) Rule (1) applies for strings containing multiple words... essentially split the search string into individual words first.
ie- search "gai bei" returns all "gai1 bei2", "gai3 bei3", "gai6 bei4" etc.
3) [optional] If one of the words the user enters begins with "l" (lower case L), the search returns words that begin with the "l" as well as those beginning with "n". (This rule is specific to Cantonese). It is NOT necessary to go the other way around (n->l)
ie- search "leui" returns "leui4", "leui1", "neui5", "neui2", etc...

These rules are for the Cantonese 6-tone numeral yale romanization system. If anyone can help please let me know.

If not, its no big deal. I'd appreciate any help I can get though!

Ryan