Large size of files generated by DictionaryGeneration / how to reduce size

Started by Gert, 03. May 2010, 20:21:38

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

jn0101

Just checked again. Indeed the duplications goes away if I remove all nested {{ and }}s.

Hmm. What I really need is just to remove evrything in [0x ..  ]'s. Is there no easy way to do this?
I'd expect most dictionaries needs exactly this, to filter out explanations etc...

Jacob

Gert

QuoteHmm. What I really need is just to remove evrything in [0x ..  ]'s. Is there no easy way to do this?
I'd expect most dictionaries needs exactly this, to filter out explanations etc...

Hmmm, yes, probably indeed for most [0x...], these need to be excluded from indexing. But really not all of them..

Hmmmmm, I could add an option such as 'omitAllContentFromIndex'. Hmmmmmmmm, even another option, I am trying to make things easier for people and not to make them understand so many options. Also, right now I have little time for adding features like that.

Hey, what if nested {{ and }} were supported; that would solve your problem ? People easily could replace [ with {{[ and ] with ]}} as you do with sed. That would work for you, right ?

Best regards,
Gert


jn0101

Quote from: Gert on 05. June 2010, 14:49:55
QuoteHmm. What I really need is just to remove evrything in [0x ..  ]'s. Is there no easy way to do this?
I'd expect most dictionaries needs exactly this, to filter out explanations etc...

Hmmm, yes, probably indeed for most [0x...], these need to be excluded from indexing. But really not all of them..

Hey, heres an idea: Why not make an option in the properties to omit specific tags. For example,
language2Content01OmitFromIndex: true
language2Content02OmitFromIndex: true
language2Content04OmitFromIndex: true

then [03 ...] stuff would be indexed and the rest not. That would probably solve almost all needs wrt marked text going/not going to be indexed.


Quote from: Gert on 05. June 2010, 14:49:55
Hey, what if nested {{ and }} were supported; that would solve your problem ? People easily could replace [ with {{[ and ] with ]}} as you do with sed. That would work for you, right ?

Yes, that'd work just fine for me, also :-)

Jacob

Gert

QuoteHey, heres an idea: Why not make an option in the properties to omit specific tags. For example,
language2Content01OmitFromIndex: true
language2Content02OmitFromIndex: true
language2Content04OmitFromIndex: true

then [03 ...] stuff would be indexed and the rest not. That would probably solve almost all needs wrt marked text going/not going to be indexed.

I had that idea too !! But right now I'd prefer the second option below, because (1) it does not introduce another option and (2 - more important) it is easier to implement.

QuoteHey, what if nested {{ and }} were supported; that would solve your problem ? People easily could replace [ with {{[ and ] with ]}} as you do with sed. That would work for you, right ?

Yes, that'd work just fine for me, also :-)

That should be done rather quickly. Let me see if I can do that right now (either I will do it now or in one month ...).

Best regards,
Gert

Gert

I just committed an update to DictionaryUpdate to the SVN repository; that update implements nested {{ and }}. Honestly speaking I did this update a little in a hurry ...

Due to time constraints, unfortunately I will not be able to implement any features during the next three weeks (or do other development tasks). Well, I should be able to read the forum posts at least occasionally.

Fortunately we have other active members, most of all Jeff, Achim and you [names are not ordered ;) ] :)

Best regards,
Gert

dreamingsky

Something in de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdateCEDICTChi breaks the {{}}.

Here is the entry:
1) tute sin izoli de 2)  {{[電/电]}}  izoli | ~體/体 izolilo; dielektriko [Tab] {{絕緣}} [01jue2 yuan2]

If I use DictionaryUpdateCEDICTChi, then DfM incorrectly shows the {{}} on the display screen [see "broken.png"].

If I don't use DictionaryUpdateCEDICTChi, then the {{}} is correctly not shown [see "fixed.png"].

Also, the letters inside {{}} are getting indexed when using DictionaryUpdateCEDICTChi.
Here is the entry:
1) uniformo 2) subigi; submeti; subjugigi [Tab] {{制服}} [01zhi4 fu2]
1) uniformo 2) subigi; submeti; subjugigi [Tab] 制服 {{[01zhifu]}}

Here is the indexChi1.csv:
制服 {{   1-779-B
制服}}   1-708-B

Only one of the words should be indexed.  Also, there is still the {{ or }} in the entry.  These should not be there.

If I don't use DictionaryUpdateCEDICTChi, then the words are correctly not indexed.
Here is the indexChi1.csv:
制服   1-717-B

Jeff

Gert

Jeff,

yes, there is a problem that I spotted in DictionaryUpdateCEDICTChi.

I have to run now, so I only was able to make a build which I could not test (I am very busy with starting from this week). If you like to give it a try: http://www.kugihan.de/dict/download/test_versions/3.5.3/DictionaryForMIDs_DictionaryGeneration_3.5.3_development.zip

Thanks for your bug report !!
Gert

dreamingsky


jn0101

I am still suffering from very large indexes on the Esperanto side in the Danish-Esperanto dictionary. I was able to halve the size of my dictionary, from 3184999 byte to 1575605 byte, by one simple step, namely excluding multiwords from from being indexed:

public class DictionaryUpdateEpo extends DictionaryUpdate {

   public void updateKeyWordVector(Vector keyWordVector)
            throws DictionaryException {

      int elementCount = 0;
      if (keyWordVector.size() > 1) {
         do {
            String keyWord = ((IndexKeyWordEntry) keyWordVector.elementAt(elementCount)).keyWord;
            //System.err.println("keyWord = " + keyWord);
            if (keyWord.contains(" ")) {
               keyWordVector.removeElementAt(elementCount);
            }
            else {
               ++elementCount;
            }
         }
         while (elementCount < keyWordVector.size());
      }
   }



BTW the printouts looks strange. No wonder the total size is doubled....

keyWord = uzo } uzateco {{}} por privata uzo {{}} {{tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = uzateco {{}} por privata uzo {{}} {{tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = por privata uzo {{}} {{tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = privata uzo {{}} {{tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = uzo {{}} {{tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = itr}} esti uzata {{}} bieno


I didnt try with the 3.5.3 stuff.

Jacob

Gert

QuoteI was able to halve the size of my dictionary, from 3184999 byte to 1575605 byte, by one simple step, namely excluding multiwords from from being indexed:
Hmmm, there is dictionaryGenerationLanguageXExpressionSplitString, which I guess does not yet support " " as separator character.

Your printout shows well what happens if long phrases are indexed: each part of the phrase needs to be put in the index so that it can be retrieved quickly when a user searches for it.

Example:
Phrase is "this is an explanatory text"
Will generate the following 5 index entries:
this is an explanatory text
is an explanatory text
an explanatory text
explanatory text
text


Which is most likely not desireable (will make index explode in size and likely produce undesired search results). So it needs to be put in {{ and }}
{{this is an explanatory text}}
No index entry will be generated for that.

About the {{ }} in your printout, I think those should not show up. Should be checked what went wrong there.

Regards,
Gert

jn0101

Ive shipped the dictionary and wont look anymore at this right now.

Gert, at some time in the future when you have time, I'd be happy to go into this and find out why the indexes were so large.

Jacob

Gert

QuoteGert, at some time in the future when you have time, I'd be happy to go into this and find out why the indexes were so large.

Oh, maybe I misunderstood: I thought you solved the problem with the large indizes ?

When long phrases (i.e. several words) are indexed, then the result is a big index, this is normal. Solution is to put phrases into {{ and }}. Well, probably I did not really understand what should go into your index and what not.

Regards,
Gert

dreamingsky

I guess I'm still having problems with the {{}} too.

Here is the input dictionary file:
{{懶洋洋}} [01 lan3 yang2 yang2]     malvigla; langvora
懶洋洋 / 懒洋洋 {{[01 lanyangyang]}}     malvigla; langvora
{{馬虎}} [01 ma3 hu]     malzorga
馬虎 / 马虎 {{[01 mahu]}}     malzorga
{{哺乳}} [01 bu3 ru3]     mamnutri
哺乳 {{[01 buru]}}     mamnutri
乳房 {{[01 rufang]}}     mamo
{{拜金主義}} [01 bai4 jin1 zhu3 yi4]     mamonismo; monoadorado
拜金主義 / 拜金主义 {{[01 baijinzhuyi]}}     mamonismo; monoadorado

Here is the index:
bai jin zhu yi     1-282-B
bai4 jin1 zhu3 yi4     1-282-B
baijinzhuyi     1-346-B,1-346-B,1-346-B
bu ru     1-191-B
bu3 ru3     1-191-B
buru     1-224-B,1-224-B,1-224-B
bài jīn zhǔ yì     1-282-B
bǔ rǔ     1-191-B
hu     1-120-S,1-120-S,1-120-S
jin zhu yi     1-282-S
jin1 zhu3 yi4     1-282-S
jīn zhǔ yì     1-282-S
lan yang yang     1-0-B
lan3 yang2 yang2     1-0-B
lanyangyang     1-58-B,1-58-B,1-58-B
lǎn yáng yáng     1-0-B
ma hu     1-120-B
ma3 hu     1-120-B
mahu     1-152-B,1-152-B,1-152-B
mǎ hu     1-120-B
ru     1-191-S
ru3     1-191-S
rufang     1-254-B,1-254-B,1-254-B
rǔ     1-191-S
yang     1-0-S
yang yang     1-0-S
yang2     1-0-S
yang2 yang2     1-0-S
yi     1-282-S
yi4     1-282-S
yáng     1-0-S
yáng yáng     1-0-S
yì     1-282-S
zhu yi     1-282-S
zhu3 yi4     1-282-S
zhǔ yì     1-282-S
乳房 {{     1-254-B
哺乳 {{     1-224-B
哺乳}}     1-191-B
懒洋洋 {{     1-58-S
懶洋洋 / 懒洋洋 {{     1-58-B
懶洋洋}}     1-0-S
拜金主义 {{     1-346-S
拜金主義 / 拜金主义 {{     1-346-B
拜金主義}}     1-282-B
馬虎 / 马虎 {{     1-152-B
馬虎}}     1-120-B
马虎 {{     1-152-S
{{懶洋洋}}     1-0-B


All of the pinyin transcription are OK: none have extra {{ or }}.  But, all of the characters show extra {{ or }}.  I'm not sure if this is still a problem with DictionaryUpdateCEDICTChi.java, or if it is a different problem.

Also, it is strange that {{懶洋洋}} is in the index.  None of the other characters have both {{ and }}.  Maybe the problem is because it is the first word in the file?  Maybe the UTF-8 byte order mark (BOM) is causing a problem?

Also, "lanyangyang" and "baijinzhuyi" should not be indexed.  They are inside {{}}:
懶洋洋 / 懒洋洋 {{[01 lanyangyang]}}     malvigla; langvora
拜金主義 / 拜金主义 {{[01 baijinzhuyi]}}     mamonismo; monoadorado

Jeff

Gert

Jeff,

thanks to your attached files it will be easy to check that problem ! I did put it as number one on my todo list, but I am not sure when I find time to look at that.

Best regards,
Gert

dreamingsky