DictionaryForMids Forum

DfM-Creator => DfM-Creator - DictionaryGeneration => Topic started by: Gert on 03. May 2010, 20:21:38

Title: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 03. May 2010, 20:21:38
Colleagues,

some people who did run DictionaryGeneration experienced that the files which were generated are very big. For example, if the inputdictionaryfile is 2 MB, then the generated files were > 10 MB. In this case very probably the index files contain unnecessary information.

To illustrate the problem, here is an example with a line from the inputdictionaryfile:

sleep  The state of reduced consciousness of a human or animal[tab]Schlaf  Zustand der Ruhe eines Tieres oder Menschen

Note: [tab] is for the tab-separator character.

Here, without additional information, DictionaryGeneration will index all expressions that are included in the explanatory texts (e.g. "The state of reduced consciousness of a human or animal"). This is undesireable.

The solution is to use a DictionaryUpdate-class that avoids including the unnecessary indexes for the explanatory texts. In simple cases you can use the class DictionaryUpdatePartialIndex. If you set DictionaryUpdatePartialIndex as DictionaryUpdateClass, then the text between {{ and }} will not be included in the index.

In the example:
sleep  {{The state of reduced consciousness of a human or animal}}[tab]Schlaf  {{Zustand der Ruhe eines Tieres oder Menschen}}

And put in DictionaryForMIDs.properties these two lines:
language1DictionaryUpdateClassName=de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdatePartialIndex
language2DictionaryUpdateClassName=de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdatePartialIndex

Then the size of the generated files will collapse. For an inputdictionaryfile with lines as in the above example, the compressed result will likely be below 2 MB.

For advanced information on DictionaryUpdate-class read here: http://dictionarymid.sourceforge.net/newdictDictionaryUpdate.html (http://dictionarymid.sourceforge.net/newdictDictionaryUpdate.html) (you do not need to read this if you use DictionaryUpdatePartialIndex)

Regards,
Gert


Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 05. May 2010, 12:43:42
The DictionaryUpdatePartialIndex DictionaryUpdateClass is very useful.  I've used it on several dictionaries.  I was wondering if the code could be added to the default DfM code instead of a separate UpdateClass?

If users want to use a custom DictionaryUpdateClass such as DictionaryUpdateThaiNIUEng or DictionaryUpdateEDICTJpn, then they can't use DictionaryUpdatePartialIndex without referencing it in the new UpdateClass.

We use brackets [] for ContentDeclarations.  So maybe write the code for the curly brackets {{}} similar in DfM.

Then users could write a dictionary using {{}} without having to reference anything in the DictionaryForMIDs.properties.

Jeff
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 05. May 2010, 18:58:40
Understand, yes, that would be very useful.

I will implement this as you suggest !

Best regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 06. May 2010, 02:13:52
Thank you very much

Jeff
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 13. May 2010, 12:33:56
I implemented an updated version of DictionaryGeneration, you can download it here: http://prdownloads.sourceforge.net/dictionarymid/DictionaryForMIDs_DictionaryGeneration_3.5.0.zip?download (http://prdownloads.sourceforge.net/dictionarymid/DictionaryForMIDs_DictionaryGeneration_3.5.0.zip?download)

DictionaryGeneration now behaves by default as if the DictionaryUpdate-class DictionaryUpdatePartialIndex was used. The class DictionaryUpdatePartialIndex still exists for compatibility reasons, but does not any processsing any more.

If anyone should desire to deactivate the omission of indexing between {{ and }} then he can put the following line in DictionaryForMIDs.properties:
dictionaryGenerationOmitParFromIndex: false

@Jeff: could you please occasionally let me know if that version works fine for you (no need to hurry). Then I will change the link from the web site from the old 3.1.0 to that version.

Best regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 13. May 2010, 18:29:53
That's great.  Thanks for adding that.  I'm working on updating the Japanese dictionary now.  I'll see if the new DictionaryGeneration works.  I should have time to do it this weekend.

I'll see about adding the {{}} information on the "Setting up a New Dictionary" webpage.

Jeff
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 13. May 2010, 20:01:19
Great !

QuoteI'll see about adding the {{}} information on the "Setting up a New Dictionary" webpage.

Actually ... I already have an update on that page in the queue; I should have it online within the next few hours. Of course, any improvement will be welcome - your examples and guides are most valuable for all people who set up dictionaries !!

Best greetings,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 13. May 2010, 20:44:27
Quote
QuoteI'll see about adding the {{}} information on the "Setting up a New Dictionary" webpage.

Actually ... I already have an update on that page in the queue; I should have it online within the next few hours. Of course, any improvement will be welcome - your examples and guides are most valuable for all people who set up dictionaries !!

I just uploaded the updated web page on "Setting up a New Dictionary". The following 2 sections were amended: dictionaryGenerationOmitParFromIndex and "Reducing the size of the generated index files".

Please introduce any improvements at will. Of course, also on any other web pages at dictionarymid.sourceforge.net !

Thanks a lot !
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 15. May 2010, 02:24:09
Gert

I tested the new DictionaryGeneration with the Hindi and Thai dictionaries.  Everythings works OK.  Thanks for adding the code.

Jeff
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: jn0101 on 04. June 2010, 17:47:14
I think there is either a problem (bug) or I just don't understand how to use it:

I have for example the line:

hjælpe  {{[051]}}{{[01tr]}} helpi; {{[02understøtte]}} subteni; {{[02assistere]}} asisti; {{[02~s {{[06ad]}}, ~ hinanden]}} helpi unu la alian, interhelpi sin, {{[02også fx: ~s ad med at trække]}} kune tiri, kunlabori tirante; {{[02~ {{[06af]}} med, fx: ~ nogen af med jakken]}} helpi iun demeti la jakon, {{[02~ nogen af med affaldet]}} helpi iun seniĝi pri la rubo; {{[02~ {{[06med/]}} til]}} kunhelpi, asisti; {{[02fx: ~ nogen jakken {{[06på]}}]}} helpi iun surmeti la jakon; {{[02fx: ~ {{[06til]}} en stilling]}} helpi dungiĝi; {{[02{{[06stå]}} til at ~]}} esti helpebla/ savebla; {{[052]}}{{[01tr]}} {{[02~ hen, fx: ~ frem]}} helpi antaŭeniri/ progresi, {{[02~ ind i huset]}} helpi veni en la domon, {{[02~ op]}} helpi stariĝi/ supreniri, {{[02~ op på hesten]}} helpi surĉevaliĝi; {{[02se også: komme, gå]}}; {{[053]}}{{[02gavne]}} {{[01itr]}} utili, {{[02det kan ikke ~ noget]}} tio ne utilas, estas senutile, {{[02hvad ~r det?]}} por kio utilas?; {{[02gøre nemmere]}} {{[01tr]}} plifaciligi, {{[02fx: det skal nok ~ på hans forståelse]}} tio certe utilos al li por kompreni, tio certe plifaciligos lian komprenon

The string "nogen af med affaldet" is here enclosed as {{[02~ nogen af med affaldet]}} so it SHOULDNT be indexed at all when reversing and going Esperanto->Danish


Anyway, I have this gigantic redundancy. It seems the phrase is included TWENTY-TWO times in the Esperanto index.

$ grep -l "nogen af med affaldet" *
directory58.csv
indexEpo104.csv
indexEpo115.csv
indexEpo118.csv
indexEpo12.csv
indexEpo144.csv
indexEpo152.csv
indexEpo159.csv
indexEpo166.csv
indexEpo170.csv
indexEpo172.csv
indexEpo1.csv
indexEpo26.csv
indexEpo2.csv
indexEpo45.csv
indexEpo60.csv
indexEpo65.csv
indexEpo68.csv
indexEpo7.csv
indexEpo83.csv
indexEpo84.csv
indexEpo86.csv
indexEpo87.csv

$ grep "nogen af med affaldet" *
directory58.csv:hjælpe   [051][01tr] helpi; [02understøtte] subteni; [02assistere] asisti; [02~s [06ad], ~ hinanden] helpi unu la alian, interhelpi sin, [02også fx: ~s ad med at trække] kune tiri, kunlabori tirante; [02~ [06af] med, fx: ~ nogen af med jakken] helpi iun demeti la jakon, [02~ nogen af med affaldet] helpi iun seniĝi pri la rubo; [02~ [06med/] til] kunhelpi, asisti; [02fx: ~ nogen jakken [06på]] helpi iun surmeti la jakon; [02fx: ~ [06til] en stilling] helpi dungiĝi; [02[06stå] til at ~] esti helpebla/ savebla; [052][01tr] [02~ hen, fx: ~ frem] helpi antaŭeniri/ progresi, [02~ ind i huset] helpi veni en la domon, [02~ op] helpi stariĝi/ supreniri, [02~ op på hesten] helpi surĉevaliĝi; [02se også: komme, gå]; [053][02gavne] [01itr] utili, [02det kan ikke ~ noget] tio ne utilas, estas senutile, [02hvad ~r det?] por kio utilas?; [02gøre nemmere] [01tr] plifaciligi, [02fx: det skal nok ~ på hans forståelse] tio certe utilos al li por kompreni, tio certe plifaciligos lian komprenon
indexEpo104.csv:med at trække kune tiri kunlabori tirante af med f nogen af med jakken helpi iun demeti la jakon nogen af med affaldet helpi iun senigi pri la rubo med til kunhelpi asisti f nogen jakken på helpi iun surmeti la jakon f til en stilling helpi dungigi stå til at esti helpebla savebla 2 tr hen f frem helpi antaueniri progresi ind i huset helpi veni en la domon op helpi starigi supreniri op på hesten helpi surcevaligi se også komme gå 3 gavne itr utili det kan ikke noget tio ne utilas estas senutile hvad r det por kio utilas gøre nemmere tr plifaciligi f det skal nok på hans forståelse tio certe utilos al li por kompreni tio certe plifaciligos lian komprenon   58-7928-S


indexEpo104.csv:med f nogen af med jakken helpi iun demeti la jakon nogen af med affaldet helpi iun senigi pri la rubo med til kunhelpi asisti f nogen jakken på helpi iun surmeti la jakon f til en stilling helpi dungigi stå til at esti helpebla savebla 2 tr hen f frem helpi antaueniri progresi ind i huset helpi veni en la domon op helpi starigi supreniri op på hesten helpi surcevaligi se også komme gå 3 gavne itr utili det kan ikke noget tio ne utilas estas senutile hvad r det por kio utilas gøre nemmere tr plifaciligi f det skal nok på hans forståelse tio certe utilos al li por kompreni tio certe plifaciligos lian komprenon   58-7928-S

indexEpo104.csv:med jakken helpi iun demeti la jakon nogen af med affaldet helpi iun senigi pri la rubo med til kunhelpi asisti f nogen jakken på helpi iun surmeti la jakon f til en stilling helpi dungigi stå til at esti helpebla savebla 2 tr hen f frem helpi antaueniri progresi ind i huset helpi veni en la domon op helpi starigi supreniri op på hesten helpi surcevaligi se også komme gå 3 gavne itr utili det kan ikke noget tio ne utilas estas senutile hvad r det por kio utilas gøre nemmere tr plifaciligi f det skal nok på hans forståelse tio certe utilos al li por kompreni tio certe plifaciligos lian komprenon   58-7928-S
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 04. June 2010, 19:53:41
@jn0101: You are using version 3.5.0 of DictionaryGeneration ?

Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: jn0101 on 04. June 2010, 21:46:01
Im using the SVN version:

/*
* Note: this class is obsolete starting with DictionaryGeneration 3.5.0 because
* with version 3.5.0 of DictionayGeneration the behaviour from DictionaryUpdatePartialIndex
* is already included in class DictionaryUpdate. The class DictionaryUpdatePartialIndex
* is only retained for compatibility reasons.
*/
package de.kugihan.dictionaryformids.dictgen.dictionaryupdate;


public class DictionaryUpdatePartialIndex extends DictionaryUpdate {
   
}


so.... yes :-)
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 05. June 2010, 07:08:25
I'll look at that !

Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 05. June 2010, 08:37:18
@jn0101
I just noted:

{{[02~s {{[06ad]}}, ~ hinanden]}}

I believe that nesting of {{ and }} is not supported by the current implementation (see class DictionaryUpdate.java). That may be part of the problem; well, maybe there is an additional issue.

Would it be hard for you to avoid nesting of {{ and }} ?

Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: jn0101 on 05. June 2010, 13:26:40
Quote from: Gert on 05. June 2010, 08:37:18
@jn0101
I just noted:

{{[02~s {{[06ad]}}, ~ hinanden]}}

I believe that nesting of {{ and }} is not supported by the current implementation (see class DictionaryUpdate.java). That may be part of the problem; well, maybe there is an additional issue.

Would it be hard for you to avoid nesting of {{ and }} ?

What I need is to exclude all text in []'s. Therefore I have these replacements as a part of the preprocessing:

# - ni enmetu {{ kaj }} por eviti indeksigon de io en [ kaj ].
   sed 's/\[/{{[/g' |
   sed 's/\]/]}}/g' |

I think it would take quite some time to exclude nesting, as the [ and ]'s are nested.

Anyway the phrase {{[02~ nogen af med affaldet]}} does not have nestings, so I cant see how it should happen.
I might try with only this line and with no nesting... but I expect it to make no difference....

Jacob
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: jn0101 on 05. June 2010, 13:43:35
Just checked again. Indeed the duplications goes away if I remove all nested {{ and }}s.

Hmm. What I really need is just to remove evrything in [0x ..  ]'s. Is there no easy way to do this?
I'd expect most dictionaries needs exactly this, to filter out explanations etc...

Jacob
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 05. June 2010, 14:49:55
QuoteHmm. What I really need is just to remove evrything in [0x ..  ]'s. Is there no easy way to do this?
I'd expect most dictionaries needs exactly this, to filter out explanations etc...

Hmmm, yes, probably indeed for most [0x...], these need to be excluded from indexing. But really not all of them..

Hmmmmm, I could add an option such as 'omitAllContentFromIndex'. Hmmmmmmmm, even another option, I am trying to make things easier for people and not to make them understand so many options. Also, right now I have little time for adding features like that.

Hey, what if nested {{ and }} were supported; that would solve your problem ? People easily could replace [ with {{[ and ] with ]}} as you do with sed. That would work for you, right ?

Best regards,
Gert

Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: jn0101 on 06. June 2010, 00:32:19
Quote from: Gert on 05. June 2010, 14:49:55
QuoteHmm. What I really need is just to remove evrything in [0x ..  ]'s. Is there no easy way to do this?
I'd expect most dictionaries needs exactly this, to filter out explanations etc...

Hmmm, yes, probably indeed for most [0x...], these need to be excluded from indexing. But really not all of them..

Hey, heres an idea: Why not make an option in the properties to omit specific tags. For example,
language2Content01OmitFromIndex: true
language2Content02OmitFromIndex: true
language2Content04OmitFromIndex: true

then [03 ...] stuff would be indexed and the rest not. That would probably solve almost all needs wrt marked text going/not going to be indexed.


Quote from: Gert on 05. June 2010, 14:49:55
Hey, what if nested {{ and }} were supported; that would solve your problem ? People easily could replace [ with {{[ and ] with ]}} as you do with sed. That would work for you, right ?

Yes, that'd work just fine for me, also :-)

Jacob
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 06. June 2010, 06:08:03
QuoteHey, heres an idea: Why not make an option in the properties to omit specific tags. For example,
language2Content01OmitFromIndex: true
language2Content02OmitFromIndex: true
language2Content04OmitFromIndex: true

then [03 ...] stuff would be indexed and the rest not. That would probably solve almost all needs wrt marked text going/not going to be indexed.

I had that idea too !! But right now I'd prefer the second option below, because (1) it does not introduce another option and (2 - more important) it is easier to implement.

QuoteHey, what if nested {{ and }} were supported; that would solve your problem ? People easily could replace [ with {{[ and ] with ]}} as you do with sed. That would work for you, right ?

Yes, that'd work just fine for me, also :-)

That should be done rather quickly. Let me see if I can do that right now (either I will do it now or in one month ...).

Best regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 06. June 2010, 12:27:39
I just committed an update to DictionaryUpdate to the SVN repository; that update implements nested {{ and }}. Honestly speaking I did this update a little in a hurry ...

Due to time constraints, unfortunately I will not be able to implement any features during the next three weeks (or do other development tasks). Well, I should be able to read the forum posts at least occasionally.

Fortunately we have other active members, most of all Jeff, Achim and you [names are not ordered ;) ] :)

Best regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 08. June 2010, 04:24:55
Something in de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdateCEDICTChi breaks the {{}}.

Here is the entry:
1) tute sin izoli de 2)  {{[電/电]}}  izoli | ~體/体 izolilo; dielektriko [Tab] {{絕緣}} [01jue2 yuan2]

If I use DictionaryUpdateCEDICTChi, then DfM incorrectly shows the {{}} on the display screen [see "broken.png"].

If I don't use DictionaryUpdateCEDICTChi, then the {{}} is correctly not shown [see "fixed.png"].

Also, the letters inside {{}} are getting indexed when using DictionaryUpdateCEDICTChi.
Here is the entry:
1) uniformo 2) subigi; submeti; subjugigi [Tab] {{制服}} [01zhi4 fu2]
1) uniformo 2) subigi; submeti; subjugigi [Tab] 制服 {{[01zhifu]}}

Here is the indexChi1.csv:
制服 {{   1-779-B
制服}}   1-708-B

Only one of the words should be indexed.  Also, there is still the {{ or }} in the entry.  These should not be there.

If I don't use DictionaryUpdateCEDICTChi, then the words are correctly not indexed.
Here is the indexChi1.csv:
制服   1-717-B

Jeff
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 08. June 2010, 05:45:51
Jeff,

yes, there is a problem that I spotted in DictionaryUpdateCEDICTChi.

I have to run now, so I only was able to make a build which I could not test (I am very busy with starting from this week). If you like to give it a try: http://www.kugihan.de/dict/download/test_versions/3.5.3/DictionaryForMIDs_DictionaryGeneration_3.5.3_development.zip (http://www.kugihan.de/dict/download/test_versions/3.5.3/DictionaryForMIDs_DictionaryGeneration_3.5.3_development.zip)

Thanks for your bug report !!
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 08. June 2010, 07:44:12
That solved the problem.  Thank you very much.

Jeff
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: jn0101 on 09. June 2010, 20:35:11
I am still suffering from very large indexes on the Esperanto side in the Danish-Esperanto dictionary. I was able to halve the size of my dictionary, from 3184999 byte to 1575605 byte, by one simple step, namely excluding multiwords from from being indexed:

public class DictionaryUpdateEpo extends DictionaryUpdate {

   public void updateKeyWordVector(Vector keyWordVector)
            throws DictionaryException {

      int elementCount = 0;
      if (keyWordVector.size() > 1) {
         do {
            String keyWord = ((IndexKeyWordEntry) keyWordVector.elementAt(elementCount)).keyWord;
            //System.err.println("keyWord = " + keyWord);
            if (keyWord.contains(" ")) {
               keyWordVector.removeElementAt(elementCount);
            }
            else {
               ++elementCount;
            }
         }
         while (elementCount < keyWordVector.size());
      }
   }



BTW the printouts looks strange. No wonder the total size is doubled....

keyWord = uzo } uzateco {{}} por privata uzo {{}} {{tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = uzateco {{}} por privata uzo {{}} {{tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = por privata uzo {{}} {{tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = privata uzo {{}} {{tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = uzo {{}} {{tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = tr}} uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = uzi, fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = fari uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = uzon el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = el {{}} {{itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = itr}} eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = eluziĝi, elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = elmodiĝi {{}} {{tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = tr}} bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = bezoni {{}} {{tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = tr}} ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = ekuzi {{}} {{itr}} esti uzata {{}} bieno
keyWord = itr}} esti uzata {{}} bieno


I didnt try with the 3.5.3 stuff.

Jacob
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 10. June 2010, 17:53:22
QuoteI was able to halve the size of my dictionary, from 3184999 byte to 1575605 byte, by one simple step, namely excluding multiwords from from being indexed:
Hmmm, there is dictionaryGenerationLanguageXExpressionSplitString, which I guess does not yet support " " as separator character.

Your printout shows well what happens if long phrases are indexed: each part of the phrase needs to be put in the index so that it can be retrieved quickly when a user searches for it.

Example:
Phrase is "this is an explanatory text"
Will generate the following 5 index entries:
this is an explanatory text
is an explanatory text
an explanatory text
explanatory text
text


Which is most likely not desireable (will make index explode in size and likely produce undesired search results). So it needs to be put in {{ and }}
{{this is an explanatory text}}
No index entry will be generated for that.

About the {{ }} in your printout, I think those should not show up. Should be checked what went wrong there.

Regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: jn0101 on 11. June 2010, 17:05:07
Ive shipped the dictionary and wont look anymore at this right now.

Gert, at some time in the future when you have time, I'd be happy to go into this and find out why the indexes were so large.

Jacob
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 11. June 2010, 18:37:24
QuoteGert, at some time in the future when you have time, I'd be happy to go into this and find out why the indexes were so large.

Oh, maybe I misunderstood: I thought you solved the problem with the large indizes ?

When long phrases (i.e. several words) are indexed, then the result is a big index, this is normal. Solution is to put phrases into {{ and }}. Well, probably I did not really understand what should go into your index and what not.

Regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 13. June 2010, 02:10:42
I guess I'm still having problems with the {{}} too.

Here is the input dictionary file:
{{懶洋洋}} [01 lan3 yang2 yang2]     malvigla; langvora
懶洋洋 / 懒洋洋 {{[01 lanyangyang]}}     malvigla; langvora
{{馬虎}} [01 ma3 hu]     malzorga
馬虎 / 马虎 {{[01 mahu]}}     malzorga
{{哺乳}} [01 bu3 ru3]     mamnutri
哺乳 {{[01 buru]}}     mamnutri
乳房 {{[01 rufang]}}     mamo
{{拜金主義}} [01 bai4 jin1 zhu3 yi4]     mamonismo; monoadorado
拜金主義 / 拜金主义 {{[01 baijinzhuyi]}}     mamonismo; monoadorado

Here is the index:
bai jin zhu yi     1-282-B
bai4 jin1 zhu3 yi4     1-282-B
baijinzhuyi     1-346-B,1-346-B,1-346-B
bu ru     1-191-B
bu3 ru3     1-191-B
buru     1-224-B,1-224-B,1-224-B
bài jīn zhǔ yì     1-282-B
bǔ rǔ     1-191-B
hu     1-120-S,1-120-S,1-120-S
jin zhu yi     1-282-S
jin1 zhu3 yi4     1-282-S
jīn zhǔ yì     1-282-S
lan yang yang     1-0-B
lan3 yang2 yang2     1-0-B
lanyangyang     1-58-B,1-58-B,1-58-B
lǎn yáng yáng     1-0-B
ma hu     1-120-B
ma3 hu     1-120-B
mahu     1-152-B,1-152-B,1-152-B
mǎ hu     1-120-B
ru     1-191-S
ru3     1-191-S
rufang     1-254-B,1-254-B,1-254-B
rǔ     1-191-S
yang     1-0-S
yang yang     1-0-S
yang2     1-0-S
yang2 yang2     1-0-S
yi     1-282-S
yi4     1-282-S
yáng     1-0-S
yáng yáng     1-0-S
yì     1-282-S
zhu yi     1-282-S
zhu3 yi4     1-282-S
zhǔ yì     1-282-S
乳房 {{     1-254-B
哺乳 {{     1-224-B
哺乳}}     1-191-B
懒洋洋 {{     1-58-S
懶洋洋 / 懒洋洋 {{     1-58-B
懶洋洋}}     1-0-S
拜金主义 {{     1-346-S
拜金主義 / 拜金主义 {{     1-346-B
拜金主義}}     1-282-B
馬虎 / 马虎 {{     1-152-B
馬虎}}     1-120-B
马虎 {{     1-152-S
{{懶洋洋}}     1-0-B


All of the pinyin transcription are OK: none have extra {{ or }}.  But, all of the characters show extra {{ or }}.  I'm not sure if this is still a problem with DictionaryUpdateCEDICTChi.java, or if it is a different problem.

Also, it is strange that {{懶洋洋}} is in the index.  None of the other characters have both {{ and }}.  Maybe the problem is because it is the first word in the file?  Maybe the UTF-8 byte order mark (BOM) is causing a problem?

Also, "lanyangyang" and "baijinzhuyi" should not be indexed.  They are inside {{}}:
懶洋洋 / 懒洋洋 {{[01 lanyangyang]}}     malvigla; langvora
拜金主義 / 拜金主义 {{[01 baijinzhuyi]}}     mamonismo; monoadorado

Jeff
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 13. June 2010, 08:46:22
Jeff,

thanks to your attached files it will be easy to check that problem ! I did put it as number one on my todo list, but I am not sure when I find time to look at that.

Best regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 13. June 2010, 12:19:35
No problem.  Take your time.
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 13. June 2010, 19:14:59
I just had a quick look at the code. DictionaryUpdateCEDICTChi is really build to handle exactly entries as those:
安康 安康 [01an1 kang1],good health


Other entries will cause problems.

Ok, I will occasionally make DictionaryUpdateCEDICTChi more flexible.

Keep you informed !
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 13. June 2010, 21:42:39
Hope that version fixes the problem:
http://prdownloads.sourceforge.net/dictionarymid/DictionaryForMIDs_DictionaryGeneration_3.5.4.zip?download (http://prdownloads.sourceforge.net/dictionarymid/DictionaryForMIDs_DictionaryGeneration_3.5.4.zip?download)

Regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 14. June 2010, 10:15:40
Thanks, that looks better.  I still see some problems though.

1. There is still an error with:
{{懶洋洋}} [01 lan3 yang2 yang2]   malvigla; langvora

Before it gave this error:
{{懶洋洋}}     1-0-B  [extra {{}} ]

Now it gives this error:
   1-0-B  [the 懶洋洋 characters are missing]

2. If there are 2 words in the left column, then only the 2nd words are getting indexed:
馬虎 / 马虎 {{[01 mahu]}}   malzorga

Both "馬虎" and "马虎" should be indexed.  But, only "马虎" is indexed.  Here is the index:
馬虎 / 马虎   1-152-B
马虎   1-152-S

I thought if there is a space between words, then they will both be indexed.  This is probably a different problem then the {{}} problem.

I added this line to DictionaryForMIDs.properties, but it didn't fix the problem:
dictionaryGenerationLanguage1ExpressionSplitString: /
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 14. June 2010, 17:24:48
Jeff,

Quote1. There is still an error with:
{{懶洋洋}} [01 lan3 yang2 yang2]   malvigla; langvora

Before it gave this error:
{{懶洋洋}}     1-0-B  [extra {{}} ]

Now it gives this error:
   1-0-B  [the 懶洋洋 characters are missing]

The 懶洋洋 is in {{ and }}, so I think it should not get indexed ... or am I wrong ?
The empty index before 1-0-B might be your BOM-character, I did not yet fully investigate that yet though. (the BOM-character would not harm in the index).


Quote2. If there are 2 words in the left column, then only the 2nd words are getting indexed:
馬虎 / 马虎 {{[01 mahu]}}   malzorga

Both "馬虎" and "马虎" should be indexed.  But, only "马虎" is indexed.  Here is the index:
馬虎 / 马虎   1-152-B
马虎   1-152-S

I thought if there is a space between words, then they will both be indexed.  This is probably a different problem then the {{}} problem.
I added this line to DictionaryForMIDs.properties, but it didn't fix the problem:
dictionaryGenerationLanguage1ExpressionSplitString: /

Yes, both need to be indexed, exactly as you describe.

However
馬虎 / 马虎   1-152-B
马虎   1-152-S

looks ok for me on a first sight: the "馬虎" is part of the index, well still combined into "馬虎 / 马虎" (DictionaryGeneration thinks this is a 'phrase' unless you set the dictionaryGenerationLanguage1ExpressionSplitString).

And you are right, you really need to put a dictionaryGenerationLanguage1ExpressionSplitString here !

With the
dictionaryGenerationLanguage1ExpressionSplitString: /
the index should look like
馬虎   1-152-B
马虎   1-152-S

But it does not ?? Ok, then I have to check that !

Best greetings !
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 15. June 2010, 00:27:33
QuoteThe 懶洋洋 is in {{ and }}, so I think it should not get indexed ... or am I wrong ?
The empty index before 1-0-B might be your BOM-character, I did not yet fully investigate that yet though. (the BOM-character would not harm in the index).

You're right, since the word is in {{}} it won't be indexed.  So the extra "     1-0-B" won't cause any problems in the index.

I changed the first line to this (I removed the {{}} ):
懶洋洋 [01 lan3 yang2 yang2]   malvigla; langvora

Now 懶洋洋 is not in the index.  So I think the BOM must be causing a problem.

QuoteWith the
dictionaryGenerationLanguage1ExpressionSplitString: /
the index should look like
馬虎   1-152-B
马虎   1-152-S

Yes, the index should look like that.  I added the SplitString, but the index still looks like this:
馬虎 / 马虎   1-90-B
马虎   1-90-S

I included the newest build files.
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 15. June 2010, 06:34:07
Jeff,

I will occasionally exclude the BOM-character from being put in the index (I will do it for all dictionaries, not only for Chinese one). But there is no reason to wait for this, cause the indexing algorithm will work fine also with the BOM-character in it.

QuoteYes, the index should look like that.  I added the SplitString, but the index still looks like this:
馬虎 / 马虎   1-90-B
马虎   1-90-S

Will look at that also occasionally. Will likely take a few weeks though, sorry for that.

Best regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 15. June 2010, 09:06:10
QuoteBut there is no reason to wait for this, cause the indexing algorithm will work fine also with the BOM-character in it.

Actually the BOM causes the 1st word to not be indexed: 懶洋洋.  This is probably true for all dictionaries.  It's a small problem, though.

QuoteWill likely take a few weeks though, sorry for that.

No problem, there's no rush.
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 27. June 2010, 21:20:52
Jeff,

I just had a look at the source code: dictionaryGenerationLanguage1ExpressionSplitString is really not yet supported by DictionaryUpdateCEDICTChi.

I need to find a way to get that incorporated into DictionaryUpdateCEDICTChi.

Regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 28. June 2010, 04:16:40
No problem.  There is no rush.
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 10. July 2010, 22:42:39
Jeff,

I made an update on DictionaryUpdateCEDICTChi concerning the ExpressionSplitString; but I need to do some testing first.

Fortunately ... no rush ...

Regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 10. July 2010, 23:44:22
Jeff,

that version should work: http://www.kugihan.de/dict/download/test_versions/for_Jeff/DictionaryGeneration.jar (http://www.kugihan.de/dict/download/test_versions/for_Jeff/DictionaryGeneration.jar).

Best regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 12. July 2010, 08:34:15
Yes, everything works now.  The BOM was removed from the first entry.  The {{}} work correctly.  And the ExpressionSplitString works correctly.

Thank you very much
Jeff
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 12. July 2010, 23:12:15
Sorry, yesterday I looked at indexChi1.csv and everything looked good.  But today I used the dictionary in an emulator.  I found a new problem.  Now all the pinyin transcription is in [ ].

Inside directory1.csv is:
懶洋洋 [01\[ lǎn yáng yáng\]]     malvigla; langvora

It is displayed in the emulator as:
懶洋洋 [ lǎn yáng yang]
malvigla; langvora

It should look like:
懶洋洋  lǎn yáng yang
malvigla; langvora

Can you please remove the [ ] from the transcription display?  There is no rush.

Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 13. July 2010, 05:56:51
Jeff,

I can remove the [ ]. Hmmm, I thought those [ ] were there always before, weren't they ?

Anyway, I will remove them occasionally, no problem with that.

Regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: dreamingsky on 13. July 2010, 07:42:20
The [ ] were only for the content declarations:
[01 ... ]
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 16. July 2010, 20:07:51
Jeff,

I updated DictionaryGeneration: http://www.kugihan.de/dict/download/test_versions/for_Jeff/DictionaryGeneration.jar (http://www.kugihan.de/dict/download/test_versions/for_Jeff/DictionaryGeneration.jar). This one removes the extra [ ].

Best regards,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 03. August 2010, 20:26:20
@jn0101:

Jacob,

public class DictionaryUpdateEpo extends DictionaryUpdate {

   public void updateKeyWordVector(Vector keyWordVector)
            throws DictionaryException {
[...]
   }
}


In your class DictionaryUpdateEpo, what methods did you implement, in addition to updateKeyWordVector ? Well, maybe you could send me your DictionaryUpdateEpo ?

Best greetings,
Gert
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: jn0101 on 12. August 2010, 14:19:57
Oops, I forgot to add it to SVN. Ive fixed that now (revision 313).
Look in DictionaryGeneration/src/de/kugihan/dictionaryformids/dictgen/dictionaryupdate
Title: Re: Large size of files generated by DictionaryGeneration / how to reduce size
Post by: Gert on 14. August 2010, 04:28:48
@jn0101:

Thanks - I will look at your class occasionally.

Greetings,
Gert