Author Topic: Large size of files generated by DictionaryGeneration / how to reduce size  (Read 12628 times)

0 Members and 1 Guest are viewing this topic.

Gert

  • DFM J2ME/Mobile Developer and Project Leader
  • Administrator
  • *****
  • Posts: 862
    • View Profile
    • DictionaryForMIDs
Colleagues,

some people who did run DictionaryGeneration experienced that the files which were generated are very big. For example, if the inputdictionaryfile is 2 MB, then the generated files were > 10 MB. In this case very probably the index files contain unnecessary information.

To illustrate the problem, here is an example with a line from the inputdictionaryfile:

sleep  The state of reduced consciousness of a human or animal[tab]Schlaf  Zustand der Ruhe eines Tieres oder Menschen

Note: [tab] is for the tab-separator character.

Here, without additional information, DictionaryGeneration will index all expressions that are included in the explanatory texts (e.g. "The state of reduced consciousness of a human or animal"). This is undesireable.

The solution is to use a DictionaryUpdate-class that avoids including the unnecessary indexes for the explanatory texts. In simple cases you can use the class DictionaryUpdatePartialIndex. If you set DictionaryUpdatePartialIndex as DictionaryUpdateClass, then the text between {{ and }} will not be included in the index.

In the example:
sleep  {{The state of reduced consciousness of a human or animal}}[tab]Schlaf  {{Zustand der Ruhe eines Tieres oder Menschen}}

And put in DictionaryForMIDs.properties these two lines:
language1DictionaryUpdateClassName=de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdatePartialIndex
language2DictionaryUpdateClassName=de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdatePartialIndex

Then the size of the generated files will collapse. For an inputdictionaryfile with lines as in the above example, the compressed result will likely be below 2 MB.

For advanced information on DictionaryUpdate-class read here: http://dictionarymid.sourceforge.net/newdictDictionaryUpdate.html (you do not need to read this if you use DictionaryUpdatePartialIndex)

Regards,
Gert



dreamingsky

  • Developer
  • *****
  • Posts: 86
    • View Profile
The DictionaryUpdatePartialIndex DictionaryUpdateClass is very useful.  I've used it on several dictionaries.  I was wondering if the code could be added to the default DfM code instead of a separate UpdateClass?

If users want to use a custom DictionaryUpdateClass such as DictionaryUpdateThaiNIUEng or DictionaryUpdateEDICTJpn, then they can't use DictionaryUpdatePartialIndex without referencing it in the new UpdateClass.

We use brackets [] for ContentDeclarations.  So maybe write the code for the curly brackets {{}} similar in DfM.

Then users could write a dictionary using {{}} without having to reference anything in the DictionaryForMIDs.properties.

Jeff

Gert

  • DFM J2ME/Mobile Developer and Project Leader
  • Administrator
  • *****
  • Posts: 862
    • View Profile
    • DictionaryForMIDs
Understand, yes, that would be very useful.

I will implement this as you suggest !

Best regards,
Gert

dreamingsky

  • Developer
  • *****
  • Posts: 86
    • View Profile
Thank you very much

Jeff

Gert

  • DFM J2ME/Mobile Developer and Project Leader
  • Administrator
  • *****
  • Posts: 862
    • View Profile
    • DictionaryForMIDs
I implemented an updated version of DictionaryGeneration, you can download it here: http://prdownloads.sourceforge.net/dictionarymid/DictionaryForMIDs_DictionaryGeneration_3.5.0.zip?download

DictionaryGeneration now behaves by default as if the DictionaryUpdate-class DictionaryUpdatePartialIndex was used. The class DictionaryUpdatePartialIndex still exists for compatibility reasons, but does not any processsing any more.

If anyone should desire to deactivate the omission of indexing between {{ and }} then he can put the following line in DictionaryForMIDs.properties:
dictionaryGenerationOmitParFromIndex: false

@Jeff: could you please occasionally let me know if that version works fine for you (no need to hurry). Then I will change the link from the web site from the old 3.1.0 to that version.

Best regards,
Gert

dreamingsky

  • Developer
  • *****
  • Posts: 86
    • View Profile
That's great.  Thanks for adding that.  I'm working on updating the Japanese dictionary now.  I'll see if the new DictionaryGeneration works.  I should have time to do it this weekend.

I'll see about adding the {{}} information on the "Setting up a New Dictionary" webpage.

Jeff

Gert

  • DFM J2ME/Mobile Developer and Project Leader
  • Administrator
  • *****
  • Posts: 862
    • View Profile
    • DictionaryForMIDs
Great !

Quote
I'll see about adding the {{}} information on the "Setting up a New Dictionary" webpage.

Actually ... I already have an update on that page in the queue; I should have it online within the next few hours. Of course, any improvement will be welcome - your examples and guides are most valuable for all people who set up dictionaries !!

Best greetings,
Gert

Gert

  • DFM J2ME/Mobile Developer and Project Leader
  • Administrator
  • *****
  • Posts: 862
    • View Profile
    • DictionaryForMIDs
Quote
Quote
I'll see about adding the {{}} information on the "Setting up a New Dictionary" webpage.

Actually ... I already have an update on that page in the queue; I should have it online within the next few hours. Of course, any improvement will be welcome - your examples and guides are most valuable for all people who set up dictionaries !!

I just uploaded the updated web page on "Setting up a New Dictionary". The following 2 sections were amended: dictionaryGenerationOmitParFromIndex and "Reducing the size of the generated index files".

Please introduce any improvements at will. Of course, also on any other web pages at dictionarymid.sourceforge.net !

Thanks a lot !
Gert

dreamingsky

  • Developer
  • *****
  • Posts: 86
    • View Profile
Gert

I tested the new DictionaryGeneration with the Hindi and Thai dictionaries.  Everythings works OK.  Thanks for adding the code.

Jeff

jn0101

  • Developer
  • *****
  • Posts: 85
    • View Profile
I think there is either a problem (bug) or I just don't understand how to use it:

I have for example the line:

hjælpe  {{[051]}}{{[01tr]}} helpi; {{[02understøtte]}} subteni; {{[02assistere]}} asisti; {{[02~s {{[06ad]}}, ~ hinanden]}} helpi unu la alian, interhelpi sin, {{[02også fx: ~s ad med at trække]}} kune tiri, kunlabori tirante; {{[02~ {{[06af]}} med, fx: ~ nogen af med jakken]}} helpi iun demeti la jakon, {{[02~ nogen af med affaldet]}} helpi iun seniĝi pri la rubo; {{[02~ {{[06med/]}} til]}} kunhelpi, asisti; {{[02fx: ~ nogen jakken {{[06på]}}]}} helpi iun surmeti la jakon; {{[02fx: ~ {{[06til]}} en stilling]}} helpi dungiĝi; {{[02{{[06stå]}} til at ~]}} esti helpebla/ savebla; {{[052]}}{{[01tr]}} {{[02~ hen, fx: ~ frem]}} helpi antaŭeniri/ progresi, {{[02~ ind i huset]}} helpi veni en la domon, {{[02~ op]}} helpi stariĝi/ supreniri, {{[02~ op på hesten]}} helpi surĉevaliĝi; {{[02se også: komme, gå]}}; {{[053]}}{{[02gavne]}} {{[01itr]}} utili, {{[02det kan ikke ~ noget]}} tio ne utilas, estas senutile, {{[02hvad ~r det?]}} por kio utilas?; {{[02gøre nemmere]}} {{[01tr]}} plifaciligi, {{[02fx: det skal nok ~ på hans forståelse]}} tio certe utilos al li por kompreni, tio certe plifaciligos lian komprenon

The string "nogen af med affaldet" is here enclosed as {{[02~ nogen af med affaldet]}} so it SHOULDNT be indexed at all when reversing and going Esperanto->Danish


Anyway, I have this gigantic redundancy. It seems the phrase is included TWENTY-TWO times in the Esperanto index.

$ grep -l "nogen af med affaldet" *
directory58.csv
indexEpo104.csv
indexEpo115.csv
indexEpo118.csv
indexEpo12.csv
indexEpo144.csv
indexEpo152.csv
indexEpo159.csv
indexEpo166.csv
indexEpo170.csv
indexEpo172.csv
indexEpo1.csv
indexEpo26.csv
indexEpo2.csv
indexEpo45.csv
indexEpo60.csv
indexEpo65.csv
indexEpo68.csv
indexEpo7.csv
indexEpo83.csv
indexEpo84.csv
indexEpo86.csv
indexEpo87.csv

$ grep "nogen af med affaldet" *
directory58.csv:hjælpe   [051][01tr] helpi; [02understøtte] subteni; [02assistere] asisti; [02~s [06ad], ~ hinanden] helpi unu la alian, interhelpi sin, [02også fx: ~s ad med at trække] kune tiri, kunlabori tirante; [02~ [06af] med, fx: ~ nogen af med jakken] helpi iun demeti la jakon, [02~ nogen af med affaldet] helpi iun seniĝi pri la rubo; [02~ [06med/] til] kunhelpi, asisti; [02fx: ~ nogen jakken [06på]] helpi iun surmeti la jakon; [02fx: ~ [06til] en stilling] helpi dungiĝi; [02[06stå] til at ~] esti helpebla/ savebla; [052][01tr] [02~ hen, fx: ~ frem] helpi antaŭeniri/ progresi, [02~ ind i huset] helpi veni en la domon, [02~ op] helpi stariĝi/ supreniri, [02~ op på hesten] helpi surĉevaliĝi; [02se også: komme, gå]; [053][02gavne] [01itr] utili, [02det kan ikke ~ noget] tio ne utilas, estas senutile, [02hvad ~r det?] por kio utilas?; [02gøre nemmere] [01tr] plifaciligi, [02fx: det skal nok ~ på hans forståelse] tio certe utilos al li por kompreni, tio certe plifaciligos lian komprenon
indexEpo104.csv:med at trække kune tiri kunlabori tirante af med f nogen af med jakken helpi iun demeti la jakon nogen af med affaldet helpi iun senigi pri la rubo med til kunhelpi asisti f nogen jakken på helpi iun surmeti la jakon f til en stilling helpi dungigi stå til at esti helpebla savebla 2 tr hen f frem helpi antaueniri progresi ind i huset helpi veni en la domon op helpi starigi supreniri op på hesten helpi surcevaligi se også komme gå 3 gavne itr utili det kan ikke noget tio ne utilas estas senutile hvad r det por kio utilas gøre nemmere tr plifaciligi f det skal nok på hans forståelse tio certe utilos al li por kompreni tio certe plifaciligos lian komprenon   58-7928-S


indexEpo104.csv:med f nogen af med jakken helpi iun demeti la jakon nogen af med affaldet helpi iun senigi pri la rubo med til kunhelpi asisti f nogen jakken på helpi iun surmeti la jakon f til en stilling helpi dungigi stå til at esti helpebla savebla 2 tr hen f frem helpi antaueniri progresi ind i huset helpi veni en la domon op helpi starigi supreniri op på hesten helpi surcevaligi se også komme gå 3 gavne itr utili det kan ikke noget tio ne utilas estas senutile hvad r det por kio utilas gøre nemmere tr plifaciligi f det skal nok på hans forståelse tio certe utilos al li por kompreni tio certe plifaciligos lian komprenon   58-7928-S

indexEpo104.csv:med jakken helpi iun demeti la jakon nogen af med affaldet helpi iun senigi pri la rubo med til kunhelpi asisti f nogen jakken på helpi iun surmeti la jakon f til en stilling helpi dungigi stå til at esti helpebla savebla 2 tr hen f frem helpi antaueniri progresi ind i huset helpi veni en la domon op helpi starigi supreniri op på hesten helpi surcevaligi se også komme gå 3 gavne itr utili det kan ikke noget tio ne utilas estas senutile hvad r det por kio utilas gøre nemmere tr plifaciligi f det skal nok på hans forståelse tio certe utilos al li por kompreni tio certe plifaciligos lian komprenon   58-7928-S

Gert

  • DFM J2ME/Mobile Developer and Project Leader
  • Administrator
  • *****
  • Posts: 862
    • View Profile
    • DictionaryForMIDs
@jn0101: You are using version 3.5.0 of DictionaryGeneration ?

Gert

jn0101

  • Developer
  • *****
  • Posts: 85
    • View Profile
Im using the SVN version:

/*
 * Note: this class is obsolete starting with DictionaryGeneration 3.5.0 because
 * with version 3.5.0 of DictionayGeneration the behaviour from DictionaryUpdatePartialIndex
 * is already included in class DictionaryUpdate. The class DictionaryUpdatePartialIndex
 * is only retained for compatibility reasons.
 */
package de.kugihan.dictionaryformids.dictgen.dictionaryupdate;


public class DictionaryUpdatePartialIndex extends DictionaryUpdate {
   
}


so.... yes :-)

Gert

  • DFM J2ME/Mobile Developer and Project Leader
  • Administrator
  • *****
  • Posts: 862
    • View Profile
    • DictionaryForMIDs
I'll look at that !

Gert

Gert

  • DFM J2ME/Mobile Developer and Project Leader
  • Administrator
  • *****
  • Posts: 862
    • View Profile
    • DictionaryForMIDs
@jn0101
I just noted:

Code: [Select]
{{[02~s {{[06ad]}}, ~ hinanden]}}
I believe that nesting of {{ and }} is not supported by the current implementation (see class DictionaryUpdate.java). That may be part of the problem; well, maybe there is an additional issue.

Would it be hard for you to avoid nesting of {{ and }} ?

Gert

jn0101

  • Developer
  • *****
  • Posts: 85
    • View Profile
@jn0101
I just noted:

Code: [Select]
{{[02~s {{[06ad]}}, ~ hinanden]}}
I believe that nesting of {{ and }} is not supported by the current implementation (see class DictionaryUpdate.java). That may be part of the problem; well, maybe there is an additional issue.

Would it be hard for you to avoid nesting of {{ and }} ?

What I need is to exclude all text in []'s. Therefore I have these replacements as a part of the preprocessing:

# - ni enmetu {{ kaj }} por eviti indeksigon de io en [ kaj ].
   sed 's/\[/{{[/g' |
   sed 's/\]/]}}/g' |

I think it would take quite some time to exclude nesting, as the [ and ]'s are nested.

Anyway the phrase {{[02~ nogen af med affaldet]}} does not have nestings, so I cant see how it should happen.
I might try with only this line and with no nesting... but I expect it to make no difference....

Jacob