Large size of files generated by DictionaryGeneration / how to reduce size

Started by Gert, 03. May 2010, 20:21:38

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Gert

Colleagues,

some people who did run DictionaryGeneration experienced that the files which were generated are very big. For example, if the inputdictionaryfile is 2 MB, then the generated files were > 10 MB. In this case very probably the index files contain unnecessary information.

To illustrate the problem, here is an example with a line from the inputdictionaryfile:

sleep  The state of reduced consciousness of a human or animal[tab]Schlaf  Zustand der Ruhe eines Tieres oder Menschen

Note: [tab] is for the tab-separator character.

Here, without additional information, DictionaryGeneration will index all expressions that are included in the explanatory texts (e.g. "The state of reduced consciousness of a human or animal"). This is undesireable.

The solution is to use a DictionaryUpdate-class that avoids including the unnecessary indexes for the explanatory texts. In simple cases you can use the class DictionaryUpdatePartialIndex. If you set DictionaryUpdatePartialIndex as DictionaryUpdateClass, then the text between {{ and }} will not be included in the index.

In the example:
sleep  {{The state of reduced consciousness of a human or animal}}[tab]Schlaf  {{Zustand der Ruhe eines Tieres oder Menschen}}

And put in DictionaryForMIDs.properties these two lines:
language1DictionaryUpdateClassName=de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdatePartialIndex
language2DictionaryUpdateClassName=de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdatePartialIndex

Then the size of the generated files will collapse. For an inputdictionaryfile with lines as in the above example, the compressed result will likely be below 2 MB.

For advanced information on DictionaryUpdate-class read here: http://dictionarymid.sourceforge.net/newdictDictionaryUpdate.html (you do not need to read this if you use DictionaryUpdatePartialIndex)

Regards,
Gert



dreamingsky

The DictionaryUpdatePartialIndex DictionaryUpdateClass is very useful.  I've used it on several dictionaries.  I was wondering if the code could be added to the default DfM code instead of a separate UpdateClass?

If users want to use a custom DictionaryUpdateClass such as DictionaryUpdateThaiNIUEng or DictionaryUpdateEDICTJpn, then they can't use DictionaryUpdatePartialIndex without referencing it in the new UpdateClass.

We use brackets [] for ContentDeclarations.  So maybe write the code for the curly brackets {{}} similar in DfM.

Then users could write a dictionary using {{}} without having to reference anything in the DictionaryForMIDs.properties.

Jeff

Gert

Understand, yes, that would be very useful.

I will implement this as you suggest !

Best regards,
Gert

dreamingsky


Gert

I implemented an updated version of DictionaryGeneration, you can download it here: http://prdownloads.sourceforge.net/dictionarymid/DictionaryForMIDs_DictionaryGeneration_3.5.0.zip?download

DictionaryGeneration now behaves by default as if the DictionaryUpdate-class DictionaryUpdatePartialIndex was used. The class DictionaryUpdatePartialIndex still exists for compatibility reasons, but does not any processsing any more.

If anyone should desire to deactivate the omission of indexing between {{ and }} then he can put the following line in DictionaryForMIDs.properties:
dictionaryGenerationOmitParFromIndex: false

@Jeff: could you please occasionally let me know if that version works fine for you (no need to hurry). Then I will change the link from the web site from the old 3.1.0 to that version.

Best regards,
Gert

dreamingsky

That's great.  Thanks for adding that.  I'm working on updating the Japanese dictionary now.  I'll see if the new DictionaryGeneration works.  I should have time to do it this weekend.

I'll see about adding the {{}} information on the "Setting up a New Dictionary" webpage.

Jeff

Gert

Great !

QuoteI'll see about adding the {{}} information on the "Setting up a New Dictionary" webpage.

Actually ... I already have an update on that page in the queue; I should have it online within the next few hours. Of course, any improvement will be welcome - your examples and guides are most valuable for all people who set up dictionaries !!

Best greetings,
Gert

Gert

Quote
QuoteI'll see about adding the {{}} information on the "Setting up a New Dictionary" webpage.

Actually ... I already have an update on that page in the queue; I should have it online within the next few hours. Of course, any improvement will be welcome - your examples and guides are most valuable for all people who set up dictionaries !!

I just uploaded the updated web page on "Setting up a New Dictionary". The following 2 sections were amended: dictionaryGenerationOmitParFromIndex and "Reducing the size of the generated index files".

Please introduce any improvements at will. Of course, also on any other web pages at dictionarymid.sourceforge.net !

Thanks a lot !
Gert

dreamingsky

Gert

I tested the new DictionaryGeneration with the Hindi and Thai dictionaries.  Everythings works OK.  Thanks for adding the code.

Jeff

jn0101

I think there is either a problem (bug) or I just don't understand how to use it:

I have for example the line:

hjælpe  {{[051]}}{{[01tr]}} helpi; {{[02understøtte]}} subteni; {{[02assistere]}} asisti; {{[02~s {{[06ad]}}, ~ hinanden]}} helpi unu la alian, interhelpi sin, {{[02også fx: ~s ad med at trække]}} kune tiri, kunlabori tirante; {{[02~ {{[06af]}} med, fx: ~ nogen af med jakken]}} helpi iun demeti la jakon, {{[02~ nogen af med affaldet]}} helpi iun seniĝi pri la rubo; {{[02~ {{[06med/]}} til]}} kunhelpi, asisti; {{[02fx: ~ nogen jakken {{[06på]}}]}} helpi iun surmeti la jakon; {{[02fx: ~ {{[06til]}} en stilling]}} helpi dungiĝi; {{[02{{[06stå]}} til at ~]}} esti helpebla/ savebla; {{[052]}}{{[01tr]}} {{[02~ hen, fx: ~ frem]}} helpi antaŭeniri/ progresi, {{[02~ ind i huset]}} helpi veni en la domon, {{[02~ op]}} helpi stariĝi/ supreniri, {{[02~ op på hesten]}} helpi surĉevaliĝi; {{[02se også: komme, gå]}}; {{[053]}}{{[02gavne]}} {{[01itr]}} utili, {{[02det kan ikke ~ noget]}} tio ne utilas, estas senutile, {{[02hvad ~r det?]}} por kio utilas?; {{[02gøre nemmere]}} {{[01tr]}} plifaciligi, {{[02fx: det skal nok ~ på hans forståelse]}} tio certe utilos al li por kompreni, tio certe plifaciligos lian komprenon

The string "nogen af med affaldet" is here enclosed as {{[02~ nogen af med affaldet]}} so it SHOULDNT be indexed at all when reversing and going Esperanto->Danish


Anyway, I have this gigantic redundancy. It seems the phrase is included TWENTY-TWO times in the Esperanto index.

$ grep -l "nogen af med affaldet" *
directory58.csv
indexEpo104.csv
indexEpo115.csv
indexEpo118.csv
indexEpo12.csv
indexEpo144.csv
indexEpo152.csv
indexEpo159.csv
indexEpo166.csv
indexEpo170.csv
indexEpo172.csv
indexEpo1.csv
indexEpo26.csv
indexEpo2.csv
indexEpo45.csv
indexEpo60.csv
indexEpo65.csv
indexEpo68.csv
indexEpo7.csv
indexEpo83.csv
indexEpo84.csv
indexEpo86.csv
indexEpo87.csv

$ grep "nogen af med affaldet" *
directory58.csv:hjælpe   [051][01tr] helpi; [02understøtte] subteni; [02assistere] asisti; [02~s [06ad], ~ hinanden] helpi unu la alian, interhelpi sin, [02også fx: ~s ad med at trække] kune tiri, kunlabori tirante; [02~ [06af] med, fx: ~ nogen af med jakken] helpi iun demeti la jakon, [02~ nogen af med affaldet] helpi iun seniĝi pri la rubo; [02~ [06med/] til] kunhelpi, asisti; [02fx: ~ nogen jakken [06på]] helpi iun surmeti la jakon; [02fx: ~ [06til] en stilling] helpi dungiĝi; [02[06stå] til at ~] esti helpebla/ savebla; [052][01tr] [02~ hen, fx: ~ frem] helpi antaŭeniri/ progresi, [02~ ind i huset] helpi veni en la domon, [02~ op] helpi stariĝi/ supreniri, [02~ op på hesten] helpi surĉevaliĝi; [02se også: komme, gå]; [053][02gavne] [01itr] utili, [02det kan ikke ~ noget] tio ne utilas, estas senutile, [02hvad ~r det?] por kio utilas?; [02gøre nemmere] [01tr] plifaciligi, [02fx: det skal nok ~ på hans forståelse] tio certe utilos al li por kompreni, tio certe plifaciligos lian komprenon
indexEpo104.csv:med at trække kune tiri kunlabori tirante af med f nogen af med jakken helpi iun demeti la jakon nogen af med affaldet helpi iun senigi pri la rubo med til kunhelpi asisti f nogen jakken på helpi iun surmeti la jakon f til en stilling helpi dungigi stå til at esti helpebla savebla 2 tr hen f frem helpi antaueniri progresi ind i huset helpi veni en la domon op helpi starigi supreniri op på hesten helpi surcevaligi se også komme gå 3 gavne itr utili det kan ikke noget tio ne utilas estas senutile hvad r det por kio utilas gøre nemmere tr plifaciligi f det skal nok på hans forståelse tio certe utilos al li por kompreni tio certe plifaciligos lian komprenon   58-7928-S


indexEpo104.csv:med f nogen af med jakken helpi iun demeti la jakon nogen af med affaldet helpi iun senigi pri la rubo med til kunhelpi asisti f nogen jakken på helpi iun surmeti la jakon f til en stilling helpi dungigi stå til at esti helpebla savebla 2 tr hen f frem helpi antaueniri progresi ind i huset helpi veni en la domon op helpi starigi supreniri op på hesten helpi surcevaligi se også komme gå 3 gavne itr utili det kan ikke noget tio ne utilas estas senutile hvad r det por kio utilas gøre nemmere tr plifaciligi f det skal nok på hans forståelse tio certe utilos al li por kompreni tio certe plifaciligos lian komprenon   58-7928-S

indexEpo104.csv:med jakken helpi iun demeti la jakon nogen af med affaldet helpi iun senigi pri la rubo med til kunhelpi asisti f nogen jakken på helpi iun surmeti la jakon f til en stilling helpi dungigi stå til at esti helpebla savebla 2 tr hen f frem helpi antaueniri progresi ind i huset helpi veni en la domon op helpi starigi supreniri op på hesten helpi surcevaligi se også komme gå 3 gavne itr utili det kan ikke noget tio ne utilas estas senutile hvad r det por kio utilas gøre nemmere tr plifaciligi f det skal nok på hans forståelse tio certe utilos al li por kompreni tio certe plifaciligos lian komprenon   58-7928-S

Gert

@jn0101: You are using version 3.5.0 of DictionaryGeneration ?

Gert

jn0101

Im using the SVN version:

/*
* Note: this class is obsolete starting with DictionaryGeneration 3.5.0 because
* with version 3.5.0 of DictionayGeneration the behaviour from DictionaryUpdatePartialIndex
* is already included in class DictionaryUpdate. The class DictionaryUpdatePartialIndex
* is only retained for compatibility reasons.
*/
package de.kugihan.dictionaryformids.dictgen.dictionaryupdate;


public class DictionaryUpdatePartialIndex extends DictionaryUpdate {
   
}


so.... yes :-)

Gert


Gert

@jn0101
I just noted:

{{[02~s {{[06ad]}}, ~ hinanden]}}

I believe that nesting of {{ and }} is not supported by the current implementation (see class DictionaryUpdate.java). That may be part of the problem; well, maybe there is an additional issue.

Would it be hard for you to avoid nesting of {{ and }} ?

Gert

jn0101

Quote from: Gert on 05. June 2010, 08:37:18
@jn0101
I just noted:

{{[02~s {{[06ad]}}, ~ hinanden]}}

I believe that nesting of {{ and }} is not supported by the current implementation (see class DictionaryUpdate.java). That may be part of the problem; well, maybe there is an additional issue.

Would it be hard for you to avoid nesting of {{ and }} ?

What I need is to exclude all text in []'s. Therefore I have these replacements as a part of the preprocessing:

# - ni enmetu {{ kaj }} por eviti indeksigon de io en [ kaj ].
   sed 's/\[/{{[/g' |
   sed 's/\]/]}}/g' |

I think it would take quite some time to exclude nesting, as the [ and ]'s are nested.

Anyway the phrase {{[02~ nogen af med affaldet]}} does not have nestings, so I cant see how it should happen.
I might try with only this line and with no nesting... but I expect it to make no difference....

Jacob