Is it possible to decrease the dictionary size?

Started by arsen_a, 29. July 2007, 10:32:31

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

arsen_a

Hello everybody!

Recently I came across to this project and I really appreciate it. I have already made a Spanish-Russian dictionary by your tools, but unfortunately file size is 465kb and it can not run on my Nokia6230. Is it possible to make the dictionary smaller than 300kb? If I make it unidirectional, will it decrease the file size? How can I do that (make unidirectional)? Thanks.

Gert

You could first look at the index.files that were generated by DictionaryGeneration and check if these contain unnecessary information. Often dictionaries include content that should not be included in the indexes.
Maybe you just could post here in the forum some lines from the index files (best also post your DictionaryForMIDs.properties file).

Making the dictionary unidirectional will reduce the index file size also (but not the directory files).

Another point: did you use JarCreator ? JarCreator includes only those application icons that are really needed for the dictionary. This makes the resulting JAR file smaller compared to a manually assembled JAR file.

Gert


arsen_a

Hi, Gert!

Of course I used JarCreator and gained almost 150kb, because the outpout directory size was 600kb and now the .jar file size is 450kb. Here is my DictionaryForMIDs.properties file

infoText:  Spanish-Russian dictionary
      dictionaryAbbreviation: IDP
      numberOfAvailableLanguages: 2
      language1DisplayText: Spanish
      language2DisplayText: Russian
      language1FilePostfix: Esp
      language2FilePostfix: Rus
      dictionaryGenerationSeparatorCharacter: ':'
      indexFileSeparationCharacter: ':'
      searchListFileSeparationCharacter: ':'
      dictionaryFileSeparationCharacter: ':'
      dictionaryGenerationInputCharEncoding: UTF-8
      indexCharEncoding: UTF-8
      searchListCharEncoding: UTF-8
      dictionaryCharEncoding: UTF-8
      language1DictionaryUpdateClassName:

de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdateIDP
      language2DictionaryUpdateClassName:

de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdateIDPSpa
      language1NormationClassName: de.kugihan.dictionaryformids.translation.normation.NormationEng
      language2NormationClassName: de.kugihan.dictionaryformids.translation.normation.NormationLat

I have modified it a bit, because characters in .txt file are in Unicode format and the separator is ':'.
Regarding to the other values, for example index files, that you have mentioned, what can I change there to make the size smaller? Do you mean this variables?
•   searchListFileMaxSize/indexFileMaxSize/dictionaryFileMaxSize

I had a look at the index.files but did not find anything because their also contain the same words from directory.files! Do I need this files? I have checked, all the index.files weight 500kb, and directory.files weight 100kb.
Thank you for your help :)

Gert

500kB index files vs. 100 kB directory files is highly suspicious ! Let's look at this a little closer.

First a few comments about the file DictionaryForMIDs.properties:
Concerning the entries  language1DictionaryUpdateClassName and language2DictionaryUpdateClassName, I guess your dictionary does not have the rather complicated syntax of the IDP dictionaries, right ? Then you should remove these two lines.

NormationEng is probably not the right for Spanish, it better should be NormationLat (NormationLat is best for Spanish). As normation class for language2 you could use the new NormationRus.

Could you update your DictionaryForMIDs.properties for these lines and then re-generate the files. If the size of the index files is still big, could you then double check that only the desired words from the inputdictionaryfile are indexed ? Maybe there are unnecessary index entries ?

About the searchListFileMaxSize/indexFileMaxSize/dictionaryFileMaxSize: with these properties you can reduce the size of the single files, but if the files are smaller, then there will be more files, so the sum will not be less.

There are still a few more possibilities to reduce the file sizes, such as including only an unidirectional index as you suggested. For an unidirectional index, just set languageXGenerateIndex and languageXIsSearchable to false for the language where you don't want to have an index.

Keep us updated about your progress !
Gert






arsen_a

Thanks for your reply!

I have modified the .properties file as you suggested, removed language1DictionaryUpdateClassName and language2DictionaryUpdateClassName, then changed NormationEng into NormationLat and NormationLat into NormationRus. Regarding to complicated syntax, I used as a source for the dictionary, a .txt file containing special characters, for example ñ, ó etc, can this be the cause of my problem?
I have recompiled now and got the same result, file size is 485kb, bigger than before, also the size of index and directory files have been changed, directories-298kb, index-567kb. You advised to "double check that only the desired words from the inputdictionaryfile are indexed ? Maybe there are unnecessary index entries"! Does this program add some unnnecessary files to the original dictionary? I mean not in the .txt file but in the final file? From where does it take that words? I think all the words in Spanish dictionary are necessary :) What can I do next?

arsen_a

 I have just made a unidirectional dictionary as you advised, but now the file size is 377 :(( I think the problem is in the index files.
I would like to provide this information for comparision: my dictionary source .txt file is 300kb, because it is in Unicode format. The original file was in ISO-8859-1 format, size was 200kb but after compilation, I ran that program on PC Mobile phone emulator and russian letters were unreadable so I decided to change the encoding into Unicode and the file became bigger for 100kb. I have English-Russian dictionary installed on my phone, that is 230kb, the source dictionary for that program weights 560kb and it contains almost 25000 words. So, I am wondering, why this Spanish dictionary, that contains only 10000 words, is becoming 485kb?   

Gert

Please have a look at the description of the index files and directory files in the chapter "Files generated by the DictionaryGeneration tool" of the section "Setting up a new dictionary". I still believe that the index files contain entries that are not needed. If your dictionary contains information such as on grammatical category, then this may be the case. Such information can blow the index size, because without additional hints DictionaryGeneration will interpret such information as phrases and generate index entries for this.

The directory files contain the translations from the inputdictionaryfiles. So the size of these files is roughly the same. Note however that these files are ZIP compressed in the JAR file. So if your inputdictionaryfile is 300 kb, the resulting compressed files could be less than 150 kb (depending on the content).

Unicode makes the files a bigger, I assume you use UTF-8 encoding, right ? ISO-8859-1 will not work for Russian characters. With Unicode you will not have a problem for characters such as  ñ, ó etc.

Can you check the compressed file sizes in the JAR file ? How much space is used by the 'dictionary' directory of the JAR file and how much size by the application ?

Another tip: I think you can remove the icons-folder from the JAR-file. I believe the application will still run (well, obviously you won't have icons then).

Note that the English-Russion dictionary is using a much older version of DfM and the application size was much smaller then.

Gert



Gert

I just realize that the DictionaryUpdatePartialIndex is not documented on the web pages (I thought that I did; probably I just wanted to do it but then I forgot :( ).

To prevent indexing of non-wanted parts (such as grammatical categories), use the DictionaryUpdate-class "DictionaryUpdatePartialIndex" for that language. The set the parts that shall not be indexed in double braces: {{ }}

for example
original line in the inputdictionaryfile:
cat (n)   Katze (f;n)

needs to be changed to
cat {{(n)}}   Katze {{(f;n)}}

The run DictionaryGeneration again.
Hope that helps.

I really need to add this on the web pages ! Without that documentation you cannot understand, sorry for this.

Gert

arsen_a

Hi, Gert!

Sorry for delayed answer, I just tried to recompile with the settings that you advised, I have added DictionaryUpdatePartialIndex: {{ }} in the properties file and now I got dictionary with the size of 385kb, so we have gained another 100kb :) Regarding to the icons, I forget to tell you that I am using the light version of empty dictionary, which is 112kb and seems, does not contain icons. Do you have any other ideas to decrease the file size? Thank you very much!

arsen_a

Hi Gert,

I just had a look into index files again and found something interesting, as I already said, I have added DictionaryUpdatePartialIndex: {{ }} line into properties file. I discovered, that in there are two big ~40kb index files in the output directory, I looked into them and found that the letters 'm' and 'f' had a lot of indexes or pointers for the directory files. I think you know, that this letters are used to describe the gender of noun, so I think our program did not identified information in the  {{ }} correctly. Dear Gert, can you tell me, did I write that line correctly, is teh syntax right?
DictionaryUpdatePartialIndex: {{ }}
Another question is, here is a line from the index file:
millonario m :5-1215-B
and here is the same word in the directory5 file:
millonario_{{(m)}}:(here are the characters in Unicode format)
Is this correct? BTW, after compilation I ran the program on a PC emulator and when I enter a word in Spanish, after translation I got the word in such format: some word {{(m)}}, is that normal? Should I see this braces or not?
Thanks a lot!

Gert

Sorry for my hurriedly description above (I know I need to update the web pages !!).

You need to put in DictionaryForMIDs.properties the line
language1DictionaryUpdateClassName=de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdatePartialIndex

(or language2... whichever is the right language)

Then it will work.

Gert

arsen_a

Hello Gert,
Thank you for your help, I have added the line (language1DictionaryUpdateClassName=de.kugihan. etc) and now my properties file looks like this:
infoText:  Spanish-Russian dictionary
                dictionaryAbbreviation: IDP
                numberOfAvailableLanguages: 2
                language1DisplayText: Spanish
                language2DisplayText: Russian
                language1FilePostfix: Esp
                language2FilePostfix: Rus
                language2IsSearchable: false
                language2GenerateIndex: false
                language1DictionaryUpdateClassName=de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdatePartialIndex
                dictionaryGenerationSeparatorCharacter: ':'
                indexFileSeparationCharacter: ':'
                searchListFileSeparationCharacter: ':'
                dictionaryFileSeparationCharacter: ':'
                dictionaryGenerationInputCharEncoding: UTF-8
                indexCharEncoding: UTF-8
                searchListCharEncoding: UTF-8
                dictionaryCharEncoding: UTF-8
                language1NormationClassName: de.kugihan.dictionaryformids.translation.normation.NormationLat
                language2NormationClassName: de.kugihan.dictionaryformids.translation.normation.NormationRus

now the final .jar file size is 356kb, we have gained another 30kb but it still can not run on my Nokia :(
BTW, here is some information, maybe it can be helpful

Creating: ./output/dictionary/DictionaryForMIDs.properties
Property searchListFileMaxSize set to 235
Property indexFileMaxSize set to 11999
Property dictionaryFileMaxSize set to 6783
Property language1IndexNumberOfSourceEntries set to 9779
Done: property file

Gert

Hmmm ...

Do you know what precisely is the jar size limit for your Nokia ?

Did you have a look at the index files again to check that there are no more superfluous entries ? Can you give me the file sizes for the index files again ? Are the index files well compressed in the JAR-file (you could use any ZIP tool to get a rought compress ratio for the files) ?

Gert

arsen_a

Hi again,

Regarding to the precise file limit for jar applications on my Nokia, I think it can be 300kb, because once I have installed a EuroMap program,which size was 296kb and it ran normally to my astonishment. After that I thought that my Nokia can run even more, installed another dictionary ( I can't remember which one but I remember the size was ~365kb ) and my Nokia could not identify that file. The generated message was: incorrect file.
As to the index file size, as I have mentioned before, each file weights 10kb.
What do you mean by saying: Are the index files well compressed in the JAR-file ? For making the dictionary I use automatic method that is described in the how to. I use this command:
java -jar JarCreator.jar dictionarydirectory emptyjar outputdirectory
( of course with correction for my directory names ).  Do I have to do this step manually? I just checked to compress the finally created jar file with zip and the file size decreased with 5kb only.
I forget to tell you that I have also checked the content of index files and did not find any other unnecessary information there :(

Gert

Yes, using JarCreator as you do is the right way, no need for manual steps. JarCreator does compress the files in the JAR-file, I was just wondering whether there everything worked well. So a 10kb index file is 5kb after compression, right ?

Could you give me the current total size of the index files (uncompressed) ?

The directory files were about 300 kb uncomppressed if I remember well. That may be roughly 150 kb compressed, maybe less. Plus the 112 kb application code and the size of the index files.

Well, would be useful to know the file size limit of your Nokia. You could try to dermine it by just removing some of the files in the /dictionary folder of the JAR-file (use any zip-tool for this) and then try to install the application.

Gert