Large size of files generated by DictionaryGeneration / how to reduce size

Started by Gert, 03. May 2010, 20:21:38

Previous topic - Next topic

0 Members and 3 Guests are viewing this topic.

Gert

I just had a quick look at the code. DictionaryUpdateCEDICTChi is really build to handle exactly entries as those:
安康 安康 [01an1 kang1],good health


Other entries will cause problems.

Ok, I will occasionally make DictionaryUpdateCEDICTChi more flexible.

Keep you informed !
Gert


dreamingsky

Thanks, that looks better.  I still see some problems though.

1. There is still an error with:
{{懶洋洋}} [01 lan3 yang2 yang2]   malvigla; langvora

Before it gave this error:
{{懶洋洋}}     1-0-B  [extra {{}} ]

Now it gives this error:
   1-0-B  [the 懶洋洋 characters are missing]

2. If there are 2 words in the left column, then only the 2nd words are getting indexed:
馬虎 / 马虎 {{[01 mahu]}}   malzorga

Both "馬虎" and "马虎" should be indexed.  But, only "马虎" is indexed.  Here is the index:
馬虎 / 马虎   1-152-B
马虎   1-152-S

I thought if there is a space between words, then they will both be indexed.  This is probably a different problem then the {{}} problem.

I added this line to DictionaryForMIDs.properties, but it didn't fix the problem:
dictionaryGenerationLanguage1ExpressionSplitString: /

Gert

Jeff,

Quote1. There is still an error with:
{{懶洋洋}} [01 lan3 yang2 yang2]   malvigla; langvora

Before it gave this error:
{{懶洋洋}}     1-0-B  [extra {{}} ]

Now it gives this error:
   1-0-B  [the 懶洋洋 characters are missing]

The 懶洋洋 is in {{ and }}, so I think it should not get indexed ... or am I wrong ?
The empty index before 1-0-B might be your BOM-character, I did not yet fully investigate that yet though. (the BOM-character would not harm in the index).


Quote2. If there are 2 words in the left column, then only the 2nd words are getting indexed:
馬虎 / 马虎 {{[01 mahu]}}   malzorga

Both "馬虎" and "马虎" should be indexed.  But, only "马虎" is indexed.  Here is the index:
馬虎 / 马虎   1-152-B
马虎   1-152-S

I thought if there is a space between words, then they will both be indexed.  This is probably a different problem then the {{}} problem.
I added this line to DictionaryForMIDs.properties, but it didn't fix the problem:
dictionaryGenerationLanguage1ExpressionSplitString: /

Yes, both need to be indexed, exactly as you describe.

However
馬虎 / 马虎   1-152-B
马虎   1-152-S

looks ok for me on a first sight: the "馬虎" is part of the index, well still combined into "馬虎 / 马虎" (DictionaryGeneration thinks this is a 'phrase' unless you set the dictionaryGenerationLanguage1ExpressionSplitString).

And you are right, you really need to put a dictionaryGenerationLanguage1ExpressionSplitString here !

With the
dictionaryGenerationLanguage1ExpressionSplitString: /
the index should look like
馬虎   1-152-B
马虎   1-152-S

But it does not ?? Ok, then I have to check that !

Best greetings !
Gert

dreamingsky

QuoteThe 懶洋洋 is in {{ and }}, so I think it should not get indexed ... or am I wrong ?
The empty index before 1-0-B might be your BOM-character, I did not yet fully investigate that yet though. (the BOM-character would not harm in the index).

You're right, since the word is in {{}} it won't be indexed.  So the extra "     1-0-B" won't cause any problems in the index.

I changed the first line to this (I removed the {{}} ):
懶洋洋 [01 lan3 yang2 yang2]   malvigla; langvora

Now 懶洋洋 is not in the index.  So I think the BOM must be causing a problem.

QuoteWith the
dictionaryGenerationLanguage1ExpressionSplitString: /
the index should look like
馬虎   1-152-B
马虎   1-152-S

Yes, the index should look like that.  I added the SplitString, but the index still looks like this:
馬虎 / 马虎   1-90-B
马虎   1-90-S

I included the newest build files.

Gert

Jeff,

I will occasionally exclude the BOM-character from being put in the index (I will do it for all dictionaries, not only for Chinese one). But there is no reason to wait for this, cause the indexing algorithm will work fine also with the BOM-character in it.

QuoteYes, the index should look like that.  I added the SplitString, but the index still looks like this:
馬虎 / 马虎   1-90-B
马虎   1-90-S

Will look at that also occasionally. Will likely take a few weeks though, sorry for that.

Best regards,
Gert

dreamingsky

QuoteBut there is no reason to wait for this, cause the indexing algorithm will work fine also with the BOM-character in it.

Actually the BOM causes the 1st word to not be indexed: 懶洋洋.  This is probably true for all dictionaries.  It's a small problem, though.

QuoteWill likely take a few weeks though, sorry for that.

No problem, there's no rush.

Gert

Jeff,

I just had a look at the source code: dictionaryGenerationLanguage1ExpressionSplitString is really not yet supported by DictionaryUpdateCEDICTChi.

I need to find a way to get that incorporated into DictionaryUpdateCEDICTChi.

Regards,
Gert

dreamingsky


Gert

Jeff,

I made an update on DictionaryUpdateCEDICTChi concerning the ExpressionSplitString; but I need to do some testing first.

Fortunately ... no rush ...

Regards,
Gert

Gert


dreamingsky

Yes, everything works now.  The BOM was removed from the first entry.  The {{}} work correctly.  And the ExpressionSplitString works correctly.

Thank you very much
Jeff

dreamingsky

Sorry, yesterday I looked at indexChi1.csv and everything looked good.  But today I used the dictionary in an emulator.  I found a new problem.  Now all the pinyin transcription is in [ ].

Inside directory1.csv is:
懶洋洋 [01\[ lǎn yáng yáng\]]     malvigla; langvora

It is displayed in the emulator as:
懶洋洋 [ lǎn yáng yang]
malvigla; langvora

It should look like:
懶洋洋  lǎn yáng yang
malvigla; langvora

Can you please remove the [ ] from the transcription display?  There is no rush.


Gert

Jeff,

I can remove the [ ]. Hmmm, I thought those [ ] were there always before, weren't they ?

Anyway, I will remove them occasionally, no problem with that.

Regards,
Gert

dreamingsky