Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - dreamingsky

#31
General discussions / Re: Build environment
17. June 2010, 01:22:16
QuotePAUSE   Prompt the user to press any key to continue.
in the end to make the windows stay open.

That's a great idea.  Putting a pause after DictionaryGeneration will show users if there was an error.  JarCreator will still run and give an error, but at least users will know where the error occurred.  I'll add the pauses to the script.
#32
QuoteBut there is no reason to wait for this, cause the indexing algorithm will work fine also with the BOM-character in it.

Actually the BOM causes the 1st word to not be indexed: 懶洋洋.  This is probably true for all dictionaries.  It's a small problem, though.

QuoteWill likely take a few weeks though, sorry for that.

No problem, there's no rush.
#33
General discussions / Re: Build environment
15. June 2010, 01:30:44
I uploaded a new version:
http://prdownloads.sourceforge.net/dictionarymid/DictionaryForMIDs_3.5.0(beta3).zip?download

I added the Linux/Mac scripts and made other minor changes.

Jeff
#34
QuoteThe 懶洋洋 is in {{ and }}, so I think it should not get indexed ... or am I wrong ?
The empty index before 1-0-B might be your BOM-character, I did not yet fully investigate that yet though. (the BOM-character would not harm in the index).

You're right, since the word is in {{}} it won't be indexed.  So the extra "     1-0-B" won't cause any problems in the index.

I changed the first line to this (I removed the {{}} ):
懶洋洋 [01 lan3 yang2 yang2]   malvigla; langvora

Now 懶洋洋 is not in the index.  So I think the BOM must be causing a problem.

QuoteWith the
dictionaryGenerationLanguage1ExpressionSplitString: /
the index should look like
馬虎   1-152-B
马虎   1-152-S

Yes, the index should look like that.  I added the SplitString, but the index still looks like this:
馬虎 / 马虎   1-90-B
马虎   1-90-S

I included the newest build files.
#35
General discussions / Re: Build environment
14. June 2010, 23:54:11
That's great.  Thanks.  Would you mind editing newdict.html to remove any references to Windows procedures and change it to Linux procedures?

I like deleting the "Dictionary" directory and then remaking it.  That will be useful for people.  I'll add that to the Windows scripts.

I.
I was debating putting DictionaryGeneration and JARCreator in the same batch file.  It makes it easier for users.  But, if they made a mistake in their DictionaryForMIDs.properties or Dictionary_input.txt, then DictionaryGeneration will give an error.  But then JARCreator will be run.  Then users may not see the DictionaryGeneration error.

If DictionaryGeneration could write a Dictionary\DictionaryForMIDs.properties (but give an error while building the index files), then JARCReator wouldn't give an error.  Then users won't know there was an error (because JARCreator put more information on the screen).

What do you guys think?:
1. put DictionaryGeneration and JARCreator in the same batch file
2. DictionaryGeneration and JARCreator in different batch files

Later when fontgenerator.jar has a command line version (now it is only a GUI version), then we could add it to the build: DictionaryGeneration -> FontGenerator -> JARCreator

II.
What should we do for the newdict.html file?:
1. write Linux & Windows instructions in the same file
2. write separate newdict_Linux.html & newdict_Windows.html

I recommend #2.

The build process is the same for both systems.  But, the details are different.  For example, on a Windows machine you must add Java to the Windows Path:
Start Menu-> Control Panel -> System -> Advanced System Settings -> Environment Variables -> System Variables -> Path -> Edit

And Windows users must run the Command Prompt to use the batch files:
Start Menu -> Accessories -> Command Prompt

If they run the batch files from within Windows, then they won't see any error messages.

III.
Should we release separate Windows & Linux build environments (should be put setup.bat & linux_mac_script.sh in the same ZIP file?)

We have 2 options:
A.
1. ZIP file for Linux (only Linux scripts)
2. ZIP file for Windows (only Windows scripts)
3. self-extracting ZIP file for Windows (only Windows scripts)

B.
1. ZIP file for Linux & Windows (Linux & Windows scripts)
2. self-extracting ZIP file for Windows (Linux & Windows scripts)

I recommend B.  Personally I don't think the script files for the other system would confuse users too much.  What do you guys think?

Jeff
#36
Thanks, that looks better.  I still see some problems though.

1. There is still an error with:
{{懶洋洋}} [01 lan3 yang2 yang2]   malvigla; langvora

Before it gave this error:
{{懶洋洋}}     1-0-B  [extra {{}} ]

Now it gives this error:
   1-0-B  [the 懶洋洋 characters are missing]

2. If there are 2 words in the left column, then only the 2nd words are getting indexed:
馬虎 / 马虎 {{[01 mahu]}}   malzorga

Both "馬虎" and "马虎" should be indexed.  But, only "马虎" is indexed.  Here is the index:
馬虎 / 马虎   1-152-B
马虎   1-152-S

I thought if there is a space between words, then they will both be indexed.  This is probably a different problem then the {{}} problem.

I added this line to DictionaryForMIDs.properties, but it didn't fix the problem:
dictionaryGenerationLanguage1ExpressionSplitString: /
#38
Sounds good.
#39
I guess I'm still having problems with the {{}} too.

Here is the input dictionary file:
{{懶洋洋}} [01 lan3 yang2 yang2]     malvigla; langvora
懶洋洋 / 懒洋洋 {{[01 lanyangyang]}}     malvigla; langvora
{{馬虎}} [01 ma3 hu]     malzorga
馬虎 / 马虎 {{[01 mahu]}}     malzorga
{{哺乳}} [01 bu3 ru3]     mamnutri
哺乳 {{[01 buru]}}     mamnutri
乳房 {{[01 rufang]}}     mamo
{{拜金主義}} [01 bai4 jin1 zhu3 yi4]     mamonismo; monoadorado
拜金主義 / 拜金主义 {{[01 baijinzhuyi]}}     mamonismo; monoadorado

Here is the index:
bai jin zhu yi     1-282-B
bai4 jin1 zhu3 yi4     1-282-B
baijinzhuyi     1-346-B,1-346-B,1-346-B
bu ru     1-191-B
bu3 ru3     1-191-B
buru     1-224-B,1-224-B,1-224-B
bài jīn zhǔ yì     1-282-B
bǔ rǔ     1-191-B
hu     1-120-S,1-120-S,1-120-S
jin zhu yi     1-282-S
jin1 zhu3 yi4     1-282-S
jīn zhǔ yì     1-282-S
lan yang yang     1-0-B
lan3 yang2 yang2     1-0-B
lanyangyang     1-58-B,1-58-B,1-58-B
lǎn yáng yáng     1-0-B
ma hu     1-120-B
ma3 hu     1-120-B
mahu     1-152-B,1-152-B,1-152-B
mǎ hu     1-120-B
ru     1-191-S
ru3     1-191-S
rufang     1-254-B,1-254-B,1-254-B
rǔ     1-191-S
yang     1-0-S
yang yang     1-0-S
yang2     1-0-S
yang2 yang2     1-0-S
yi     1-282-S
yi4     1-282-S
yáng     1-0-S
yáng yáng     1-0-S
yì     1-282-S
zhu yi     1-282-S
zhu3 yi4     1-282-S
zhǔ yì     1-282-S
乳房 {{     1-254-B
哺乳 {{     1-224-B
哺乳}}     1-191-B
懒洋洋 {{     1-58-S
懶洋洋 / 懒洋洋 {{     1-58-B
懶洋洋}}     1-0-S
拜金主义 {{     1-346-S
拜金主義 / 拜金主义 {{     1-346-B
拜金主義}}     1-282-B
馬虎 / 马虎 {{     1-152-B
馬虎}}     1-120-B
马虎 {{     1-152-S
{{懶洋洋}}     1-0-B


All of the pinyin transcription are OK: none have extra {{ or }}.  But, all of the characters show extra {{ or }}.  I'm not sure if this is still a problem with DictionaryUpdateCEDICTChi.java, or if it is a different problem.

Also, it is strange that {{懶洋洋}} is in the index.  None of the other characters have both {{ and }}.  Maybe the problem is because it is the first word in the file?  Maybe the UTF-8 byte order mark (BOM) is causing a problem?

Also, "lanyangyang" and "baijinzhuyi" should not be indexed.  They are inside {{}}:
懶洋洋 / 懒洋洋 {{[01 lanyangyang]}}     malvigla; langvora
拜金主義 / 拜金主义 {{[01 baijinzhuyi]}}     mamonismo; monoadorado

Jeff
#40
Can someone help me with Russian transcription?  I wanted to write a readme to help users with writing Russian transcription in DfM.  But, I don't know Russian myself.

DfM has 4 Cyrillic transcriptions Normation classes:
1. NormationRus2.java
2. NormationUkr.java
3. NormationRusC.java
4. NormationUkrC.java

A description of the normation classes is here:
http://dictionarymid.sourceforge.net/newdictNormationLang.html

NormationRus2.java:
Allows you to search words both in Cyrillic and Latin transcription (according to the GOST 1971 - but yards are 'x' and there are used no apostrophes).

I found GOST 16876-71 here:
http://en.wikipedia.org/wiki/GOST_16876-71

But, NormationRus2 is a little different from GOST 16876-71.

Cyrillic   GOST 16876-71   Rus2   Ukr   RusC   UkrC
а   a   a   a   a   a
б   b   b   b   b   b
в   v   v   v   v   v
г   g   g   h   g   h
д   d   d   d   d   d
е   e   e   e   e   e
ё   jo   yo   yo   jo   jo
ж   zh   zh   zh   z   z
з   z   z   z   z   z
и   i   i   i   i   i
ї         yi      ji
й   jj   y   y   j   j
к   k   k   k   k   k
л   l   l   l   l   l
м   m   m   m   m   m
н   n   n   n   n   n
о   o   o   o   o   o
п   p   p   p   p   p
р   r   r   r   r   r
с   s   s   s   s   s
т   t   t   t   t   t
у   u   u   u   u   u
ф   f   f   f   f   f
х   kh   kh   kh   ch   ch
ц   c   c   c   c   c
ч   ch   ch   ch   c   c
ш   sh   sh   sh   s   s
щ   shh   shh   shh   sc   sc
ъ         x   x   x   x
ы   y   y   y   y   y
ь   '   x   x   x   x
э   eh   eh   eh   e   e
ю   ju   yu   yu   ju   ju
я   ja   ya   ya   ja   ja
ґ         g      g

Here are the 4 changes:
Cyrillic   GOST   NormationRus2
ё   jo   yo
й   jj   y
ю   ju   yu
я   ja   ya

Were these 4 changes intentional?  Or, are they a mistake?


Also, NormationRusC and NormationUkrC state "according to the Czech ISO norm".  Does anyone know the ISO number?
#41
Yes, that is a feature, not a bug.  NormationEpo.java let's users search for ĉ, ĥ, ĵ, ĝ, ŝ, ŭ or c, h, j, g, s, u.  Many users do not have an input method editor (IME) to type Esperanto.  Therefore NormationEpo.java let's users type Esperanto on any cell phone.

It is not possible to force the dictionary to only search for ĉ.  It will always search for ĉ and c.
#42
Problems / Re: Content declaration bug
11. June 2010, 07:39:36
Scanning for "[01" instead of "[" would be useful.  I'd actually like 2 declarations for Chinese:
1. traditional Chinese (black - no declaration)
2. simplified Chinese (dark red)
3. pinyin (dark blue)

I can keep pinyin as "[01" and use "[02" for simplified Chinese.

Also, originally in the CEDICT dictionary there was no space after "[01" (example: [01tuo1 ci2]).  Personally I like a space after the "[01" (example: [01 tuo1 ci2]).  Currently [01 tuo1 ci2] works OK.  If you make changes to DictionaryUpdateCEDICTChi, can you please keep the option to use a space after "[01"?

There's no rush for the code fix.  Thanks for your help
Jeff
#43
Problems / Re: Content declaration bug
10. June 2010, 23:46:23
I changed the content declaration from 03 to 01.  Everything works OK now.  This should be fine.  The Chinese half only needs 1 content declaration.  The other language half can still use multiple declarations.

Thanks for your help
Jeff
#44
That solved the problem.  Thank you very much.

Jeff
#45
Something in de.kugihan.dictionaryformids.dictgen.dictionaryupdate.DictionaryUpdateCEDICTChi breaks the {{}}.

Here is the entry:
1) tute sin izoli de 2)  {{[電/电]}}  izoli | ~體/体 izolilo; dielektriko [Tab] {{絕緣}} [01jue2 yuan2]

If I use DictionaryUpdateCEDICTChi, then DfM incorrectly shows the {{}} on the display screen [see "broken.png"].

If I don't use DictionaryUpdateCEDICTChi, then the {{}} is correctly not shown [see "fixed.png"].

Also, the letters inside {{}} are getting indexed when using DictionaryUpdateCEDICTChi.
Here is the entry:
1) uniformo 2) subigi; submeti; subjugigi [Tab] {{制服}} [01zhi4 fu2]
1) uniformo 2) subigi; submeti; subjugigi [Tab] 制服 {{[01zhifu]}}

Here is the indexChi1.csv:
制服 {{   1-779-B
制服}}   1-708-B

Only one of the words should be indexed.  Also, there is still the {{ or }} in the entry.  These should not be there.

If I don't use DictionaryUpdateCEDICTChi, then the words are correctly not indexed.
Here is the indexChi1.csv:
制服   1-717-B

Jeff