DictionaryForMids Forum

DfM-Creator => DfM-Creator - DictionaryGeneration => Topic started by: dreamingsky on 03. June 2010, 23:26:33

Title: Byte Order Mark (BOM)
Post by: dreamingsky on 03. June 2010, 23:26:33
There is a small bug in DictionaryGeneration regarding UTF-8 files.  I like to save all files as UTF-8.  But, when DictionaryForMIDs.properties is saved as UTF-8, DictionaryGeneration gives an error.

The error is about the byte order mark (BOM) saved at the beginning of the DictionaryForMIDs.properties file.
http://en.wikipedia.org/wiki/Byte_order_mark

When I run DictionaryGeneration, I get this error:

Thrown de.kugihan.dictionaryformids.general.DictionaryException: Property infoTe
xt not found / Property infoText not found

infoText is the first item in DictionaryForMIDs.properties, therefore it gives the error.

In UTF-8, U+FEFF is stored as EF BB BF.  So, in DictionaryGeneration, would it be possible for it to ignore EF BB BF at the beginning of the file?

An easy work-around is to add a line break at the beginning of DictionaryForMIDs.properties.  But, new people trying to build a new dictionary might not know that.

Jeff
Title: Re: Byte Order Mark (BOM)
Post by: Gert on 04. June 2010, 08:07:51
Jeff,

Good hint !!

I just looked in Properties.java (which is taken from GNU Classpath):

  protected String propertyCharEncoding = "ISO-8859-1";


I.e. UTF-8 would likely not work at all for non-ASCII characters. Jeff, did you ever succeed to put non-ASCII-characters in DictionaryForMIDs.properties ?

I need to investigate there more about the origins of that "ISO-8859-1"; possibly it could be a Java standard for property files ? Any Java developer out there who knows that ? Would it harm to change the encoding to UTF-8 ?

Regards,
Gert
Title: Re: Byte Order Mark (BOM)
Post by: jn0101 on 04. June 2010, 15:35:42
Ive experienced the same. Properties should be ISO-8859-1 and I wanted special chars like ĉ, ĝ, ŭ.

I solved it by using UTF-8 in my scripts and then just pipe it thru the command "native2ascii -encoding UTF-8", like this:

echo "
infoText: blabla. Kiel serĉi:\nPor tajpi literojn kun akcentoj (ekz ĉ, ĝ, ŭ) vi simple tajpu la bazan varianton de la litero (ekz c, g, c).

... (stuff)

" | native2ascii -encoding UTF-8 > DictionaryForMIDs.properties

In the file will be the escaped values.

infoText: blabla. Kiel ser\u0109i:\nPor tajpi literojn kun akcentoj (ekz \u0109, \u011d, \u016d) vi simple tajpu la bazan varianton de la litero (ekz c, g, c).
Title: Re: Byte Order Mark (BOM)
Post by: Gert on 04. June 2010, 19:12:58
Yes, here an extract from the spec for the 'Properties' class on Java SE:

The load(Reader)  /  store(Writer, String)  methods load and store properties from and to a character based stream in a simple line-oriented format specified below. The load(InputStream)  /  store(OutputStream, String)  methods work the same way as the load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes  ; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.


Remark: only streams are available on Java ME,  no Reader/Writer.

Well, technically it should be rather easy to support UTF-8 instead of ISO-8859-1 by modifying the source of the Properties class for DfM. But the ISO-8859-1 encodings need to work also.

Hmmm, maybe an auto-detection of the encoding based on the first three bytes is doable (see posting Jeff). But that would be not so easy to implement.

Need to give a few more thoughts on that.

Gert
Title: Re: Byte Order Mark (BOM)
Post by: dreamingsky on 04. June 2010, 22:35:17
Thanks for looking into this.  It's no problem if the DictionaryForMIDs.properties must be in ISO-8859-1.  Personally I save everything in UTF-8.  So it's just a personal preference.

And I was thinking of the build environment for new users of DfM.  We have UTF-8 for all the options, so I thought it might be confusing if only the DictionaryForMIDs.properties is saved in ISO-8859-1.

dictionaryGenerationInputCharEncoding: UTF-8
indexCharEncoding: UTF-8
searchListCharEncoding: UTF-8
dictionaryCharEncoding: UTF-8

I like Jacob's idea of saving the user's DictionaryForMIDs.properties in UTF-8 and then changing it to ISO-8859-1 in dictionary\DictionaryForMIDs.properties (using \u0109, for example).

Jeff