Byte Order Mark (BOM)

Started by dreamingsky, 03. June 2010, 23:26:33

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

dreamingsky

There is a small bug in DictionaryGeneration regarding UTF-8 files.  I like to save all files as UTF-8.  But, when DictionaryForMIDs.properties is saved as UTF-8, DictionaryGeneration gives an error.

The error is about the byte order mark (BOM) saved at the beginning of the DictionaryForMIDs.properties file.
http://en.wikipedia.org/wiki/Byte_order_mark

When I run DictionaryGeneration, I get this error:

Thrown de.kugihan.dictionaryformids.general.DictionaryException: Property infoTe
xt not found / Property infoText not found

infoText is the first item in DictionaryForMIDs.properties, therefore it gives the error.

In UTF-8, U+FEFF is stored as EF BB BF.  So, in DictionaryGeneration, would it be possible for it to ignore EF BB BF at the beginning of the file?

An easy work-around is to add a line break at the beginning of DictionaryForMIDs.properties.  But, new people trying to build a new dictionary might not know that.

Jeff

Gert

Jeff,

Good hint !!

I just looked in Properties.java (which is taken from GNU Classpath):

  protected String propertyCharEncoding = "ISO-8859-1";


I.e. UTF-8 would likely not work at all for non-ASCII characters. Jeff, did you ever succeed to put non-ASCII-characters in DictionaryForMIDs.properties ?

I need to investigate there more about the origins of that "ISO-8859-1"; possibly it could be a Java standard for property files ? Any Java developer out there who knows that ? Would it harm to change the encoding to UTF-8 ?

Regards,
Gert

jn0101

Ive experienced the same. Properties should be ISO-8859-1 and I wanted special chars like ĉ, ĝ, ŭ.

I solved it by using UTF-8 in my scripts and then just pipe it thru the command "native2ascii -encoding UTF-8", like this:

echo "
infoText: blabla. Kiel serĉi:\nPor tajpi literojn kun akcentoj (ekz ĉ, ĝ, ŭ) vi simple tajpu la bazan varianton de la litero (ekz c, g, c).

... (stuff)

" | native2ascii -encoding UTF-8 > DictionaryForMIDs.properties

In the file will be the escaped values.

infoText: blabla. Kiel ser\u0109i:\nPor tajpi literojn kun akcentoj (ekz \u0109, \u011d, \u016d) vi simple tajpu la bazan varianton de la litero (ekz c, g, c).

Gert

Yes, here an extract from the spec for the 'Properties' class on Java SE:

The load(Reader)  /  store(Writer, String)  methods load and store properties from and to a character based stream in a simple line-oriented format specified below. The load(InputStream)  /  store(OutputStream, String)  methods work the same way as the load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes  ; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.


Remark: only streams are available on Java ME,  no Reader/Writer.

Well, technically it should be rather easy to support UTF-8 instead of ISO-8859-1 by modifying the source of the Properties class for DfM. But the ISO-8859-1 encodings need to work also.

Hmmm, maybe an auto-detection of the encoding based on the first three bytes is doable (see posting Jeff). But that would be not so easy to implement.

Need to give a few more thoughts on that.

Gert

dreamingsky

Thanks for looking into this.  It's no problem if the DictionaryForMIDs.properties must be in ISO-8859-1.  Personally I save everything in UTF-8.  So it's just a personal preference.

And I was thinking of the build environment for new users of DfM.  We have UTF-8 for all the options, so I thought it might be confusing if only the DictionaryForMIDs.properties is saved in ISO-8859-1.

dictionaryGenerationInputCharEncoding: UTF-8
indexCharEncoding: UTF-8
searchListCharEncoding: UTF-8
dictionaryCharEncoding: UTF-8

I like Jacob's idea of saving the user's DictionaryForMIDs.properties in UTF-8 and then changing it to ISO-8859-1 in dictionary\DictionaryForMIDs.properties (using \u0109, for example).

Jeff