DictionaryForMIDs XML dictionaries

Started by Gert, 27. May 2010, 20:16:26

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Gert

Colleagues,

several dictionaries out there in the world are maintained in an XML format. Some of these dictionaries use an XML structure that is specific to a tool. Some adhere more or less to a standard such as TEI or XDXF (both TEI and XDXF are XML-based).

In order to ease set-up of such dictionaries in XML format, I am currently establishing an XML representation for the DictionaryForMIDs dictionaries. I call it "DfM_XML_schema".

The main task for setting up a dictionary that is maintained in XML format, is to transform its structure into the DfM_XML_schema.

Dictionaries in the format of DfM_XML_schema can be automatically converted into an 'inputdictionaryfile' (read below). And the inputdictionaryfile is then run through DictionaryGeneration/JarCreator.

This allows to set up quickly a dictionary that is maintained in XML format (well, some experience with XSLT or another XML transformation tool will be useful to get it done quickly).


Here some technical information:
1.
The is the XML schema file: DfM_XML_Schema.xsd. Look at http://www.kugihan.de/dict/download/Preprocessing/DfM_XML_Schema.xsd (not yet documented and still in work)

2.
There is the XSLT transformation script that converts an XML file in the format of DfM_XML_Schema: http://www.kugihan.de/dict/download/Preprocessing/DfM_XML_to_inputdictionaryfile.xsl (still in work)

3.
Here a sample file in the format of DfM_XML_Schema:
<?xml version="1.0" encoding="UTF-8"?>
<dictionary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.kugihan.de/DfMSchema DfMSchema.xsd ">
  <translationOfDictionary>
    <translationForLanguage includeInIndex="true">
      <translationForLanguagePart>
      <partNonContent>cat</partNonContent>
      </translationForLanguagePart>
    </translationForLanguage>
    <translationForLanguage includeInIndex="true">
      <translationForLanguagePart>
        <partNonContent>chat</partNonContent>
      </translationForLanguagePart>
    </translationForLanguage>
  </translationOfDictionary>
  <translationOfDictionary>
    <translationForLanguage includeInIndex="true">
      <translationForLanguagePart>
        <partNonContent>bird</partNonContent>
      </translationForLanguagePart>
    </translationForLanguage>
    <translationForLanguage includeInIndex="true">
      <translationForLanguagePart>
        <partNonContent>oiseau</partNonContent>
      </translationForLanguagePart>
    </translationForLanguage>
  </translationOfDictionary>
</dictionary>


Run this file through any XSLT conversion tool and the output will be the inputdictionaryfile:
cat    chat
bird   oiseau


Well, that does not yet look sophisticated, I know. But it will help a lot to set up XML dictionaries !

Important note: all of the above is work in progress and is not yet ready to use. I plan to improve the above files during the next weeks.

Ok I keep you updated on the progress !
Gert

jn0101


First, before starting out to define a new XML format, I trust you have checked and you are extremely sure that no existing format could be possibly used.

There is already way too many ways of representing a dictionary in XML out there.

Here is, just as an example Apertium's (http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-en-fr/apertium-en-fr.en-fr.dix?revision=14163&view=markup):

  <e><p><l>bird<s n="n"/></l><r>oiseau<s n="n"/><s n="m"/></r></p></e>
  <e r="RL"><p><l>cat<s n="n"/></l><r>chat<s n="n"/></r></p></e>
(e=entry, p=pair, l=left, r=right, s=symbol)

By using an existing format you get a bunch of tools for free. For example, the Apertium format can be transformed and used in  a lot of ways, i.a by http://wiki.apertium.org/wiki/Apertium-dixtools .

I am not suggesting adopting Apertium's format. But I strongly encourage you to make sure an existing format (and its toolset and existing data) cannot be adopted.



Second, your format seems extremely verbose. Uncompressed files will be extremely large, a factor 30 of the real information content in them (the cat=chat entry takes 11 lines each approx 30 chars).
Consider a flatter XML structure and some shorter names for the inner elements.

The tags also doesent seem intuitive:

What is a <translationOfDictionary>? (You dont translate dictionaries, you translate words!).
Rename it to <entry> or at least <translationOfWord>.

Rename <translationForLanguage> to <translation>, and rename includeInIndex="true" to index. Make the default value be "true" so the attribute only has to be put on those entries which should be marked "false" (to be excluded from index, I suppose).

Rename <translationForLanguagePart> to <part> and consider if you really need an extra level here.

What is <partNonContent>??



Third you have no notion of naming the languages. I'd add attribute lang="fr" to <translationForLanguage> / <translation>.



Fourthly I am not sure if XML is a win. These days many are going away from XML and back to plain files, as XML really hasnt proved to be an universal and easy to read format easily transformable and treatable evryone that we all were promised 10 years ago. Comma separated files are easier for most people (they can work in spread sheets, for example), and for dictionaries I really can't see where the extreme flexibility (at the cost of easy parsing) is a win.


Yeah, I know I'm negative (and I might be wrong :-), but just consider my thoughts.

Jacob

Gert

First of all I need some support for XSLT ! Here is what I'd like to do:

   <xsl:template match="partNonContent">
      <xsl:call-template name="replaceEscapeCharacters">
          <xsl:with-param name="characterString" select="."/>
      </xsl:call-template>
   </xsl:template>


replaceEscapeCharacters shall be a procedure/function that replaces newlines with \n, tabs with \t etc.

Can anyone explain to me how to implement replaceEscapeCharacters with XSLT ?? I need help here !!


@jn0101:
Just let me elaborate a little on the background of the DfM_XML_schema: there is an almost endless number of dictionary formats out there. In order to set up a dictionary for DfM that 'external dictionary format' must be brought into the format of an 'inputdictionaryfile' for DictionaryGeneration. Likely this will done by any sort of script or conversion application. One example for this is the DictdToDictionaryForMIDs application that converts dictionaries in the DICT format to an inputdictionaryfile.

Now, from these 'almost endless dictionary formats' several are XML formats. For me it seems somewhat cumbersome to convert the XML-format directly to the format of an inputdictionaryfile. On the other hand there are several XML transformation tools which rather easily produce an XML output. So I set up the DfM_XML_schema which matches to the structure of an inputdictionaryfile (and in a second step also to DictionaryForMIDs.properties). And I provide an XSLT script that does convert the format of DfM_XML_schema to inputdictionaryfiles (well, if only someone could help me with the replaceEscapeCharacters above  :-[ )

Support for {put your favourite XML schema here} will be provided by XSLT scripts that map {put your favourite XML schema here} to DfM_XML_schema. (instead of XSLT scripts you can use any other XML transformation tool)

(your 'second'): Yes, the schema is very verbose. However I'd like to mention that this is only used a preprocessing step prior to running DictionaryGeneration; a verbose schema will not affect the size of the resulting dictionary in the Jar file.

You are right, the naming of the elements/tags should be improved; currently the naming is not easily understandable ... I will improve this when I find time ;)

partNonContent is a 'part' that does not have a content inside, i.e. not something like [01 verb] (see http://dictionarymid.sourceforge.net/newdictContent.html)

(your 'third'): the sequence of the elements defines the languages; same as the column sequence of the inputdictionaryfile

(your 'fourth'): I intend to use the DfM_XML_schema as aid to set up XML-based dictionaries for DictionaryForMIDs. As an intermediate step for the inputdictionaryfile. It should help those people who set up an XML-based dictionary ... well, at least for me personally it is helpful   ::)

Please do not forget my problem on the replaceEscapeCharacters !!

Gert

jn0101

Gert, thanks for the explanations, your intention is much clearer now, and even in my conservative mind, it seems a good idea.

About replaceEscapeCharacters, I cannot help.
(XSLT is my no 1 language I just love to hate ;)

Jacob

Gert

Fine !

Besides, my problem about replaceEscapeCharacters in XSLT, it seems that I am able to solve this with a recursive call (no more assistance needed there  ;D )

Regards,
Gert

Gert

I spent quite a few night hours to learn that XSLT stuff (really not that easy as I thought ...). Once you know how to do it, you really can set up a dictionary quickly.

I uploaded a few XSLT-scripts here: http://dictionarymid.svn.sourceforge.net/viewvc/dictionarymid/trunk/Preprocessing/XML

And I will provide a documentation as web page soon.

Regards,
Ger