March 30, 2007 Leave a comment
In one of my last posts I mentioned that one of the problems we encounter in current knowledge bases is the fact that polymer information is quite often present in free text. It is therefore very hard to extract information from these sources (although it can be done, see Peter Corbett’s OSCAR system) and even when it is accomplished, one is quite often faced with the problem of what the extracted information means. Take your favourite search engine and look for the term “cook” for example. The search engine will most likely retrieve information about people called “Cook”, about “cook” the profession, the Cook Islands or Cook County, Illinois.
One way around this, is too add more descriptive data to data contained in web pages and other documents, or, in other words, data about data. If we could mark up the term “cook” as a person, or a profession of a place name according to the context in which we use it, a machine would have a much better time of finding the bits of information we were really interested in. Now, data about data is also called “metadata” and one way of adding metadata to documents is through the use of markup languages and, in our case, through the use of Extensible Markup Language (XML) and its dialects.
Now the concept of a markup language should not be unfamiliar. Every internet user should has heard of HTML, Hypertext Markup Language, which can be used to structure text into headings, tables, paragraphs etc. XML, just like HTML belongs to the class of descriptive markup languages.
If you use Wikis at all, then you will have come across and used another type of markup, which is used for purely presentational purposes. And maybe you write your papers, in LaTeX and deal with postscript files a lot, in which case you will have had exposure to procedural markup languages too.
Now according to the Wikipedia entry on XML, the latter “provides a text-based means to describe and apply a tree-based structure to information. At its base level, all information manifests as text, interspersed with markup that indicates the information’s separation into a hierarchy of character data, container-like elements, and attributes of those elements.” In an XML document, metadata is enclosed in angle brackets (“”), which, in turn enclose the data to be described. This is what is meant by a container. Let’s look at a simple XML document, it’s a receipe for baking bread (also taken from the Wikipedia article):
We see that there are a number of containers with labels (known as “elements” such as “recipe”, “title”, “ingredient”, “instructions” and “step”. Some of these carry a number of attributes, such as “name”, “prep_time”, “unit” and “state”, which specify further information concerning that element.
When looking at this example , you will have hopefully realized, that XML is eminantly human readable and that you don’t have to be computer genius to figure out what is going on in the document. And you will hopefully also realize, that this markup should now make it easy for a computer to, for example, extract all the ingredients from the text, as they are now explicitly labelled as such.
In my next post, I’ll discuss how to mark up chemistry and molecules….but maybe you can beginn to see now, how this structuring of information could be useful for polymers already.