David Cheng successfully passes his PhD examination
Wednesday 11 July 2007
Congratulations to Tai (David) Cheng, who successfully defended his thesis on Corpus Analysis and Market Sentiment on the 9th July 2007.
Information extraction/retrieval has been of interest to researchers since the early 1960's. There are two main approaches to the design of Information Extraction (IE) systems - the Knowledge Engineering Approach (KEA) and the Automatic Training Approach (ATA).
In the KEA, grammars expressing rules for the system are constructed by hand using knowledge of the application domain. The knowledge engineer's skill and expertise regarding the application domain is essential to the level of performance of the system as hand crafted systems often perform the best. However, such development processes can be very lengthy as the manual investigation of domain-relevant texts is required, and it is difficult to accommodate changes in system specifications. Moreover, the required expertise can be difficult to obtain in some cases. As for the ATA, instead of requiring a system expert when customising the system for a new domain, someone with sufficient knowledge who is capable of annotating a set of training documents of the domain is sufficient. The system is able to analyse novel texts once a training corpus has been annotated and a training algorithm is run. Since this approach relies on training data, sometimes it may be difficult and expensive to obtain a well annotated corpus.
Despite the fact that IE systems built for different tasks often differ from each other, the core elements are shared by nearly every extraction system, regardless of whether they follow the Knowledge Engineering or Automatic Training approach. Some of these core elements such as parser and part of speech (POS) tagger, are tuned for optimal performance for a specific domain, or text with pre-defined structures. The goal of this thesis is to develop an algorithm that can be used to extract information from free texts, in our case, from financial news text; and from arbitrary domains unambiguously. We believe the use of corpus linguistics and statistical techniques would be more appropriate and efficient for this task, instead of using other approaches that rely on machine learning, POS taggers, parsers, and so on, which are tuned to work for a pre-defined domain. Based on this belief, a framework using corpus linguistics and statistical techniques, to extract information unambiguously from arbitrary domains was developed. A contrastive evaluation has been carried out not only in the domain of financial texts and movie reviews, but also with multi-lingual texts (Chinese and English). The results are encouraging.

