On distributing the analysis process of a broad-coverage unification-based grammar of spanish


Author

Marimon Felipe, Montserrat

Director

Bel Rafecas, Núria

Codirector

Theofilidis, Axel

Tutor

Martín Rioja, Josep Andreu

Date of defense

2003-03-28

ISBN

8468835226

Legal Deposit

B.44817-2003



Department/Institute

Universitat Politècnica de Catalunya. Institut de Ciències de l'Educació

Abstract

This thesis describes research into the development and deployment of engineered large-scale unification-based grammar to provide more robust and efficient deep grammatical analysis of linguistic expressions in real-world applications, while maintaining the accuracy of the grammar (i.e. percentage of input sentences that receive the correct analysis) and keeping its precision up to a reasonable level (i.e. percentage of input sentences that received no superfluous analysis).<br/><br/>In tacking the efficiency problem, our approach has been to prune the search space of the parser by integrating shallow and deep processing. We propose and implement a NLP system which integrates a Part-of-Speech (PoS) tagger and chunker as a pre-processing module of broad-coverage nification-based grammar of Spanish. This allows us to release the arser from certain tasks that may be efficiently and reliably dealt with by these computationally less expensive processing techniques. On the one hand, by integrating the morpho-syntactic information delivered by the PoS tagger, we reduce the number of morpho-syntactic ambiguities of the linguistic expression to be analyzed. On the other hand, by integrating chunk mark-ups delivered by the partial parser, we do notonly avoid generating irrelevant constituents which are not to contribute to the final parse tree, but we also provide part of the structure that the analysis component has to compute, thus, avoiding a duplication of efforts.<br/><br/>In addition, we want our system to be able to maintain the accuracy of the high-level grammar. In the integrated architecture we propose, we keep the ambiguities which can not be reliably solved by the PoS tagger to be dealt with by the linguistic components of the grammar performing deep analysis.<br/><br/>Besides improving the efficiency of the overall analysis process and maintaining the accuracy of the grammar, our system provides both structural and lexical robustness to the high-level processing. Structural robustness is obtained by integrating into the linguistic components of the high-level grammar the structures which have already been parsed by the chunker such that they do not need to be re-built by phrase structure rules. This allows us to extend the coverage of the grammar to deal with very low frequent constructions whose treatment would increase drastically the parsing search space and would create spurious ambiguity. To provide lexical robustness to the system, we have implemented default lexical entries. Default lexical entries are lexical entry templates that are activated when the system can not find a particular lexical entry to apply. Here, the integration of the tagger, which supplies the PoS information to the linguistic processing modules of our system, allows us to increase robustness while avoiding increase in morphological ambiguity. Better precision is achieved by extending the PoS tags of our external lexicon so that they include syntactic information, for instance subcategorization information.

Keywords

lingüística computacional; processament del llenguatge natural

Subjects

004 - Computer science; 81 - Linguistics and languages

Knowledge Area

5701. Lingüística Aplicada

Documents

TESI.pdf

2.150Mb

 

Rights

ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.

This item appears in the following Collection(s)