On distributing the analysis process of a broad-coverage unification-based grammar of spanish

Marimon Felipe, Montserrat; Marimon Felipe, Montserrat

doi:https://dx.doi.org/10.5821/dissertation-2117-93245

On distributing the analysis process of a broad-coverage unification-based grammar of spanish

Autor/a

Marimon Felipe, Montserrat

Director/a

Bel Rafecas, Núria

Codirector/a

Theofilidis, Axel

Tutor/a

Martín Rioja, Josep Andreu

Data de defensa

2003-03-28

ISBN

8468835226

Dipòsit Legal

B.44817-2003

Departament/Institut

Universitat Politècnica de Catalunya. Institut de Ciències de l'Educació

Programa de doctorat

DOCTORAT EN FORMALITZACIÓ DEL LLENGUATGE

Resum

This thesis describes research into the development and deployment of engineered large-scale unification-based grammar to provide more robust and efficient deep grammatical analysis of linguistic expressions in real-world applications, while maintaining the accuracy of the grammar (i.e. percentage of input sentences that receive the correct analysis) and keeping its precision up to a reasonable level (i.e. percentage of input sentences that received no superfluous analysis). In tacking the efficiency problem, our approach has been to prune the search space of the parser by integrating shallow and deep processing. We propose and implement a NLP system which integrates a Part-of-Speech (PoS) tagger and chunker as a pre-processing module of broad-coverage nification-based grammar of Spanish. This allows us to release the arser from certain tasks that may be efficiently and reliably dealt with by these computationally less expensive processing techniques. On the one hand, by integrating the morpho-syntactic information delivered by the PoS tagger, we reduce the number of morpho-syntactic ambiguities of the linguistic expression to be analyzed. On the other hand, by integrating chunk mark-ups delivered by the partial parser, we do notonly avoid generating irrelevant constituents which are not to contribute to the final parse tree, but we also provide part of the structure that the analysis component has to compute, thus, avoiding a duplication of efforts. In addition, we want our system to be able to maintain the accuracy of the high-level grammar. In the integrated architecture we propose, we keep the ambiguities which can not be reliably solved by the PoS tagger to be dealt with by the linguistic components of the grammar performing deep analysis. Besides improving the efficiency of the overall analysis process and maintaining the accuracy of the grammar, our system provides both structural and lexical robustness to the high-level processing. Structural robustness is obtained by integrating into the linguistic components of the high-level grammar the structures which have already been parsed by the chunker such that they do not need to be re-built by phrase structure rules. This allows us to extend the coverage of the grammar to deal with very low frequent constructions whose treatment would increase drastically the parsing search space and would create spurious ambiguity. To provide lexical robustness to the system, we have implemented default lexical entries. Default lexical entries are lexical entry templates that are activated when the system can not find a particular lexical entry to apply. Here, the integration of the tagger, which supplies the PoS information to the linguistic processing modules of our system, allows us to increase robustness while avoiding increase in morphological ambiguity. Better precision is achieved by extending the PoS tags of our external lexicon so that they include syntactic information, for instance subcategorization information.

Paraules clau

lingüística computacional; processament del llenguatge natural

Matèries

004 - Informàtica; 81 - Lingüística i llengües

Àrea de coneixement

5701. Lingüística Aplicada

Citació recomanada

Aquesta citació s'ha generat automàticament.

Documents

TESI.pdf

2.150Mb

Exportar

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Drets

ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.

Aquest element apareix en la col·lecció o col·leccions següent(s)

Programa de Doctorat en Formalització del Llenguatge [1]

Àrea de contingut

On distributing the analysis process of a broad-coverage unification-based grammar of spanish

llistat de metadades

Autor/a

Director/a

Codirector/a

Tutor/a

Data de defensa

ISBN

Dipòsit Legal

Compartir

Departament/Institut

Programa de doctorat

Resum

Paraules clau

Matèries

Àrea de coneixement

Citació recomanada

Documents

Llistat documents

Exportar

Drets

Aquest element apareix en la col·lecció o col·leccions següent(s)