Part-of-speech Tagging: A Machine Learning Approach based on Decision Trees

Màrquez, Lluís

Part-of-speech Tagging: A Machine Learning Approach based on Decision Trees

dc.contributor

Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics

dc.contributor.author

Màrquez, Lluís

dc.date.accessioned

2011-04-12T15:20:49Z

dc.date.available

2009-10-05

dc.date.issued

1999-07-21

dc.date.submitted

2009-07-23

dc.identifier.isbn

9788469270387

dc.identifier.uri

http://www.tdx.cat/TDX-0723109-124136

dc.identifier.uri

http://hdl.handle.net/10803/6663

dc.description.abstract

The study and application of general Machine Learning (ML) algorithms to theclassical ambiguity problems in the area of Natural Language Processing (NLP) isa currently very active area of research. This trend is sometimes called NaturalLanguage Learning. Within this framework, the present work explores the applicationof a concrete machine-learning technique, namely decision-tree induction, toa very basic NLP problem, namely part-of-speech disambiguation (POS tagging).Its main contributions fall in the NLP field, while topics appearing are addressedfrom the artificial intelligence perspective, rather from a linguistic point of view.A relevant property of the system we propose is the clear separation betweenthe acquisition of the language model and its application within a concrete disambiguationalgorithm, with the aim of constructing two components which are asindependent as possible. Such an approach has many advantages. For instance, thelanguage models obtained can be easily adapted into previously existing taggingformalisms; the two modules can be improved and extended separately; etc.As a first step, we have experimentally proven that decision trees (DT) providea flexible (by allowing a rich feature representation), efficient and compact wayfor acquiring, representing and accessing the information about POS ambiguities.In addition to that, DTs provide proper estimations of conditional probabilities fortags and words in their particular contexts. Additional machine learning techniques,based on the combination of classifiers, have been applied to address some particularweaknesses of our tree-based approach, and to further improve the accuracy in themost difficult cases.As a second step, the acquired models have been used to construct simple,accurate and effective taggers, based on diiferent paradigms. In particular, wepresent three different taggers that include the tree-based models: RTT, STT, andRELAX, which have shown different properties regarding speed, flexibility, accuracy,etc. The idea is that the particular user needs and environment will define whichis the most appropriate tagger in each situation. Although we have observed slightdifferences, the accuracy results for the three taggers, tested on the WSJ test benchcorpus, are uniformly very high, and, if not better, they are at least as good asthose of a number of current taggers based on automatic acquisition (a qualitativecomparison with the most relevant current work is also reported.Additionally, our approach has been adapted to annotate a general Spanishcorpus, with the particular limitation of learning from small training sets. A newtechnique, based on tagger combination and bootstrapping, has been proposed toaddress this problem and to improve accuracy. Experimental results showed thatvery high accuracy is possible for Spanish tagging, with a relatively low manualeffort. Additionally, the success in this real application has confirmed the validity of our approach, and the validity of the previously presented portability argumentin favour of automatically acquired taggers.

eng

dc.format.mimetype

application/pdf

dc.language.iso

eng

dc.publisher

Universitat Politècnica de Catalunya

dc.rights.license

ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.

dc.source

TDX (Tesis Doctorals en Xarxa)

dc.subject

processament del llenguatge natural

dc.subject

intel·ligència artificial

dc.subject

aprenenetatge automàtic

dc.title

Part-of-speech Tagging: A Machine Learning Approach based on Decision Trees

dc.type

info:eu-repo/semantics/doctoralThesis

dc.type

info:eu-repo/semantics/publishedVersion

dc.subject.udc

004

cat

dc.contributor.director

Rodríguez Hontoria, Horacio

dc.rights.accessLevel

info:eu-repo/semantics/openAccess

dc.identifier.doi

https://dx.doi.org/10.5821/dissertation-2117-93974

dc.identifier.dl

B.45333-2009

dc.description.degree

DOCTORAT EN INTEL·LIGÈNCIA ARTIFICIAL (Pla 1998)

Documentos

TLMV1de2.pdf

9.360Mb PDF

TLMV2de2.pdf

5.032Mb PDF

Este ítem aparece en la(s) siguiente(s) colección(ones)

Programa de Doctorat en Intel·ligència Artificial [75]

Àrea de contingut