In-depth exploration of the syntactic capabilities of autoencoding language models for downstream applications

Pérez-Mayos, Laura

In-depth exploration of the syntactic capabilities of autoencoding language models for downstream applications

dc.contributor

Universitat Pompeu Fabra. Departament de Tecnologies de la Informació i les Comunicacions

dc.contributor.author

Pérez-Mayos, Laura

dc.date.accessioned

2022-06-28T11:47:04Z

dc.date.available

2022-06-28T11:47:04Z

dc.date.issued

2022-06-15

dc.identifier.uri

http://hdl.handle.net/10803/674651

dc.description.abstract

Pretrained Transformer-based language models have quickly replaced traditional approaches to model NLP tasks, pushing the state of the art to new levels, and will certainly continue to be very influential in the years to come. In this thesis, we offer an extensive empirical comparison of the morpho-syntactic capabilities of pretrained Transformer-based autoencoding models. We analyse the syntactic generalisation abilities of different widely-used pretrained models, comparing them along two dimensions: 1-- language: monolingual (English and Spanish) and multilingual models; and 2-- pretraining objectives: masked language modeling and next sentence prediction. We complement the analysis with a study of the impact of the pretraining data size on the syntactic generalisation abilities of the models and their performance on different downstream tasks. Finally, we investigate how the syntactic knowledge encoded in the models evolves along the fine-tuning process on different morpho-syntactic and semantics-related downstream tasks.

en_US

dc.description.abstract

Els models de llenguatge preentrenats basats en Transformer han reemplaçat ràpidament els models tradicionals de Processat del Llenguatge Natural, fent avançar l'estat de l'art a nous nivells, i de ben segur continuaran sent molt influents durant els propers anys. En aquesta tesi presentem una extensa comparativa empírica de les capacitats morfosintàctiques de models de llenguatge preentrenats basats en Transformer de tipus \textit{autoencoding}. Analitzem les capacitats de generalització sintàctica de diferents models que es fan servir habitualment, comparant-los en base a: 1-- llenguatge: models monolingües (anglès i castellà) i multilingües; i 2-- objectius d'entrenament: modelat del llenguatge amb màscares i predicció de la següent frase. Per complementar la comparativa, estudiem l'impacte del volum de les dades d'entrenament en les habilitats de generalització sintàctica dels models i el seu rendiment en diverses tasques. Finalment, investiguem com el coneixement sintàctic codificat als models evoluciona durant el seu entrenament en diverses tasques sintàctiques i semàntiques.

en_US

dc.format.extent

160 p.

en_US

dc.format.mimetype

application/pdf

dc.language.iso

eng

en_US

dc.publisher

Universitat Pompeu Fabra

dc.rights.license

L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by-sa/4.0/

dc.rights.uri

http://creativecommons.org/licenses/by-sa/4.0/

dc.source

TDX (Tesis Doctorals en Xarxa)

dc.subject

Pretrained language models

en_US

dc.subject

Transformer

en_US

dc.subject

BERT

en_US

dc.subject

Syntax

en_US

dc.subject

Syntactic knowledge

en_US

dc.subject

Contextual embeddings

en_US

dc.subject

Models de llenguatge preentrenats

en_US

dc.subject

Sintaxi

en_US

dc.subject

Coneixement sintàctic

en_US

dc.subject

Representacions contextuals

en_US

dc.title

In-depth exploration of the syntactic capabilities of autoencoding language models for downstream applications

en_US

dc.type

info:eu-repo/semantics/doctoralThesis

dc.type

info:eu-repo/semantics/publishedVersion

dc.subject.udc

en_US

dc.contributor.authoremail

lpmayos@gmail.com

en_US

dc.contributor.director

Wanner, Leo

dc.contributor.director

Ballesteros, Miguel

dc.embargo.terms

cap

en_US

dc.rights.accessLevel

info:eu-repo/semantics/openAccess

dc.description.degree

Programa de doctorat en Tecnologies de la Informació i les Comunicacions

Documents

tlpm.pdf

5.017Mb PDF

This item appears in the following Collection(s)

Programa de Doctorat en Tecnologies de la Informació i les Comunicacions [394]