Re-thinking large scale hate speech identification: beyond common NLP conventions and supervised machine learning

Teixeira Fortuna, Paula Cristina

Re-thinking large scale hate speech identification: beyond common NLP conventions and supervised machine learning

dc.contributor

Universitat Pompeu Fabra. Departament de Tecnologies de la Informació i les Comunicacions

dc.contributor.author

Teixeira Fortuna, Paula Cristina

dc.date.accessioned

2023-04-26T12:09:35Z

dc.date.available

2023-04-26T12:09:35Z

dc.date.issued

2023-03-06

dc.identifier.uri

http://hdl.handle.net/10803/688156

dc.description.abstract

The detection of hate speech in online spaces is traditionally conceptualized as a classification task that uses Machine Learning (ML)-driven Natural Language Processing (NLP) techniques. In accordance with this conceptualization, the hate speech detection task relies upon common conventions and practices in Artificial Intelligence, ML and NLP – among them interpretation of the inter-annotator agreement as a way to measure dataset quality and the use of standard metrics such as precision, recall or accuracy and benchmarks to assess model performance. However, hate speech is a highly subjective and context-dependent notion that eludes such static and disembodied practices. Their application results in definitorial challenges and the failure of the models to generalize across different datasets, two problems that I analyse in empirical studies. Furthermore, I critically reflect on the followed methodologies. I argue that many conventions in NLP are poorly suited for the problem and suggest to develop methods that are more appropriate for fighting online hate speech.

dc.description.abstract

Abordar el discurs de l’odi als espais en línia s’ha conceptualitzat comuna tasca de classificació que utilitza t`ecniques d’intelligència artificial (IA), aprenentatge automàtic (ML) o processament del llenguatge natural (PNL). Mitjançant aquesta conceptualització, la tasca de detecció del discurs d’odi s’ha basat en les convencions i pr`actiques comunes d’aquests camps. Per exemple, l’acord entre anotadors es conceptualitza com una manera de mesurar la qualitat del conjunt de dades i s’utilitzen determinades m`etriques i punts de referència per inferir el rendiment del model. Tanmateix, el discurs de l’odi és un concepte profundament complex i situat que eludeix aquestes pràctiques estàtiques i incorpònies. En aquesta tesi aprofundeixo en els reptes de definici ó i les dificultatKeywordss pel que fa a la generalització de models, dos problemes que analitzo amb estudis empírics. A més, reflexiono críticament sobre les metodologies seguides, argumento que moltes convencions en PNL són poc adequades per al problema i animo els investigadors a desenvolupar mètodes més adequats per combatre el discurs d’odi en línia.

dc.format.extent

127 p.

dc.language.iso

eng

dc.publisher

Universitat Pompeu Fabra

dc.rights.license

L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by-nc-nd/4.0/

dc.rights.uri

http://creativecommons.org/licenses/by-nc-nd/4.0/

dc.source

TDX (Tesis Doctorals en Xarxa)

dc.subject

Hate speech detection

dc.subject

Machine learning conventions

dc.subject

Algorithmic challenges

dc.subject

Deteccio de discurs d’odi

dc.subject

Convencions d’aprenentatge automàtic

dc.subject

Reptes algorítmics

dc.title

Re-thinking large scale hate speech identification: beyond common NLP conventions and supervised machine learning

dc.type

info:eu-repo/semantics/doctoralThesis

dc.type

info:eu-repo/semantics/publishedVersion

dc.subject.udc

dc.contributor.authoremail

paulatfortuna@gmail.com

dc.contributor.director

Wanner, Leo

dc.contributor.director

Soler Company, Juan

dc.embargo.terms

cap

dc.rights.accessLevel

info:eu-repo/semantics/openAccess

dc.description.degree

Programa de doctorat en Tecnologies de la Informació i les Comunicacions

Documents

tpctf.pdf

903.7Kb PDF

This item appears in the following Collection(s)

Programa de Doctorat en Tecnologies de la Informació i les Comunicacions [376]