Representation learning for music classification and retrieval: bridging the gap between natural language and music semantics

Won, Minz

Representation learning for music classification and retrieval: bridging the gap between natural language and music semantics

dc.contributor

Universitat Pompeu Fabra. Departament de Tecnologies de la Informació i les Comunicacions

dc.contributor.author

Won, Minz

dc.date.accessioned

2022-07-06T12:30:06Z

dc.date.available

2022-07-06T12:30:06Z

dc.date.issued

2022-07-01

dc.identifier.uri

http://hdl.handle.net/10803/674722

dc.description.abstract

The explosion of digital music has dramatically changed our music consumption behavior. Massive digital music libraries are now available through streaming platforms. Since the amount of information available to an individual listener has increased greatly, it is nearly impossible for them to go through the entire catalog exhaustively. As a result, we need robust knowledge management systems more than ever. Recent advances in deep learning have enabled data-driven music representation learning for classification and retrieval. However, there is still a gap between machinelearned representations and the human understanding of music. This dissertation aims at reducing this semantic gap in order to assist listener’s behavior around music information with advanced algorithmic support. To this end, we tackle three main challenges in representation learning: model architecture design, scalability, and multimodality. Firstly, we carefully review previous deep representation models and propose new architectures that improve the representation in qualitative and quantitative ways. The newly proposed models are more flexible, interpretable, and powerful than previous ones. Secondly, training schemes beyond supervised learning are explored as a way to achieve scalable research. Transfer learning, semi-supervised learning, and self-supervised learning approaches are addressed in detail; transfer learning and semi-supervised methods are applied to enhance music representation learning. Finally, metric learning is proposed as a way to bridge music audio representation and natural language semantics, forming a multi-modal embedding space. This facilitates music retrieval using arbitrary tags beyond a fixed vocabulary, and makes it possible to match music to text stories based on mood. Although our work focuses on bridging music and natural language semantics, we believe the proposed approaches generalize to other modalities. All implementation details of this thesis are available and open-source for reproducibility. The knowledge gained throughout this thesis has been put in practice and grounded in research internships and collaborations with multiple industries.

en_US

dc.description.abstract

L’esclat de la música digital ha revolucionat la manera en que consumim música. Les plataformes de música per Internet posen tal quantitat d’informació i continguts a l’abast dels seus usuaris que és pràcticament impossible explorar els seus catàlegs de manera exhaustiva. Per tant, ara més que mai, cal seguir desenvolupant sistemes robustos de gestió del coneixement. Els avenços en aprenentatge profund dels darrers anys han permès el desenvolupament de mètodes per a l’aprenentatge automàtic de representacions musicals, i la seva aplicació en tasques de classificació i cerca. Tanmateix, hi ha encara un buit entre aquestes representacions apreses automàticament i la comprensió humana de la música. L’objectiu d’aquesta tesi és reduir aquest “buit semàntic”, per tal d’oferir ajuda algorísmica als oients a l’hora de relacionar-se amb informació musical. A aquest efecte, abordem tres problemes de l’aprenentatge de representacions: el disseny de l’arquitectura dels models, l’escalabilitat i la multimodalitat. En primer lloc, analitzem en detall models anteriors de representació profunda, i proposem arquitectures noves que milloren les representacions qualitativa i quantitativament, donant lloc a models més potents, flexibles i interpretables. Seguidament, per tal d’assolir millor escalabilitat, investiguem processos d’entrenament més enllà de l’aprenentatge supervisat. Presentem en detall els aprenentatges per transferència, semi-supervisat i auto-supervisat; i apliquem els aprenentatges per transferència i semi-supervisat com a manera de potenciar l’aprenentatge automàtic de representacions musicals. Finalment, proposem l’aprenentatge de mètriques com a manera de reconciliar les representacions d’àudio musical i la semàntica en llenguatge natural, donant lloc a un espai d’encastament multimodal. Això facilita la recuperació de música mitjançant descriptors arbitraris en lloc de vocabularis concrets, i permet assignar música a una història automàticament en base al seu context anímic. Tot i que la nostra recerca se centra en reconciliar la música i la semàntica en llenguatge natural, opinem que el mètode proposat es pot generalitzar a altres modalitats. Tots els detalls de la implementació d’aquesta tesi estan disponibles com a codi obert per tal de permetre la seva reproducció. El coneixement adquirit al llarg d’aquesta tesi ha estat posat en pràctica mitjançant col·laboracions amb la indústria i estades en pràctiques de recerca.

en_US

dc.format.extent

180 p.

en_US

dc.format.mimetype

application/pdf

dc.language.iso

eng

en_US

dc.publisher

Universitat Pompeu Fabra

dc.rights.license

L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by-nc-sa/4.0/

dc.rights.uri

http://creativecommons.org/licenses/by-nc-sa/4.0/

dc.source

TDX (Tesis Doctorals en Xarxa)

dc.subject

Music representation learning

en_US

dc.subject

Music classification

en_US

dc.subject

Multimodality

en_US

dc.subject

Cross-modal retrieval

en_US

dc.subject

Aprenentatge automàtic de representacions musicals

en_US

dc.subject

Classificació musical

en_US

dc.subject

Multimodalitat

en_US

dc.subject

Recuperació transmodal

en_US

dc.title

Representation learning for music classification and retrieval: bridging the gap between natural language and music semantics

en_US

dc.type

info:eu-repo/semantics/doctoralThesis

dc.type

info:eu-repo/semantics/publishedVersion

dc.subject.udc

en_US

dc.contributor.authoremail

minz.won@upf.edu

en_US

dc.contributor.director

Serra, Xavier

dc.contributor.director

Saggion, Horacio

dc.embargo.terms

cap

en_US

dc.rights.accessLevel

info:eu-repo/semantics/openAccess

dc.description.degree

Programa de doctorat en Tecnologies de la Informació i les Comunicacions

Documents

tmw.pdf

5.256Mb PDF

This item appears in the following Collection(s)

Programa de Doctorat en Tecnologies de la Informació i les Comunicacions [401]