Resource management techniques aware of interference among high-performance computing applications

dc.contributor
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.contributor.author
Jokanović, Ana
dc.date.accessioned
2015-01-13T09:40:52Z
dc.date.available
2015-01-13T09:40:52Z
dc.date.issued
2014-12-19
dc.identifier.uri
http://hdl.handle.net/10803/284934
dc.description.abstract
Network interference of nearby jobs has been recently identified as the dominant reason for the high performance variability of parallel applications running on High Performance Computing (HPC) systems. Typically, HPC systems are dynamic with multiple jobs coming and leaving in an unpredictable fashion, sharing simultaneously the system interconnection network. In such environment contention for network resources is causing random stalls in the progress of application execution degrading application's performance. Eliminating interactions between jobs is the key for guaranteeing both high performance and performance predictability of applications. These interactions are determined by the job location in the system. Upon arriving to the system, the job is allocated the computing and network resources by resource managers. Based on the job size requirements, the job scheduler finds a set of available computing nodes. In addition, the subnet manager determines the allocation of the network resources such as paths between nodes, virtual lanes, link bandwidth. Typically, resource managers are mainly focused on increasing utilization of the resources while neglecting job interactions. In this thesis, we propose techniques for both, job scheduler and subnet manager, able to mitigate job interactions: 1) a job scheduling policy that reduces the node fragmentation in the system, and 2) a quality-of-service (QoS) policy based on a characterization of job's network load; this policy is relaying on the virtual lanes mechanism provided by modern interconnection network (e.g. InfiniBand). In order to evaluate our job scheduling policy we use a simulator developed for this thesis that takes as an input the job scheduler log from a production HPC system. This simulator performs the node allocation for the jobs from the log. The proposed QoS policy is evaluated using a flit-level network simulator that is able to replay multiple traces from real executions of MPI applications. Experimental results show that the proposed job scheduling policy leads to few jobs sharing network resources and thus having fewer job's interactions while the QoS policy is able to effectively reduce the degradation from the remaining job's interactions. These two software techniques are complementary and could be used together without additional hardware.
eng
dc.format.extent
144 p.
dc.format.mimetype
application/pdf
dc.language.iso
eng
dc.publisher
Universitat Politècnica de Catalunya
dc.rights.license
L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by-nc-sa/3.0/es/
dc.rights.uri
http://creativecommons.org/licenses/by-nc-sa/3.0/es/
*
dc.source
TDX (Tesis Doctorals en Xarxa)
dc.title
Resource management techniques aware of interference among high-performance computing applications
dc.type
info:eu-repo/semantics/doctoralThesis
dc.type
info:eu-repo/semantics/publishedVersion
dc.subject.udc
004
cat
dc.contributor.director
Labarta Mancho, Jesús
dc.contributor.codirector
Sancho Pitarch, José Carlos
dc.contributor.codirector
Rodríguez Herrera, German
dc.embargo.terms
cap
dc.rights.accessLevel
info:eu-repo/semantics/openAccess
dc.identifier.dl
B 5593-2015


Documents

TAJ1de1.pdf

4.809Mb PDF

This item appears in the following Collection(s)