Fine-grained bit-flip protection for relaxation methods
comunitat-uji-handle:10234/9
comunitat-uji-handle2:10234/7036
comunitat-uji-handle3:10234/8620
comunitat-uji-handle4:
INVESTIGACIONThis resource is restricted
https://doi.org/10.1016/j.jocs.2016.11.013 |
Metadata
Title
Fine-grained bit-flip protection for relaxation methodsDate
2019-09Publisher
ElsevierType
info:eu-repo/semantics/articlePublisher version
https://www.sciencedirect.com/science/article/pii/S1877750316303891Version
info:eu-repo/semantics/publishedVersionSubject
Abstract
Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As ... [+]
Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance. [-]
Investigation project
U.S. Department of Energy (Award Number DE-SC-0010042) and NVIDIA ; MINECO and FEDER (project CICYT TIN2014-53495-R).Rights
© 2016 Elsevier B.V. All rights reserved.
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/restrictedAccess
http://rightsstatements.org/vocab/InC/1.0/
info:eu-repo/semantics/restrictedAccess
This item appears in the folowing collection(s)
- ICC_Articles [424]