Robust Class Parallelism - Error Resilient Parallel Inference with Low Communication Cost
Article 2020 en
Authors
YY
Yaoqing Yang
JC
Jichan Chung
GW
Guanhua Wang
Abstract
1 min read
Model parallelism is a standard paradigm to decouple a deep neural network (DNN) into sub-nets when the model is large. Recent advances in class parallelism significantly reduce the communication overhead of model parallelism to a single floating-point number per machine per iteration. However, traditional fault-tolerance schemes, when applied to class parallelism, require storing the entire model on the hard disk. Thus, these schemes are not suitable for soft and frequent system noise such as stragglers (temporarily slow worker machines). In this paper, we propose an erasure-coding based redundant computing technique called robust class parallelism to improve the error resilience of model parallelism. We show that by introducing slight overhead in the computation at each machine, we can obtain robustness to soft system noise while maintaining the low communication overhead in class parallelism. More importantly, we show that on standard classification tasks, robust class parallelism maintains the state-of-the-art performance.
Discussion(0)
No comments yet. Be the first to comment.