This paper describes recent work on the BIDMach toolkit for large-scale machine learning. BIDMach has demonstrated single-node performance that exceeds that of published cluster systems for many common machine-learning task. BIDMach makes full use of both CPU and GPU acceleration (through a sister library BID-Mat), and requires only modest hardware (commodity GPUs). One of the chal-lenges of reaching this level of performance is the allocation barrier. While it is simple and expedient to allocate and recycle matrix (or graph) objects in ex-pressions, this approach is too slow to match the arithmetic throughput possible on either GPUs or CPUs. In this paper we describe a caching approach that al-lows code with complex matrix (graph) expressions to run at massive scale, i.e. multi-terabyte data, with zero memory allocation after initial start-up. We present a number of new benchmarks that leverage this approach. 1
Discussion(0)
No comments yet. Be the first to comment.