Serving ML prediction pipelines spanning multiple models and hardware accelerators is a key challenge in production machine learning. Optimally configuring these pipelines to meet tight end-to-end latency goals is complicated by the interaction between model batch size, the choice of hardware accelerator, and variation in the query arrival process.
Wan Jinyu, Zheng Sun, Zhang Xiang, Bai Yu, Tsai Chengying, Paul Kim Ho Chu, Senlin Huang, Yi Jiao, Leng Yongbin, Biaobin Li, Jingyi Li, Nan Li, Lu Xiaohan, Meng Cai, Peng Yuemei, Sheng Wang, Chengyi Zhang
Discussion(0)
No comments yet. Be the first to comment.