Paper ID | IMT-CIF-2.9 | ||
Paper Title | BEYOND FLOPS IN LOW-RANK COMPRESSION OF NEURAL NETWORKS: OPTIMIZING DEVICE-SPECIFIC INFERENCE RUNTIME | ||
Authors | Yerlan Idelbayev, Miguel Á. Carreira-Perpiñán, University of California, Merced, United States | ||
Session | IMT-CIF-2: Computational Imaging 2 | ||
Location | Area I | ||
Session Time: | Wednesday, 22 September, 14:30 - 16:00 | ||
Presentation Time: | Wednesday, 22 September, 14:30 - 16:00 | ||
Presentation | Poster | ||
Topic | Computational Imaging Methods and Models: Sparse and Low Rank Models | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Neural network compression has become an important practical step when deploying trained models. We consider the problem of low-rank compression of the neural networks with the goal of optimizing the measured inference time. Given a neural network and a target device to run it, we want to find the matrix ranks and the weight values of the compressed model so that network runs as fast as possible on the device while having best task performance (e.g., classification accuracy). This is a hard optimization problem involving weights, ranks, and device constraints. To tackle this problem, we first implement a simple yet accurate model of the on-device runtime that requires only a few measurements. Then we give a suitable formulation of the optimization problem involving the proposed runtime model and solve it using alternating optimization. We validate our approach on various neural networks and show that by using our estimated runtime model we achieve better task performance compared to FLOPs based methods for the same runtime budget on the actual device. |