case study
Published: 20 March 2023

Efficient audio-based CNNs via filter pruning

Dr Arshdeep Singh, a machine learning researcher in sound with Professor Mark D Plumbley as a part of “AI for sound” (AI4S) project within the Centre for Vision, Speech and Signal Processing (CVSSP), have been focusing on designing efficient and sustainable artificial intelligence and machine learning (AI-ML) models.

The issue

Recent trends in artificial intelligence (AI) employ convolutional neural networks (CNNs) [1, 2] that provide remarkable performance compared to other existing methods. However, the large size and high computational cost of CNNs is a bottleneck to deploying CNNs on resource-constrained devices such as smartphones. Moreover, training CNNs for several hours leads to emitting more CO2. For instance, a computing device (NVIDIA GPU RTX-2080 Ti) used to train CNNs for 48 hours generates the equivalent CO2 emitted by an average car driven for 13 miles. For estimating CO2, we use an openly available tool [Link-1].

Therefore, we aimed to compress CNNs

  1. To reduce the computational complexity for faster inference.
  2. To reduce memory footprints for using underlying resources effectively.
  3. To reduce the number of computations during the training stage of CNNs by analyzing how many training examples are sufficient in the fine-tuning process of the compressed CNNs to achieve a similar performance to that obtained using all training examples for uncompressed CNNs.

The solution

One of the directions to compress CNNs is by “pruning”, where the unimportant filters are explicitly removed from the original network to build a compact or pruned network. After pruning, the pruned network is fine-tuned to regain the performance loss. This study proposes a cosine distance-based greedy algorithm [3] to prune similar filters in filter space for openly available CNNs designed for audio scene classification [Link-2]. Further, we improve the efficiency of the proposed algorithm [3] by reducing the computational time in pruning [4].

The outcome

We find that the proposed pruning method reduces the number of computations per inference by 27%, with 25% less memory requirements, with less than a 1% drop in accuracy. During fine-tuning of the pruned CNNs, a reduction of training examples by 25% gives a similar performance as that obtained using all examples. We made openly available the proposed algorithm [Link-3] for reproducibility and provided a video presentation [Link-4] explaining the methodology and results from our published work [3].

In addition, we improve the computational time of the proposed pruning method by three times without degrading performance [4, Link-5].

Open research practices/URL links

The proposed work uses the following Open Research practices,

See the corresponding poster


[1] Q Kong et al., “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880– 2894, 2020.

[2] Irene et al., “Low-complexity acoustic scene classification for multi-device audio: Analysis of DCASE 2021 challenge systems,” in DCASE workshop, pp. 85-89, 2021.

[3] A Singh and Mark D Plumbley, “A passive similarity-based CNN filter pruning for efficient acoustic scene classification,” in INTERSPEECH, pp. 2433-2437, 2022.

[4] A Singh and Mark D Plumbley, “Efficient similarity-based passive filter pruning for compressing CNNs,” accepted for ICASSP 2023.

Contact details

Arshdeep Singh (1) and Mark D. Plumbley (2)

1: Department of Computer Science and Electrical Engineering, University of Surrey, UK,

2: EPSRC Fellow in “AI for sound” project, Professor of Signal Processing, University of Surrey, UK

Contact information (Arshdeep Singh)

Lead author job title: Research Fellow A

Lead author faculty: Faculty of Engineering and Physical Sciences

Lead author email:

Lead author ORCID: