\ Harmonic Convolution

Deep Audio Priors Emerge from Harmonic Convolutional Networks

Zhoutong Zhang 1    Yunyun Wang 1, 2    Chuang Gan 3    Jiajun Wu 1, 4, 5
Joshua B. Tenenbaum 1    Antonio Torralba 1    William T. Freeman 1, 5

1 MIT CSAIL 2 IIIS, Tsinghua University 3 MIT-IBM Watson Lab 4 Stanford University 5 Google Research
U-Net [1] Wave-U-Net [2] Harmonic Convolution Target Signal

Replacing the convolution operation in Unet[1] with our Harmonic Convolution, we introduce an inductive bias for the network to capture audio signal priors. The video above shows a fitting process of a toy signal(100+200+300hz with background noise), using whitenoise as input.


Convolutional neural networks (CNNs) excel in image recognition and generation. Among many efforts to explain their effectiveness, experiments show that CNNs carry strong inductive biases that capture natural image priors. Do deep networks also have inductive biases for audio signals? In this paper, we empirically show that current network architectures for audio processing do not show strong evidence in capturing such priors. We propose Harmonic Convolution, an operation that helps deep networks model priors in audio signals by explicitly utilizing the harmonic structure. This is done by engineering the kernels to be supported by sets of harmonic series, instead of by local neighborhoods as convolutional kernels. We show that networks using Harmonic Convolution can reliably model audio priors and achieve high performance on unsupervised audio restoration. With Harmonic Convolution, they also achieve better generalization performance for supervised musical source separation.


We introduce an operation called Harmonic Convolution, which defines the support of convolution kernels on harmonic series. An illustration is shown below:

The kernel of convolution is defined by a local neighborhood as shown in (a). To utilize the harmonic structure, Harmonic Concolution defines the kernel support on harmonic structures, as shown in (a), (b), and (c). Another hyper-parameter, called anchor, is introduced to specify the order of harmonics at the output location. In (a), anchor=1 indicates that the output frequency is intepreted as the base frequency. Consequently, when anchor=2, as shown in (c), the output frequency is intepreted as the second-order harmonics. In (d), the output is the 4th harmonics.


We show various results on unsupervised music/speech denoising, quantization recovery, and musical instrument seperation. Please click on the spectrograms to hear the results. Using a headphone is highly recommended. More results can be found here.

Experiment Type:
Noisy Clean Ours
DNP [3] Wave-U-Net [2] UNet [1]


Deep Audio Priors Emerge from Harmonic Convolutional Networks
Zhoutong Zhang, Yunyun Wang, Chuang Gan, Jiajun Wu, Joshua B. Tenenbaum, Antonio Torralba, and William T.Freeman.
ICLR 2020  |   Paper  |   Slides  |   Code(Coming Soon)  |  Supplementary material

            title = {Deep Audio Priors Emerge From Harmonic Convolutional Networks},
            author = {Zhang, Zhoutong and Wang, Yunyun and Gan, Chuang and Wu, Jiajun and Tenenbaum, Joshua B. and Torralba, Antonio and Freeman, William T.},
            booktitle = {International Conference on Learning Representations (ICLR)}, 
            year = {2020}


[1] Ronneberger et al., U-net: Convolutional networks for biomedical image segmentation, MICCAI, 2015

[2] Stolleret al., Wave-U-Net: A multi-scale neural network for end-to-end audio source separation. ISMIR, 2018.

[3] Michelashvili et al., Audio Denoising with Deep Network Prior. arXiv 2018

[4] Yu, Fisher, and Vladlen Koltun, Multi-scale context aggregation by dilated convolutions. ICLR 2016