TECHNOLOGICAL BASIS OF “INDUSTRY 4.0”

Sequence-level knowledge distillation for image captioning model compression

  • 1 Department of Information Technologies, Vilnius Gediminas Technical University, Vilnius, Lithuania

Abstract

One of the most important tasks on the edge between natural language processing (NLP) and computer vision (CV) is image captioning. There are many papers dedicated to researches in a field of improving image captioning models quality. However, compression of such models in order to be used on mobile devices is quite underexplored. More than that, such an important technique as knowledge distillation which is widely used for model compression isn’t mentioned in almost any of them. To fill this gap we applied the most efficient knowledge distillation approaches to several state-of-the-art image captioning architectures.

Keywords

References

  1. R. Staniūtė and D. Šešok, “A systematic literature review on image captioning”, in Applied Sciences, 2019, vol. 9 (10)
  2. O. Vinyals et al, “Show and tell: A neural image caption generator”, in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156-3164
  3. K. Xu et al, “Show, attend and tell: Neural image caption generation with visual attention”, in International Conference on Machine Learning, 2015, pp. 2048-2057.
  4. J. Lu et al, “Neural baby talk”, in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7219-7228.
  5. A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions”, in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128-3137.
  6. S. J. Rennie et al, “Self-critical sequence training for image captioning”, in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7008-7024.
  7. T. Mikolov et al, “Recurrent neural network based language model”, in Interspeech, 2010, vol. 3, pp. 1045- 1048
  8. A. Vaswani et al, “Attention is all you need”, in Advances in Neural Information Processing Systems, 2017, T. 30.
  9. V. Atliha and D. Šešok, “Image-Captioning Model Compression”, in Applied Sciences, 2022, vol 3.
  10. J. H. Tan, C. S. Chan and J. H. Chuah, “End-to-End Supermask Pruning: Learning to Prune Image Captioning Models”, in Pattern Recognition, 2022, vol. 122.
  11. J. H. Tan, C. S. Chan and J. H. Chuah, “Image Captioning with Sparse Recurrent Neural Network”, arXiv preprint, 2019.
  12. X. Dai, H. Yin, N. K. Jha, “Grow and prune compact, fast, and accurate LSTMs”, in IEEE Transactions on Computers, vol. 3, pp. 441-452.
  13. J. Dong, Z. Hu, Y. Zhou, “Revisiting Knowledge Distillation for Image Captioning”, in CAAI International Conference on Artificial Intelligence, 2021, pp. 613-625.
  14. P. Anderson et al, “Bottom-up and top-down attention for image captioning and visual question answering”, in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6077-6086.
  15. L. Huang et al, “Attention on attention for image captioning”, in IEEE/CVF International Conference on Computer Vision, 2019, pp. 4634-4643.
  16. M. Tan, R. Pang, Q. V. Le, “Efficientdet: Scalable and efficient object detection”, in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10781-10790.
  17. T. Y. Lin et al, “Microsoft coco: Common objects in context”, in European Conference on Computer Vision, 2014, pp. 740-755.
  18. K. Papineni et al, “Bleu: a method for automatic evaluation of machine translation”, in 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311-318.
  19. S. Banerjee, A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments”, in ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, pp. 65-72.
  20. C. Y. Lin, F. J. Ochm, “Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics”, in 42nd Annual Meeting of the Association for Computational Linguistics, 2004, pp. 605- 612.
  21. R. Vedantam, C. Lawrence Zitnick, D. Parikh, “Cider: Consensus-based image description evaluation”, in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4566-4575.
  22. P. Anderson P. et al, “Spice: Semantic propositional image caption evaluation”, in European Conference on Computer Vision, 2016, pp. 382-398.

Article full text

Download PDF