Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement (2404.05669v1)

Published 8 Apr 2024 in cs.CV

Abstract: Real-world documents may suffer various forms of degradation, often resulting in lower accuracy in optical character recognition (OCR) systems. Therefore, a crucial preprocessing step is essential to eliminate noise while preserving text and key features of documents. In this paper, we propose NAF-DPM, a novel generative framework based on a diffusion probabilistic model (DPM) designed to restore the original quality of degraded documents. While DPMs are recognized for their high-quality generated images, they are also known for their large inference time. To mitigate this problem we provide the DPM with an efficient nonlinear activation-free (NAF) network and we employ as a sampler a fast solver of ordinary differential equations, which can converge in a few iterations. To better preserve text characters, we introduce an additional differentiable module based on convolutional recurrent neural networks, simulating the behavior of an OCR system during training. Experiments conducted on various datasets showcase the superiority of our approach, achieving state-of-the-art performance in terms of pixel-level and perceptual similarity metrics. Furthermore, the results demonstrate a notable character error reduction made by OCR systems when transcribing real-world document images enhanced by our framework. Code and pre-trained models are available at https://github.com/ispamm/NAF-DPM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. M. Hradiš, J. Kotera, P. Zemčík, and F. Šroubek, “Convolutional neural networks for direct text deblurring,” in Proceedings of BMVC 2015.   The British Machine Vision Association and Society for Pattern Recognition, 2015. [Online]. Available: http://www.fit.vutbr.cz/research/view_pub.php?id=10922
  2. I. Pratikakis, K. Zagoris, G. Barlas, and B. Gatos, “Icdar2017 competition on document image binarization (dibco 2017),” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, 2017, pp. 1395–1403.
  3. I. Pratikakis, K. Zagori, P. Kaddas, and B. Gatos, “Icfhr 2018 competition on handwritten document image binarization (h-dibco 2018),” in 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018, pp. 489–493.
  4. Z. Anvari and V. Athitsos, “A survey on deep learning based document image enhancement,” ArXiv, vol. abs/2112.02719, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:244908637
  5. Y. Zhou, S. Zuo, Z. Yang, J. He, J. Shi, and R. Zhang, “A review of document image enhancement based on document degradation problem,” Applied Sciences, vol. 13, no. 13, 2023. [Online]. Available: https://www.mdpi.com/2076-3417/13/13/7855
  6. A. Sulaiman, K. Omar, and M. F. Nasrudin, “Degraded historical document binarization: A review on issues, challenges, techniques, and future directions,” Journal of Imaging, vol. 5, no. 4, 2019. [Online]. Available: https://www.mdpi.com/2313-433X/5/4/48
  7. T. Lelore and F. Bouchara, “Fair: A fast algorithm for document image restoration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 2039–2048, 2013.
  8. S. Lin, Z. He, and L. Sun, “Defect enhancement generative adversarial network for enlarging data set of microcrack defect,” IEEE Access, vol. 7, pp. 148 413–148 423, 2019.
  9. M. Souibgui, S. Biswas, S. Jemni, Y. Kessentini, A. Fornes, J. Llados, and U. Pal, “Docentr: An end-to-end document image enhancement transformer,” in 2022 26th International Conference on Pattern Recognition (ICPR).   Los Alamitos, CA, USA: IEEE Computer Society, aug 2022, pp. 1699–1705. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICPR56361.2022.9956101
  10. Z. Yang, B. Liu, Y. Xxiong, L. Yi, G. Wu, X. Tang, Z. Liu, J. Zhou, and X. Zhang, “Docdiff: Document enhancement via residual diffusion models,” in Proceedings of the 31st ACM International Conference on Multimedia, ser. MM ’23.   New York, NY, USA: Association for Computing Machinery, 2023, p. 2795–2806. [Online]. Available: https://doi.org/10.1145/3581783.3611730
  11. X.-J. Mao, C. Shen, and Y. Yang, “Image denoising using very deep fully convolutional encoder-decoder networks with symmetric skip connections,” ArXiv, vol. abs/1603.09056, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:1773169
  12. C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” CoRR, vol. abs/1501.00092, 2015. [Online]. Available: http://arxiv.org/abs/1501.00092
  13. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  14. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
  15. T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=Hk99zCeAb
  16. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” in Neural Information Processing Systems, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:261560300
  17. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 6840–6851. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
  18. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=St1giarCHLP
  19. C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” in ACM SIGGRAPH 2022 Conference Proceedings, ser. SIGGRAPH ’22.   New York, NY, USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3528233.3530757
  20. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34.   Curran Associates, Inc., 2021, pp. 8780–8794. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf
  21. L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image restoration,” arXiv preprint arXiv:2204.04676, 2022.
  22. C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=2uAaGwlP_V
  23. I. Pratikakis, K. Zagoris, X. Karagiannis, L. Tsochatzidis, T. Mondal, and I. Marthot-Santaniello, “Icdar 2019 competition on document image binarization (dibco 2019),” in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 1547–1556.
  24. G. Panci, P. Campisi, S. Colonnese, and G. Scarano, “Multichannel blind image deconvolution using the bussgang algorithm: spatial and multiresolution approaches,” IEEE Transactions on Image Processing, vol. 12, no. 11, pp. 1324–1337, 2003.
  25. H. Cho, J. Wang, and S. Lee, “Text image deblurring using text-specific properties,” in Computer Vision – ECCV 2012, vol. 7576, 10 2012, pp. 524–537.
  26. J. Pan, Z. Hu, Z. Su, and M.-H. Yang, “Deblurring text images via l0-regularized intensity and gradient prior,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2901–2908.
  27. O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, “Deblurgan: Blind motion deblurring using conditional adversarial networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, jun 2018, pp. 8183–8192. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR.2018.00854
  28. H. Neji, M. B. Halima, T. M. Hamdani, J. Nogueras-Iso, and A. M. Alimi, “Blur2sharp: A gan-based model for document image deblurring,” International Journal of Computational Intelligence Systems, vol. 14, pp. 1315–1321, 2021. [Online]. Available: https://doi.org/10.2991/ijcis.d.210407.001
  29. S. Gonwirat and O. Surinta, “Deblurgan-cnn: Effective image denoising and recognition for noisy handwritten characters,” IEEE Access, vol. 10, pp. 90 133–90 148, 2022.
  30. Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6228–6237.
  31. N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
  32. J. Sauvola and M. Pietikäinen, “Adaptive document image binarization,” Pattern Recognition, vol. 33, no. 2, pp. 225–236, 2000. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0031320399000552
  33. P. V. Bezmaternykh, D. A. Ilin, and D. P. Nikolaev, “U-Net-bin: hacking the document image binarization contest,” Computer Optics, vol. 43, no. 5, pp. 825–832, Oct. 2019.
  34. C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4713–4726, 2023.
  35. H. Li, Y. Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Srdiff: Single image super-resolution with diffusion probabilistic models,” Neurocomputing, vol. 479, pp. 47–59, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231222000522
  36. J. Whang, M. Delbracio, H. Talebi, C. Saharia, A. G. Dimakis, and P. Milanfar, “Deblurring via stochastic refinement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 16 293–16 303.
  37. S. Shang, Z. Shan, G. Liu, and J. Zhang, “Resdiff: Combining cnn and diffusion model for image super-resolution,” ArXiv, vol. abs/2303.08714, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257532290
  38. Q. Zhang and Y. Chen, “Fast sampling of diffusion models with exponential integrator,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=Loek7hfb46P
  39. S. Xue, M. Yi, W. Luo, S. Zhang, J. Sun, Z. Li, and Z.-M. Ma, “Sa-solver: Stochastic adams solver for fast sampling of diffusion models,” 2023.
  40. T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=TIdIXIpzhoI
  41. A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial diffusion distillation,” 2023.
  42. C. Meng, R. Rombach, R. Gao, D. Kingma, S. Ermon, J. Ho, and T. Salimans, “On distillation of guided diffusion models,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, jun 2023, pp. 14 297–14 306. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR52729.2023.01374
  43. D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational diffusion models,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34.   Curran Associates, Inc., 2021, pp. 21 696–21 707. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2021/file/b578f2a52a0229873fefc2a4b06377fa-Paper.pdf
  44. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, jun 2022, pp. 10 674–10 685. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR52688.2022.01042
  45. A. Vahdat, K. Kreis, and J. Kautz, “Score-based generative modeling in latent space,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34.   Curran Associates, Inc., 2021, pp. 11 287–11 302. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2021/file/5dca4c6b9e244d24a30b4c45601d9720-Paper.pdf
  46. M. Fritsche, S. Gu, and R. Timofte, “Frequency separation for real-world super-resolution,” 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 3599–3608, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:208158302
  47. K.-U. Song, D. Shim, K.-W. Kim, J. young Lee, and Y.-G. Kim, “Fs-ncsr: Increasing diversity of the super-resolution space via frequency separation and noise-conditioned normalizing flow,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 967–976, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:248299786
  48. L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Comput. Surv., vol. 56, no. 4, nov 2023. [Online]. Available: https://doi.org/10.1145/3626235
  49. Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön, “Refusion: Enabling large-size realistic image restoration with latent-space diffusion models,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023, pp. 1680–1691.
  50. B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 2298–2304, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:24139
  51. A. Randika, N. Ray, X. Xiao, and A. Latimer, “Unknown-box approximation to improve optical character recognition performance,” in Document Analysis and Recognition – ICDAR 2021, J. Lladós, D. Lopresti, and S. Uchida, Eds.   Cham: Springer International Publishing, 2021, pp. 481–496.
  52. A. Horé and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010 20th International Conference on Pattern Recognition, 2010, pp. 2366–2369.
  53. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:4766599
  54. K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image quality assessment: Unifying structure and texture similarity,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, pp. 2567–2581, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:215785896
  55. F. Deng, Z. Wu, Z. Lu, and M. Brown, “Binarizationshop: A user-assisted software suite for converting old documents to black-and-white,” in JCDL’10 - Digital Libraries - 10 Years Past, 10 Years Forward, a 2020 Vision, ser. Proceedings of the ACM International Conference on Digital Libraries, 2010, pp. 255–258, copyright: Copyright 2010 Elsevier B.V., All rights reserved.; 10th Annual Joint Conference on Digital Libraries, JCDL 2010 ; Conference date: 21-06-2010 Through 25-06-2010.
  56. H. Nafchi, S. Ayatollahi, R. Farrahi Moghaddam, and M. Cheriet, “An efficient ground truthing tool for binarization of historical manuscripts,” 08 2013, pp. 807–811.
  57. R. Hedjam, H. Z. Nafchi, R. F. Moghaddam, M. Kalacska, and M. Cheriet, “Icdar 2015 contest on multispectral text extraction (ms-tex 2015),” in 2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015, pp. 1181–1185.
  58. J.-C. Burie, M. Coustaty, S. Hadi, M. W. A. Kesiman, J.-M. Ogier, E. Paulus, K. Sok, I. M. G. Sunarya, and D. Valy, “Icfhr2016 competition on the analysis of handwritten text in images of balinese palm leaf manuscripts,” 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 596–601, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:1165182
  59. S. Khamekhem Jemni, M. A. Souibgui, Y. Kessentini, and A. Fornés, “Enhance to read better: A multi-task adversarial network for handwritten document image enhancement,” Pattern Recognition, vol. 123, p. 108370, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0031320321005501
  60. M. Yang and S. Xu, “A novel degraded document binarization model through vision transformer network,” Information Fusion, vol. 93, pp. 159–173, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253522002597
Citations (1)

Summary

We haven't generated a summary for this paper yet.