Shotit: compute-efficient image-to-video search engine for the cloud (2404.12169v1)
Abstract: With the rapid growth of information technology, users are exposed to a massive amount of data online, including image, music, and video. This has led to strong needs to provide effective corresponsive search services such as image, music, and video search services. Most of them are operated based on keywords, namely using keywords to find related image, music, and video. Additionally, there are image-to-image search services that enable users to find similar images using one input image. Given that videos are essentially composed of image frames, then similar videos can be searched by one input image or screenshot. We want to target this scenario and provide an efficient method and implementation in this paper. We present Shotit, a cloud-native image-to-video search engine that tailors this search scenario in a compute-efficient approach. One main limitation faced in this scenario is the scale of its dataset. A typical image-to-image search engine only handles one-to-one relationships, colloquially, one image corresponds to another single image. But image-to-video proliferates. Take a 24-min length video as an example, it will generate roughly 20,000 image frames. As the number of videos grows, the scale of the dataset explodes exponentially. In this case, a compute-efficient approach ought to be considered, and the system design should cater to the cloud-native trend. Choosing an emerging technology - vector database as its backbone, Shotit fits these two metrics performantly. Experiments for two different datasets, a 50 thousand-scale Blender Open Movie dataset, and a 50 million-scale proprietary TV genre dataset at a 4 Core 32GB RAM Intel Xeon Gold 6271C cloud machine with object storage reveal the effectiveness of Shotit. A demo regarding the Blender Open Movie dataset is illustrated within this paper.
- 1985. IEEE standard for binary floating-point arithmetic. Institute of Electrical and Electronics Engineers, New York. Note: Standard 754–1985.
- 2023. Apache Hadoop. Retrieved June 12, 2023 from https://hadoop.apache.org/
- 2023. Apache Lucene. Retrieved June 9, 2023 from https://lucene.apache.org/
- 2023a. Apache Solr. Retrieved June 9, 2023 from https://solr.apache.org/
- 2023. Apache Spark. Retrieved June 9, 2023 from https://spark.apache.org/
- 2023. Apple buyed Shazam. Retrieved June 12, 2023 from https://www.bloomberg.com/news/articles/2017-12-11/apple-buys-early-iphone-app-hit-shazam-to-boost-apple-music
- 2023. AWS Overview Whitepaper. Retrieved June 12, 2023 from https://docs.aws.amazon.com/pdfs/whitepapers/latest/aws-overview/aws-overview.pdf
- 2023. Facebook Faiss. Retrieved June 9, 2023 from https://github.com/facebookresearch/faiss/
- 2023. Fastai. Retrieved June 19, 2023 from https://github.com/fastai/fastai
- 2023. Hadoop MapReduce. Retrieved June 12, 2023 from https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
- 2023. HBase. Retrieved June 12, 2023 from https://hbase.apache.org/
- 2023. HDFS. Retrieved June 12, 2023 from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
- 2023. jsbi-calculator. Retrieved June 23, 2023 from https://www.npmjs.com/package/jsbi-calculator
- 2023. jsbi pull request 82. Retrieved June 23, 2023 from https://github.com/GoogleChromeLabs/jsbi/pull/82
- 2023. LireSolr Vector Processing Source Code. Retrieved June 19, 2023 from https://github.com/dermotte/liresolr/blob/master/src/main/java/net/semanticmetadata/lire/solr/LireRequestHandler.java#L421
- 2023. MariaDB. Retrieved June 23, 2023 from https://mariadb.org/
- 2023. Milvus. Retrieved June 9, 2023 from https://github.com/milvus-io/milvus/
- 2023. MinIO. Retrieved June 23, 2023 from https://min.io/
- 2023a. OpenCV findContours. Retrieved June 23, 2023 from https://docs.opencv.org/2.4/doc/tutorials/imgproc/shapedescriptors/find_contours/find_contours.html
- 2023b. OpenDistro. Retrieved June 9, 2023 from https://opendistro.github.io/for-elasticsearch/
- 2023. PaddlePaddle. Retrieved June 19, 2023 from https://github.com/PaddlePaddle/Paddle
- 2023. Pinecore. Retrieved June 9, 2023 from https://www.pinecone.io/
- 2023. Pytorch. Retrieved June 19, 2023 from https://pytorch.org/
- 2023. Qdrant. Retrieved June 9, 2023 from https://qdrant.tech/
- 2023. Slide of Billion-scale Approximate Nearest Neighbor Search. Retrieved June 13, 2023 from https://speakerdeck.com/matsui_528/cvpr20-tutorial-billion-scale-approximate-nearest-neighbor-search
- 2023b. SolrCloud. Retrieved June 9, 2023 from https://solr.apache.org/guide/6_6/solrcloud.html
- 2023. Tensorflow. Retrieved June 19, 2023 from https://www.tensorflow.org/
- 2023. Towhee. Retrieved June 19, 2023 from https://github.com/towhee-io/towhee
- 2023a. trace.moe 2018 report. Retrieved June 23, 2023 from https://github.com/soruly/slides/blob/master/2018-09-whatanime.ga.md
- 2023b. trace.moe 2020 markdown report. Retrieved June 9, 2023 from https://github.com/soruly/slides/blob/master/2020-12-trace.moe.md
- 2023c. trace.moe about page. Retrieved June 9, 2023 from https://trace.moe/about
- 2023d. trace.moe: Anime Scene Search Engine. Retrieved June 9, 2023 from https://trace.moe/
- 2023e. trace.moe github. Retrieved June 19, 2023 from https://github.com/soruly/trace.moe
- 2023f. trace.moe initial 2016 slide. Retrieved June 9, 2023 from https://github.com/soruly/slides/blob/master/2016-05-whatanime.ga.slide
- 2023. Vald. Retrieved June 9, 2023 from https://github.com/vdaas/vald/
- 2023. Vsearch. Retrieved June 9, 2023 from https://github.com/vearch/vearch/
- 2023. Weaviate. Retrieved June 9, 2023 from https://github.com/weaviate/weaviate/
- André Araujo and Bernd Girod. 2018. Large-Scale Video Retrieval Using Image Queries. IEEE Transactions on Circuits and Systems for Video Technology 28, 6 (2018), 1406–1420. https://doi.org/10.1109/TCSVT.2017.2667710
- G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000).
- Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!). In 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), November 6-8, Seattle, WA, USA, Brian N. Bershad and Jeffrey C. Mogul (Eds.). USENIX Association, 205–218. http://www.usenix.org/events/osdi06/tech/chang.html
- F. Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 1800–1807. https://doi.org/10.1109/CVPR.2017.195
- Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras
- The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 215–226. https://doi.org/10.1145/2882903.2903741
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6-8, 2004, Eric A. Brewer and Peter Chen (Eds.). USENIX Association, 137–150. http://www.usenix.org/events/osdi04/tech/dean.html
- The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles 2003, SOSP 2003, Bolton Landing, NY, USA, October 19-22, 2003, Michael L. Scott and Larry L. Peterson (Eds.). ACM, 29–43. https://doi.org/10.1145/945445.945450
- Deep Residual Learning for Image Recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (Las Vegas, NV, USA) (CVPR ’16). IEEE, 770–778. https://doi.org/10.1109/CVPR.2016.90
- Searching for MobileNetV3. CoRR abs/1905.02244 (2019). arXiv:1905.02244 http://arxiv.org/abs/1905.02244
- E. Kasutani and A. Yamada. 2001. The MPEG-7 color layout descriptor: a compact image feature description for high-speed image/video segment retrieval. In Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205), Vol. 1. 674–677 vol.1. https://doi.org/10.1109/ICIP.2001.959135
- Mathias Lux and Oge Marques. 2013. Visual Information Retrieval Using Java and LIRE. Morgan & Claypool Publishers. https://doi.org/10.2200/S00468ED1V01Y201301ICR025
- LireSolr: A Visual Information Retrieval Server. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, ICMR 2017, Bucharest, Romania, June 6-9, 2017, Bogdan Ionescu, Nicu Sebe, Jiashi Feng, Martha A. Larson, Rainer Lienhart, and Cees Snoek (Eds.). ACM, 466–469. https://doi.org/10.1145/3078971.3079014
- CVPR2020 Tutorial on Image Retrieval in the Wild. https://matsui528.github.io/cvpr2020_tutorial_retrieval/.
- Keiron O’Shea and Ryan Nash. 2015. An Introduction to Convolutional Neural Networks. CoRR abs/1511.08458 (2015). arXiv:1511.08458 http://arxiv.org/abs/1511.08458
- Anna Podlesnaya and Sergey Podlesnyy. 2018. Deep Learning Based Semantic Video Indexing and Retrieval. In Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016. 359–372.
- David R. O’Hallaron Randal E. Bryant. 2010. Computer Systems: A Programmer’s Perspective. Addison-Wesley Publishing CompanyUnited States.
- Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.1556
- Suramya Tomar. 2006. Converting video formats with FFmpeg. Linux Journal 2006, 146 (2006), 10.
- Avery Wang. 2003. An Industrial Strength Audio Search Algorithm. In ISMIR 2003, 4th International Conference on Music Information Retrieval, Baltimore, Maryland, USA, October 27-30, 2003, Proceedings.
- Wikipedia contributors. 2023a. BitTorrent — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=BitTorrent&oldid=1158427041. [Online; accessed 12-June-2023].
- Wikipedia contributors. 2023b. Content-based image retrieval(CBIR) — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Content-based_image_retrieval&oldid=1147985578. [Online; accessed 9-June-2023].
- Wikipedia contributors. 2023c. Locality-sensitive hashing(LSH) — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Locality-sensitive_hashing&oldid=1158689833. [Online; accessed 9-June-2023].
- Wikipedia contributors. 2023d. Reverse Polish notation — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Reverse_Polish_notation&oldid=1160633074. [Online; accessed 23-June-2023].
- CNN-VWII: An efficient approach for large-scale video retrieval by image queries. Pattern Recognit. Lett. 123 (2019), 82–88. https://doi.org/10.1016/j.patrec.2019.03.015