Selecting the Best Fit Software Programming Languages: Using BERT for File Format Detection

Authors

  • Jize Xiong Computer Information Technology, Northern Arizona University, Flagstaff, USA
  • Chufeng Jiang Computer Science, The University of Texas at Austin, Fremont, USA
  • Zhiming Zhao Computer Science, East China University of Science and Technology, Shanghai, China
  • Yuxin Qiao Computer Information Technology, Northern Arizona University, Flagstaff, USA
  • Ning Zhang Computer Science, University of Birmingham, Dubai, United Arab Emirates
  • Mingyang Feng Computer Information Technology, Northern Arizona University, Flagstaff, USA
  • Xiaosong Wang Computer Network Technology, Xuzhou University of Technology, Xuzhou, China

DOI:

https://doi.org/10.53469/jtpes.2024.04(06).03

Keywords:

Programming Language Classification, Multi-class Classification, Bidirectional Encoder Representations from Transformers (BERT)

Abstract

The detection and classification of programming languages and file formats are crucial in a variety of contexts, such as software analysis, code management, and cybersecurity. Despite significant research efforts, existing methods often struggle with the diversity and complexity of modern programming environments. This paper addresses these challenges by proposing a novel approach utilizing a BERT-based multi-class classification model to accurately classify input text into one of 21 programming languages. Our method leverages BERT’s advanced natural language processing capabilities to capture the intricate syntactic and semantic patterns of different programming languages, thereby enhancing detection accuracy and flexibility. We provide a comprehensive review of related work, highlighting existing approaches that include feature-based methods, structural and content-based techniques, and deep learning applications. While these methods have achieved notable success, they frequently lack generalizability and adaptability to new and evolving file types and languages. Our proposed BERT-based model addresses these limitations by offering a scalable and robust solution for programming language classification, demonstrating superior performance across diverse datasets. This research contributes to the field by providing a more versatile and accurate framework for programming language and file format detection, applicable to various real-world scenarios.

References

Burnett, M. M., & Baker, M. J. (1994). A classification system for visual programming languages. Journal of Visual Languages and Computing, 5(3), 287-300.

Amirani, M. C., Toorani, M., & Mihandoost, S. (2013). Feature‐based type identification of file fragments. Security and Communication Networks, 6(1), 115-128.

Maiorca, D., Ariu, D., Corona, I., & Giacinto, G. (2015, February). A structural and content-based approach for a precise and robust detection of malicious PDF files. In 2015 international conference on information systems security and privacy (icissp) (pp. 27-36). IEEE.

Saxe, J., Harang, R., Wild, C., & Sanders, H. (2018, May). A deep learning approach to fast, format-agnostic detection of malicious web content. In 2018 IEEE Security and Privacy Workshops (SPW) (pp. 8-14). IEEE.

Eken, S., Menhour, H., & Köksal, K. (2019). DoCA: a content-based automatic classification system over digital documents. IEEE Access, 7, 97996-98004.

Venkata, S. K., Young, P., & Green, A. (2020). Using machine learning for text file format identification. EasyChair Preprint, (4698).

Li, H., Xu, F., & Lin, Z. (2023). ET-DM: Text to image via diffusion model with efficient Transformer. Displays, 80, 102568.

Tian, G., & Xu, Y. (2022). A Study on the Typeface Design method of Han Characters imitated Tangut. Advances in Education, Humanities and Social Science Research, 1(2), 270-270.

Peng, Q., Ding, Z., Lyu, L., Sun, L., & Chen, C. (2022). RAIN: regularization on input and network for black-box domain adaptation. arXiv preprint arXiv:2208.10531.

Chen, H., Yang, Y., & Shao, C. (2021). Multi-task learning for data-efficient spatiotemporal modeling of tool surface progression in ultrasonic metal welding. Journal of Manufacturing Systems, 58, 306-315.

Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., & Han, J. (2022). Large language models can self-improve. arXiv preprint arXiv:2210.11610.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.

Zhou, H., Lou, Y., Xiong, J., Wang, Y., & Liu, Y. (2023). Improvement of Deep Learning Model for Gastrointestinal Tract Segmentation Surgery. Frontiers in Computing and Intelligent Systems, 6(1), 103-106.

Tian, G., & Xu, Y. (2022). A Study on the Typeface Design method of Han Characters imitated Tangut. Advances in Education, Humanities and Social Science Research, 1(2), 270-270.

Guo, Q., Fu, J., Lu, Y., & Gan, D. (2024, March). Diffusion Attack: Leveraging Stable Diffusion for Naturalistic Image Attacking. In 2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) (pp. 975-976). IEEE.

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., ... & Bera, A. (2024). Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 22160-22169).

Chen, Y. (2015). Convolutional neural network for sentence classification (Master's thesis, University of Waterloo).

Smagulova, K., & James, A. P. (2019). A survey on LSTM memristive neural network architectures and applications. The European Physical Journal Special Topics, 228(10), 2313-2324.

Liu, S., Zhang, C., & Ma, J. (2017). CNN-LSTM neural network model for quantitative strategy analysis in stock markets. In Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14-18, 2017, Proceedings, Part II 24 (pp. 198-206). Springer International Publishing.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dai, S., Dai, J., Zhong, Y., Zuo, T., & Mo, Y. (2024). The cloud-based design of unmanned constant temperature food delivery trolley in the context of artificial intelligence. Journal of Computer Technology and Applied Mathematics, 1(1), 6-12.

Wang, J., Wang, J., Dai, S., Yu, J., & Li, K. (2024). Research on emotionally intelligent dialogue generation based on automatic dialogue system. arXiv preprint arXiv:2404.11447.

Zhang, N., Xiong, J., Zhao, Z., Feng, M., Wang, X., Qiao, Y., & Jiang, C. (2024). Dose My Opinion Count? A CNN-LSTM Approach for Sentiment Analysis of Indian General Elections. Journal of Theory and Practice of Engineering Science, 4(05), 40-50.

Wang, X., Qiao, Y., Xiong, J., Zhao, Z., Zhang, N., Feng, M., & Jiang, C. (2024). Advanced network intrusion detection with tabtransformer. Journal of Theory and Practice of Engineering Science, 4(03), 191-198.

Su, J., Nair, S., & Popokh, L. (2022, November). Optimal resource allocation in sdn/nfv-enabled networks via deep reinforcement learning. In 2022 IEEE Ninth International Conference on Communications and Networking (ComNet) (pp. 1-7). IEEE.

Feng, M., Wang, X., Zhao, Z., Jiang, C., Xiong, J., & Zhang, N. (2024). Enhanced Heart Attack Prediction Using eXtreme Gradient Boosting. Journal of Theory and Practice of Engineering Science, 4(04), 9-16.

Li, Z., Huang, Y., Zhu, M., Zhang, J., Chang, J., & Liu, H. (2024). Feature manipulation for ddpm based change detection. arXiv preprint arXiv:2403.15943.

Zhao, Z., Zhang, N., Xiong, J., Feng, M., Jiang, C., & Wang, X. (2024). Enhancing E-commerce Recommendations: Unveiling Insights from Customer Reviews with BERTFusionDNN. Journal of Theory and Practice of Engineering Science, 4(02), 38-44.

Zhu, E. Y., Zhao, C., Yang, H., Li, J., Wu, Y., & Ding, R. (2024). A Comprehensive Review of Knowledge Distillation-Methods, Applications, and Future Directions. International Journal of Innovative Research in Computer Science & Technology, 12(3), 106-112.

Li, Z., Yin, Y., Wei, Z., Luo, Y., Xu, G., & Xie, Y. (2024). High-Precision Neuronal Segmentation: An Ensemble of YOLOX, Mask R-CNN, and UPerNet. Journal of Theory and Practice of Engineering Science, 4(04), 45-52.

Luo, Y., Wei, Z., Xu, G., Li, Z., Xie, Y., & Yin, Y. (2024). Enhancing E-commerce Chatbots with Falcon-7B and 16-bit Full Quantization. Journal of Theory and Practice of Engineering Science, 4(02), 52-57.

Ding, R., Zhu, E. Y., Zhao, C., Yang, H., Li, J., & Wu, Y. (2024). Research on Optimizing Lightweight Small Models Based on Generating Training Data with ChatGPT. Journal of Industrial Engineering and Applied Science, 2(2), 39-45.

Bao, W., Che, H., & Zhang, J. (2020, December). Will_Go at SemEval-2020 Task 3: An accurate model for predicting the (graded) effect of context in word similarity based on BERT. In Proceedings of the Fourteenth Workshop on Semantic Evaluation (pp. 301-306).

Popokh, L., Su, J., Nair, S., & Olinick, E. (2021, September). IllumiCore: Optimization Modeling and Implementation for Efficient VNF Placement. In 2021 International Conference on Software, Telecommunications and Computer Networks (SoftCOM) (pp. 1-7). IEEE.

Peng, Q., Zheng, C., & Chen, C. (2024). A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2240-2249).

Yin, Y., Xu, G., Xie, Y., Luo, Y., Wei, Z., & Li, Z. (2024). Utilizing Deep Learning for Crystal System Classification in Lithium-Ion Batteries. Journal of Theory and Practice of Engineering Science, 4(03), 199-206.

Xie, Y., Li, Z., Yin, Y., Wei, Z., Xu, G., & Luo, Y. (2024). Advancing Legal Citation Text Classification A Conv1D-Based Approach for Multi-Class Classification. Journal of Theory and Practice of Engineering Science, 4(02), 15-22.

Peng, Q., Zheng, C., & Chen, C. (2023). Source-free domain adaptive human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4826-4836).

Li, Z., Yu, H., Xu, J., Liu, J., & Mo, Y. (2023). Stock market analysis and prediction using LSTM: A case study on technology stocks. Innovations in Applied Engineering and Technology, 1-6.

Su, J., Nair, S., & Popokh, L. (2023, February). EdgeGYM: a reinforcement learning environment for constraint-aware NFV resource allocation. In 2023 IEEE 2nd International Conference on AI in Cybersecurity (ICAIC) (pp. 1-7). IEEE.

Su, J., Jiang, C., Jin, X., Qiao, Y., Xiao, T., Ma, H., ... & Lin, J. (2024). Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review. arXiv preprint arXiv:2402.10350.

Downloads

Published

2024-07-08

How to Cite

Xiong, J., Jiang, C., Zhao, Z., Qiao, Y., Zhang, N., Feng, M., & Wang, X. (2024). Selecting the Best Fit Software Programming Languages: Using BERT for File Format Detection. Journal of Theory and Practice of Engineering Science, 4(06), 20–28. https://doi.org/10.53469/jtpes.2024.04(06).03