A Smart Model for Categorization of GitHub Repositories
PDF

Keywords

GitHub
Categorization
CNN
Deep Learning
Source Code

How to Cite

Aslam, M. M., Muhammad Farhan, Sana Yaseen, Ahmad Raza, Muhammad Javed Iqbal, & Muhammad Munwar Iqbal. (2022). A Smart Model for Categorization of GitHub Repositories. KIET Journal of Computing and Information Sciences, 6(1), 50-64. https://doi.org/10.51153/kjcis.v6i1.149

Abstract

There are several datasets of source code available on the World Wide Web. These files are usually grouped into application categories for programming languages. Various repositories of open-source code are now public on GitHub. Users can upload the source code they develop, distribute it to other users, allow other programmers to renew or change the program over time, and announce specific software applications. This study proposes a machine learningbased model for classifying source code. Machine learning algorithms are necessary to train and authenticate predictions of the required tasks. In the future, related software applications will be categorized into multiple source codes. Our machine-learning model trains you to organize your multiprogramming source code according to your problem. Training datasets can be obtained from GitHub. The main goal of this study is to learn different solutions for software classification. This research study aims to use source code to confirm the classification of multilingual software. The outcomes mentioned below in the result section appear to be quite encouraging. This method shows that the source code type can be completed using the earlier criteria. The software’s great precision and quick response time make it ideal for the most realistic functions. The software had a 97 percent accuracy when identifying three programming languages.

https://doi.org/10.51153/kjcis.v6i1.149
PDF

References

Altarawy, D., et al., Lascad: Language-agnostic software categorization and similar application detection. Journal of Systems and Software, 2018. 142: p. 21-34.

Jelodar, H., et al., Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 2019. 78(11): p. 15169-15211.

Auch, M., et al., Similarity-based analyses on software applications: A systematic literature review. Journal of Systems and Software, 2020. 168: p. 110669.

Nafi, K.W., et al., A universal cross-language software similarity detector for open source software categorization. Journal of Systems and Software, 2020. 162: p. 110491.

Katz, D.S., et al. Detecting execution anomalies as an oracle for autonomy software robustness. in 2020 IEEE International Conference on Robotics and Automation (ICRA). 2020. IEEE.

Bouziane, Y., M.K. Abdi, and S. Sadou, Automatically Labelled Software Topic Model. International Journal of Open Source Software and Processes (IJOSSP), 2020. 11(1): p. 57-78.

Zhang, J., et al. A novel neural source code representation based on abstract syntax tree. in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 2019. IEEE.

Chen, L., et al. Taming behavioral backward incompatibilities via cross-project testing and analysis. in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 2020.

Komamizu, T., et al. Exploring Identical Users on GitHub and Stack Overflow. in SEKE. 2017.

Izadi, M., A. Heydarnoori, and G. Gousios, Topic recommendation for software repositories using multi-label classification algorithms. Empirical Software Engineering, 2021. 26(5): p. 1-33.

Barrios, S., et al., Partial discharge classification using deep learning methods—Survey of recent progress. Energies, 2019. 12(13): p. 2485.

LeClair, A., Z. Eberhart, and C. McMillan. Adapting neural text classification for improved software categorization. in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). 2018. IEEE.

Zhang,Y., et al.Detecting similar repositories on GitHub. in 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). 2017. IEEE.

Gu, X., et al. Deep API learning. in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 2016.

Samtani, S., et al., Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence. Journal of Management Information Systems, 2017. 34(4): p. 1023-1053.

Catal, C., S. Tugul, and B. Akpinar, Automatic software categorization using ensemble methods and bytecode analysis. International Journal of Software Engineering and Knowledge Engineering, 2017. 27(07): p. 1129-1144.

Reyes, J., D. Ramírez, and J. Paciello. Automatic classification of source code archives by programming language: A deep learning approach. in 2016 international conference on computational science and computational intelligence (CSCI). 2016. IEEE.

Yang, Z., et al. Hierarchical attention networks for document classification. in Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016.

Li, C., et al. Effective document labeling with very few seed words: A topic model approach. in Proceedings of the 25th ACM international on conference on information and knowledge management. 2016.

Meng, Y., et al. Weakly-supervised neural text classification. in proceedings of the 27th ACM International Conference on information and knowledge management. 2018.

Angulo, M.A. and O. Aktunc. Using GitHub as a teaching tool for programming courses. in 2018 Gulf Southwest Section Conference. 2019.

Alim, Affan, et al. "The most discriminant subbands for face recognition: A novel information-theoretic framework." International Journal of Wavelets, Multiresolution and Information Processing 16.05 (2018): 1850040