Abstract
The term “Big Data” refers to a large volume of information usually in terabytes and petabytes. It includes both structured and unstructured data. Unstructured data is conventionally text-heavy, but may also contain data such as facts, dates and numbers. To use this unstructured information effectively, it needs to be processed and organized. Taxonomy is considered as a powerful way of organizing the information. For automatic taxonomy generation, various techniques have been proposed in the past. However, the substantial nature of big data presently crosses the processing abilities of traditional techniques. Thus, to meet this challenge an extensible and scalable technique is required to potentially accelerate the process of taxonomy generation and its evolution upon arrival of new data, hence catering large amount of unstructured big data. This paper proposes a technique for both the taxonomy generation and evolution on Apache Spark infrastructure. The proposed technique is evaluated on a text dataset of a computing domain. The evaluation results show that the technique presented in this paper outperformed the existing techniques in terms of time and quality metrics. The time and quality-based evaluation showed that the use of MapReduce environment has resolved the scalability issues of current taxonomy generation and evolution process.
References
M. Zwolenski, and L. Weatherill, "The Digital Universe: Rich Data and the Increasing Value of the Internet of Things,"Journal of Telecommunications and the Digital Economy, vol. 2, no. 3, pp. 1—47, 2014.
Rajesh Math. “Big Data Analytics: Recent and Emerging Application in Services Industry. Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 654)” SpringerDoi: 978-981- 10-6620-7_21
Coronel, C., Morris, S., Rob, P. (2013). Database Systems: Design, Implementation, and Management, (10th. Ed.). Boston: Cengage Learning.
ImenChebbi, WadiiBoulila, ImedRiadh Farah.“Big Data: Concepts, Challenges and Applications”Springer Doi: 978-3-319-24306-1_62
GeorgiosSkourletopoulos, Constandinos X. Mavromoustakis, George Mastorakis, Jordi MongayBatalla, CiprianDobre, Spyros Panagiotakis and EvangelosPallis: Big Data and Cloud Computing.“A Survey of the State-of-the-Art and Research Challenges” SpringerDoi: 9783319451435c2
M. S. Paukkeri, A. P. García-Plaza, V. Fresno, R. M. Unanue, and T. Honkela, "Learning a taxonomy from a set of text documents," Applied Soft Computing, vol. 12, no. 3, pp. 1138–1148, 2012.
R. Sujatha, R. Bandaru, and R. Rao, "Taxonomy Construction Techniques–Issues and Challenges," Indian Journal of Computer Science and Engineering, vol. 2, no. 5, pp. 661-671, 2011.
H. Hedden, The Accidental Taxonomist, Information Today, Inc., 2016.
D. Sánchez, and A. Moreno, "Automatic Generation of Taxonomies from the WWW," In International Conference on Practical Aspects of Knowledge Management, pp. 208-219, Vienna, Austria, December 2004.
H. Delgado, Taxonomy Organization of information of Web Content, 2019. https://disenowebakus.net/en/taxonomyinformation-web-content
V. Kashyap, C. Ramakrishnan, C. Thomas, and A. Sheth, "TaxaMiner: an experimentation framework for automated taxonomy bootstrapping," International Journal of Web and Grid Services, vol. 1, no. 2, pp. 240–266, 2005.
D. Boley, "Principal Direction Divisive Partitioning," Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 325–344, 1998.
L. E. Anke, H. Saggion, and F. Ronzano, " TALN-UPF: Taxonomy Learning Exploiting CRF-based Hypernym Extraction on Encyclopedic Definitions," In Proceedings of the 9th International Workshop on Semantic Evaluation, pp. 949–954, Denver, Colorado, June 2015.
P. Velardi, S. Faralli, and R. Navigli, "Ontolearn Reloaded: A Graphbased Algorithm for Taxonomy Induction," Computational Linguistics, vol. 39, no. 3, pp. 665–707, 2013.
Yao, B. Cui, G. Cong, and Y. Huang, "Evolutionary Taxonomy Construction from Dynamic Tag Space," World Wide Web, vol. 15, no. 5, pp. 581–602, 2012.
L. Tang, H. Liu, J. Zhang, N. Agarwal, and J. J. Salerno, "Topic Taxonomy Adaptation for Group Profiling," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 1, no. 4, pp. 1–28, 2008.
R. M. Marcacini, and S. O. Rezende, "Incremental Construction of Topic Hierarchies using Hierarchical Term Clustering.," In Software Engineering and Knowledge Engineering (SEKE), pp. 553–558. Redwood City, California, USA, July 2010.
R. Irfan, S. Khan, K. Rajpoot, and A. M. Qamar, "TIE Algorithm: a Layer over Clustering-based Taxonomy Generation for Handling Evolving Data," Frontiers of Information Technology Electronic Engineering (FITEE), vol. 19, no. 6, pp. 763-782, 2018.
X. Wu, X. Zhu, G. Q. Wu, and W. Ding, "Data Mining with Big Data," IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, 2013.
A. McAfee, E. Brynjolfsson, T. H. Davenport, D. J. Patil, and D. Barton, "Big data: The Management Revolution," Harvard Business Review, vol. 90, no. 10, pp. 60–68, 2012.
M. J. Embrechts, C. J. Gatti, J. Linton, and B. Roysam, "Hierarchical Clustering for Large Data Sets," In Advances in Intelligent Signal Processing and Data Mining, vol 410, pp. 197—233, Springer, 2013.
R. Babbar, I. Partalas, E. Gaussier, M. R. Amini, and C. Amblard, "Learning Taxonomy Adaptation in Large-scale Classification," The Journal of Machine Learning Research, vol. 17, no. 1, pp. 3350–3386, 2016.
A. Muller, J. Dorre, P. Gerstl, and R. Seiffert, "The TaxGen Framework: Automating the Generation of a Taxonomy for a Large Document Collection," In Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences HICSS-32, Hawaii, USA, 1999.
E. A. Dietz, D. Vandic, and F. Frasincar, "Taxolearn: A Semantic Approach to Domain Taxonomy Learning," In 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, pp. 58-65, Macau, China, 2012.
M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," In TextMining Workshop at KDD2000, May 2000.
H. Schütze, C. D. Manning, and P. Raghavan, Introduction to Information Retrieval, Cambridge University Press, 2008.
A. K. Jain, "Data Clustering: 50 years beyond K-means," Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, 2010.
R. Irfan, S. Khan, M.A. Abbas, and A. A. Shah, "Determining Influential Factors and Challenges in Automatic Taxonomy Generation: A Systematic Literature Review of Techniques 1999-2016," Information Research: An International Electronic Journal, vol. 24, no. 2, 2019.
V. Subramaniyaswamy, V. Vijayakumar, R. Logesh, and V. Indragandhi, "Unstructured Data Analysis on Big Data using MapReduce," Procedia Computer Science, pp. 456–465, 2015
A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, and T. Herawan, "Big Data Clustering: A Review," In International Conference on Computational Science and its Applications, Springer, 2014, pp. 707– 720.
D. Moulavi, P. A. Jaskowiak, R. J. Campello, A. Zimek, and J. Sander, "Density-based Clustering Validation," In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 2014.
B. Zerhari, A. A. Lahcen, and S. Mouline, "Big Data Clustering: Algorithms and Challenges," In Proc. of Int. Conf. on Big Data, Cloud and Applications (BDCA’15), Tetuan, Morocco, May, 2015.
J. Dean, and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop Distributed File System," In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, May, 2010.
J. Lin, "Mapreduce is good enough? if all you have is a hammer, throw away everything that’s not a nail!," Big Data, vol. 1, no. 1, pp. 28-37, 2013.
M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave et al., "Apache Spark: A Unified Engine for Big Data Processing," Communications of the ACM, pp. 56–65, 2016.
K. Aalijah and R. Irfan, "Scalable Taxonomy Generation and Evolution on Apache Spark," 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), 2020, pp. 634-639, doi: 10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00110.
A. G. Jivani, "A comparative study of stemming algorithms," International Journal of Computer Technology and Applications, vol. 2, no. 6, pp. 1930-1938, 2011.
F. Murtagh, and P. Legendre, " Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?," Journal of Classification, vol. 31, no. 3, pp. 274-295, 2014.
E. K. Molloy, and T. Warnow, "TreeMerge: A New Method for Improving the Scalability of Species Tree Estimation Methods," Bioinformatics, vol. 35, no. 14, pp. 417-426, 2019.
E. K. . W. T. Molloy, "NJMerge: A Generic Technique for Scaling Phylogeny Estimation Methods and its Application to Species Trees," In RECOMB International conference on Comparative Genomics, Magog-Orford, QC, Canada, 2018.
S. Petrovic, "A Comparison between the Silhouette Index and the Davies-Bouldin Index in Labelling IDs Clusters," In Proceedings of the 11th Nordic Workshop of Secure IT Systems, Linköping, Sweden, October 2006.
J. Xiao, J. Lu, and X. Li, "Davies Bouldin Index based Hierarchical Initialization K-means," Intelligent Data Analysis, vol. 21, no. 6, pp. 1327-1338, 2017.