Technical Report
Systems Engineering Methods Business and Analytics
-
Systems Engineering and Systems Management Transformation
Report Number: SERC-2019-TR-016
Publication Date: 2019-11-15
Project:
Systems Engineering Business and Analytics
Principal Investigators:
Dr. K. P. Subbalakshmi
Co-Principal Investigators:
Dr. William Rouse
In recent times, data has emerged as the most valuable commodity, earning the moniker of “the new oil”. The flip side of this abundance of data is that analyzing and understanding this data has become an exponentially complicated task. Under these circumstances it becomes important to develop knowledge discovery tools that can digest this vast amounts of data into actionable intelligence. Such a tool will significantly reduce the time it takes to solve problems of interest to ARDEC and its customers.
The goal of this project is to develop and compare natural language processing (NLP) based tools that will extract keyphrases from scientific literature of interest to CCDC. Keyphrase extraction is an important first step in several downstream NLP tasks including summarization, opinion mining, trend analysis etc.
We start with the CCDC SME provided keyphrases:
- Additive technology
- Chemical additive to solutions
- Cryogenic milling
- Microfluidics
- Nanopowder
- Nanoscale
The project was executed in the following steps:
- Extract meaningful datasets from scientific literature relevant to CCDC
- Extract topics using Latent Dirichlet Allocation (LDA) methods from these datasets
- Use this as a base and add on a couple of keyphrase extraction methods to extract keyphrases. Specifically we used Rapid Automatic Keyword Extraction algorithm (RAKE) [1] and Position Rank Analysis (PRA) [4].
- Compare the LDA-RAKE, LDA-PRA and PRA on the datasets obtained.
- Stress test these algorithms using standard datasets like the NUS dataset.
Conclusion: Several methods of keyphrase extraction were compared using several metrics for the problem of identifying relevant keyphrases form scientific documents. LDA, a topic modeling mechanism was used as a component in two of these methods. We proposed two methods called LDA-RAKE and LDA-PRA that uses LDA in different ways with RAKE and PRA respectively. Extensive data collection resulted in creation of two new datasets for the journals of interest to CCDC. Experiments indicate that LDA-PRA performs well on standard datasets like NUS in terms of precision, recall and F-score. For the specialized dataset, CCDC, we find that LDA-PRA does not perform as well as only PRA. We believe this may have something to do with the specific nature of the dataset or the number of topics picked by LDA. Further experiments will be necessary to determine this for sure. As a by product of this work, we also identified a subset of papers for the SME’s consideration based only on the keyphrases provided by the SME. Using this method, the workload of the SME can be cut down significantly by only having to read through a subsection of the vast data repository. For example, in some cases, the papers of interest can be whittled down to 4, from the full dataset of 684 papers.conclusion: Several methods of keyphrase extraction were compared using several metrics for the problem of identifying relevant keyphrases form scientific documents. LDA, a topic modeling mechanism was used as a component in two of these methods. We proposed two methods called LDA-RAKE and LDA-PRA that uses LDA in different ways with RAKE and PRA respectively. Extensive data collection resulted in creation of two new datasets for the journals of interest to CCDC. Experiments indicate that LDA-PRA performs well on standard datasets like NUS in terms of precision, recall and F-score. For the specialized dataset, CCDC, we find that LDA-PRA does not perform as well as only PRA. We believe this may have something to do with the specific nature of the dataset or the number of topics picked by LDA. Further experiments will be necessary to determine this for sure. As a by product of this work, we also identified a subset of papers for the SME%E2%80%99s consideration based only on the keyphrases provided by the SME. Using this method, the workload of the SME can be cut down significantly by only having to read through a subsection of the vast data repository. For example, in some cases, the papers of interest can be whittled down to 4, from the full dataset of 684 papers.Conclusion: Several methods of keyphrase extraction were compared using several metrics for the problem of identifying relevant keyphrases form scientific documents. LDA, a topic modeling mechanism was used as a component in two of these methods. We proposed two methods called LDA-RAKE and LDA-PRA that uses LDA in different ways with RAKE and PRA respectively. Extensive data collection resulted in creation of two new datasets for the journals of interest to CCDC. Experiments indicate that LDA-PRA performs well on standard datasets like NUS in terms of precision, recall and F-score. For the specialized dataset, CCDC, we find that LDA-PRA does not perform as well as only PRA. We believe this may have something to do with the specific nature of the dataset or the number of topics picked by LDA. Further experiments will be necessary to determine this for sure. As a by product of this work, we also identified a subset of papers for the SME’s consideration based only on the keyphrases provided by the SME. Using this method, the workload of the SME can be cut down significantly by only having to read through a subsection of the vast data repository. For example, in some cases, the papers of interest can be whittled down to 4, from the full dataset of 684 papers.