| Peer-Reviewed

MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec

Received: 15 July 2019     Accepted: 12 August 2019     Published: 28 August 2019
Views:       Downloads:
Abstract

Text representation is the key for text processing. Scientific papers have significant structural features. The different internal components, mainly including titles, abstracts, keywords, main texts, etc., embody different degrees of importance. In addition, the external structural features of scientific papers, such as topics and authors, also have certain value for analysis of scientific papers. However, most of the traditional analysis methods of scientific papers are based on the analysis of keyword co-occurrence and citation links, which only consider partial information. There is a lack of research on the textual information and external structural information of scientific papers, which has led to the inability to deeply explore the inherent laws of scientific papers. Therefore, this paper proposes Multi-Layers Paragraph Vector (MLPV), a text representing method for scientific papers based on Doc2vec and structural information of scientific papers including both internal and external structures, and constructs five text representation models: PV-NO, PV-TOP, PV-TAKM, MLPV and MLPV-PSO. The results show that the effect of the MLPV model is much better than the PV-NO, PV-TOP and PV-TAKM models. The average accuracy of MLPV model is much more stable and higher, reaching 91.71%, which proves its validity. On the basis of the MLPV model, the accuracy of the optimized MLPV-PSO model is 3.33% higher than MLPV model which proves the effectiveness of the optimization algorithm.

Published in American Journal of Information Science and Technology (Volume 3, Issue 3)
DOI 10.11648/j.ajist.20190303.12
Page(s) 62-71
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2019. Published by Science Publishing Group

Keywords

MLPV Model, Scientific Papers, Text Representation, Doc2vec, Structural Features

References
[1] Yoon S H, Kim J S, Kim S W and Lee C. TL-Rank: A Blend of Text and Link Information for Measuring Similarity in Scientific Literature Databases [J]. IEICE TRANSACTIONS on Information and Systems, 2012, 95 (10): 2556-2559.
[2] Hamedani M R, Kim S W, Kim D J. SimCC: A novel method to consider both content and citations for computing similarity of scientific papers [J]. Information Sciences, 2016, 334: 273-292.
[3] Cao M, Sun X, Zhuge H. The contribution of cause-effect link to representing the core of scientific paper—The role of Semantic Link Network [J]. PloS one, 2018, 13 (6): e0199303.
[4] Liu M, Lang B, Gu Z and Zeeshan A. Measuring similarity of academic articles with semantic profile and joint word embedding [J]. Tsinghua Science and Technology, 2017, 22 (6): 619-632.
[5] Mahdi A E, Joorabchi A. A citation-based approach to automatic topical indexing of scientific literature [J]. Journal of Information Science, 2010, 36 (6): 798-811.
[6] Xu G, Wang H F. Development of topic models in natural language processing. [J]. Chinese J Comput, 2011 (8): 1423-1436. M. Young, The Technical Writer's Handbook. Mill Valley, CA: University Science, 198.
[7] Deerwester S, Dumais S T, Furnas G W, Landauer T K, Harshman R. Indexing by latent semantic analysis [J]. Journal of the Association for Information Science & Technology, 1990, 41 (6): 391-407.
[8] Hofmann T. Probabilistic latent semantic indexing [C]// International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 50-57.
[9] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation [J]. J Machine Learning Research Archive, 2003, 3: 993-1022.
[10] Luo L, Li L. Defining and evaluating classification algorithm for high-dimensional data based on latent topics [J]. PloS one, 2014, 9 (1): e82119.
[11] Hinton G E. Learning distributed representations of concepts. [C]// Eighth Conference of the Cognitive Science Society. 1986.
[12] Bengio Y, Ducharme R, Vincent P, Jauvin C. A neural probabilistic language model [J]. Journal of machine learning research, 2003, 3 (Feb): 1137-1155.
[13] Mikolov T, Le Q V, Sutskever I. Exploiting Similarities among Languages for Machine Translation [J/OL]. arXiv preprint arXiv, 2013: 1309 [2013-9-17]. https://arxiv.org/abs/1309.4168.
[14] Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space [J/OL]. arXiv preprint arXiv, 2013: 1301 [2013-9-7]. https://arxiv.org/abs/1301.3781.
[15] Zhang W T. Research and Application of Synonym Expansion Based on Feature Space Optimization of Word Vector Model [D]. Beijing University of Posts and Telecommunications, 2014.
[16] Zhu X M. Weibo recommendation based on Word2Vec topic extraction [D]. Beijing Institute of Technology, 2014.
[17] Tang M, Zhu L, Zou X C. A Document Vector Representation Based on Word2Vec [J]. Computer Science, 2016, 43 (6): 214-217.
[18] Wang Y, Liu Z, Sun M. Incorporating linguistic knowledge for learning distributed word representations [J]. PloS one, 2015, 10 (4): e0118437.
[19] Alsuhaibani M, Bollegala D, Maehara T, Kawarabayashi K. Jointly learning word embeddings using a corpus and a knowledge base [J]. PloS one, 2018, 13 (3): e0193094.
[20] Li Y, Wei B, Liu Y, et al. Incorporating knowledge into neural network for text representation [J]. Expert Systems with Applications, 2018, 96: 103-114.
[21] Le Q V, Mikolov T. Distributed Representations of Sentences and Documents [J]. 2014, 4: II-1188.
[22] Dai A M, Olah C, Le Q V. Document Embedding with Paragraph Vectors [J/OL]. arXiv preprint arXiv, 2015: 1507 [2015-7-29]. https://arxiv.org/abs/1507.07998.
[23] Fisher G, Israni M, Robert Z. Exploring Optimizations to Paragraph Vectors [J]. https://web.stanford.edu/class/cs224n/reports/2760664.pdf
[24] Grzegorczyk K, Kurdziel M. Binary Paragraph Vectors [J/OL]. arXiv preprint arXiv, 2017: 1611 [2017-6-9]. https://arxiv.org/abs/1611.01116.
[25] Palangi H, Deng L, Shen Y, et al. Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval [J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2015, 24 (4): 694-707.
[26] Kennedy J, Eberhart R. Particle Swarm Optimization. In: Proc IEEE International Conference on Neural Networks. Perth, Australia, 1995: 1942-1948.
Cite This Article
  • APA Style

    Yonghe Lu, Yuanyuan Zhai, Jiayi Luo, Yongshan Chen. (2019). MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec. American Journal of Information Science and Technology, 3(3), 62-71. https://doi.org/10.11648/j.ajist.20190303.12

    Copy | Download

    ACS Style

    Yonghe Lu; Yuanyuan Zhai; Jiayi Luo; Yongshan Chen. MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec. Am. J. Inf. Sci. Technol. 2019, 3(3), 62-71. doi: 10.11648/j.ajist.20190303.12

    Copy | Download

    AMA Style

    Yonghe Lu, Yuanyuan Zhai, Jiayi Luo, Yongshan Chen. MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec. Am J Inf Sci Technol. 2019;3(3):62-71. doi: 10.11648/j.ajist.20190303.12

    Copy | Download

  • @article{10.11648/j.ajist.20190303.12,
      author = {Yonghe Lu and Yuanyuan Zhai and Jiayi Luo and Yongshan Chen},
      title = {MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec},
      journal = {American Journal of Information Science and Technology},
      volume = {3},
      number = {3},
      pages = {62-71},
      doi = {10.11648/j.ajist.20190303.12},
      url = {https://doi.org/10.11648/j.ajist.20190303.12},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajist.20190303.12},
      abstract = {Text representation is the key for text processing. Scientific papers have significant structural features. The different internal components, mainly including titles, abstracts, keywords, main texts, etc., embody different degrees of importance. In addition, the external structural features of scientific papers, such as topics and authors, also have certain value for analysis of scientific papers. However, most of the traditional analysis methods of scientific papers are based on the analysis of keyword co-occurrence and citation links, which only consider partial information. There is a lack of research on the textual information and external structural information of scientific papers, which has led to the inability to deeply explore the inherent laws of scientific papers. Therefore, this paper proposes Multi-Layers Paragraph Vector (MLPV), a text representing method for scientific papers based on Doc2vec and structural information of scientific papers including both internal and external structures, and constructs five text representation models: PV-NO, PV-TOP, PV-TAKM, MLPV and MLPV-PSO. The results show that the effect of the MLPV model is much better than the PV-NO, PV-TOP and PV-TAKM models. The average accuracy of MLPV model is much more stable and higher, reaching 91.71%, which proves its validity. On the basis of the MLPV model, the accuracy of the optimized MLPV-PSO model is 3.33% higher than MLPV model which proves the effectiveness of the optimization algorithm.},
     year = {2019}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec
    AU  - Yonghe Lu
    AU  - Yuanyuan Zhai
    AU  - Jiayi Luo
    AU  - Yongshan Chen
    Y1  - 2019/08/28
    PY  - 2019
    N1  - https://doi.org/10.11648/j.ajist.20190303.12
    DO  - 10.11648/j.ajist.20190303.12
    T2  - American Journal of Information Science and Technology
    JF  - American Journal of Information Science and Technology
    JO  - American Journal of Information Science and Technology
    SP  - 62
    EP  - 71
    PB  - Science Publishing Group
    SN  - 2640-0588
    UR  - https://doi.org/10.11648/j.ajist.20190303.12
    AB  - Text representation is the key for text processing. Scientific papers have significant structural features. The different internal components, mainly including titles, abstracts, keywords, main texts, etc., embody different degrees of importance. In addition, the external structural features of scientific papers, such as topics and authors, also have certain value for analysis of scientific papers. However, most of the traditional analysis methods of scientific papers are based on the analysis of keyword co-occurrence and citation links, which only consider partial information. There is a lack of research on the textual information and external structural information of scientific papers, which has led to the inability to deeply explore the inherent laws of scientific papers. Therefore, this paper proposes Multi-Layers Paragraph Vector (MLPV), a text representing method for scientific papers based on Doc2vec and structural information of scientific papers including both internal and external structures, and constructs five text representation models: PV-NO, PV-TOP, PV-TAKM, MLPV and MLPV-PSO. The results show that the effect of the MLPV model is much better than the PV-NO, PV-TOP and PV-TAKM models. The average accuracy of MLPV model is much more stable and higher, reaching 91.71%, which proves its validity. On the basis of the MLPV model, the accuracy of the optimized MLPV-PSO model is 3.33% higher than MLPV model which proves the effectiveness of the optimization algorithm.
    VL  - 3
    IS  - 3
    ER  - 

    Copy | Download

Author Information
  • School of Information Management, Sun Yat-sen University, Guangzhou, China

  • School of Information Management, Sun Yat-sen University, Guangzhou, China

  • School of Information Management, Sun Yat-sen University, Guangzhou, China

  • School of Information Management, Sun Yat-sen University, Guangzhou, China

  • Sections