publication . Conference object . 2019

Automatic Identification and Normalisation of Physical Measurements in Scientific Literature

Foppiano, Luca; Romary, Laurent; ishii, masashi; tanifuji, mikiko;
Open Access English
  • Published: 23 Sep 2019
  • Publisher: HAL CCSD
  • Country: France
Abstract
We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].
Proceedings of the ACM Symposium on Document Engineering 2019 (DocEng '19). Article 24, 1–4.
Fields of Science and Technology classification (FOS)
03 medical and health sciences, 0302 clinical medicine, 030220 oncology & carcinogenesis, 05 social sciences, 0509 other social sciences, 050904 information & library sciences
Subjects
free text keywords: Units of measurements, Physical quantities, Measurements, Text and data mining, Machine Learning, Document analysis, Applied computing, Document meta- data, TDM, [INFO]Computer Science [cs], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, Scientific literature, Parsing, computer.software_genre, computer, Context (language use), Conditional random field, Materials informatics, Scientific notation, International System of Units, Computer science, Information retrieval, Identification (information)

[1] Milan Agatonovic, Niraj Aswani, Kalina Bontcheva, Hamish Cunningham, Thomas Heitz, Yaoyong Li, Ian Roberts, and Valentin Tablan. 2008. Large-scale, parallel automatic patent annotation. In Proceedings of the 1st ACM workshop on Patent information retrieval. ACM, 1-8.

[2] Skopinava AM and Lobanov BM. 2013. Processing of quantitative exPressions with units of measurement in scientific texts as aPPlied to Belarusian and russian text-to-sPeech synthesis. (2013).

[3] Hidir Aras, René Hackl-Sommer, Michael Schwantner, and Mustafa Sofean. 2014. Applications and Challenges of Text Mining with Patents.. In IPaMin@ KONVENS.

[4] Soumia Lilia Berrahou, Patrice Buche, Juliette Dibie-Barthélemy, and Mathieu Roche. [n. d.]. How to Extract Unit of Measure in Scientific Documents?.

[5] Contributors [n. d.]. Units of Measurement. https://github.com/ unitsofmeasurement.

[6] Contributors 2008 - 2019. GROBID (GeneRation Of BIbliographic Data). https://github.com/kermitt2/grobid. swh:1:dir:6a298c1b2008913d62e01e5bc967510500f80710.

[7] André Dazy. 2014. ISTEX: a powerful project for scientific and technical electronic resources archives. Insights 27, 3 (2014).

[8] Thaer M Dieb, Masaharu Yoshioka, Shinjiro Hara, and Marcus C Newton. 2015. Framework for automatic information extraction from research papers on nanocrystal devices. Beilstein journal of nanotechnology 6, 1 (2015), 1872-1882.

[9] Luca Foppiano, M. Dieb Thaer, Akira Suzuki, and Masashi Ishii. 2019. Proposal for Automatic Extraction Framework of Superconductors Related Information from Scientific Literature. In Letters and Technology News, vol. 119, no. 66, SC2019-1 (no.66), Vol. 119. Tsukuba, 1-5. ISSN: 2432-6380. [OpenAIRE]

[10] Kyle Hundman and Chris A Mattmann. 2017. Measurement Context Extraction from Text: Discovering Opportunities and Gaps in Earth Science. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. [OpenAIRE]

[11] Yanna Shen Kang and Mehmet Kayaalp. 2013. Extracting laboratory test information from biomedical text. Journal of pathology informatics 4 (Aug. 2013), 23-23. https://doi.org/10.4103/2153-3539.117450

[12] John Laferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).

[13] Patrice Lopez. 2009. GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In International conference on theory and practice of digital libraries. Springer, 473-474. [OpenAIRE]

Abstract
We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurements are normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].
Proceedings of the ACM Symposium on Document Engineering 2019 (DocEng '19). Article 24, 1–4.
Fields of Science and Technology classification (FOS)
03 medical and health sciences, 0302 clinical medicine, 030220 oncology & carcinogenesis, 05 social sciences, 0509 other social sciences, 050904 information & library sciences
Subjects
free text keywords: Units of measurements, Physical quantities, Measurements, Text and data mining, Machine Learning, Document analysis, Applied computing, Document meta- data, TDM, [INFO]Computer Science [cs], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, Scientific literature, Parsing, computer.software_genre, computer, Context (language use), Conditional random field, Materials informatics, Scientific notation, International System of Units, Computer science, Information retrieval, Identification (information)

[1] Milan Agatonovic, Niraj Aswani, Kalina Bontcheva, Hamish Cunningham, Thomas Heitz, Yaoyong Li, Ian Roberts, and Valentin Tablan. 2008. Large-scale, parallel automatic patent annotation. In Proceedings of the 1st ACM workshop on Patent information retrieval. ACM, 1-8.

[2] Skopinava AM and Lobanov BM. 2013. Processing of quantitative exPressions with units of measurement in scientific texts as aPPlied to Belarusian and russian text-to-sPeech synthesis. (2013).

[3] Hidir Aras, René Hackl-Sommer, Michael Schwantner, and Mustafa Sofean. 2014. Applications and Challenges of Text Mining with Patents.. In IPaMin@ KONVENS.

[4] Soumia Lilia Berrahou, Patrice Buche, Juliette Dibie-Barthélemy, and Mathieu Roche. [n. d.]. How to Extract Unit of Measure in Scientific Documents?.

[5] Contributors [n. d.]. Units of Measurement. https://github.com/ unitsofmeasurement.

[6] Contributors 2008 - 2019. GROBID (GeneRation Of BIbliographic Data). https://github.com/kermitt2/grobid. swh:1:dir:6a298c1b2008913d62e01e5bc967510500f80710.

[7] André Dazy. 2014. ISTEX: a powerful project for scientific and technical electronic resources archives. Insights 27, 3 (2014).

[8] Thaer M Dieb, Masaharu Yoshioka, Shinjiro Hara, and Marcus C Newton. 2015. Framework for automatic information extraction from research papers on nanocrystal devices. Beilstein journal of nanotechnology 6, 1 (2015), 1872-1882.

[9] Luca Foppiano, M. Dieb Thaer, Akira Suzuki, and Masashi Ishii. 2019. Proposal for Automatic Extraction Framework of Superconductors Related Information from Scientific Literature. In Letters and Technology News, vol. 119, no. 66, SC2019-1 (no.66), Vol. 119. Tsukuba, 1-5. ISSN: 2432-6380. [OpenAIRE]

[10] Kyle Hundman and Chris A Mattmann. 2017. Measurement Context Extraction from Text: Discovering Opportunities and Gaps in Earth Science. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. [OpenAIRE]

[11] Yanna Shen Kang and Mehmet Kayaalp. 2013. Extracting laboratory test information from biomedical text. Journal of pathology informatics 4 (Aug. 2013), 23-23. https://doi.org/10.4103/2153-3539.117450

[12] John Laferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).

[13] Patrice Lopez. 2009. GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In International conference on theory and practice of digital libraries. Springer, 473-474. [OpenAIRE]

Any information missing or wrong?Report an Issue