Procedural Information Extraction (IE)
Weekly Meetings
Our next weekly meeting will be held from 10:00 - 11:00 U.S. Central time/UTC+0500, Thu 1 Apr 2021, in the Zoom channel https://ksu.zoom.us/j/6575247268. Please use https://trello.com/b/YD0ozS8K to access the Trello board of the project.
Project Description
Recipe Extraction Project
This research work focuses on extracting procedural information in the form of recipes from published scientific literature with application to nanomaterials synthesis. From our overall goal of producing recipes from free text, we derive the technical objectives of a system consisting of pipeline stages: document acquisition and filtering, payload extraction, recipe step extraction as a relationship extraction task, recipe assembly, and presentation through an information retrieval interface with question answering (QA) functionality. This system meets computational information and knowledge management (CIKM) requirements of metadata-driven payload extraction, named entity extraction, and relationship extraction from text. Results, key novel contributions, and significant open problems derived from this work center around the attribution of these holistic quality measures to specific machine learning and inference stages of the pipeline, each with their performance measures.
Project Pipeline

Pipeline Stages
Keywords
artificial intelligence
, machine learning
, natural language processing (NLP)
, information extraction
, computer vision
, topic modeling
Methods
The methods in our work include semi-supervised machine learning methodologies for PDF filtering and payload extraction tasks, followed by structured extraction and data transformation tasks beginning with section extraction, recipe steps as information tuples, and finally assembled recipes. Measurable objective criteria for extraction quality include precision and recall of recipe steps, ordering constraints, and QA accuracy, precision, and recall.
Team Members
Current Project Personnel
- Huichen Yang, Project Manager
- Derek Christensen, Member of Technical Staff
- Aneesh Duraiswaamy, (Ph.D. program)
- Yihong Theis, Graduate Research Assistant (Ph.D. program)
- Timothy Tucker, Graduate Research Assistant (M.S. program)
- Aidan Harries, Undergraduate Research Programmer
- Tinashe Sekabanja, Undergraduate Research Programmer
- Wesley Baldwin, Undergraduate Research Programmer
Alumni
- Sneha Gullapalli - M.S. 2018
- Alice Lam - B.S. 2018
- Shelby Coen - B.S. 2018
- Carlos Aguirre - B.S. 2019
- Jordan Roth - B.S. 2019
- Maria Fernanda De La Torre - B.S. 2019
- Paula Mendez - B.S. 2019
- Caleb Martin - B.S. 2023
- Mark Ferguson - B.S. 2022
- Rahul Reddy Karne - M.S., 2019
- Yihong Theis - M.S. 2019
- Kyu Seok Lee - M.S., 2020
- Marissa Shivers - M.S. 2022
- Luann Jung - Pre-Undergraduate, summer 2017
- Stephanie Fu - Pre-Undergraduate, summer 2017 - spring 2018
- Bhavin Koirala - Pre-Undergraduate, summer 2018
Data Sets
To be posted.
Source Code
References
Background and Related Work
- Hawizy, Lezan, et al. ChemicalTagger: A tool for semantic text-mining in chemistry. Journal of cheminformatics 3.1 (2011): 17.
- Kim, Edward, et al. Machine-learned and codified synthesis parameters of oxide materials. Scientific data 4 (2017): 170127.
- Kim, Edward, et al. Materials synthesis insights from scientific literature via text extraction and machine learning. Chemistry of Materials 29.21 (2017): 9436-9444.
KDD Lab Publications
Martin, C., Yang, H., & Hsu, W. (2022). KDDIE at SemEval-2022 Task 11: Using DeBERTa for Named Entity Recognition. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Seattle, USA, July 14-15, 2022.
Yang, H. & Hsu, W. (2022). Transformer-based Approach for Document Layout Understanding. In Proceedings of the 29th IEEE International Conference on Image Processing (ICIP 2022), Bordeaux, France, October 16-19, 2022.
Yang, H., Aguirre, C. A., Hsu, W. (2022). PIEKM: ML-based Procedural Information Extraction and Knowledge Management System for Materials Science Literature. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations (AACL-IJCNLP 2022), November 20-23, 2022, virtual conference.
Yang, H. & Hsu, W. H. (2021). Automatic Metadata Information Extraction from Scientific Literature using Deep Neural Networks. In Proceedings of the 14th International Conference on Machine Vision (ICMV 2021), Rome, Italy (virtual conference), November 8-12, 2021.
Yang, H. & Hsu, W. H. (2021). Named Entity Recognition from Procedural Text on Materials Synthesis using an Attention-Based Approach. Proceedings of the 2021 Workshop on Scientific Document Understanding at the 35th International Conference of the Association for the Advancement of Artificial Intelligence (SDU@AAAI-21), virtual conference.
Yang, H. & Hsu, W. (2020). Vision-Based Layout Detection from Scientific Literature using Recurrent Convolutional Neural Networks. Proceedings of the 25th International Conference on Pattern Recognition (ICPR 2020), Milan, Italy, January 10-15, 2021.
Yang, H., Aguirre, C. A., De La Torre, M. F., Christensen, D., Bobadilla, L., Davich, E., Roth, J., Luo, L., Theis, Y., Lam, A., Han, T. Y.-J., Buttler, D., & Hsu, W. H. (2019). Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science. Proceedings of the 2nd International Conference on Document Analysis and Recognition Workshop on Open Services and Tools for Document Analysis (ICDAR-OST 2019), Sydney, Australia, September 21, 2019.
De La Torre, M. F., Aguirre, C. A., Anshutz, B., & Hsu, W. (2018). MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications. Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2018): International Conference on Knowledge Discovery and Information Retrieval (KDIR 2018), Seville, Spain, September 18-20, 2018.
Aguirre, C. A., Coen, S., De La Torre, M. F., Hsu, W. H., & Rys, M. (2018). Towards Faster Annotation Interfaces for Learning to Filter in Information Extraction and Search. Proceedings of the 2nd ACM Intelligent User Interfaces (IUI) Workshop on Exploratory Search and Interactive Data Analytics (ESIDA 2018), Tokyo, Japan, March 11, 2018.
Aguirre, C. A., Gullapalli, S., De La Torre, M. F., Lam, A., Weese, J. L., & Hsu, W. H. (2017). Learning to Filter Documents for Information Extraction using Rapid Annotation. Proceedings of the 1st International Conference on Machine Learning and Data Science (MLDS 2017), Noida, India, December 14-15, 2017.
Last updated by rotclanny on Aug 20, 2023