Wiki Contents

About the Natural Language (NL) Division


The Natural Language (NL) division of the KDD Lab focuses on text understanding and multilingual cognitive services (especially multilingual speech recognition and extractive summarization). The division is currently directed by Yihong Theis (Spring 2023 - present). The inaugural division lead was Dr. Huichen Yang (2016 - spring 2022), now a postdoctoral fellow at CSIRO in Sydney, Australia.

Information Extraction Project

This research work focuses on extracting procedural information in the form of recipes from published scientific literature with application to nanomaterials synthesis. From our overall goal of producing recipes from free text, we derive the technical objectives of a system consisting of pipeline stages: document acquisition and filtering, payload extraction, recipe step extraction as a relationship extraction task, recipe assembly, and presentation through an information retrieval interface with question answering (QA) functionality. This system meets computational information and knowledge management (CIKM) requirements of metadata-driven payload extraction, named entity extraction, and relationship extraction from text. Results, key novel contributions, and significant open problems derived from this work center around the attribution of these holistic quality measures to specific machine learning and inference stages of the pipeline, each with their performance measures.

Topical Knowledge Map

Topic modeling is a form of unsupervised machine learning in the area of natural language processing. Topic modeling has many applications such as text mining, social media analytics, and information retrieval and discovery. It enables information organization and discovery of knowledge from large amounts of unstructured data.

Our goal is to create and visualize a knowledge map of existing theses and dissertations using K-State's current ETDR collection. We will build a front-end system to automatically extract, process, and visualize scientific papers, clustering them by topic. This will enable other students to discover similar papers in the same "knowledge space" and make them aware of their peers' work.

We hope to build on this work in the future by implementing several variations of topic models and visualizations, such as hierarchal topic modeling, dynamic topic models, and a version of previous Ph.D. work in our lab involving a combination of a continuous-time dynamic topic model and an online Hierarchal Dirichlet Process model.

Code-Mix Speech Recognition

Speech recognition is defined as the ability to understand and convert human speech to readable text. Currently, there are many home assistant products that help our daily routines such as Amazon Alexa, Google Home, Apple Siri, etc. These cognitive services assistants only support monolingual speech. As the world develops and the internet grows, more and more people can speak more than one language. Some people even can speak more than three languages. For example, there was a language competition on social media in 2023 where a person can use four languages in one speech paragraph. Therefore, it is really necessary to study how to train a machine to understand Code-Mixed speech recognition.

The Code-Mixed speech can be divided into two different kinds:

  1. intrasentence: Switch language within a sentence.
  2. intersentence: Switch language sentence by sentence.

Our goal for this project aims to collect natural Code-Mixed Speech data as open source and create a model that can train to recognize these speech data.

Threat Intelligence

This project focuses to identify and discover real time cyber threats in computing systems using machine learning approaches applied to the datasets gathered from different open source online networks such as online social networks, security blogs, technical forums, sources, etc. The purposes of this project are mentioned below

  1. Detecting cyber threat events in real time
  2. Help the community to deal with the new attack techniques and vulnerabilities
  3. Try to use open source network as a potential source of cyber threat information effectively
  4. Promote this project to make it an open source tool, so that the expert community can also contribute

List of Large Language Models (LLMs) Maintained by the KDD Lab

This list is under construction. The current planned list includes EleutherAI GPT-J, Google Flan-t5, Meta LLAMA-2, EleutherAI Pythia, and TII Falcon. For more information, please see the respective web sites for each model. NLP division members and systems staff, please see the home card for KDD Lab LLM installations.

Last updated by vinnysun1 on Nov 30, 2023