Chenghao Xiao

Postgraduate Student in the Department of Computer Science

By compressing and preserving the knowledge of those languages with numerical representations and keep passing them on to the next generations.

Chenghao Xiao

Postgraduate Student in the Department of Computer Science

What do you do?

I am a natural language processing researcher that focuses on information retrieval and cognition-inspired language model pretraining. I care about improving the efficiency and the equity of information acquisition for all with my research. At the moment, I am working on facilitating large language models with retrievers that have reasoning abilities; and learning natural language semantics with computer vision models.

How are you involved in this area of science?

My research mainly lies in two areas:

1) Training representation models that facilitate accurate information retrieval. In this area, I have proposed training frameworks that achieve state-of-the-art results on standard information retrieval benchmarks. Recently, I proposed a novel benchmark that assesses reasoning abilities of retrieval models, which has been widely recognised by the information retrieval community.

2) I also work on cognition-inspired language model pretraining. In this area, I have proposed a framework that won the "BabyLM challenge", which is a competition that limits training resources to word counts similar to that of a human has seen in their whole life, where we proposed the Contextualizer training strategy and reached performance to state-of-the-art language models with 1/300 of their training resources. Recently, I am also working on pretraining computer vision models to learn textual semantics, grounding the learning in more modalities.

What do you love about this topic?

I would specifically like to discuss the first area here. To facilitate information retrieval, we need to train a "representation" model that compresses texts of arbitrary lengths to a fixed-size vector. The beauty of this compression has never stopped fascinating me. No matter it is a single word "word" or the longest book in the world, we can represent both of them with two vectors of say 768 dimensions, and understand them in the same semantic space. This is the most beautiful thing I have seen.

How does this work deliver real-world impact?

I can see both lines of the work (information retrieval & low-resource language model pretraining) can bring impact in helping low-resource languages. With a low-resource language model pretraining framework, languages with very limited available corpus can be understood with a language model. This will help knowledge acquisition in those countries and areas (by facilitating retrieval/recommendation systems of the languages used) and will help preserve endangered languages, by compressing and preserving the knowledge of those languages with numerical representations and keep passing them on to the next generations.

Find out more:

Learn more about isotropy, contextualization and learning dynamics of contrastive-based sentence representation learning.
Read Chenghao Xiao's paper titled: Length is a Curse and a Blessing for Document-level Semantic.
Learn more about pixel sentence representation learning.

Still from look closer videos showing robots in computer science

Find out more

Take a look at the Department of Computer Science at Durham, explore their work and discover opportunities to get involved.

Take a look

Discover more

Meet more of the brilliant minds behind our Explainability, Machine Learning and Healthcare research! Explore the experts driving real world change and ground-breaking discoveries in this fascinating field.

Find out more