You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
During my MSc dissertation, after leveraging LLMs to identify domain owners by analysing scraped privacy policy texts, a new challenge emerged: the models struggled to accurately identify owners from non-English privacy policy texts by giving random names. This issue was evident across key evaluation metrics, including accuracy and, precision. My research addresses a critical gap in the field of AI Existential Safety: understanding and improving the safety of large language models (LLMs) in multilingual contexts. Current LLM evaluations are predominantly English based, leading to a narrow view of these models' safety and capabilities. This project seeks to expand the understanding of LLM safety across languages, exploring how token-based language encoding affects LLM reasoning, alignment, and robustness. The rapidly expanding capabilities of LLMs necessitate rigorous safety evaluations to prevent potential risks to global communities, especially those in non-English-speaking regions. Multilinguality adds complexity to AI safety challenges by creating multiple “versions” of safety that vary between languages. This research is motivated by a desire to build globally safe AI systems that honor diverse cultural norms while resisting adversarial manipulation.
Research Objectives
Assess Multilingual Safety: Investigate how well existing LLM safety evaluations perform across languages, hypothesizing that substantial safety and bias variations will emerge.
Develop Cross-Language Safety Interventions: Design methodologies to reinforce safety and reduce vulnerabilities across different linguistic contexts.
Enhance Representation Engineering: Use insights from multilingual variations to advance the creation of a language-agnostic safety framework that could contribute to robust, jailbreak-resistant LLMs.
It will be used to fund my study as DPhil student at University of Oxford.
I will work alone under supervision.
These is no fail measurement as it will show how different languages in LLM might affect AI Alignment.
Nothing till now