Arabic AI Breakthrough Could Benefit Millions of Language Speakers Worldwide

SAUDI-TECHNOLOGY-SUMMIT — Guests attend the Global AI 2020 (Artificial Intelligence) Summit in the Saudi capital Riyadh on October 21, 2020. by FAYEZ NURELDINE/AFP via Getty Images

A group of academics, researchers, and engineers from the United Arab Emirates (UAE) recently unveiled a potent tool designed for Arabic speakers worldwide that, according to its developers, may pave the way for large language model (LLM) systems in additional languages that are "underrepresented in mainstream AI."

"Jais," which is named after the tallest mountain in the United Arab Emirates, was developed in partnership between Silicon Valley-based Cerebras Systems, the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in Abu Dhabi, and Inception, a division of UAE-based AI firm G42, according to CNN.

Although ChatGPT, Meta's LLaMA, and other LLMs support Arabic, Timothy Baldwin, acting provost and professor of natural language processing at MBZUAI, claims that they were primarily trained on English data from the internet.

Surpassing "What Anyone Else Has Been Able to Achieve for Arabic"

Instead, Jais made use of English and Arabic datasets, concentrating on Middle Eastern content, enabling it to surpass "what anyone else has been able to achieve for Arabic," according to Baldwin.

The majority of languages on the internet use the Latin alphabet, with English being by far the most popular. According to Mohammed Soliman, head of strategic technologies and the cyber security program at the Middle East Institute in Washington, DC, this indicates that datasets in those languages are the largest.

Typically, data sets used to train language models in English are Western-centric.

The challenge of training a language model is increased by the fact that Arabic is the sixth most spoken language in the world and is rich in a "constellation" of several dialects, according to Baldwin.

Local dialects are frequently utilized on blogs and social media, while Modern Standard Arabic is mainly used for official documents and formal writing. Jais can typically flip between dialects because of his training on a variety of material.

In addition to understanding queries in more than a dozen Arabic dialects, including Egyptian colloquial Arabic and Saudi colloquial Arabic, Google's Bard can now understand inquiries in Modern Standard Arabic as well.

Jais currently contains 13 billion parameters, and a 30-billion parameter update is under development. The correctness of a linguistic model is not always quantified by parameters. According to OpenAI, ChatGPT-3.5 includes almost 175 billion parameters.

Adherence to the UAE Government Regulations

Jais, like other generative AI models, requires instruction tweaking to avoid producing "toxic" or "harmful" results. It won't produce anything that could result in harm to oneself or others or that suggests addiction. Regarding subjects like drug use and homosexuality, the responses it produces follow regional laws and customs.

MBZUAI conducted "various dialogues" regarding responsible AI with the UAE government and other institutions, which were taken into consideration when creating Jais.

The UAE has increased its efforts to create generative AI systems. The region's largest generative AI model, Falcon, was unveiled by Abu Dhabi's Advanced Technology Research Council and the Technology Innovation Institute (TII) in March, with a fresh iteration released in September. It was the first nation in the world to select a minister of AI in 2017.

Although it isn't yet accessible in Arabic, Falcon is more capable than Jais in English, has 180 billion parameters, and exceeds rivals like Meta's LLaMA 2 in terms of reasoning, coding, and passing knowledge exams, according to TII. Falcon is also more powerful than Jais in Arabic. Falcon and Jais, in contrast to Google's Bard and ChatGPT, are open-source, which implies that anybody can use or modify their code.