Welcome to another edition of AI News.
The Gap Between Open and Proprietary LLMs
"The False Promise of Imitating Proprietary LLMs" by researchers from Berkeley suggests that open-source alternatives can't match proprietary models like ChatGPT in terms of capabilities, regardless of the high ratings received from crowd workers after imitation training. They suggest that focusing on enhancing base capabilities through scaling and pretraining could be more impactful than gathering more imitation data. Read more
Evaluating Factual Precision
"FactScore: Fine-grained atomic evaluation of factual precision in long form text generation" offers an automated way to assess the factual precision of language models. The authors observe that factual error rates are higher when generating content about rare entities or facts mentioned later in the text. Among LLMs, GPT-4 exhibits substantially higher factual accuracy than StableLM. Read more
Gorilla: API calls with LLMs
A collaborative effort between Berkeley and Microsoft has led to the creation of "Gorilla: Large Language Model Connected with Massive APIs". By fine-tuning from a LLaMA model, this Gorilla surpasses GPT-4 in writing API calls. Read more
Evaluating Extreme Risks
Model evaluation isn't just about performance—it's also crucial in managing extreme risks. A study led by DeepMind, "Model evaluation for extreme risks", sheds light on the evaluation of potential extreme dangers posed by LLMs. Read more
Chatbot Arena Leaderboard
In the chatbot domain, we have seen some interesting updates with PaLM-2 lagging behind models from OpenAI, Anthropic, and LMSYS ORG. However, there are some caveats (the PaLM-2 available through the API is likely not the latest version, and often underperforms due to refusals). Read more
GPT-4 Surpassing RL Algorithms
"SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning" suggests that GPT-4 can significantly outperform Reinforcement Learning (RL)-based approaches in open-ended games such as Crafter. Read more
Making Finetuning More Efficient
The article "QLoRA: Efficient Finetuning of Quantized LLMs" demonstrates how memory usage can be dramatically reduced during the finetuning of LLMs without sacrificing performance. Read more
Improving Factuality through Multiagent Debate
"Improving Factuality and Reasoning in Language Models through Multiagent Debate" explores an innovative method of enhancing factual validity and mathematical reasoning in LLMs by leveraging multiple LLM instances in a debate format. Read more
Hallucinations in LLMs
"How Language Model Hallucinations Can Snowball" investigates how early mistakes made by LLMs can lead to 'hallucination snowballing', causing a cascade of errors in longer responses. Read more
AlignScore: A New Metric for Factual Consistency
Lastly, "AlignScore: A metric for evaluating factual consistency" introduces a unified alignment function to evaluate the factual consistency of generated content. Read more
Quality Diversity through AI Feedback
CarperAI's latest piece, "Quality Diversity through AI Feedback", introduces a new approach to producing high-quality solutions across a design space. In particular, this paper proposes to use Language Models to assess quality and diversity. The approach is put to the test in generating poetry and movie reviews, offering interesting insights into AI-assisted creativity.
Data-constrained Language Models
As the era of big data continues to scale, the question arises: What happens when we run out of data? "Scaling Data-constrained Language Models" sets out to answer this by training over 400 models, fitting a new data-constrained scaling law that generalizes the Chinchilla scaling law for repeated data usage. The study offers intriguing insights into data usage efficiency and the return on investment in computational resources.
In Other News...
ChatGPT and the Legal Realm
An unusual case saw a lawyer apologizing for fake court citations from ChatGPT. CNN's report further validates the necessity for careful AI use and regulation in professional fields.
Tiny Corp Takes on NVIDIA
George Hotz's Tiny Corp is aiming to rival NVIDIA by leveraging the tinygrad stack. More details here.
Open, Active, and Responsible AGI Research
As voices for pausing AI research grow, Forbes reports on LAION petitioning governments to keep AGI research open, active, and responsible.
AI & Leadership
UK Prime Minister Rishi Sunak met with AI leaders from Anthropic, DeepMind, and OpenAI to discuss future developments. More here
Nvidia Joins the $1 Trillion Club
Bolstered by the booming AI demand, Nvidia has briefly joined the trillion-dollar valuation club.
Mitigating AI Risk
A single-sentence statement led by the Centre for AI Safety attracted many signatures in the AI research community. The statement is:
Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.
I signed the statement.
It is probably fair to say that there remains significant disagreement within the ML/AI community about the level of risk posed by AI.
Useful Resources
The safetensors library, a key component in AI safety, has had an external audit ordered by Hugging face, Stability, and Eleuther AI. It will soon be a default installation in the transformers library.
Microsoft's Andrej Karpathy gave an insightful overview of how GPT models are currently trained. Find the full video here
Book recommendation
This week, I recommend "The Structure of Scientific Revolutions" by Thomas S. Kuhn. Offering a novel perspective on scientific progress, this book is a must-read for anyone interested in the dynamics of scientific discovery.
Filtir - fact-checking AI outputs
Lastly, I'm working with colleagues on a project called Filtir with the goal of catching AI hallucinations. If you’re interested in finding out more, we’re on Discord here.
If you prefer video summaries, you can find a video version of the newsletter here: