April 2024's Major Open LLM Releases and Reinforcement Learning Methods Unveiled



April 2024 has proven to be a month of significant advancements in the field of Large Language Models (LLMs). The release of four major open LLMs—Mixtral by Mistral AI, Meta AI's Llama 3, Microsoft's Phi-3, and Apple's OpenELM—highlights the rapid pace of innovation. Additionally, new research has surfaced comparing two key reinforcement learning methods—Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO)—for aligning LLMs with human preferences and safety standards. This blog provides an in-depth overview of these releases and examines the implications of the new comparative research.

Key LLM Releases

Mixtral 8x22B by Mistral AI

Mixtral 8x22B, developed by Mistral AI, adopts a mixture-of-experts (MoE) model—a design that stands out by replacing traditional feed-forward modules with 8 expert layers in a transformer architecture. This model is particularly notable for its resource efficiency and high performance, despite its relatively low active parameter count. The mixture-of-experts approach helps distribute computational tasks among different specialized experts, thereby optimally utilizing resources and improving overall efficiency.

Llama 3 by Meta AI

Meta AI's Llama 3 continues the legacy of its predecessor, Llama 2, but introduces several enhancements. With models ranging from 8 billion to 70 billion parameters, and ongoing development of a colossal 400 billion-parameter variant, Llama 3 benefits from a significantly larger training dataset comprising 15 trillion tokens. The architecture remains similar to Llama 2, with increased vocabulary size and grouped-query attention for smaller models, resulting in substantial performance improvements. This defies conventional Chinchilla scaling laws and demonstrates the efficacy of an extensive training dataset in boosting model capabilities.

Phi-3 by Microsoft

Phi-3 by Microsoft stands out for its reliance on a high-quality dataset, trained on a mere 3.3 trillion tokens—far fewer than Llama 3's 15 trillion tokens. The model employs heavy filtering of web data and synthetic data, ensuring the data quality is exceptionally high. Despite the lower quantity of training data, the stringent data curation techniques allow Phi-3's smaller models to outperform some larger competitors in various benchmarks. Specific curation techniques include selective web scraping, elimination of low-quality data, and generation of high-quality synthetic data.

OpenELM by Apple

Apple's OpenELM is designed with mobile device deployment in mind, offering models that range from 270 million to 3 billion parameters. OpenELM employs a layer-wise scaling strategy, where the architecture is progressively adapted across the model's layers. This method enhances performance without a proportional increase in complexity, making it particularly advantageous for mobile and resource-constrained devices. Practical applications for this strategy include real-time language processing on smartphones and other mobile devices, where computational resources are limited but efficiency and speed are critical.

Reinforcement Learning in LLM Alignment

A comprehensive study has recently examined the efficiency of two critical reinforcement learning methods—PPO and DPO—for aligning LLMs with human preferences and safety standards. Here is a detailed overview of each method:

Proximal Policy Optimization (PPO)

PPO has been the traditional method used for reinforcement learning with human feedback. It requires a separate reward model, which gauges the performance of the LLM based on predefined metrics and feedback. The reward model then guides the LLM in adapting its responses to better align with human preferences. Despite its complexity and requirement for an auxiliary reward model, PPO has proven effective in achieving high alignment performance.

Direct Preference Optimization (DPO)

DPO offers a reward-model-free alternative to PPO. This method streamlines the alignment process by utilizing a classification-like objective for updating the LLM. DPO's simplicity makes it an attractive option, particularly for newer LLMs, as it reduces the need for additional models and processes. While PPO generally outperforms DPO in alignment precision, DPO's ease of use and solid performance cannot be overlooked. Both methods play crucial roles in the ongoing quest to refine LLM alignment techniques.


April 2024 has been a landmark month for LLM advancements, characterized by the release of innovative models and insightful research into reinforcement learning methods for alignment. Mixtral 8x22B, Llama 3, Phi-3, and OpenELM each offer distinct benefits—from Mixtral's efficient resource usage and Llama 3's extensive dataset training to Phi-3's high-quality data reliance and OpenELM's mobile-optimized architecture. Furthermore, the debate between PPO and DPO provides valuable insights into the optimal methods for aligning LLMs with human values.

Frequently Asked Questions

What specific data curation techniques were employed in Phi-3's heavily filtered web data and synthetic data?

The data curation techniques for Phi-3 involved selective web scraping to gather high-quality data, eliminating low-quality or irrelevant content, and generating high-quality synthetic data to augment the training dataset. This rigorous approach ensures that the model's training data is of the highest caliber, contributing to its strong performance compared to models trained on larger datasets.

In what practical scenarios or applications might the layer-wise scaling strategy from OpenELM be particularly advantageous?

The layer-wise scaling strategy from OpenELM is particularly advantageous in scenarios where computational resources are limited but performance cannot be compromised. Practical applications include real-time language processing on mobile devices, smart home assistants, and other IoT devices. This strategy allows for incremental improvements in model performance without a proportional increase in complexity, making it ideal for resource-constrained environments.

What are the potential implications of further expanding training datasets beyond the current 15 trillion tokens for models like Llama 3?

Expanding training datasets beyond the current 15 trillion tokens could lead to even more significant performance improvements for models like Llama 3. Larger datasets can provide more linguistic nuances, diverse contexts, and varied examples for the model to learn from, potentially enhancing its ability to understand and generate human-like text. However, this also raises questions about the diminishing returns of dataset expansions and the computational and environmental costs associated with training on such large scales.