Reflection 70B saga continues as training data provider releases post-mortem report
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More On September 5th, 2024, Matt Shumer, co-founder and CEO of the startup Hyperwrite AI (also known as OthersideAI) took to the social network X to post the bombshell news that he had fine-tuned a version of Meta’s open source Llama 3.1-70B into an even more performant large language model (LLM) known as Reflection 70B — so performant, in fact, based on alleged third-party benchmarking test results he published, that it was “the world’s top open-source model,” according to his post. I’m excited to announce Reflection 70B, the world’s top open-source model. Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes. 405B coming next week – we expect it to be the best model in the world. Built w/ @GlaiveAI. Read on ⬇️: pic.twitter.com/kZPW1plJuo — Matt Shumer (@mattshumer_) September 5, 2024 However, shortly after its release, third-party evaluators in the AI research and hosting community struggled to reproduce the claimed results, leading to accusations of fraud. Researchers cited discrepancies between the announced benchmark results and their independent tests, sparking a wave of criticism on social platforms such as Reddit and X. In response to these concerns, Shumer pledged he would conduct a review of the issues alongside Sahil Chaudhary, founder of Glaive, the AI startup whose synthetic data Shumer claimed he had trained Reflection 70B on — and which he later revealed to have invested what he called a small amount into. Now, nearly a month later, Chaudhary last night released a post-mortem report on his Glaive AI blog about the Reflection 70B model and published resources for the open-source AI community to test the model and his training process on their own. He says while he was unable to reproduce all of the same benchmarks, he “found a bug in the initial code,” resulting in several results appearing higher than what he has found on recent tests of Reflection 70B. However, other benchmark results appear higher than before — adding to the mystery. On September 5th, @mattshumer_ announced Reflection 70B, a model fine-tuned on top of Llama 3.1 70B, showing SoTA benchmark numbers, which was trained by me on Glaive generated data. Today, I’m sharing model artifacts to reproduce the initial claims and a post-mortem to address… — Sahil Chaudhary (@csahil28) October 2, 2024 As Chaudhary wrote in the post: “There were a lot of mistakes made by us in the way we launched the model, and handled the problems reported by the community. I understand that things like these have a significant negative effect on the open source ecosystem, and I’d like to apologize for that. I hope that this adds some clarity to what happened, and is a step in the direction of regaining the lost trust. I have released all of the assets required to independently verify the benchmarks and use this model.“ Sharing model artifacts To restore transparency and rebuild trust, Chaudhary shared several resources to help the community replicate the Reflection 70B benchmarks. These include: Model weights: Available on Hugging Face, providing the pre-trained version of Reflection 70B. Training data: Released for public access, enabling independent tests on the dataset used to fine-tune the model. Training scripts and evaluation code: Available on GitHub, these scripts allow for reproduction of the model’s training and evaluation process. These resources aim to clarify how the model was developed and offer a path for the community to validate the original performance claims. Reproducing the benchmarks In his post-mortem, Chaudhary explained that a major issue with reproducing the initial benchmark results stemmed from a bug in the evaluation code. This bug caused inflated scores in certain tasks, such as MATH and GSM8K, due to an error in how the system handled responses from an external API. The corrected benchmarks show slightly lower, but still strong, performance relative to the initial report. The updated benchmark results for Reflection 70B are as follows: MMLU: 90.94% GPQA: 55.6% HumanEval: 89.02% MATH: 70.8% GSM8K: 95.22% IFEVAL: 87.63% Compare that to the originally stated performance of: MMLU: 89.9% GPQA: 55.3% HumanEval: 91% MATH: 79.7% GSM8K: 99.2% IFEVAL: 90.13% Although the revised scores are not as high as those initially reported, Chaudhary asserts that they are more accurate reflections of the model’s capabilities. He also addressed concerns about dataset contamination, confirming that tests showed no significant overlap between the training data and benchmark sets. Reflecting on a hasty release Chaudhary admitted that the decision to release Reflection 70B was made hastily, driven by enthusiasm for the model’s performance on reasoning-based tasks. He noted that the launch lacked sufficient testing, particularly regarding the compatibility of the model files, and that he and Shumer had not verified whether the model could be easily downloaded and run by the community. “We shouldn’t have launched without testing, and with the tall claims of having the best open-source model,” Chaudhary wrote. He also acknowledged that more transparency was needed, especially regarding the model’s strengths and weaknesses. While Reflection 70B excels at reasoning tasks, it struggles in areas like creativity and general user interaction, a fact that was not communicated at launch. Clarifying API confusion One of the more serious accusations involved the suspicion that the Reflection 70B API was simply relaying outputs from Anthropic’s Claude model. Users reported strange behavior in the model’s outputs, including responses that seemed to reference Claude directly. Chaudhary addressed these concerns, explaining that although some of these behaviors were reproducible, he asserts there was no use of Claude APIs or any form of word filtering in the Reflection 70B model. He reiterated that the API was run on Glaive AI’s compute infrastructure, and Matt Shumer had no access to the code or servers used during this period. Looking ahead In closing, Chaudhary emphasized his commitment to transparency and expressed his hope that this post-mortem and the release of model artifacts will help restore trust in the project. He also confirmed that
Reflection 70B saga continues as training data provider releases post-mortem report Read More »