There's Too Much Happening (Part 2)
To kick things off, Google DeepMind has some ideas about "self-training"
There is so much going on in AI right now that we needed a part 2!
Recall that my self-assigned assignment was to understand a single post about important announcements that happened in the field over the last week. Part 1 included an extended metaphor about running and an explanation about the first two announcements. Today we’ll finish the rest of the announcements, minus the honorable mentions. Extra credit if you send me explanations of those.
To kick things off, Google DeepMind has some ideas about "self-training"
Say you want to train an LLM to do some task, such as solving math problems or writing code. You’ll need to build one that both understands language and your task. You can accomplish that with these steps:
Step 1: Train the model on large swaths of the internet
You need to build a distributed system that can munge through lots of data and train the model. The resulting model will have an understanding of language, but is not likely to be good at the task you care about.
Step 2: Fine-tune the model for the task you care about
You need to generate a dataset of good and bad examples for the task you care about to “teach” the model to do the task well. Often, this step involves paying humans to evaluate the output of the model and give “reinforcing” signals like how you’d train a puppy. (“Bad model, we do not call users mean names”).
The first step was initially a research problem, then became an engineering problem and is now mostly a money problem. Researchers have figured out what settings to use to train an LLM, engineers have figured out how to keep these processes running, and now the biggest problem is to get someone to write you a check to buy GPUs.
The second step is an operational problem that is much harder to scale. Even if you have a lot of money to hire a lot of humans to annotate a lot of data, the work to coordinate large-scale human annotation projects is non-trivial. This is why people are interested in finding an alternative to Step 2 that is “fully automated” (i.e. human-free). This paper offers one such alternative for the cases where you have an evaluation dataset for your task that can return “binary labels” (0 or 1).
Not all tasks have a "true answer", but let's consider a case when they do. Say you want your LLM to solve math problems and you have a test set of math problems and their solutions. That test set might look something like this:
[
Question 1: some math problem, Answer 1: 10,
Question 2: some math problem, Answer 2: 1,
Question 3: some math problem, Answer 3: -3.4,
...
Question N: some math problem, Answer N: 8.0
]
You could feed your problem all of the questions and evaluate whether it produced the correct answer. If it did, you give that sample the “binary label” of 1 and if not, you label it 0.
The “self-training” idea proposed in this paper is to iteratively fine-tune the model using this auto-generated dataset. The fine-tuning steps are:
- Run your test set through your LLM model and score the samples using binary feedback.
- Fine-tune the model on these samples, i.e. update the weights to make the model more likely to output the samples that were correct and less likely to output the samples that were incorrect.
- Repeat
They call this technique “self-training with feedback” and they claim it “significantly surpasses fine-tuning only on human data” in experiments with one LLM and two datasets related to mathematical problem solving and code generation. 💥
Welcome to the world Phi-2 2.7B
There’s a race to make large language models smaller, and Microsoft just stepped into first place with the release of the 2.7 billion parameter Phi-2 model.
Why smaller? Because large models are unwieldy to work with and expensive to run. To give you a sense of scale, the Mixtral 8x7B model we talked about in Part I has 47 billion parameters and requires over 90 GB of VRAM to run with full precision (or 30 GB with 4-bit quantization). For our non-technical readers, that’s large enough to be annoying and expensive.
In comparison, Phi-2 is roughly 5 GB and has only 2.7 billion parameters. The larger language models are still more powerful, but Microsoft claims that Phi-2 is the best model with less than 13 billion parameters. It’s a good model to try if you want to play around with an LLM locally or in Colab. 💥
Finally, LLM360.
This is an initiative from a company called Petuum to release “fully transparent” LLMs that include details about the training process used to build the final model artifact. I’m not clear why they are doing this, but it’s cool!
From the paper:
Most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process.
To kick off the initiative, they’ve released two 7 billion parameter LLMs pre-trained from scratch, including their training code, data, intermediate checkpoints, and analyses. You can download their text generation and code generation models at llm360.ai.
These aren’t the top performing open-source models out there (according to their own analysis), but they’re in the mix and I commend their effort to level the playing field. Some of the magic required to get these models working is still passed through word-of-mouth across an insular research community that has historically been, let’s say, not the most inclusive. While I imagine the incentives behind this project have more to do with a business model play than DEI and I'm skeptical of any company that has taken $90 million from SoftBank, I welcome the effort. 💥
And that’s all folks! You should have more than enough to play around with over the holidays.
Comments
Sign in or become a Machines on Paper member to join the conversation.
Just enter your email below to get a log in link.