训练数据合成(一)

发表于 2024-11-09 分类于 CS ， NLP ， LLM 本文字数： 10k 阅读时长 ≈ 18 分钟

【本文已在同名微信公众号 / 知乎 / 个人博客linsight.cn 上线】

现在大模型的训练方法大部分都比较固定了，那么最重要的问题就是搞数据。真实世界的高质量数据虽然好用，但是成本高数量少，于是合成数据就成了一条很重要的路子。较新的专门模型如数学模型、代码模型或者阅读理解模型，基本上都已经使用上了大量的合成数据。这些领域的合成数据和训练的模型经过多次迭代，又会反哺下一代通用模型，左脚踩右脚直接起飞。Llama-3就是这么干的。

最近在搞代码能力的提升，很有必要学习一下(代码)数据合成的方法。

big picture

正好找到一篇新出的综述，《A Survey on Data Synthesis and Augmentation for Large Language Models》，梳理了“面向LLM的数据生成”相关的250篇文献，做了一些分类和总结。参考这篇综述先看下有那些思路。

首先，文中把数据的生成分成两大类：data augmentation和data synthesis。

data augmentation是一种“数据->类似数据”的做法，并在这个过程中保持原始数据的显著特征。最典型的做法就是CV领域中对图像数据做的各种畸变，如旋转翻转调整histogram等，但是并不破坏原图的重要内容；除此之外，借助较强的大模型，通过CoT等方式对无标签数据进行打标也是一种数据增强的方式，这个过程中还可以通过和人类标注结合进一步提升效果和效率。

data synthesis是创造全新数据的方法，文中把它分成三大类：
- general model distillation：借助强大的通用模型如GPT-4，生成可增强较弱模型的数据，典型的例子就是Phi系列
- domain model distillation：在一些领域如数学或者代码领域，通用模型的效果可能不够好，就需要借助专门模型来生成数据
- model self-improvment：比如基于现有文本数据，生成不同难度或者不同风格的文本数据

按时间线整理，上面各类方法对应的工作：

data augmentation和data synthesis在LM生命周期又有不同的做法。文中把LM的生命周期分为6段：（1）data preparation（2）pretraining（3）finetuning（4）instruction-tuning（5）preference alignment（6）application。（这里个人感觉（1）和（2）可以放一起，（3）和（4）可以放一起）

各个生命周期下，各种方法的整理和分类如下图：

这里我个人主要关注在文本预训练阶段，以及和code相关的内容。

文中给出了数据准备和预训练阶段的相关工作：

从中pick一些重要的/较新的工作：
- OSS-Instruct
- Case2Code
- TinyStories
- *Iterative Question Composing
- Generator prompts
- MathInstruct
- SciLitLLM
- TRAIT
- Persona Hub
- AceCoder
- Repocoder
- Evol-Instruct

Persona Hub

论文：《Scaling Synthetic Data Creation with 1,000,000,000 Personas》

时间：2024年06月

机构：腾讯

数据合成要解决的几个问题可以总结为3个：
- 多样性：真实的数据来自现实世界不同场景和不同人物，具有很强的多样性，而合成数据往往受限于prompt和model的特性，多样性有限
- 一致性：合成的数据分布要和真实数据分布一致，否则在推理的时候遇到了不同分布的输入，效果就会大打折扣
- 高质量：合成数据在多样化、拟合真实分布的情况下，还应尽量具有高质量的内容（毕竟真实数据中也有很多低质量内容，但这些内容已经被证实价值不高）

腾讯这篇论文主要就是要解决多样性的问题。通常来说，不考虑使用抽样的情况下，一个固定的prompt在固定的LLM只能获得固定的一条输出样例。而LLM不可能大规模地变换，因此要获得多样化的输出，就需要改变prompt。

现有在合成数据中提升多样化的方法基本上可以分成两种paradigm：
- instance-driven：使用种子语料库帮获得多样化的prompt，代表性的工作有《Self-instruct: Aligning language models with self-generated instructions》和《Metamath: Bootstrap your own mathematical questions for large language models》。这种方法prompt的多样性受限于种子语料库的规模。
- key-point-driven：通过在关键维度的排列组合，提升prompt的多样性，代表性的工作有《Synthetic data (almost) from scratch: Generalized instruction tuning for language models》和《Key-point-driven data synthesis with its enhancement on mathematical reasoning》。但是对于通用数据，关键维度可以很多，这就需要投入大量的人力，因此这种方法更适合于特定领域的数据合成，比如数学。

无论哪种方法，其实就是给模型输入一个“随机数”，但是这个随机数并不是一个数字，而可以认为是“字符串化”后的随机数。

那么腾讯就提出了一种角色驱动的数据合成方法（来提供这样一种字符串化的随机数）。角色描述可以是这样的：
- a moving company driver
- a chemical kinetics researcher

然后让模型为给定的角色创造符合要求的数据：“create {data} with {persona}”：

这种方法的好处是，角色这个维度不影响原来prompt的设定，因此几乎可以和任意的数据合成方法相结合。那么只要合成足够多的角色，理论上就可以获取和真实世界完全一样的多样性了。

角色的获取

第一个问题就是怎么获取足够多样的角色。文中给出了两种获取角色的方法：
- text-to-persona
- persona-to-persona

1、text-to-persona

利用海量网络数据构建角色：首先找一篇任意文档，然后让模型按“谁可能读/写/喜欢/不喜欢这段文字”的prompt输出，并给出对应角色的描述：

实践上来说，让模型给出详细一些的粒度效果更好。当然模型给出的角色描述粒度和类型和输入文本也有很大的关系，比如当输入文本是数学或者物理相关的文档时，给出的角色描述就比较细。

输出的角色描述可以是自然语言，也可以是结构化的文本，这个可以根据需求选择。

2、persona-to-persona

虽然使用网上多样化的文本可以生成很多样的角色，但是依然有可能存在一些遗漏。因此除了使用text生成角色外，还可以通过已有的角色泛化更多的角色。比如关于“儿童”的角色可以从儿童医院护士的角色（患者-照顾者关系）中推断出来。类似地，“乞丐”可以从避难所工作人员（援助关系）的角色衍生出来，“幕后电影工作人员”可以衍生出来来自电影主角的角色（同事关系）。

根据六度分离理论，文中对通过text-to-persona获得的每个角色进行六次persona-to-persona关系扩展迭代，从而进一步丰富角色多样性。

获得大量角色之后还需要去重。论文中使用了两种方法进行去重。

1、minhash

角色的描述一般都比较短，因此简单地使用 1-gram 和 128 的签名大小来进行 MinHash 重复数据删除，阈值设置为0.9。

2、embedding

使用embedding模型，比如OpenAI 的 text-embedding-3-small 模型来计算不同角色描述之间的相似度，然后按阈值过滤，这里的相似度阈值设置为0.9。

过滤时相似度阈值可以根据需求设置，比如当所需的量不大，而多样性要求更高时，可以选择更高的阈值，以保留少量差异更大的角色。

角色的使用

获得角色之后，就是怎么使用的问题。角色可以插入在0-shot、few-shot prompt里：

角色信息可以使用在不同数据的合成上。

1、数学

可以看到对于不同的角色，模型会给出难度不同，类型不同的数据。

2、逻辑推理

3、instruction

OSS-Instruct

论文：《Magicoder: Empowering Code Generation with OSS-Instruct》

时间：2023年12月

Magicoder利用OSS-INSTRUCT的方法（OSS=open-source code snippets），合成了75k的指令数据，并获得了不错的效果。

OSS-INSTRUCT的流程如下：

首先，从开源代码数据中，获取种子代码片段
对于每个代码文档，随机提取1-15行连续行作为种子片段
每个代码片段用下面这个prompt模板获取coding problem和solution

文字版：

You are exceptionally skilled at crafting high-quality programming problems and
offering precise solutions.
Please gain inspiration from the following random code snippet to create a
high-quality programming problem. Present your output in two distinct sections:
[Problem Description] and [Solution].
Code snippet for inspiration:
``
{code}
``
Guidelines for each section:
1. [Problem Description]: This should be **completely self-contained**, providing
all the contextual information one needs to understand and solve the problem.
Assume common programming knowledge, but ensure that any specific context,
variables, or code snippets pertinent to this problem are explicitly included.
1. [Solution]: Offer a comprehensive, **correct** solution that accurately
addresses the [Problem Description] you provided.

一些生成的样例如下：

Case2Code

论文：《Case2Code: Learning Inductive Reasoning with Synthetic Data》

时间：2024年7月

这篇论文发现在代码领域，deductive reasoning的数据比较常见，而inductive reasoning的数据就比较少见，这也导致模型在归纳推理能力上较弱。

因此提出case2code，和LLM归纳推理能力相关的一个任务。

case2code要求受测模型根据给定的代码输入和输出，归纳出代码的执行逻辑。

case2code的数据是合成得到的，合成框架如下图：

首先使用基于规则的filter收集program，然后利用LLM对这些program编写示例输入，用代码解释器获取这些输入的输出结果。最后根据输出过滤掉低质量的program，获得高质量的（program，input，output）三元组数据。

1、program的获取

从The Stack数据集中用解析工具获取python函数，保留满足以下规则的函数：
- 通过语法检查
- 具有一个或多个输入参数和有返回值
- 不依赖第三方或者外部IO操作

符合这些规则的函数可以轻易地使用代码解释器运行。

2、生成输入

利用LLM给收集到的python函数编写输入，这里发现这一步可以使用较小的LLM，以提高效率降低成本。

生成输入的prompt：

Given the function, first analyze the types of the function arguments, then write
10 different example inputs for the function, each example should be a dict with
function arguments' names and their values.
Output format:
``python
examples = [
dict(argname=argvalue),
....
]
``
Function:
``python
def test_func(a: int, b: str) -> str:
return str(a) + b
``
Examples:
``python
examples = [
dict(a=1, b='a'),
dict(a=2, b='b'),
dict(a=3, b='c'),
dict(a=4, b='d'),
dict(a=5, b='e'),
dict(a=6, b='f'),
dict(a=7, b='g'),
dict(a=8, b='h'),
dict(a=9, b='i'),
dict(a=10, b='j'),
]
``
Function:
``python
{code}
``
Examples:

3、获取输出

获取输出之后，会引入一些过滤规则筛选掉无效的输入或者函数。比如如果一个函数的输出在输入变化时保持不变，那么这个函数可能就是有问题的，就会被筛掉。

此外，还会过滤掉输出值很长的函数，确保case2code的数据不会超过LLM的窗口范围（不过目前LLM的窗口都很长，这种情况应该不多）。

4、post-processing

最后会把函数和输入输出构建成prompt。对于有n个输入输出对的case，会抽取m<=n个输入输出对作为观察集。

另外还发现prompt的多样性会极大地影响模型推理性能的泛化能力，因此构建了10个不同风格的prompt模板。使用多样化prompt训练效果更好：

拼接的prompt样例：

在预训练或者微调阶段加入case2code数据训练，对模型的代码能力都有提升：

TinyStories

论文：《TinyStories: How Small Can Language Models Be and Still Speak Coherent English?》

时间：2023年5月

机构：微软

这篇文章时间比较久了。首先，观察到对于比较小（比如1.25B参数）的LLM的生成效果一般，会出现无法生成连贯且一致内容的情况。为了研究这个问题，就利用GPT-3.5和GPT-4合成了TinyStories数据集，这个数据集只包含典型3到4岁儿童通常理解的词语。在这个数据集下训练的模型，即使参数只有一千万，也能够生成流畅且一致的故事。

TinyStories数据的合成方式大致是这样的：首先收集一个包含1500个基本单词的vocabulary，其中的单词分为名词、动词和形容词。每次从这个vocabulary中挑选3个单词（一个名词，一个动词，一个形容词），让模型根据这三个词生成一个故事。这种生成方式使得故事具有多样性，并包含了大量将不同的概念结合起来的方式。

下面是一个prompt的例子：

Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would likely understand.
The story should use the verb ”decorate”, the noun ”thunder” and the adjective ”ancient”. The story
should have the following features: the story should contain at least one dialogue, the story has a bad ending.
Remember to only use simple words!

利用这个prompt，GPT-3.5给出的结果如下：

Once upon a time, in an ancient house, there lived a girl named Lily. She loved to decorate her room with pretty things. One
day, she found a big box in the attic. She opened it and saw many shiny decorations. Lily was very happy and decided to use
them in her room.
As Lily was decorating her room, the sky outside became dark. There was a loud thunder sound, and Lily got scared. She ran
to her mom and said, ”Mommy, the thunder is so loud!” Her mom hugged her and said, ”Don’t worry, it will pass soon.”
But the thunder did not stop. It got louder and louder, and the ancient house started to shake. Suddenly, the roof fell down
on the pretty decorations. Lily was sad because her room was not pretty anymore. The end.

除了生成连贯的语言，指令的遵循能力也是LLM的一个重点。为此构建了TinyStories-Instruct数据集。数据集里每个故事都以一组指令开头，指令分为四种类型： - 故事中要包含的词语列表。
- 故事中应该出现的句子。
- 特征，比如对话、冲突、转折、bad ending等
- 故事的简短摘要（1-2行）

一个例子如下：

Summary: Lily and Timmy build a sandcastle together and learn to compromise, but it gets knocked over by a gust of wind.
They find beauty in the broken sandcastle and play happily with a butterfly.
Features: Dialogue, Foreshadowing, Twist
Sentence: One day, she went to the park and saw a beautiful butterfly.
Words: disagree, network, beautiful
Story: Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, she went to the park and saw a beautiful butterfly. She tried to catch it, but it flew away. Lily’s friend, Timmy, came to play with her.
They decided to build a sandcastle, but they disagreed on how to build it. Lily wanted to make it tall, while Timmy wanted to
make it wide. They argued for a while, but finally agreed to make it both tall and wide. As they were building the sandcastle,
they noticed a group of ants working together. They realized that the ants were building a network of tunnels. Suddenly, a gust
of wind came and knocked over their sandcastle. They were sad, but then they saw the butterfly again. It landed on the remains
of their sandcastle and they realized that it was still beautiful, even in its broken state. They smiled and played together happily.

小结

数据的合成最终就是要解决这几个问题： - 多样性：真实的数据来自现实世界不同场景和不同人物，具有很强的多样性，而合成数据往往受限于prompt和model的特性，多样性有限
- 一致性：合成的数据分布要和真实数据分布一致，否则在推理的时候遇到了不同分布的输入，效果就会大打折扣
- 高质量：合成数据在多样化、拟合真实分布的情况下，还应尽量具有高质量的内容（毕竟真实数据中也有很多低质量内容，但这些内容已经被证实价值不高）

其中一致性来自于prompt的编写，和用于合成数据的LLM的理解能力；质量提升通常使用后筛选模型来做，或者如执行反馈这样的方法；而多样性则是最考验数据合成设计的一个环节，这个环节在数据量大的情况下也是最重要的。

博客：http://www.linsight.cn/
知乎：Linsight
微信公众号：Linsight
博主微信号(添加请注明来意)：

Reference

【1】A Survey on Data Synthesis and Augmentation for Large Language Models https://arxiv.org/abs/2410.12896
【2】Scaling Synthetic Data Creation with 1,000,000,000 Personas https://arxiv.org/abs/2406.20094
【3】Magicoder: Empowering Code Generation with OSS-Instruct https://arxiv.org/abs/2312.02120
【4】Case2Code: Learning Inductive Reasoning with Synthetic Data https://arxiv.org/pdf/2407.12504
【5】TinyStories: How Small Can Language Models Be and Still Speak Coherent English? https://arxiv.org/abs/2305.07759