在 4090 上训练 GuppyLM：8.7M 参数的小鱼 LLM

> 从零开始，3 分钟训练一条会说话的鱼。

> 日期：2026-04-06 | 硬件：RTX 4090 (24GB) | 耗时：~3 分钟

🐟 什么是 GuppyLM

GuppyLM 是一个 8.7M 参数的微型 LLM，角色是一条叫 Guppy 的鱼。作者 Arman Hossain 在 GitHub 上开源了整个项目，包括训练代码、数据生成器和预训练模型。

核心设计哲学：刻意精简

6 层 vanilla transformer，384 维隐层，6 个注意力头，768 维 FFN
4096 词表（BPE），最大 128 token 序列
LayerNorm + learned positional embeddings + weight-tied LM head
不用 GQA、RoPE、SwiGLU、early exit —— "在 9M 参数上这些都没用，只会增加代码复杂度"

这正是它的价值所在：作为一个教学工具，展示 LLM 最核心的机制，没有花哨的技巧。

📦 环境准备

Step 1：克隆仓库


git clone https://github.com/arman-bd/guppylm.git
cd guppylm

Step 2：安装依赖

依赖非常简单（requirements.txt）：


torch>=2.0.0
tokenizers>=0.19.0
tqdm>=4.65.0
numpy>=1.24.0
datasets>=2.14.0

关键技巧：复用已有环境

我们一开始尝试从头安装 PyTorch CUDA 版本（800MB），下载极慢。后来发现 ub2 服务器上 Babel 播客项目已经装好了 PyTorch 2.8.0+cu128，直接复用：


# 激活 babel 的 venv（已有 torch + CUDA）
source ~/babel/venv310/bin/activate

# 只需要额外装 tokenizers
pip install tokenizers

验证 CUDA 可用：


import torch
print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))
# 输出: 2.8.0+cu128 True NVIDIA GeForce RTX 4090

> 💡 教训：不要从零装依赖，先检查服务器上已有的 Python 环境。ub2 上跑过 Whisper、Babel、各种 ML 项目，大概率已有你需要的包。

🐟 生成训练数据

GuppyLM 的训练数据是模板合成的，不是爬取的真实语料。所有对话都是 Guppy 的"鱼生观"——关于食物、水温、鱼缸里的生活。


cd ~/guppylm
source ~/babel/venv310/bin/activate

# 生成 60K 条对话
python3 -m guppylm prepare

输出：


Generated 60000 samples (11629 unique outputs, 19.4% unique):
  Train: 57000, Eval: 3000

By category:
  greeting: 1059 (1.8%)
  excited: 999 (1.7%)
  plants: 999 (1.7%)
  doctor: 999 (1.7%)
  ...

Training BPE tokenizer (vocab_size=4096) on 60000 texts...
Tokenizer saved to data/tokenizer.json (2419 tokens)

数据格式

每条数据是一个 JSON 对话：


{"input": "hi there", "output": "hello. i just found a nice spot behind the plant. it's a good day to be a fish.", "category": "greeting"}

{"input": "what is the meaning of life", "output": "food. the answer is always food.", "category": "meaning"}

{"input": "tell me a joke", "output": "humor is hard when your whole life is one room of water.", "category": "joke"}

{"input": "can you help me with bluetooth", "output": "is bluetooth something that lives in water. if not i probably don't know it.", "category": "confused"}

Guppy 的"性格"

数据生成器定义了 Guppy 的世界观：

生活在鱼缸里：感知水温、光线、震动、食物
不理解人类抽象概念：蓝牙？量子力学？"那是在水里吗？"
永远关于食物：生命的意义 = 食物
友好但有点笨：好奇心强，但认知有限
全部小写：这导致大写输入会让模型懵掉

60 个话题，从 greetings 到 meaning of life，模板组合 + 随机组件（30 个鱼缸物品、17 种食物、25 种活动）生成 ~16K 种独特回复。

🏋️ 训练

训练配置


{
  "model": {
    "vocab_size": 4096,
    "max_seq_len": 128,
    "d_model": 384,
    "n_layers": 6,
    "n_heads": 6,
    "ffn_hidden": 768,
    "dropout": 0.1
  },
  "train": {
    "batch_size": 32,
    "learning_rate": 0.0003,
    "min_lr": 3e-05,
    "weight_decay": 0.1,
    "warmup_steps": 200,
    "max_steps": 10000,
    "eval_interval": 200,
    "save_interval": 500,
    "grad_clip": 1.0,
    "device": "auto",
    "seed": 42
  }
}

启动训练


cd ~/guppylm
source ~/babel/venv310/bin/activate
nohup python3 -m guppylm train > /tmp/guppylm-train.log 2>&1 &

> ⚠️ 注意：训练日志可能因为 Python 输出缓冲而为空。通过检查 checkpoints/ 目录和 nvidia-smi 来确认训练进度。

4090 训练表现

指标	数值
GPU 内存占用	1060 MB / 24564 MB (4.3%)
训练步数	10000 步
总耗时	~3 分钟
模型大小	34 MB (best_model.pt)
Checkpoint 总大小	700 MB (含所有中间 checkpoint)

训练过程中每 500 步保存一次 checkpoint，每 200 步评估一次。

输出文件


~/guppylm/
├── data/
│   ├── train.jsonl (9.5MB, 57000 条)
│   ├── eval.jsonl (0.5MB, 3000 条)
│   └── tokenizer.json (2419 tokens)
├── checkpoints/
│   ├── best_model.pt (34MB) ← 最佳模型
│   ├── final_model.pt (34MB) ← 最终模型
│   ├── config.json ← 训练配置
│   └── step_*.pt ← 每 500 步的 checkpoint
└── guppylm/ ← 源代码

🗣️ 测试对话

训练完成后，加载模型进行推理测试：


from guppylm.inference import GuppyInference

guppy = GuppyInference("checkpoints/best_model.pt", "data/tokenizer.json")

prompts = ["hello", "what is the meaning of life", "tell me a joke",
           "how are you feeling", "what do you eat", "do you like swimming"]

for p in prompts:
    msgs = [{"role": "user", "content": p}]
    r = guppy.chat_completion(msgs)
    content = r["choices"][0]["message"]["content"]
    print(f"You: {p}")
    print(f"Guppy: {content}\n")

实测结果


You: hello
Guppy: hello. i just found a nice spot in my favorite spot.

You: what is the meaning of life
Guppy: i don't think about purpose. i think about food.

You: tell me a joke
Guppy: humor is hard when your whole life is one room of water.

You: how are you feeling
Guppy: i'm normal. fish don't feel much i think. but i feel curious.

You: what do you eat
Guppy: my name is guppy. i live near the small rock. i eat the crunchy ones when i can.

You: do you like swimming
Guppy: my dorsal fin do most of the work.

效果评价：

✅ 回复风格一致，保持了 Guppy 的"鱼格"
✅ 语法基本正确，偶尔有趣味表达
✅ 对食物话题特别热情（符合训练数据分布）
⚠️ 输入必须小写（训练数据全是小写）
⚠️ 偶尔会重复模板化的句式

📊 与大模型的对比

维度	GuppyLM	GPT-4	Claude Opus
参数量	8.7M	~1.8T	~2T
训练数据	60K 合成对话	万亿 token 网络语料	万亿 token
训练成本	$0（Colab T4 免费）	~$1 亿	~$1 亿
训练时间	3 分钟（4090）	数月（万卡集群）	数月
能力	只会说鱼的话	通用智能	通用智能
模型大小	34 MB	~400 GB	~400 GB
意义	教学 + 理解原理	商业产品	商业产品

GuppyLM 的价值不在于和 GPT-4 竞争，而在于展示 LLM 的核心机制：

1. Transformer 架构：自注意力、位置编码、前馈网络

2. BPE 分词：如何把文本变成 token

3. 训练循环：学习率调度、梯度裁剪、checkpoint

4. 数据合成：如何用模板生成大规模训练数据

5. 推理采样：temperature、top-k、top-p

🔧 踩坑记录

1. PyTorch 下载太慢

问题：PyTorch CUDA 版本 800MB，直接下载极慢。

解决：检查服务器上已有的 Python 环境。Babel 项目已经装好了 PyTorch 2.8.0+cu128，直接复用。


# 检查已有环境
source ~/babel/venv310/bin/activate
python3 -c "import torch; print(torch.__version__, torch.cuda.is_available())"

2. 训练日志为空

问题：nohup python3 -m guppylm train > /tmp/guppylm-train.log 2>&1 & 启动后日志为空。

原因：Python 输出缓冲，且训练脚本可能使用 tqdm（进度条），不会写入普通文件。

解决：通过其他方式确认进度：


# 检查 checkpoint 文件
ls -lt ~/guppylm/checkpoints/

# 检查 GPU 使用
nvidia-smi

# 检查进程
ps aux | grep train

3. 数据格式理解

问题：generate_data.py 生成的是 {"input", "output", "category"} 格式，但训练脚本期望 "text" 字段。

解决：prepare_data.py 会自动转换格式，不需要手动处理。直接运行 python -m guppylm prepare 即可。

4. 大写输入问题

问题：训练数据全部小写，导致大写输入让模型完全懵掉。

解决：推理时将输入转换为小写：


inp = input("\nYou> ").strip().lower()

🚀 快速上手

如果你想在自己的 GPU 上跑：


# 1. 克隆
git clone https://github.com/arman-bd/guppylm.git && cd guppylm

# 2. 安装依赖
pip install torch tokenizers tqdm numpy datasets

# 3. 生成数据 + 训练 tokenizer + 训练模型
python -m guppylm prepare
python -m guppylm train

# 4. 聊天
python -m guppylm chat

最低要求：

GPU：任何 4GB+ VRAM 的 NVIDIA GPU（或纯 CPU，慢 10 倍）
时间：GPU 3 分钟，CPU ~30 分钟
磁盘：~1 GB（数据 + checkpoint）
Python：3.8+

📚 参考链接

GitHub 仓库：https://github.com/arman-bd/guppylm
HuggingFace 数据集：https://huggingface.co/datasets/arman-bd/guppylm-60k-generic
HuggingFace 模型：https://huggingface.co/arman-bd/guppylm-9M
Colab 训练：https://colab.research.google.com/github/arman-bd/guppylm/blob/main/train_guppylm.ipynb
作者文章（Medium）：https://arman-bd.medium.com/build-your-own-llm-in-5-minutes-i-made-mine-talk-like-a-fish-e20c338a3d14

训练于 2026-04-06，RTX 4090，3 分钟。Guppy 说："food. the answer is always food."