解决unsloth加载lora模型，默认的from_pretrained可能导致加载失败

Author： xy3
发布时间：August 23, 2025
253 views
No comments
6729 words
Categories： unsloth

背景信息：
lora微调后的gemma3-1b目录

ls sft_lora_models/unsloth_gemma-3-1b-it
adapter_config.json        chat_template.jinja      tokenizer_config.json
adapter_model.safetensors  README.md                tokenizer.json
added_tokens.json          special_tokens_map.json  tokenizer.model

# -*- coding: utf-8 -*-
# 说明：使用和 evaluate_gsm8k.py 一致的加载方式来测试 LoRA

import torch
from unsloth import FastLanguageModel

# 1. 路径配置
BASE = "unsloth/gemma-3-1b-it"
LORA_DIR = "./sft_lora_models/unsloth_gemma-3-1b-it"
DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
MAXLEN = 2048

q = "A clown needs to buy 12 red noses. He already has 5. Each red nose costs $2. How much money does the clown need to buy the rest of the noses?\n\nLet's think step by step and output the final answer within \\boxed{}."

# =====================================================================
# 方式一：基座模型
# =====================================================================
base, tok = FastLanguageModel.from_pretrained(
    model_name=BASE,
    max_seq_length=MAXLEN,
    dtype=DTYPE,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(base)

prompt = tok.apply_chat_template(
    [{"role": "user", "content": q}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(base.device)

with torch.no_grad():
    out_base = base.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

print("BASE ===>\n", tok.decode(out_base[0], skip_special_tokens=True), "\n")

# =====================================================================
# 方式二：LoRA 模型（和 evaluate_gsm8k.py 一样，直接从 LoRA_DIR 加载）
# =====================================================================
lora, tok2 = FastLanguageModel.from_pretrained(
    model_name=LORA_DIR,
    max_seq_length=MAXLEN,
    dtype=DTYPE,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(lora)

with torch.no_grad():
    out_lora = lora.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

print("LORA ===>\n", tok2.decode(out_lora[0], skip_special_tokens=True), "\n")

输出完全一致：

(tryLLM) root@DESKTOP-AFQV2GT:/proj/tryLLM# python test_lora.py 
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 08-23 00:59:59 [__init__.py:241] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
   \\   /|    NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
BASE ===>
 user
Tom has 8 apples. He eats 3. How many are left?

Let's think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.

Final Answer: The final answer is $\boxed{5}$ 

==((====))==  Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
   \\   /|    NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
LORA ===>
 user
Tom has 8 apples. He eats 3. How many are left?

Let's think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.

Final Answer: The final answer is $\boxed{5}$

虽然unsloth有尝试判断是否为peft model，但是这个过程可能失效了（我尚未进调试去观察）

class FastLanguageModel(FastLlamaModel):
@staticmethod
    def from_pretrained(
        model_name                 = "unsloth/Llama-3.2-1B-Instruct",
        max_seq_length             = 2048,
        dtype                      = None,
        load_in_4bit               = True,
        load_in_8bit               = False,
        full_finetuning            = False,
        token                      = None,
        device_map                 = "sequential",
        rope_scaling               = None,
        fix_tokenizer              = True,
        trust_remote_code          = False,
        use_gradient_checkpointing = "unsloth",
        resize_model_vocab         = None,
        revision                   = None,
        use_exact_model_name       = False,

        fast_inference             = False, # uses vLLM
        gpu_memory_utilization     = 0.5,
        float8_kv_cache            = False,
        random_state               = 3407,
        max_lora_rank              = 64,
        disable_log_stats          = True,
        *args, **kwargs,
    ):
.....
        try:
            peft_config = PeftConfig.from_pretrained(
                model_name,
                token = token,
                revision = revision,
                trust_remote_code = trust_remote_code,
            )
            is_peft = True
        except Exception as error:
            peft_error = str(error)
            is_peft = False
        pass

在给了我的部分代码和日志给GPT-5以及低温度下的gemini-2.5-pro后，他们怀疑是自动加载问题，改了一个版本的测试lora，现在输出不同了

File: test_lora.py

# -*- coding: utf-8 -*-
# 说明：使用和 evaluate_gsm8k.py 一致的加载方式来测试 LoRA

import torch
from unsloth import FastLanguageModel
from peft import PeftModel # << 1. 导入 PeftModel

# 1. 路径配置
BASE = "unsloth/gemma-3-1b-it"
LORA_DIR = "./sft_lora_models/unsloth_gemma-3-1b-it"
DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
MAXLEN = 2048

q = "Tom has 8 apples. He eats 3. How many are left?\n\nLet's think step by step and output the final answer within \\boxed{}."

# =====================================================================
# 方式一：基座模型 (这部分保持不变)
# =====================================================================
base, tok = FastLanguageModel.from_pretrained(
    model_name=BASE,
    max_seq_length=MAXLEN,
    dtype=DTYPE,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(base)

prompt = tok.apply_chat_template(
    [{"role": "user", "content": q}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(base.device)

with torch.no_grad():
    out_base = base.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

print("BASE ===>\n", tok.decode(out_base[0], skip_special_tokens=True), "\n")


# =====================================================================
# 方式二：LoRA 模型 (修正后的加载逻辑)
# =====================================================================
# 2. 不要重新加载模型，而是将LoRA权重应用到已经加载的 'base' 模型上
print("正在将 LoRA 适配器应用到基座模型上...")
lora_model = PeftModel.from_pretrained(base, LORA_DIR)
# 基座模型已经设置过 for_inference, 这里不需要重复设置

with torch.no_grad():
    # 3. 使用新的 'lora_model' 对象进行推理
    out_lora = lora_model.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

# 4. 使用原始的 'tok' 分词器进行解码
print("LORA ===>\n", tok.decode(out_lora[0], skip_special_tokens=True), "\n")

(tryLLM) root@DESKTOP-AFQV2GT:/proj/tryLLM# python test_lora.py 
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 08-23 01:12:48 [__init__.py:241] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
   \\   /|    NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
BASE ===>
 user
Tom has 8 apples. He eats 3. How many are left?

Let's think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.

Final Answer: The final answer is $\boxed{5}$ 

正在将 LoRA 适配器应用到基座模型上...
LORA ===>
 user
Tom has 8 apples. He eats 3. How many are left?

Let's think step by step and output the final answer within \boxed{}.
model
8 - 3 = <<8-3=5>>5
There are 5 apples left.
\boxed{5}

总结：
虽然unsloth有自动加载lora的功能，但是可能因为未知原因失效，推荐用peft 常规方法进行加载

from peft import PeftModel
lora_model = PeftModel.from_pretrained(base, LORA_DIR)

Last modification：August 23, 2025

如果觉得我的文章对你有用，请随意赞赏

解决unsloth加载lora模型，默认的from_pretrained可能导致加载失败

xy3 • 2025 年 08 月 23 日

背景信息： lora微调后的gemma3-1b目录<pre><code>ls sft_lora_models/unsloth_gemma-3-1b-it
adapter_config.json chat_template.jinja tokenizer_config.json
adapter_model.safetensors README.md tokenizer.json
added_tokens.json special_tokens_map.json tokenizer.model</code></pre><pre><code class="lang-py"># -*- coding: utf-8 -*-
# 说明：使用和 evaluate_gsm8k.py 一致的加载方式来测试 LoRA

import torch
from unsloth import FastLanguageModel

# 1. 路径配置
BASE = &quot;unsloth/gemma-3-1b-it&quot;
LORA_DIR = &quot;./sft_lora_models/unsloth_gemma-3-1b-it&quot;
DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
MAXLEN = 2048

q = &quot;A clown needs to buy 12 red noses. He already has 5. Each red nose costs $2. How much money does the clown need to buy the rest of the noses?\n\nLet&#039;s think step by step and output the final answer within \\boxed{}.&quot;

# =====================================================================
# 方式一：基座模型
# =====================================================================
base, tok = FastLanguageModel.from_pretrained(
    model_name=BASE,
    max_seq_length=MAXLEN,
    dtype=DTYPE,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(base)

prompt = tok.apply_chat_template(
    [{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: q}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors=&quot;pt&quot;).to(base.device)

with torch.no_grad():
    out_base = base.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

print(&quot;BASE ===&gt;\n&quot;, tok.decode(out_base[0], skip_special_tokens=True), &quot;\n&quot;)

# =====================================================================
# 方式二：LoRA 模型（和 evaluate_gsm8k.py 一样，直接从 LoRA_DIR 加载）
# =====================================================================
lora, tok2 = FastLanguageModel.from_pretrained(
    model_name=LORA_DIR,
    max_seq_length=MAXLEN,
    dtype=DTYPE,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(lora)

with torch.no_grad():
    out_lora = lora.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

print(&quot;LORA ===&gt;\n&quot;, tok2.decode(out_lora[0], skip_special_tokens=True), &quot;\n&quot;)</code></pre>输出完全一致：<pre><code>(tryLLM) root@DESKTOP-AFQV2GT:/proj/tryLLM# python test_lora.py 
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 08-23 00:59:59 [__init__.py:241] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))== Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
 \\ /| NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 &quot;-____-&quot; Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
BASE ===&gt;
 user
Tom has 8 apples. He eats 3. How many are left?

Let&#039;s think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.

Final Answer: The final answer is $\boxed{5}$

==((====))==  Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
   \\   /|    NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 &quot;-____-&quot;     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
LORA ===&gt;
 user
Tom has 8 apples. He eats 3. How many are left?

Let&#039;s think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.

Final Answer: The final answer is $\boxed{5}$</code></pre>虽然unsloth有尝试判断是否为peft model，但是这个过程可能失效了（我尚未进调试去观察）<pre><code class="lang-python">class FastLanguageModel(FastLlamaModel):
@staticmethod
 def from_pretrained(
 model_name = &quot;unsloth/Llama-3.2-1B-Instruct&quot;,
 max_seq_length = 2048,
 dtype = None,
 load_in_4bit = True,
 load_in_8bit = False,
 full_finetuning = False,
 token = None,
 device_map = &quot;sequential&quot;,
 rope_scaling = None,
 fix_tokenizer = True,
 trust_remote_code = False,
 use_gradient_checkpointing = &quot;unsloth&quot;,
 resize_model_vocab = None,
 revision = None,
 use_exact_model_name = False,

fast_inference = False, # uses vLLM
 gpu_memory_utilization = 0.5,
 float8_kv_cache = False,
 random_state = 3407,
 max_lora_rank = 64,
 disable_log_stats = True,
 *args, **kwargs,
 ):
.....
 try:
 peft_config = PeftConfig.from_pretrained(
 model_name,
 token = token,
 revision = revision,
 trust_remote_code = trust_remote_code,
 )
 is_peft = True
 except Exception as error:
 peft_error = str(error)
 is_peft = False
 pass</code></pre>在给了我的部分代码和日志给GPT-5以及低温度下的gemini-2.5-pro后，他们怀疑是自动加载问题，改了一个版本的测试lora，现在输出不同了File: test_lora.py<pre><code class="lang-py"># -*- coding: utf-8 -*-
# 说明：使用和 evaluate_gsm8k.py 一致的加载方式来测试 LoRA

import torch
from unsloth import FastLanguageModel
from peft import PeftModel # &lt;&lt; 1. 导入 PeftModel

q = &quot;Tom has 8 apples. He eats 3. How many are left?\n\nLet&#039;s think step by step and output the final answer within \\boxed{}.&quot;

# =====================================================================
# 方式一：基座模型 (这部分保持不变)
# =====================================================================
base, tok = FastLanguageModel.from_pretrained(
    model_name=BASE,
    max_seq_length=MAXLEN,
    dtype=DTYPE,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(base)

with torch.no_grad():
    out_base = base.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

print(&quot;BASE ===&gt;\n&quot;, tok.decode(out_base[0], skip_special_tokens=True), &quot;\n&quot;)

# =====================================================================
# 方式二：LoRA 模型 (修正后的加载逻辑)
# =====================================================================
# 2. 不要重新加载模型，而是将LoRA权重应用到已经加载的 &#039;base&#039; 模型上
print(&quot;正在将 LoRA 适配器应用到基座模型上...&quot;)
lora_model = PeftModel.from_pretrained(base, LORA_DIR)
# 基座模型已经设置过 for_inference, 这里不需要重复设置

with torch.no_grad():
    # 3. 使用新的 &#039;lora_model&#039; 对象进行推理
    out_lora = lora_model.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

# 4. 使用原始的 &#039;tok&#039; 分词器进行解码
print(&quot;LORA ===&gt;\n&quot;, tok.decode(out_lora[0], skip_special_tokens=True), &quot;\n&quot;)</code></pre><pre><code>(tryLLM) root@DESKTOP-AFQV2GT:/proj/tryLLM# python test_lora.py 
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 08-23 01:12:48 [__init__.py:241] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))== Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
 \\ /| NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 &quot;-____-&quot; Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
BASE ===&gt;
 user
Tom has 8 apples. He eats 3. How many are left?

Let&#039;s think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.

Final Answer: The final answer is $\boxed{5}$

正在将 LoRA 适配器应用到基座模型上...
LORA ===&gt;
 user
Tom has 8 apples. He eats 3. How many are left?

Let&#039;s think step by step and output the final answer within \boxed{}.
model
8 - 3 = &lt;&lt;8-3=5&gt;&gt;5
There are 5 apples left.
\boxed{5}</code></pre>总结： 虽然unsloth有自动加载lora的功能，但是可能因为未知原因失效，推荐用<code>peft</code> 常规方法进行加载<pre><code>from peft import PeftModel
lora_model = PeftModel.from_pretrained(base, LORA_DIR)</code></pre>

解决unsloth加载lora模型，默认的from_pretrained可能导致加载失败

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

博客重新搭建

樱花动漫APP打包

Handsome主题配套友情链接修改默认网站图标为对应站ico

宝塔7.7专业，企业开心版

windows安装miniforge

MiniMind 学习笔记

解决智简魔方对接智简魔方ep主机自动开通失败问题

宝塔dokcer部署川虎

东校园羽毛球抢场脚本（暂不开源）

unsloth+QWEN-3-4B+Base复现Learning Dynamics of LLM Finetuning的观测现象

解决unsloth加载lora模型，默认的from_pretrained可能导致加载失败

Leave a Comment Cancel reply 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

解决unsloth加载lora模型，默认的from_pretrained可能导致加载失败

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款