背景信息:
lora微调后的gemma3-1b目录

ls sft_lora_models/unsloth_gemma-3-1b-it
adapter_config.json        chat_template.jinja      tokenizer_config.json
adapter_model.safetensors  README.md                tokenizer.json
added_tokens.json          special_tokens_map.json  tokenizer.model
# -*- coding: utf-8 -*-
# 说明:使用和 evaluate_gsm8k.py 一致的加载方式来测试 LoRA

import torch
from unsloth import FastLanguageModel

# 1. 路径配置
BASE = "unsloth/gemma-3-1b-it"
LORA_DIR = "./sft_lora_models/unsloth_gemma-3-1b-it"
DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
MAXLEN = 2048

q = "A clown needs to buy 12 red noses. He already has 5. Each red nose costs $2. How much money does the clown need to buy the rest of the noses?\n\nLet's think step by step and output the final answer within \\boxed{}."

# =====================================================================
# 方式一:基座模型
# =====================================================================
base, tok = FastLanguageModel.from_pretrained(
    model_name=BASE,
    max_seq_length=MAXLEN,
    dtype=DTYPE,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(base)

prompt = tok.apply_chat_template(
    [{"role": "user", "content": q}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(base.device)

with torch.no_grad():
    out_base = base.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

print("BASE ===>\n", tok.decode(out_base[0], skip_special_tokens=True), "\n")

# =====================================================================
# 方式二:LoRA 模型(和 evaluate_gsm8k.py 一样,直接从 LoRA_DIR 加载)
# =====================================================================
lora, tok2 = FastLanguageModel.from_pretrained(
    model_name=LORA_DIR,
    max_seq_length=MAXLEN,
    dtype=DTYPE,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(lora)

with torch.no_grad():
    out_lora = lora.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

print("LORA ===>\n", tok2.decode(out_lora[0], skip_special_tokens=True), "\n")

输出完全一致:

(tryLLM) root@DESKTOP-AFQV2GT:/proj/tryLLM# python test_lora.py 
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 08-23 00:59:59 [__init__.py:241] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
   \\   /|    NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
BASE ===>
 user
Tom has 8 apples. He eats 3. How many are left?

Let's think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.

Final Answer: The final answer is $\boxed{5}$ 

==((====))==  Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
   \\   /|    NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
LORA ===>
 user
Tom has 8 apples. He eats 3. How many are left?

Let's think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.

Final Answer: The final answer is $\boxed{5}$

虽然unsloth有尝试判断是否为peft model,但是这个过程可能失效了(我尚未进调试去观察)

class FastLanguageModel(FastLlamaModel):
@staticmethod
    def from_pretrained(
        model_name                 = "unsloth/Llama-3.2-1B-Instruct",
        max_seq_length             = 2048,
        dtype                      = None,
        load_in_4bit               = True,
        load_in_8bit               = False,
        full_finetuning            = False,
        token                      = None,
        device_map                 = "sequential",
        rope_scaling               = None,
        fix_tokenizer              = True,
        trust_remote_code          = False,
        use_gradient_checkpointing = "unsloth",
        resize_model_vocab         = None,
        revision                   = None,
        use_exact_model_name       = False,

        fast_inference             = False, # uses vLLM
        gpu_memory_utilization     = 0.5,
        float8_kv_cache            = False,
        random_state               = 3407,
        max_lora_rank              = 64,
        disable_log_stats          = True,
        *args, **kwargs,
    ):
.....
        try:
            peft_config = PeftConfig.from_pretrained(
                model_name,
                token = token,
                revision = revision,
                trust_remote_code = trust_remote_code,
            )
            is_peft = True
        except Exception as error:
            peft_error = str(error)
            is_peft = False
        pass

在给了我的部分代码和日志给GPT-5以及低温度下的gemini-2.5-pro后,他们怀疑是自动加载问题,改了一个版本的测试lora,现在输出不同了

File: test_lora.py

# -*- coding: utf-8 -*-
# 说明:使用和 evaluate_gsm8k.py 一致的加载方式来测试 LoRA

import torch
from unsloth import FastLanguageModel
from peft import PeftModel # << 1. 导入 PeftModel

# 1. 路径配置
BASE = "unsloth/gemma-3-1b-it"
LORA_DIR = "./sft_lora_models/unsloth_gemma-3-1b-it"
DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
MAXLEN = 2048

q = "Tom has 8 apples. He eats 3. How many are left?\n\nLet's think step by step and output the final answer within \\boxed{}."

# =====================================================================
# 方式一:基座模型 (这部分保持不变)
# =====================================================================
base, tok = FastLanguageModel.from_pretrained(
    model_name=BASE,
    max_seq_length=MAXLEN,
    dtype=DTYPE,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(base)

prompt = tok.apply_chat_template(
    [{"role": "user", "content": q}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(base.device)

with torch.no_grad():
    out_base = base.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

print("BASE ===>\n", tok.decode(out_base[0], skip_special_tokens=True), "\n")


# =====================================================================
# 方式二:LoRA 模型 (修正后的加载逻辑)
# =====================================================================
# 2. 不要重新加载模型,而是将LoRA权重应用到已经加载的 'base' 模型上
print("正在将 LoRA 适配器应用到基座模型上...")
lora_model = PeftModel.from_pretrained(base, LORA_DIR)
# 基座模型已经设置过 for_inference, 这里不需要重复设置

with torch.no_grad():
    # 3. 使用新的 'lora_model' 对象进行推理
    out_lora = lora_model.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)

# 4. 使用原始的 'tok' 分词器进行解码
print("LORA ===>\n", tok.decode(out_lora[0], skip_special_tokens=True), "\n")
(tryLLM) root@DESKTOP-AFQV2GT:/proj/tryLLM# python test_lora.py 
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 08-23 01:12:48 [__init__.py:241] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
   \\   /|    NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
BASE ===>
 user
Tom has 8 apples. He eats 3. How many are left?

Let's think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.

Final Answer: The final answer is $\boxed{5}$ 

正在将 LoRA 适配器应用到基座模型上...
LORA ===>
 user
Tom has 8 apples. He eats 3. How many are left?

Let's think step by step and output the final answer within \boxed{}.
model
8 - 3 = <<8-3=5>>5
There are 5 apples left.
\boxed{5}

总结:
虽然unsloth有自动加载lora的功能,但是可能因为未知原因失效,推荐用peft 常规方法进行加载

from peft import PeftModel
lora_model = PeftModel.from_pretrained(base, LORA_DIR)
Last modification:August 23, 2025
如果觉得我的文章对你有用,请随意赞赏