背景信息:
lora微调后的gemma3-1b目录
ls sft_lora_models/unsloth_gemma-3-1b-it
adapter_config.json chat_template.jinja tokenizer_config.json
adapter_model.safetensors README.md tokenizer.json
added_tokens.json special_tokens_map.json tokenizer.model
# -*- coding: utf-8 -*-
# 说明:使用和 evaluate_gsm8k.py 一致的加载方式来测试 LoRA
import torch
from unsloth import FastLanguageModel
# 1. 路径配置
BASE = "unsloth/gemma-3-1b-it"
LORA_DIR = "./sft_lora_models/unsloth_gemma-3-1b-it"
DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
MAXLEN = 2048
q = "A clown needs to buy 12 red noses. He already has 5. Each red nose costs $2. How much money does the clown need to buy the rest of the noses?\n\nLet's think step by step and output the final answer within \\boxed{}."
# =====================================================================
# 方式一:基座模型
# =====================================================================
base, tok = FastLanguageModel.from_pretrained(
model_name=BASE,
max_seq_length=MAXLEN,
dtype=DTYPE,
load_in_4bit=False,
)
FastLanguageModel.for_inference(base)
prompt = tok.apply_chat_template(
[{"role": "user", "content": q}],
tokenize=False,
add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(base.device)
with torch.no_grad():
out_base = base.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)
print("BASE ===>\n", tok.decode(out_base[0], skip_special_tokens=True), "\n")
# =====================================================================
# 方式二:LoRA 模型(和 evaluate_gsm8k.py 一样,直接从 LoRA_DIR 加载)
# =====================================================================
lora, tok2 = FastLanguageModel.from_pretrained(
model_name=LORA_DIR,
max_seq_length=MAXLEN,
dtype=DTYPE,
load_in_4bit=False,
)
FastLanguageModel.for_inference(lora)
with torch.no_grad():
out_lora = lora.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)
print("LORA ===>\n", tok2.decode(out_lora[0], skip_special_tokens=True), "\n")
输出完全一致:
(tryLLM) root@DESKTOP-AFQV2GT:/proj/tryLLM# python test_lora.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 08-23 00:59:59 [__init__.py:241] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))== Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
\\ /| NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
BASE ===>
user
Tom has 8 apples. He eats 3. How many are left?
Let's think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.
Final Answer: The final answer is $\boxed{5}$
==((====))== Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
\\ /| NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
LORA ===>
user
Tom has 8 apples. He eats 3. How many are left?
Let's think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.
Final Answer: The final answer is $\boxed{5}$
虽然unsloth有尝试判断是否为peft model,但是这个过程可能失效了(我尚未进调试去观察)
class FastLanguageModel(FastLlamaModel):
@staticmethod
def from_pretrained(
model_name = "unsloth/Llama-3.2-1B-Instruct",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
load_in_8bit = False,
full_finetuning = False,
token = None,
device_map = "sequential",
rope_scaling = None,
fix_tokenizer = True,
trust_remote_code = False,
use_gradient_checkpointing = "unsloth",
resize_model_vocab = None,
revision = None,
use_exact_model_name = False,
fast_inference = False, # uses vLLM
gpu_memory_utilization = 0.5,
float8_kv_cache = False,
random_state = 3407,
max_lora_rank = 64,
disable_log_stats = True,
*args, **kwargs,
):
.....
try:
peft_config = PeftConfig.from_pretrained(
model_name,
token = token,
revision = revision,
trust_remote_code = trust_remote_code,
)
is_peft = True
except Exception as error:
peft_error = str(error)
is_peft = False
pass
在给了我的部分代码和日志给GPT-5以及低温度下的gemini-2.5-pro后,他们怀疑是自动加载问题,改了一个版本的测试lora,现在输出不同了
File: test_lora.py
# -*- coding: utf-8 -*-
# 说明:使用和 evaluate_gsm8k.py 一致的加载方式来测试 LoRA
import torch
from unsloth import FastLanguageModel
from peft import PeftModel # << 1. 导入 PeftModel
# 1. 路径配置
BASE = "unsloth/gemma-3-1b-it"
LORA_DIR = "./sft_lora_models/unsloth_gemma-3-1b-it"
DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
MAXLEN = 2048
q = "Tom has 8 apples. He eats 3. How many are left?\n\nLet's think step by step and output the final answer within \\boxed{}."
# =====================================================================
# 方式一:基座模型 (这部分保持不变)
# =====================================================================
base, tok = FastLanguageModel.from_pretrained(
model_name=BASE,
max_seq_length=MAXLEN,
dtype=DTYPE,
load_in_4bit=False,
)
FastLanguageModel.for_inference(base)
prompt = tok.apply_chat_template(
[{"role": "user", "content": q}],
tokenize=False,
add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(base.device)
with torch.no_grad():
out_base = base.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)
print("BASE ===>\n", tok.decode(out_base[0], skip_special_tokens=True), "\n")
# =====================================================================
# 方式二:LoRA 模型 (修正后的加载逻辑)
# =====================================================================
# 2. 不要重新加载模型,而是将LoRA权重应用到已经加载的 'base' 模型上
print("正在将 LoRA 适配器应用到基座模型上...")
lora_model = PeftModel.from_pretrained(base, LORA_DIR)
# 基座模型已经设置过 for_inference, 这里不需要重复设置
with torch.no_grad():
# 3. 使用新的 'lora_model' 对象进行推理
out_lora = lora_model.generate(**inputs, max_new_tokens=128, top_p=1, do_sample=False)
# 4. 使用原始的 'tok' 分词器进行解码
print("LORA ===>\n", tok.decode(out_lora[0], skip_special_tokens=True), "\n")
(tryLLM) root@DESKTOP-AFQV2GT:/proj/tryLLM# python test_lora.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 08-23 01:12:48 [__init__.py:241] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))== Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.3. vLLM: 0.10.1.1.
\\ /| NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 47.988 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
BASE ===>
user
Tom has 8 apples. He eats 3. How many are left?
Let's think step by step and output the final answer within \boxed{}.
model
Tom starts with 8 apples.
He eats 3 apples.
The number of apples left is 8 - 3 = 5.
So the answer is 5.
Final Answer: The final answer is $\boxed{5}$
正在将 LoRA 适配器应用到基座模型上...
LORA ===>
user
Tom has 8 apples. He eats 3. How many are left?
Let's think step by step and output the final answer within \boxed{}.
model
8 - 3 = <<8-3=5>>5
There are 5 apples left.
\boxed{5}
总结:
虽然unsloth有自动加载lora的功能,但是可能因为未知原因失效,推荐用peft
常规方法进行加载
from peft import PeftModel
lora_model = PeftModel.from_pretrained(base, LORA_DIR)