ollama跑DeepSeekOCR的几种提示词输出格式

Author： xy3
发布时间：January 20, 2026
89 views
No comments
8904 words
Categories：谈说天地

背景

网上似乎有提示词的说明，但是很少有输出格式的说明
本文记录一下格式，不管用什么跑其实输出格式都是一样的

测试用例图片为

给排版定位坐标

ollama run deepseek-ocr "/path/to/image\n<|grounding|>Given the layout of the image."

<|ref|>title<|/ref|><|det|>[[244, 140, 752, 162]]<|/det|>  

<|ref|>text<|/ref|><|det|>[[182, 187, 816, 258]]<|/det|>  

<|ref|>title<|/ref|><|det|>[[468, 294, 528, 308]]<|/det|>  

<|ref|>text<|/ref|><|det|>[[182, 315, 816, 480]]<|/det|>  

<|ref|>image<|/ref|><|det|>[[161, 507, 840, 716]]<|/det|>
<|ref|>image_caption<|/ref|><|det|>[[143, 735, 855, 829]]<|/det|>

自由识别文字

ollama run deepseek-ocr "/path/to/image\nFree OCR."

输出几乎纯文本（部分markdown）

End-to-End Test-Time Training for Long Context

Arnuv Tandon\*1,3, Karan Dalal\*1,4, Xinhao Li\*5, Daniel Koceja\*3, Marcel Rød\*3, Sam Buchanan4, Xiaolong Wang5, Jure Leskovec3, Sanmi Koyejo3, Tatsunori 
Hashimoto3, Carlos Guestrin3, Jed McCaleb1, Yejin Choi2, Yu Sun\*2,3
1 Astera Institute 2 NVIDIA 3 Stanford University 4 UC Berkeley 5 UC San Diego

Abstract

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard 
architecture – a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, 
compressing the context it reads into its weights. In addition, we improve the model’s initialization for learning at test time via meta-learning at training time. 
Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in 
contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method 
(TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar 
to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7× faster than full attention for 128K context. Our code is publicly 
available.

Figure 1. Scaling with context length, in terms of test loss (left) and latency (right). **Left:** Our method (TTT-E2E) turns the worst line (green) into the best 
(blue) at 128K context length. Loss $\Delta$ (↓), the $y$-value, is computed as (loss of the reported method) – (loss of Transformer with full attention), so loss 
$\Delta$ of full attention itself (orange) is the flat line at $y = 0$. While other methods produce worse loss $\Delta$ in longer context, TTT-E2E maintains the same 
advantage over full attention. All models have 3B parameters and are trained with 164B tokens. **Right:** Similar to SWA and the RNN baselines, TTT-E2E has constant 
inference latency regardless of context length, making it 2.7× faster than full attention for 128K context on an H100.

\* Core contributors. See statement of contributions before references.
Correspondence to: arnuv@stanford.edu, kdalal@berkeley.edu, yusun@cs.stanford.edu.

定位图像？

和命令1似乎输出一样

ollama run deepseek-ocr "/path/to/image\nParse the figure."

<|ref|>title<|/ref|><|det|>[[244, 140, 752, 161]]<|/det|>

<|ref|>text<|/ref|><|det|>[[182, 189, 816, 258]]<|/det|>

<|ref|>title<|/ref|><|det|>[[468, 295, 528, 307]]<|/det|>

<|ref|>text<|/ref|><|det|>[[184, 316, 814, 480]]<|/det|>

<|ref|>image<|/ref|><|det|>[[161, 507, 840, 716]]<|/det|>

<|ref|>image_caption<|/ref|><|det|>[[143, 735, 855, 829]]<|/det|>

提取纯文本

ollama run deepseek-ocr "/path/to/image\nExtract the text in the image."

忽略图像输出文本

End-to-End Test-Time Training for Long Context

Arnuv Tandon\*1,3, Karan Dalal\*1,4, Xinhao Li\*5, Daniel Koceja\*3, Marcel Rød\*3, Sam Buchanan4, Xiaolong Wang5, Jure Leskovec3, Sanmi Koyejo3, Tatsunori 
Hashimoto3, Carlos Guestrin3, Jed McCaleb1, Yejin Choi2, Yu Sun\*2,3

1 Astera Institute 2 NVIDIA 3 Stanford University 4 UC Berkeley 5 UC San Diego

Abstract

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard 
architecture – a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, 
compressing the context it reads into its weights. In addition, we improve the model’s initialization for learning at test time via meta-learning at training time. 
Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in 
contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method 
(TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar 
to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7× faster than full attention for 128K context. Our code is publicly 
available.

Figure 1. Scaling with context length, in terms of test loss (left) and latency (right). **Left:** Our method (TTT-E2E) turns the worst line (green) into the best 
(blue) at 128K context length. Loss \(\Delta\) (\(\downarrow\)), the \(y\)-value, is computed as (loss of the reported method) – (loss of Transformer with full 
attention), so loss \(\Delta\) of full attention itself (orange) is the flat line at \(y = 0\). While other methods produce worse loss \(\Delta\) in longer context, 
TTT-E2E maintains the same advantage over full attention. All models have 3B parameters and are trained with 164B tokens. **Right:** Similar to SWA and the RNN 
baselines, TTT-E2E has constant inference latency regardless of context length, making it 2.7× faster than full attention for 128K context on an H100.

* Core contributors. See statement of contributions before references.

Correspondence to: arnuv@stanford.edu, kdalal@berkeley.edu, yusun@cs.stanford.edu.

提取为markdown

ollama run deepseek-ocr "/path/to/image\n<|grounding|>Convert the document to markdown."

输出最全，包括定位符号和定位符号对应的文本

<|ref|>title<|/ref|><|det|>[[243, 138, 752, 161]]<|/det|>
# End-to-End Test-Time Training for Long Context  

<|ref|>text<|/ref|><|det|>[[181, 186, 816, 258]]<|/det|>
Arnuv Tandon \(^{*,1,3}\) , Karan Dalal \(^{*,1,4}\) , Xinhao Li \(^{*,5}\) , Daniel Koceja \(^{*,3}\) , Marcel Rod \(^{*,3}\) , Sam Buchanan \(^{4}\) , Xiaolong 
Wang \(^{5}\) , Jure Leskovec \(^{3}\) , Sanmi Koyejo \(^{3}\) , Tatsunori Hashimoto \(^{3}\) , Carlos Guestrin \(^{3}\) , Jed McCaleb \(^{1}\) , Yejin Choi \(^{2}\) 
, Yu Sun \(^{*,2,3}\) \(^{1}\) Astera Institute \(^{2}\) NVIDIA \(^{3}\) Stanford University \(^{4}\) UC Berkeley \(^{5}\) UC San Diego  

<|ref|>sub_title<|/ref|><|det|>[[468, 294, 527, 308]]<|/det|>
## Abstract  

<|ref|>text<|/ref|><|det|>[[183, 314, 814, 480]]<|/det|>
We formulate long- context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard 
architecture - a Transformer with sliding- window attention. However, our model continues learning at test time via next- token prediction on the given context, 
compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta- learning at training time. 
Overall, our method, a form of Test- Time Training (TTT), is End- to- End (E2E) both at test time (via next- token prediction) and training time (via meta- 
learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, 
our method (TTT- E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. 
However, similar to RNNs, TTT- E2E has constant inference latency regardless of context length, making it \(2.7 \times\) faster than full attention for 128K context. 
Our code is publicly available.  

<|ref|>image<|/ref|><|det|>[[159, 506, 840, 716]]<|/det|>
<|ref|>image_caption<|/ref|><|det|>[[143, 734, 855, 830]]<|/det|>
<center>Figure 1. Scaling with context length, in terms of test loss (left) and latency (right). Left: Our method (TTT-E2E) turns the worst line (green) into the 
best (blue) at 128K context length. Loss \(\Delta (\downarrow)\) , the \(y\) -value, is computed as (loss of the reported method) – (loss of Transformer with full 
attention), so loss \(\Delta\) of full attention itself (orange) is the flat line at \(y = 0\) . While other methods produce worse loss \(\Delta\) in longer context, 
TTT-E2E maintains the same advantage over full attention. All models have 3B parameters and are trained with 164B tokens. Right: Similar to SWA and the RNN 
baselines, TTT-E2E has constant inference latency regardless of context length, making it \(2.7 \times\) faster than full attention for 128K context on an H100. 
</center>

Last modification：January 20, 2026

如果觉得我的文章对你有用，请随意赞赏

ollama跑DeepSeekOCR的几种提示词输出格式

xy3 • 2026 年 01 月 20 日

<h1>背景</h1><p>网上似乎有提示词的说明，但是很少有输出格式的说明<br>本文记录一下格式，不管用什么跑其实输出格式都是一样的</p><p>命令来自<span class="external-link"><a class="no-external-link" href="https://ollama.com/library/deepseek-ocr" target="_blank"><i data-feather="external-link"></i>deepseek-ocr</a></span></p><p>测试用例图片为<br><img src="https://blog.skyw.cc/usr/uploads/2026/01/1911827752.png" alt="image.png" title="image.png" style=""></p><h1>给排版定位坐标</h1><pre><code>ollama run deepseek-ocr &quot;/path/to/image\n&lt;|grounding|&gt;Given the layout of the image.&quot;</code></pre><pre><code>&lt;|ref|&gt;title&lt;|/ref|&gt;&lt;|det|&gt;[[244, 140, 752, 162]]&lt;|/det|&gt;

&lt;|ref|&gt;text&lt;|/ref|&gt;&lt;|det|&gt;[[182, 187, 816, 258]]&lt;|/det|&gt;

&lt;|ref|&gt;title&lt;|/ref|&gt;&lt;|det|&gt;[[468, 294, 528, 308]]&lt;|/det|&gt;

&lt;|ref|&gt;text&lt;|/ref|&gt;&lt;|det|&gt;[[182, 315, 816, 480]]&lt;|/det|&gt;

&lt;|ref|&gt;image&lt;|/ref|&gt;&lt;|det|&gt;[[161, 507, 840, 716]]&lt;|/det|&gt;
&lt;|ref|&gt;image_caption&lt;|/ref|&gt;&lt;|det|&gt;[[143, 735, 855, 829]]&lt;|/det|&gt;</code></pre><h1>自由识别文字</h1><pre><code>ollama run deepseek-ocr &quot;/path/to/image\nFree OCR.&quot;</code></pre><p>输出几乎纯文本（部分markdown）</p><pre><code>End-to-End Test-Time Training for Long Context

Arnuv Tandon\*1,3, Karan Dalal\*1,4, Xinhao Li\*5, Daniel Koceja\*3, Marcel Rød\*3, Sam Buchanan4, Xiaolong Wang5, Jure Leskovec3, Sanmi Koyejo3, Tatsunori 
Hashimoto3, Carlos Guestrin3, Jed McCaleb1, Yejin Choi2, Yu Sun\*2,3
1 Astera Institute 2 NVIDIA 3 Stanford University 4 UC Berkeley 5 UC San Diego

Abstract

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard 
architecture – a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, 
compressing the context it reads into its weights. In addition, we improve the model’s initialization for learning at test time via meta-learning at training time. 
Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in 
contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method 
(TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar 
to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7× faster than full attention for 128K context. Our code is publicly 
available.

Figure 1. Scaling with context length, in terms of test loss (left) and latency (right). **Left:** Our method (TTT-E2E) turns the worst line (green) into the best 
(blue) at 128K context length. Loss $\Delta$ (↓), the $y$-value, is computed as (loss of the reported method) – (loss of Transformer with full attention), so loss 
$\Delta$ of full attention itself (orange) is the flat line at $y = 0$. While other methods produce worse loss $\Delta$ in longer context, TTT-E2E maintains the same 
advantage over full attention. All models have 3B parameters and are trained with 164B tokens. **Right:** Similar to SWA and the RNN baselines, TTT-E2E has constant 
inference latency regardless of context length, making it 2.7× faster than full attention for 128K context on an H100.

\* Core contributors. See statement of contributions before references.
Correspondence to: arnuv@stanford.edu, kdalal@berkeley.edu, yusun@cs.stanford.edu.</code></pre><h1>定位图像？</h1><p>和命令1似乎输出一样</p><pre><code>ollama run deepseek-ocr &quot;/path/to/image\nParse the figure.&quot;</code></pre><pre><code>&lt;|ref|&gt;title&lt;|/ref|&gt;&lt;|det|&gt;[[244, 140, 752, 161]]&lt;|/det|&gt;

&lt;|ref|&gt;text&lt;|/ref|&gt;&lt;|det|&gt;[[182, 189, 816, 258]]&lt;|/det|&gt;

&lt;|ref|&gt;title&lt;|/ref|&gt;&lt;|det|&gt;[[468, 295, 528, 307]]&lt;|/det|&gt;

&lt;|ref|&gt;text&lt;|/ref|&gt;&lt;|det|&gt;[[184, 316, 814, 480]]&lt;|/det|&gt;

&lt;|ref|&gt;image&lt;|/ref|&gt;&lt;|det|&gt;[[161, 507, 840, 716]]&lt;|/det|&gt;

&lt;|ref|&gt;image_caption&lt;|/ref|&gt;&lt;|det|&gt;[[143, 735, 855, 829]]&lt;|/det|&gt;</code></pre><h1>提取纯文本</h1><pre><code>ollama run deepseek-ocr &quot;/path/to/image\nExtract the text in the image.&quot;</code></pre><p>忽略图像输出文本</p><pre><code>End-to-End Test-Time Training for Long Context

1 Astera Institute 2 NVIDIA 3 Stanford University 4 UC Berkeley 5 UC San Diego

Abstract

Figure 1. Scaling with context length, in terms of test loss (left) and latency (right). **Left:** Our method (TTT-E2E) turns the worst line (green) into the best 
(blue) at 128K context length. Loss $\Delta$ ($\downarrow$), the $y$-value, is computed as (loss of the reported method) – (loss of Transformer with full 
attention), so loss $\Delta$ of full attention itself (orange) is the flat line at $y = 0$. While other methods produce worse loss $\Delta$ in longer context, 
TTT-E2E maintains the same advantage over full attention. All models have 3B parameters and are trained with 164B tokens. **Right:** Similar to SWA and the RNN 
baselines, TTT-E2E has constant inference latency regardless of context length, making it 2.7× faster than full attention for 128K context on an H100.

* Core contributors. See statement of contributions before references.

&lt;|ref|&gt;text&lt;|/ref|&gt;&lt;|det|&gt;[[181, 186, 816, 258]]&lt;|/det|&gt;
Arnuv Tandon $^{*,1,3}$ , Karan Dalal $^{*,1,4}$ , Xinhao Li $^{*,5}$ , Daniel Koceja $^{*,3}$ , Marcel Rod $^{*,3}$ , Sam Buchanan $^{4}$ , Xiaolong 
Wang $^{5}$ , Jure Leskovec $^{3}$ , Sanmi Koyejo $^{3}$ , Tatsunori Hashimoto $^{3}$ , Carlos Guestrin $^{3}$ , Jed McCaleb $^{1}$ , Yejin Choi $^{2}$ 
, Yu Sun $^{*,2,3}$ $^{1}$ Astera Institute $^{2}$ NVIDIA $^{3}$ Stanford University $^{4}$ UC Berkeley $^{5}$ UC San Diego

&lt;|ref|&gt;sub_title&lt;|/ref|&gt;&lt;|det|&gt;[[468, 294, 527, 308]]&lt;|/det|&gt;
## Abstract

&lt;|ref|&gt;text&lt;|/ref|&gt;&lt;|det|&gt;[[183, 314, 814, 480]]&lt;|/det|&gt;
We formulate long- context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard 
architecture - a Transformer with sliding- window attention. However, our model continues learning at test time via next- token prediction on the given context, 
compressing the context it reads into its weights. In addition, we improve the model&#039;s initialization for learning at test time via meta- learning at training time. 
Overall, our method, a form of Test- Time Training (TTT), is End- to- End (E2E) both at test time (via next- token prediction) and training time (via meta- 
learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, 
our method (TTT- E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. 
However, similar to RNNs, TTT- E2E has constant inference latency regardless of context length, making it $2.7 \times$ faster than full attention for 128K context. 
Our code is publicly available.

&lt;|ref|&gt;image&lt;|/ref|&gt;&lt;|det|&gt;[[159, 506, 840, 716]]&lt;|/det|&gt;
&lt;|ref|&gt;image_caption&lt;|/ref|&gt;&lt;|det|&gt;[[143, 734, 855, 830]]&lt;|/det|&gt;
&lt;center&gt;Figure 1. Scaling with context length, in terms of test loss (left) and latency (right). Left: Our method (TTT-E2E) turns the worst line (green) into the 
best (blue) at 128K context length. Loss $\Delta (\downarrow)$ , the $y$ -value, is computed as (loss of the reported method) – (loss of Transformer with full 
attention), so loss $\Delta$ of full attention itself (orange) is the flat line at $y = 0$ . While other methods produce worse loss $\Delta$ in longer context, 
TTT-E2E maintains the same advantage over full attention. All models have 3B parameters and are trained with 164B tokens. Right: Similar to SWA and the RNN 
baselines, TTT-E2E has constant inference latency regardless of context length, making it $2.7 \times$ faster than full attention for 128K context on an H100. 
&lt;/center&gt;</code></pre>

ollama跑DeepSeekOCR的几种提示词输出格式

背景

给排版定位坐标

自由识别文字

定位图像？

提取纯文本

提取为markdown

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

博客重新搭建

樱花动漫APP打包

Handsome主题配套友情链接修改默认网站图标为对应站ico

宝塔7.7专业，企业开心版

windows安装miniforge

github镜像站

dify修改版docker-compose

Crx搜搜：一键搜索下载谷歌Chrome浏览器插件

u盘设置快捷方式启动软件

linux使用uv安装unsloth

ollama跑DeepSeekOCR的几种提示词输出格式

背景

给排版定位坐标

自由识别文字

定位图像？

提取纯文本

提取为markdown

Leave a Comment Cancel reply 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

ollama跑DeepSeekOCR的几种提示词输出格式

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款