Skip to main content

Pix2Text

caution
  • table-ocr 可能会错误的识别出 table spanning cell
    • 导致一列少了内容
# 目前最高只支持 Python 3.12
# CnSTD>=1.2.1, CnOCR>=2.2.2.1, transformers>=4.37.0
# pip install pix2text[multilingual] # 多语言 - 除了 简体中文和英文以外
pip install pix2text -i https://mirrors.aliyun.com/pypi/simple

# for GPU
pip uninstall onnxruntime
pip install onnxruntime-gpu

# --file-type [pdf|page|text_formula|formula|text]
# -i, --img-file-or-dir TEXT
p2t predict \
-l en,ch_sim --disable-formula --enable-table \
--resized-shape 768 \
--file-type pdf \
-i docs/examples/test-doc.pdf \
-o output-md \
--save-debug-res output-debug

# 启动 HTTP 服务
p2t serve -l en,ch_sim -H 0.0.0.0 -p 8503

curl -X POST \
-F "file_type=page" \
-F "resized_shape=768" \
-F "embed_sep= $,$ " \
-F "isolated_sep=$$\n, \n$$" \
-F "image=@docs/examples/page2.png;type=image/jpeg" \
http://0.0.0.0:8503/pix2text

Notes

  • Pix2Text
    • #from_config(total_configs:{layout,text_formula,table},enable_table,enable_formula)
    • recognize
    • recognize_pdf
    • recognize_page
    • recognize_text
    • recognize_text_formula
  • TableOCR
    • recognize(out_cells,out_objects,out_html,out_csv,out_markdown)
    • AutoModelForObjectDetection.from_pretrained("$HOME/.pix2text/1.1/table-ocr")
  • DocYoloLayoutParser
    • layout 识别
  • LayoutLMv3LayoutParser
    • 之前的 layout 识别
  • 数据目录 - data_dir(), root - PIX2TEXT_HOME=$HOME/.pix2text
    • model 目录 data_dir()/MODEL_VERSION 例如 ~/.pix2text/1.1
    • layout-docyolo/
    • mfd-onnx/
    • mfr-onnx/
    • table-rec/
  • ~/.cnstd - breezedeus/CnSTD - 基于 RapidOCR 集成 PPOCRv4
    • CN STD - 中文文本检测
  • ~/.cnocr - breezedeus/CnOCR
    • CN OCR - 中文文本识别
  • PIX2TEXT_DOWNLOAD_SOURCE=HF
from torchvision import transforms

# TableOCR 的输入处理
structure_transform = transforms.Compose(
[
MaxResize(1000),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
]
)

table-rec

CN OCR

# pip install cnocr[ort-gpu]
pip install cnocr[ort-cpu] -i https://mirrors.aliyun.com/pypi/simple