最近想对 PDF 下手:将 PDF 里面的所有图片全都提取出来,同时获取这些图片的位置信息。
之前发现 pyMuPDF
这个包超级好用。于是就照着官方教程写:
from pathlib import Path
import fitz
pdf_file = fitz.Document(filename="/home/haoyu.love/pdf1.pdf")
for page in pdf_file:
for page_image in page.getImageList(full=True):
xref = page_image[0] # check if this xref was handled already?
bbox = page.getImageBbox(page_image)
pix = fitz.Pixmap(pdf_file, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (page.number, xref))
else: # CMYK needs to be converted to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix) # make RGB pixmap copy
pix1.writePNG("p%s-%s.png" % (page.number, xref))
这里面的关键点是 full=Tru
e这句话。
然而发现,对于有些 PDF 文件(特别是学术论文),一直获得不了bbox。跟进源码,发现里面有一个判断,如果传入的东西是一个tuple,那么需要最后一个参数是 0,即「被当页直接引用」,但是出错的那些 PDF ,最后一个参数都是 13 (不知道啥意思),于是就报错了。
那么,就尝试了一下骗过这个检查:
from pathlib import Path
import fitz
pdf_file = fitz.Document(filename="/home/haoyu.love/pdf1.pdf")
for page in pdf_file:
for page_image in page.getImageList(full=True):
xref = page_image[0] # check if this xref was handled already?
bbox = page.getImageBbox((*page_image[:-1],0)) # Here
pix = fitz.Pixmap(pdf_file, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (page.number, xref))
else: # CMYK needs to be converted to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix) # make RGB pixmap copy
pix1.writePNG("p%s-%s.png" % (page.number, xref))
嗯,事实证明,骗过就好了~
发表回复