OCR版面分析--PaddleOCR（python文档解析提取）

么逗我 · 发表于 2024-9-4 00:26:45

1.创建新的conda环境#在命令行输入以下命令，创建名为paddle_env的环境#此处为加速下载，使用清华源condacreate--namepaddle_envpython=3.8--channelhttps://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/#这是一行命令1232.激活刚创建的conda环境#激活paddle_env环境condaactivatepaddle_env#查看当前python的位置whereispython12343.安装PaddlePaddle1.CUDA9或CUDA10python3-mpipinstallpaddlepaddle-gpu-ihttps://mirror.baidu.com/pypi/simple12.CPUpython3-mpipinstallpaddlepaddle-ihttps://mirror.baidu.com/pypi/simple14.安装PaddleOCRwhl包pipinstall"paddleocr>=2.0.1"#推荐使用2.0.1+版本125.代码使用paddleocr默认使用PP-OCRv4模型，具体版本说明如下：如需新增自己训练的模型，可以在paddleocr中增加模型链接和字段，重新编译即可。5.1检测+方向分类器+识别全流程frompaddleocrimportPaddleOCR,draw_ocr#Paddleocr目前支持中英文、英文、法语、德语、韩语、日语，可以通过修改lang参数进行切换#参数依次为`ch`,`en`,`french`,`german`,`korean`,`japan`。ocr=PaddleOCR(use_angle_cls=True,lang="ch")#needtorunonlyoncetodownloadandloadmodelintomemoryimg_path='PaddleOCR/doc/imgs/11.jpg'result=ocr.ocr(img_path,cls=True)foridxinrange(len(result)):res=result[idx]forlineinres:print(line)#显示结果fromPILimportImageresult=result[0]image=Image.open(img_path).convert('RGB')boxes=[line[0]forlineinresult]txts=[line[1][0]forlineinresult]scores=[line[1][1]forlineinresult]im_show=draw_ocr(image,boxes,txts,scores,font_path='/path/to/PaddleOCR/doc/fonts/simfang.ttf')im_show=Image.fromarray(im_show)im_show.save('result.jpg')12345678910111213141516171819202122结果是一个list，每个item包含了文本框，文字和识别置信度[[[24.0,36.0],[304.0,34.0],[304.0,72.0],[24.0,74.0]],['纯臻营养护发素',0.964739]][[[24.0,80.0],[172.0,80.0],[172.0,104.0],[24.0,104.0]],['产品信息/参数',0.98069626]][[[24.0,109.0],[333.0,109.0],[333.0,136.0],[24.0,136.0]],['（45元/每公斤，100公斤起订）',0.9676722]]......1234结果可视化5.2检测+识别frompaddleocrimportPaddleOCR,draw_ocrocr=PaddleOCR()#needtorunonlyoncetodownloadandloadmodelintomemoryimg_path='PaddleOCR/doc/imgs/11.jpg'result=ocr.ocr(img_path,cls=False)foridxinrange(len(result)):res=result[idx]forlineinres:print(line)#显示结果fromPILimportImageresult=result[0]image=Image.open(img_path).convert('RGB')boxes=[line[0]forlineinresult]txts=[line[1][0]forlineinresult]scores=[line[1][1]forlineinresult]im_show=draw_ocr(image,boxes,txts,scores,font_path='/path/to/PaddleOCR/doc/fonts/simfang.ttf')im_show=Image.fromarray(im_show)im_show.save('result.jpg')1234567891011121314151617181920结果是一个list，每个item包含了文本框，文字和识别置信度[[[24.0,36.0],[304.0,34.0],[304.0,72.0],[24.0,74.0]],['纯臻营养护发素',0.964739]][[[24.0,80.0],[172.0,80.0],[172.0,104.0],[24.0,104.0]],['产品信息/参数',0.98069626]][[[24.0,109.0],[333.0,109.0],[333.0,136.0],[24.0,136.0]],['（45元/每公斤，100公斤起订）',0.9676722]]......1234结果可视化5.3方向分类器+识别frompaddleocrimportPaddleOCRocr=PaddleOCR(use_angle_cls=True)#needtorunonlyoncetodownloadandloadmodelintomemoryimg_path='PaddleOCR/doc/imgs_words/ch/word_1.jpg'result=ocr.ocr(img_path,det=False,cls=True)foridxinrange(len(result)):res=result[idx]forlineinres:print(line)123456789结果是一个list，每个item只包含识别结果和识别置信度['韩国小馆',0.9907421]1'运行运行5.4单独执行检测frompaddleocrimportPaddleOCR,draw_ocrocr=PaddleOCR()#needtorunonlyoncetodownloadandloadmodelintomemoryimg_path='PaddleOCR/doc/imgs/11.jpg'result=ocr.ocr(img_path,rec=False)foridxinrange(len(result)):res=result[idx]forlineinres:print(line)#显示结果fromPILimportImageresult=result[0]image=Image.open(img_path).convert('RGB')im_show=draw_ocr(image,result,txts=None,scores=None,font_path='/path/to/PaddleOCR/doc/fonts/simfang.ttf')im_show=Image.fromarray(im_show)im_show.save('result.jpg')1234567891011121314151617结果是一个list，每个item只包含文本框[[26.0,457.0],[137.0,457.0],[137.0,477.0],[26.0,477.0]][[25.0,425.0],[372.0,425.0],[372.0,448.0],[25.0,448.0]][[128.0,397.0],[273.0,397.0],[273.0,414.0],[128.0,414.0]]......1234结果可视化5.5单独执行识别frompaddleocrimportPaddleOCRocr=PaddleOCR()#needtorunonlyoncetodownloadandloadmodelintomemoryimg_path='PaddleOCR/doc/imgs_words/ch/word_1.jpg'result=ocr.ocr(img_path,det=False)foridxinrange(len(result)):res=result[idx]forlineinres:print(line)123456789结果是一个list，每个item只包含识别结果和识别置信度['韩国小馆',0.9907421]1'运行运行5.6单独执行方向分类器frompaddleocrimportPaddleOCRocr=PaddleOCR(use_angle_cls=True)#needtorunonlyoncetodownloadandloadmodelintomemoryimg_path='PaddleOCR/doc/imgs_words/ch/word_1.jpg'result=ocr.ocr(img_path,det=False,rec=False,cls=True)foridxinrange(len(result)):res=result[idx]forlineinres:print(line)123456789结果是一个list，每个item只包含分类结果和分类置信度['0',0.9999924]1'运行运行6.自定义模型当内置模型无法满足需求时，需要使用到自己训练的模型。首先，参照模型导出将检测、分类和识别模型转换为inference模型，然后按照如下方式使用6.1代码使用frompaddleocrimportPaddleOCR,draw_ocr#模型路径下必须含有model和params文件ocr=PaddleOCR(det_model_dir='{your_det_model_dir}',rec_model_dir='{your_rec_model_dir}',rec_char_dict_path='{your_rec_char_dict_path}',cls_model_dir='{your_cls_model_dir}',use_angle_cls=True)img_path='PaddleOCR/doc/imgs/11.jpg'result=ocr.ocr(img_path,cls=True)foridxinrange(len(result)):res=result[idx]forlineinres:print(line)#显示结果fromPILimportImageresult=result[0]image=Image.open(img_path).convert('RGB')boxes=[line[0]forlineinresult]txts=[line[1][0]forlineinresult]scores=[line[1][1]forlineinresult]im_show=draw_ocr(image,boxes,txts,scores,font_path='/path/to/PaddleOCR/doc/fonts/simfang.ttf')im_show=Image.fromarray(im_show)im_show.save('result.jpg')12345678910111213141516171819202122236.2通过命令行使用paddleocr--image_dirPaddleOCR/doc/imgs/11.jpg--det_model_dir{your_det_model_dir}--rec_model_dir{your_rec_model_dir}--rec_char_dict_path{your_rec_char_dict_path}--cls_model_dir{your_cls_model_dir}--use_angle_clstrue17.使用网络图片或者numpy数组作为输入7.1网络图片7.1.1frompaddleocrimportPaddleOCR,draw_ocr,download_with_progressbar#Paddleocr目前支持中英文、英文、法语、德语、韩语、日语，可以通过修改lang参数进行切换#参数依次为`ch`,`en`,`french`,`german`,`korean`,`japan`。ocr=PaddleOCR(use_angle_cls=True,lang="ch")#needtorunonlyoncetodownloadandloadmodelintomemoryimg_path='http://n.sinaimg.cn/ent/transform/w630h933/20171222/o111-fypvuqf1838418.jpg'result=ocr.ocr(img_path,cls=True)foridxinrange(len(result)):res=result[idx]forlineinres:print(line)#显示结果fromPILimportImageresult=result[0]download_with_progressbar(img_path,'tmp.jpg')image=Image.open('tmp.jpg').convert('RGB')boxes=[line[0]forlineinresult]txts=[line[1][0]forlineinresult]scores=[line[1][1]forlineinresult]im_show=draw_ocr(image,boxes,txts,scores,font_path='/path/to/PaddleOCR/doc/fonts/simfang.ttf')im_show=Image.fromarray(im_show)im_show.save('result.jpg')12345678910111213141516171819202122237.1.2命令行模式paddleocr--image_dirhttp://n.sinaimg.cn/ent/transform/w630h933/20171222/o111-fypvuqf1838418.jpg--use_angle_cls=true17.2numpy数组importcv2frompaddleocrimportPaddleOCR,draw_ocr#Paddleocr目前支持中英文、英文、法语、德语、韩语、日语，可以通过修改lang参数进行切换#参数依次为`ch`,`en`,`french`,`german`,`korean`,`japan`。ocr=PaddleOCR(use_angle_cls=True,lang="ch")#needtorunonlyoncetodownloadandloadmodelintomemoryimg_path='PaddleOCR/doc/imgs/11.jpg'img=cv2.imread(img_path)#img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY),如果你自己训练的模型支持灰度图，可以将这句话的注释取消result=ocr.ocr(img,cls=True)foridxinrange(len(result)):res=result[idx]forlineinres:print(line)#显示结果fromPILimportImageresult=result[0]image=Image.open(img_path).convert('RGB')boxes=[line[0]forlineinresult]txts=[line[1][0]forlineinresult]scores=[line[1][1]forlineinresult]im_show=draw_ocr(image,boxes,txts,scores,font_path='/path/to/PaddleOCR/doc/fonts/simfang.ttf')im_show=Image.fromarray(im_show)im_show.save('result.jpg')123456789101112131415161718192021222324258.PDF文件作为输入8.1命令行模式可以通过指定参数page_num来控制推理前面几页，默认为0，表示推理所有页。paddleocr--image_dir./xxx.pdf--use_angle_clstrue--use_gpufalse--page_num218.2代码使用frompaddleocrimportPaddleOCR,draw_ocr#Paddleocr目前支持的多语言语种可以通过修改lang参数进行切换#例如`ch`,`en`,`fr`,`german`,`korean`,`japan`ocr=PaddleOCR(use_angle_cls=True,lang="ch"，page_num=2)#needtorunonlyoncetodownloadandloadmodelintomemoryimg_path='./xxx.pdf'result=ocr.ocr(img_path,cls=True)foridxinrange(len(result)):res=result[idx]forlineinres:print(line)#显示结果importfitzfromPILimportImageimportcv2importnumpyasnpimgs=[]withfitz.open(img_path)aspdf:forpginrange(0,pdf.pageCount):page=pdf[pg]mat=fitz.Matrix(2,2)pm=page.getPixmap(matrix=mat,alpha=False)#ifwidthorheight>2000pixels,don'tenlargetheimageifpm.width>2000orpm.height>2000:pm=page.getPixmap(matrix=fitz.Matrix(1,1),alpha=False)img=Image.frombytes("RGB",[pm.width,pm.height],pm.samples)img=cv2.cvtColor(np.array(img),cv2.COLOR_RGB2BGR)imgs.append(img)foridxinrange(len(result)):res=result[idx]image=imgs[idx]boxes=[line[0]forlineinres]txts=[line[1][0]forlineinres]scores=[line[1][1]forlineinres]im_show=draw_ocr(image,boxes,txts,scores,font_path='doc/fonts/simfang.ttf')im_show=Image.fromarray(im_show)im_show.save('result_page_{}.jpg'.format(idx))123456789101112131415161718192021222324252627282930313233343536373839参数说明参考：版面分析–OCR神奇PAddleOCRPaddlePaddle/PaddleOCRPaddleOCR运行环境准备PaddleOCR快速开始版面分析–OCR开源项目记录（备用）paddleocrpackage使用说明通过OCR实现验证码识别基于图片相似度的图片搜索

		自动登录	找回密码
密码			会员注册