找回密码
 会员注册
查看: 32|回复: 0

Python解析Word文档的自动编号

[复制链接]

4

主题

0

回帖

13

积分

新手上路

积分
13
发表于 2024-9-9 20:25:11 | 显示全部楼层 |阅读模式
关于自动编号的知识可以参考《在OpenXMLWordprocessingML中使用编号列表》链接:https://learn.microsoft.com/zh-cn/previous-versions/office/ee922775(v=office.14)python-docx库并不能直接解析出Word文档的自动编号,因为原理较为复杂,但我们希望python能够读取自动编号对应的文本。基本解析原理为了测试验证,我们创建一个带有编号的文档进行测试,例如:然后我们先看看主文档中,对应的xml存储:fromdocximportDocumentdoc=Document(r"编号测试1.docx")forparagraphindoc.paragraphs:print(paragraph._element.xml)break123456结果:第一章123456789101112131415161718192021在微软的文档中,说明了最重要的部分:w:numPr元素包含自动编号元素。w:ilvl元素从零开始表示编号等级,w:numId元素是编号部件的索引。w:numId为0值时,表示编号已经被删除段落不含列表项。所以我们可以根据段落是否存在w:numPr并且w:numId的值不为0判断段落是否存在自动编号。然后我们需要获取每个w:numId对应的自动编号状态,这个信息存储在zip压缩包的\word\numbering.xml文件中,可以参考微软文档的示例:w:numbering同时包含w:num和w:abstractNum两种节点,其中w:num记录了每个numId对应的abstractNumId,而w:abstractNum记录了每个abstractNumId对应的编号格式,包含了每个级别的编号样式信息。对于w:num,python-docx库已经帮我们解析好,可以直接读取,但w:abstractNum节点python-docx库却并未进行解析,只能我们自己进行xml解析。可以通过如下代码获取每个numId对应的abstractNumId:fromdocximportDocumentdoc=Document(r"编号测试1.docx")numbering_part=doc.part.numbering_part._elementnumId2abstractId={num.numId:num.abstractNumId.valfornuminnumbering_part.num_lst}1234567接下来我们需要解析w:abstractNum节点,查阅python-docx库的源码可以知道,它使用lxml的etree进行xml解析。初步解析代码为:fromdocx.oxml.nsimportqnabstractNumId2style={}forabstractNumIdTaginnumbering_part.findall(qn("w:abstractNum")):abstractNumId=abstractNumIdTag.get(qn("w:abstractNumId"))forlvlTaginabstractNumIdTag.findall(qn("w:lvl")):ilvl=lvlTag.get(qn("w:ilvl"))style={tag.tag[tag.tag.rfind("}")+1:]:tag.get(qn("w:val"))fortaginlvlTag.xpath("./*[@w:val]",namespaces=numbering_part.nsmap)}abstractNumId2style[(int(abstractNumId),int(ilvl))]=styleprint(abstractNumId2style)1234567891011注意:docx.oxml.ns的qn函数可以将w:转换为对应的命名空间名称,但对于xpath表达式却无法正确处理,所以对于xpath表达式使用namespaces传入对应的命名空间。除了上面的解析方法以外,还可以事先将节点的所有命名空间清除后再解析,清除代码如下:defremove_namespace(node):node_tag=node.tagif'}'innode_tag:node.tag=node_tag[node_tag.rfind("}")+1:]forattr_keyinlist(node.attrib):if'}'inattr_key:new_attr_key=attr_key[attr_key.rfind("}")+1:]node.attrib[new_attr_key]=node.attrib.pop(attr_key)forchildinnode:remove_namespace(child)returnnode1234567891011'运行运行这样可以递归消除目标节点所有子节点的命名空间。可以每个类别每个级别的自动编号的属性信息:{(0,0):{'start':'1','numFmt':'decimal','lvlText':'%1.','lvlJc':'left'},(0,1):{'start':'1','numFmt':'decimal','lvlText':'%1.%2.','lvlJc':'left'},(0,2):{'start':'1','numFmt':'decimal','lvlText':'%1.%2.%3.','lvlJc':'left'},(0,3):{'start':'1','numFmt':'decimal','lvlText':'%1.%2.%3.%4.','lvlJc':'left'},(0,4):{'start':'1','numFmt':'decimal','lvlText':'%1.%2.%3.%4.%5.','lvlJc':'left'},(0,5):{'start':'1','numFmt':'decimal','lvlText':'%1.%2.%3.%4.%5.%6.','lvlJc':'left'},(0,6):{'start':'1','numFmt':'decimal','lvlText':'%1.%2.%3.%4.%5.%6.%7.','lvlJc':'left'},(0,7):{'start':'1','numFmt':'decimal','lvlText':'%1.%2.%3.%4.%5.%6.%7.%8.','lvlJc':'left'},(0,8):{'start':'1','numFmt':'decimal','lvlText':'%1.%2.%3.%4.%5.%6.%7.%8.%9.','lvlJc':'left'}}1当然我们只测试了最基本的数值型自动编号,有些自动编号对应的节点没有直接的w:numFmt节点,解析代码还需针对性调整。微软的文档中提到,对多级列表的某一级列表进行特殊设定时,w:num内会出现w:lvlOverride节点,但本人使用wps反复测试过后并没有出现。估计这种格式的xml只会在老版的office中出现,而且我们也不会故意在多级列表的某一级进行特殊设定,所以我们不考虑这种情况。还需要考虑w:suff元素控制的列表后缀,即列表项与段落之间的空白内容,有可能为制表符和空格,也可以什么都没有。处理代码为:{"space":"","nothing":""}.get(style.get("suff"),"\t")1多级编号处理首先尝试读取每个段落对应的自动编号样式:forparagraphindoc.paragraphs:numpr=paragraph._element.pPr.numPrifnumprisnotNoneandnumpr.numId.val!=0:numId=numpr.numId.valilvl=numpr.ilvl.valabstractId=numId2abstractId[numId]style=abstractNumId2style[(abstractId,ilvl)]print(style)print(paragraph.text)123456789结果:{'start':'1','numFmt':'decimal','lvlText':'%1.','lvlJc':'left'}第一章{'start':'1','numFmt':'decimal','lvlText':'%1.%2.','lvlJc':'left'}第一节{'start':'1','numFmt':'decimal','lvlText':'%1.%2.','lvlJc':'left'}第二节{'start':'1','numFmt':'decimal','lvlText':'%1.%2.%3.','lvlJc':'left'}第一条{'start':'1','numFmt':'decimal','lvlText':'%1.%2.%3.','lvlJc':'left'}第二条{'start':'1','numFmt':'decimal','lvlText':'%1.','lvlJc':'left'}第二章{'start':'1','numFmt':'decimal','lvlText':'%1.','lvlJc':'left'}第三章1234567891011121314我们需要一个计数器来记录每个样式出现的次数,从而生成其对应的编号。cache={}forparagraphindoc.paragraphs:numpr=paragraph._element.pPr.numPrlvlText=""ifnumprisnotNoneandnumpr.numId.val!=0:numId=numpr.numId.valilvl=numpr.ilvl.valabstractId=numId2abstractId[numId]style=abstractNumId2style[(abstractId,ilvl)]if(abstractId,ilvl)incache:cache[(abstractId,ilvl)]+=1else:cache[(abstractId,ilvl)]=int(style["start"])lvlText=style.get("lvlText")foriinrange(0,ilvl+1):lvlText=lvlText.replace(f'%{i+1}',str(cache[(abstractId,i)]))suff_text={"space":"","nothing":""}.get(style.get("suff"),"\t")lvlText+=suff_textprint(lvlText+paragraph.text)12345678910111213141516171819结果:1. 第一章1.1. 第一节1.2. 第二节1.2.1. 第一条1.2.2. 第二条2. 第二章3. 第三章1234567各种其他类型的编号生成为了尽量多的支持更多类型的编号,我创建了如下测试文件:我们没有必要获取对应的圆圈数字,圆圈就获取对应的整数。除了三种日文编号,上面的示例几乎包含所有的编号类型。需要注意三位数以上的数字格式,其xml有些特殊,例如:12345678910111213141516171819202122基于此,解析格式的代码也作出如下调整:abstractNumId2style={}forabstractNumIdTaginnumbering_part.findall(qn("w:abstractNum")):abstractNumId=abstractNumIdTag.get(qn("w:abstractNumId"))forlvlTaginabstractNumIdTag.findall(qn("w:lvl")):ilvl=lvlTag.get(qn("w:ilvl"))style={tag.tag[tag.tag.rfind("}")+1:]:tag.get(qn("w:val"))fortaginlvlTag.xpath("./*[@w:val]",namespaces=numbering_part.nsmap)}if"numFmt"notinstyle:numFmtVal=lvlTag.xpath("./mc:AlternateContent/mc:Fallback/w:numFmt/@w:val",namespaces=numbering_part.nsmap)ifnumFmtValandnumFmtVal[0]=="decimal":numFmt_format=lvlTag.xpath("./mc:AlternateContent/mc:Choice/w:numFmt/@w:format",namespaces=numbering_part.nsmap)ifnumFmt_format:style["numFmt"]="decimal"+numFmt_format[0].split(",")[0]ifstyle.get("numFmt")=="decimalZero":style["numFmt"]="decimal01"abstractNumId2style[(int(abstractNumId),int(ilvl))]=style123456789101112131415161718目前只发现这种基于decimal的格式,所以只针对这种自定义格式处理,其他类型的统一认为是没有自动编号。另外既然三位数的整数格式已经被我们命名为decimal001,那么也将二位数的decimalZero修改为decimal01。目前测试出这个文件有以下这些numFmt:bullet,cardinalText,chineseCounting,chineseLegalSimplified,decimal,decimalEnclosedCircleChinese,ideographTraditional,ideographZodiac,lowerLetter,lowerRoman,ordinal,ordinalText,upperLetter,upperRoman1下面我们预先选择一些可能比较复杂的转换编写相应的函数:正整数转换为大写字母代码如下:defint2upperLetter(num):result=[]whilenum>0:num-=1remainder=num%26result.append(chr(remainder+ord('A')))num//=26return"".join(reversed(result))12345678'运行运行正整数转换为罗马数字defint2upperRoman(num):t=[(1000,'M'),(900,'CM'),(500,'D'),(400,'CD'),(100,'C'),(90,'XC'),(50,'L'),(40,'XL'),(10,'X'),(9,'IX'),(5,'V'),(4,'IV'),(1,'I')]roman_num=''i=0whilenum>0:val,syb=t[i]for_inrange(num//val):roman_num+=sybnum-=vali+=1returnroman_num12345678910111213141516'运行运行正整数转换为英文基数字defint2cardinalText(num):ifnotisinstance(num,int)ornum999999999999:raiseValueError("Invalidnumber:mustbeapositiveintegerwithinfourdigits")base=["Zero","One","Two","Three","Four","Five","Six","Seven","Eight","Nine","Ten","Eleven","Twelve","Thirteen","Fourteen","Fifteen","Sixteen","Seventeen","Eighteen","Nineteen"]tens=["","","Twenty","Thirty","Fourty","Fifty","Sixty","Seventy","Eighty","Ninety"]thousands=["","Thousand","Million","Billion"]deftwo_digits(n):ifn0:result+=two_digits(rest)returnresult.strip()ifnum0:num,remainder=divmod(num,1000)chunks.append(remainder)words=[]foriinrange(len(chunks)-1,-1,-1):ifchunks[i]==0:continuechunk_word=three_digits(chunks[i])ifthousands[i]:chunk_word+=f"{thousands[i]}"words.append(chunk_word)words="".join(words).lower()returnwords[0].upper()+words[1:]1234567891011121314151617181920212223242526272829303132333435363738394041424344'运行运行正整数转换为英文序数字defint2ordinalText(num):ifnotisinstance(num,int)ornum999999:raiseValueError("Invalidnumber:mustbeapositiveintegerwithinfourdigits")base=["Zero","One","Two","Three","Four","Five","Six","Seven","Eight","Nine","Ten","Eleven","Twelve","Thirteen","Fourteen","Fifteen","Sixteen","Seventeen","Eighteen","Nineteen"]baseth=['Zeroth','First','Second','Third','Fourth','Fifth','Sixth','Seventh','Eighth','Ninth','Tenth','Eleventh','Twelfth','Thirteenth','Fourteenth','Fifteenth','Sixteenth','Seventeenth','Eighteenth','Nineteenth','Twentieth']tens=["","","Twenty","Thirty","Fourty","Fifty","Sixty","Seventy","Eighty","Ninety"]tensth=["","","Twentieth","Thirtieth","Fortieth","Fiftieth","Sixtieth","Seventieth","Eightieth","Ninetieth"]deftwo_digits(n):ifn0:ifnum==0:returnf"{int2cardinalText(thousand)}thousandth"result.append(f"{int2cardinalText(thousand)}thousand")hundred,num=divmod(num,100)ifhundred>0:ifnum==0:result.append(f"{base[hundred]}hundredth")return"".join(result)result.append(f"{base[hundred]}hundred")result.append(two_digits(num))result="".join(result).lower()returnresult[0].upper()+result[1:]123456789101112131415161718192021222324252627282930313233343536373839'运行运行会复用前面的基数字转换规则。正整数转换为中文数字importredefint2Chinese(num,ch_num,units):ifnot(00:num-=1remainder=num%26result.append(chr(remainder+ord('A')))num//=26return"".join(reversed(result))@staticmethoddefint2upperRoman(num):t=[(1000,'M'),(900,'CM'),(500,'D'),(400,'CD'),(100,'C'),(90,'XC'),(50,'L'),(40,'XL'),(10,'X'),(9,'IX'),(5,'V'),(4,'IV'),(1,'I')]roman_num=''i=0whilenum>0:val,syb=t[i]for_inrange(num//val):roman_num+=sybnum-=vali+=1returnroman_num@staticmethoddefint2cardinalText(num):ifnotisinstance(num,int)ornum999999999:raiseValueError("Invalidnumber:mustbeapositiveintegerwithinfourdigits")base=["Zero","One","Two","Three","Four","Five","Six","Seven","Eight","Nine","Ten","Eleven","Twelve","Thirteen","Fourteen","Fifteen","Sixteen","Seventeen","Eighteen","Nineteen"]tens=["","","Twenty","Thirty","Fourty","Fifty","Sixty","Seventy","Eighty","Ninety"]thousands=["","Thousand","Million","Billion"]deftwo_digits(n):ifn0:result+=two_digits(rest)returnresult.strip()ifnum0:num,remainder=divmod(num,1000)chunks.append(remainder)words=[]foriinrange(len(chunks)-1,-1,-1):ifchunks[i]==0:continuechunk_word=three_digits(chunks[i])ifthousands[i]:chunk_word+=f"{thousands[i]}"words.append(chunk_word)words="".join(words).lower()returnwords[0].upper()+words[1:]@staticmethoddefint2ordinalText(num):ifnotisinstance(num,int)ornum999999:raiseValueError("Invalidnumber:mustbeapositiveintegerwithinfourdigits")base=["Zero","One","Two","Three","Four","Five","Six","Seven","Eight","Nine","Ten","Eleven","Twelve","Thirteen","Fourteen","Fifteen","Sixteen","Seventeen","Eighteen","Nineteen"]baseth=['Zeroth','First','Second','Third','Fourth','Fifth','Sixth','Seventh','Eighth','Ninth','Tenth','Eleventh','Twelfth','Thirteenth','Fourteenth','Fifteenth','Sixteenth','Seventeenth','Eighteenth','Nineteenth','Twentieth']tens=["","","Twenty","Thirty","Fourty","Fifty","Sixty","Seventy","Eighty","Ninety"]tensth=["","","Twentieth","Thirtieth","Fortieth","Fiftieth","Sixtieth","Seventieth","Eightieth","Ninetieth"]deftwo_digits(n):ifn0:ifnum==0:returnf"{WithNumberDocxReader.int2cardinalText(thousand)}thousandth"result.append(f"{WithNumberDocxReader.int2cardinalText(thousand)}thousand")hundred,num=divmod(num,100)ifhundred>0:ifnum==0:result.append(f"{base[hundred]}hundredth")return"".join(result)result.append(f"{base[hundred]}hundred")result.append(two_digits(num))result="".join(result).lower()returnresult[0].upper()+result[1:]@staticmethoddefint2Chinese(num,ch_num,units):ifnot(0ilvl:delself.cnt[(a,b,c)]pos_key=(numId,ilvl,isTxbxContent)ifpos_keyinself.cnt:self.cnt[pos_key]+=1else:self.cnt[pos_key]=int(style["start"])pos=self.cnt[pos_key]num_text=str(pos)ifnumFmt.startswith('decimal'):num_text=num_text.zfill(numFmt.count("0")+1)elifnumFmt=='upperRoman':num_text=self.int2upperRoman(pos)elifnumFmt=='lowerRoman':num_text=self.int2upperRoman(pos).lower()elifnumFmt=='upperLetter':num_text=self.int2upperLetter(pos)elifnumFmt=='lowerLetter':num_text=self.int2upperLetter(pos).lower()elifnumFmt=='ordinal':num_text=f"{pos}{'th'if11
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 会员注册

本版积分规则

QQ|手机版|心飞设计-版权所有:微度网络信息技术服务中心 ( 鲁ICP备17032091号-12 )|网站地图

GMT+8, 2025-1-10 07:49 , Processed in 0.421263 second(s), 25 queries .

Powered by Discuz! X3.5

© 2001-2025 Discuz! Team.

快速回复 返回顶部 返回列表