开发者

Python使用python-docx实现自动化处理Word文档

开发者 https://www.devze.com 2025-05-26 09:19 出处:网络 作者: 东方佑
目录一、引言二、核心功能模块解析1. 段落样式与图片复制2. html表格转Word表格3. 模板生成与样式动态化三、完整示例代码示例1:复制段落样式与图片示例2:HTML表格转Word四、关键实现细节五、应用场景六、总结七、知
目录
  • 一、引言
  • 二、核心功能模块解析
    • 1. 段落样式与图片复制
    • 2. html表格转Word表格
    • 3. 模板生成与样式动态化
  • 三、完整示例代码
    • 示例1:复制段落样式与图片
    • 示例2:HTML表格转Word
  • 四、关键实现细节
    • 五、应用场景
      • 六、总结
        • 七、知识扩展
          • 使用模版样式生成文档
          • 模版样式文本分离
          • 生成可更改模版

        一、引言

        随着办公自动化需求的增长,python通过python-docx库实现了对Word文档的深度操作。本文将展示如何通过代码实现段落样式复制、HTML表格转Word表格以及动态生成可定制化模板的功能。

        二、核心功能模块解析

        1. 段落样式与图片复制

        def copy_inline_shapes(new_doc, img):
            """复制段落中的所有内嵌形状(通常是图片)"""
            new_para = new_doc.add_paragraph()
            for image_bytes, w, h in img:
                # 添加图片到新段落
                new_para.add_run().add_picture(io.BytesIO(image_bytes), width=w, height=h)  # 设置宽度为1.25英寸或其他合适的值
        

        功能说明:从旧文档中提取图片并复制至新文档,支持自定义宽度和高度。

        使用场景:适用于需要保留原始格式的图文混排文档。

        2. HTML表格转Word表格

        def docx_table_to_html(word_table):
            # 实现HTML表单转换逻辑,包括合并单元格处理
        

        功能说明:将解析后的HTML表格结构转换为Word文档中的表格,支持横向/纵向合并。

        关键点:

        • 使用BeautifulSoup解析HTML
        • 处理单元格样式、边框和背景颜色
        • 支持多级标题的样式继承

        3. 模板生成与样式动态化

        def generate_template():
            doc = Document()
            for align in [WD_ALIGN_PARAGRAPH.LEFT, WD_ALIGN_PARAGRAPH.RIGHT, WD_ALIGN_PARAGRAPH.CENTER, None]:
                for blod_flag in [True, False]:
                    # 创建不同样式的段落
        

        功能说明:动态生成包含多种样式(左、右、居中、无)的模板文档。

        优势:支持快速扩展新样式,适应不同场景需求。

        三、完整示例代码

        示例1:复制段落样式与图片

        def clone_document(old_s, old_p, old_ws, new_doc_path):
            new_doc = Document()
            for para in old_p:
                if "Image_None" in para:
                    copy_inline_shapes(new_doc, [i["image"] for i in old_s if len(i) > 3][0])
                elif "table" in para:
                    html_table_to_docx(new_doc, para)
                else:
                    clone_paragraph(para)
        

        示例2:HTML表格转Word

        def html_table_to_docx(doc, html_content):
            soup = BeautifulSoup(html_content, 'html.parser')
            tables = soup.find_all('table')
            for table in tablwww.devze.comes:
                # 处理合并单元格和样式转换逻辑...
        

        四、关键实现细节

        1. 样式复制策略

        继承机制:通过run_style和style字段传递字体、对齐等属性。

        分页符处理:使用is_page_break判断段落或表格后是否需要换页。

        2. 表格转换优化

        合并单元格检测:通过tcPr元素识别横向/纵向合并。

        样式迁移:保留边框、背景色等视觉属性。

        3. 模板动态生成

        多样式支持:通过遍历所有段落样式,生成可扩展的模板。

        灵活配置:允许用户自定义分页符位置和样式参数。

        五、应用场景

        场景解决方案
        段落排版自动复制样式并保留格式
        数据表导出HTML转Word表格,支持合并单元格
        报告模板生成动态创建包含多种样式的模板文件

        六、总结

        通过python-docx库,我们实现了从样式复制到表格转换的完整流程。动态生成的模板功能进一步提升了文档处理的灵活性。无论是处理复杂的图文排版,还是需要快速生成多风格文档的需求,这套解决方案都能提供高效的实现路径。

        建议:在实际应用中,可结合python-docx的Document对象特性,通过遍历所有元素实现更精细的控制。同时,对异常情况的捕获(如图片格式错误)也是提升健壮性的重要部分。

        七、知识扩展

        使用模版样式生成文档

        from docx import Document
        from docx.oXML import OxmlElement
        from docx.oxml.shared import qn
        from wan_neng_copy_word import clone_document as get_para_style,html_table_to_docx
        import io
        
        
        # 剩余部分保持不变...
        
        def copy_inline_shapes(new_doc, img):
            """复制段落中的所有内嵌形状(通常是图片)"""
            new_para = new_doc.add_paragraph()
            for image_bytes, w, h in img:
                # 添加图片到新段落
                new_para.add_run().add_picture(io.BytesIO(image_bytes), width=w, height=h)  # 设置宽度为1.25英寸或其他合适的值
        
        
        def copy_paragraph_style(run_from, run_to):
            """复制 run 的样式"""
            run_to.bold = run_from.bold
            run_to.italic = run_from.italic
            run_to.underline = run_from.underline
            run_to.font.size = run_from.font.size
            run_to.font.color.rgb = run_from.font.color.rgb
            run_to.font.name = run_from.font.name
            run_to.font.all_caps = run_from.font.all_caps
            run_to.font.strike = run_from.font.strike
            run_to.font.shadow = run_from.font.shadow
        
        
        def is_page_break(element):
            """判断元素是否为分页符(段落或表格后)"""
            if element.tag.endswith('p'):
                for child in element:
                    if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                        return True
            elif element.tag.endswith('tbl'):
                # 表格后可能有分页符(通过下一个元素判断)
                if element.getnext() is not None:
                    next_element = element.getnext()
                    if next_element.tag.endswith('p'):
                        for child in next_element:
                            if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                                return True
            return False
        
        
        def clone_paragraph(para_style, text, new_doc, para_style_ws):
            """根据旧段落创建新段落"""
            new_para = new_doc.add_paragraph()
            para_style_ws = list(para_style_ws["style"].values())[0]
            para_style_data = list(para_style["style"].values())[0]
            para_style_ws.font.size = para_style_data.font.size
        
            new_para.style = para_style_ws
        
            new_run = new_para.add_run(text)
            copy_paragraph_style(para_style["run_style"][0], new_run)
            new_para.alignment = list(para_style["alignment"].values())[0]
        
            return new_para
        
        
        def copy_cell_borders(old_cell, new_cell):
            """复制单元格的边框样式"""
            old_tc = old_cell._tc
            new_tc = new_cell._tc
        
            old_borders = old_tc.xpath('.//w:tcBorders')
            if old_borders:
                old_border = old_borders[0]
                new_border = OxmlElement('w:tcBorders')
        
                border_types = ['top', 'left', 'bottom', 'right', 'insideH', 'insideV']
                for border_type in border_types:
                    old_element = old_border.find(f'.//w:{border_type}', namespaces={
                        'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'
                    })
                    if old_element is not None:
                        new_element = OxmlElement(f'w:{border_type}')
                        for attr, value in old_element.attrib.items():
                            new_element.set(attr, value)
                        new_border.append(new_element)
        
                tc_pr = new_tc.get_or_add_tcPr()
                tc_pr.append(new_border)
        
        
        def clone_table(old_table, new_doc):
            """根据旧表格创建新表格"""
            new_table = new_doc.add_table(rows=len(old_table.rows), cols=len(old_table.columns))
            if old_table.style:
                new_table.style = old_table.style
        
            for i, old_row in enumerate(old_table.rows):
                for j, old_cell in enumerate(old_row.cells):
                    new_cell = new_table.cell(i, j)
                    for paragraph in new_cell.paragraphs:
                        new_cell._element.remove(paragraph._element)
                    for old_paragraph in old_cell.paragraphs:
                        new_paragraph = new_cell.add_paragraph()
                        for old_run in old_paragraph.runs:
                            new_run = new_paragraph.add_run(old_run.text)
                            copy_paragraph_style(old_run, new_run)
                        new_paragraph.alignment = old_paragraph.alignment
                    copy_cell_borders(old_cell, new_cell)
        
            for i, col in enumerate(old_table.columns):
                if col.width is not None:
                    new_table.columns[i].width = col.width
        
            return new_table
        
        
        def clone_document(old_s, old_p, old_ws, new_doc_path):
            new_doc = Document()
        
            # 复制主体内容
            for para in old_p:
                for k, v in para.items():
        
                    if "Image_None" == k:
                        # print()
                        copy_inline_shapes(new_doc, [i["image"] for i in old_s if len(i) > 3][0])
                    elif "table" == k:
                        html_table_to_docx(new_doc,v)
                    else:
                        style = [i for i in old_s if v in list(i["style"].keys()) and "style" in i]
                        style_ws = [i for i in old_ws if v in list(i["style"].keys()) and "style" in i]
                        clone_paragraph(style[0], k, new_doc, style_ws[0])
        
            new_doc.save(new_doc_path)
        
        
        # 使用示例
        if __name__ == "__main__":
            body_ws, _ = get_para_style('demo_template.docx')
            body_s, body_p = get_para_style("南山三防工作专报1.docx")
            clone_document(body_s, body_p, body_ws, 'cloned_example.docx')
        

        模版样式文本分离

        from docx.enum.text import WD_BREAK
        
        from docx import Document
        from docx.enum.text import WD_ALIGN_PARAGRAPH
        from docx.oxml import OxmlElement
        from bs4 import BeautifulSoup
        
        from docx.oxml.ns import qn
        
        def docx_table_to_html(word_table):
            soup = BeautifulSoup(features='html.parser')
            html_table = soup.new_tag('table',)
        
            # 记录哪些单元格已经被合并
            merged_cells = [[False for _ in range(len(word_table.columns))] for _ in range(len(word_table.rows))]
        
            for row_idx, row in enumerate(word_table.rows):
                html_tr = soup.new_tag('tr')
        
                col_idx = 0
                while col_idx < len(row.cells):
                    cell = row.cells[col_idx]
        
                    # 如果该单元格已经被合并(被前面的 colspan 或 rowspan 占用),跳过
                    if merged_cells[row_idx][col_idx]:
                        col_idx += 1
                        continue
        
                    # 跳过纵向合并中被“continue”的单元格
                    v_merge = cell._element.tcPr and cell._element.tcPr.find(qn('w:vMerge'))
                    if v_merge is not None and v_merge.get(qn('w:val')) == 'continue':
                        col_idx += 1
                        continue
        
                    td = soup.new_tag('td')
        
                    # 设置文本内容
                    td.string = cell.text.strip()
        
                    # 初始化样式字符串
                    td_style = ''
        
                    # 获取单元格样式
                    if cell._element.tcPr:
                        tc_pr = cell._element.tcPr
        
                        # 处理背景颜色
                        shd = tc_pr.find(qn('w:shd'))
                        if shd is not None:
                            bg_color = shd.get(qn('w:fill'))
                            if bg_color:
                                td_style += f'background-color:#{bg_color};'
        
                        # 处理对齐方式
                        jc = tc_pr.find(qn('w:jc'))
                        if jc is not None:
                            align = jc.get(qn('w:val'))
                            if align == 'center':
                                td_style += 'text-align:center;'
                            elif align == 'right':
                                td_style += 'text-align:right;'
                            else:
                                td_style += 'text-align:left;'
        
                        # 处理边框
                        borders = tc_pr.find(qn('w:tcBorders'))
                        if borders is not None:
                            for border_type in ['top', 'left', 'bottom', 'right']:
                                border = borders.find(qn(f'w:{border_type}'))
                                if border is not None:
                                    color = border.get(qn('w:color'), '000000')
                                    size = int(border.get(qn('w:sz'), '4'))  # 半点单位,1pt = 2sz
                                    style = border.get(qn('w:val'), 'single')
                                    td_style += f'border-{border_type}:{size // 2}px {style} #{color};'
        
                        # 处理横向合并(colspan)
                        grid_span = tc_pr.find(qn('w:gridSpan'))
                        if grid_span is not None:
                            colspan = int(grid_span.get(qn('w:val'), '1'))
                            if colspan > 1:
                                td['colspan'] = colspan
                                # 标记后面被合并的单元格
                                for c in range(col_idx + 1, col_idx + colspan):
                                    if c < len(row.cells):
                                        merged_cells[row_idx][c] = True
        
                        # 处理纵向合并(rowspan)
                        v_merge = tc_pr.find(qn('w:vMerge'))
                        if v_merge is not None and v_merge.get(qn('w:val')) != 'continue':
                            rowspan = 1
                            next_row_idx = row_idx + 1
                            while next_row_idx < len(word_table.rows):
                                next_cell = word_table.rows[nVFJVOZicxext_row_idx].cells[col_idx]
                                next_v_merge = next_cell._element.tcPr and next_cell._element.tcPr.find(qn('w:vMerge'))
                                if next_v_merge is not None and next_v_merge.get(qn('w:val')) == 'continue':
                                    rowspan += 1
                      javascript              next_row_idx += 1
                                else:
                                    break
                            if rowspan > 1:
                                td['rowspan'] = rowspan
                                # 标记后面被合并的行
                                for r in range(row_idx + 1, row_idx + rowspan):
                                    if r < len(word_table.rows):
                                        merged_cells[r][col_idx] = True
        
                    # 设置样式和默认边距
                    td['style'] = td_style + "padding: 5px;"
                    html_tr.append(td)
        
                    # 更新列索引
                    if 'colspan' in td.attrs:
                        col_idx += int(td['colspan'])
                    else:
                        col_idx += 1
        
                html_table.append(html_tr)
        
            soup.append(html_table)
            return str(soup)
        
        def set_cell_background(cell, color_hex):
            """设置单元格背景色"""
            color_hex = color_hex.lstrip('#')
            shading_elm = OxmlElement('w:shd')
            shading_elm.set(qn('w:fill'), color_hex)
            cell._tc.get_or_add_tcPr().append(shading_elm)
        
        
        def html_table_to_docx(doc, html_content):
            """
            将 HTML 中的表格转换为 Word 文档中的表格
            :param html_content: HTML 字符串
            :param doc: python-docx Document 实例
            """
            soup = BeautifulSoup(html_content, 'html.parser')
            tables = soup.find_all('table')
        
            for html_table in tables:
                # 获取表格行数
                trs = html_table.find_all('tr')
                rows = len(trs)
        
                # 估算最大列数(考虑 colspan)
                cols = 0
                for tr in trs:
                    col_count = 0
                    for cell in tr.find_all(['td', 'th']):
                        col_count += int(cell.get('colspan', 1))
                    cols = max(cols, col_count)
        
                # 创建 Word 表格
                table = doc.add_table(rows=rows, cols=cols)
                table.style = 'Table Grid'
        
                # 记录已处理的单元格(用于处理合并)
                used_cells = [[False for _ in range(cols)] for _ in range(rows)]
        
                for row_idx, tr in enumerate(trs):
                    cells = tr.find_all(['td', 'th'])
                    col_idx = 0
        
                    for cell in cells:
                        while col_idx < cols and used_cells[row_idx][col_idx]:
                            col_idx += 1
        
                        if col_idx >= cols:
                            break  # 避免越界
        
                        # 获取 colspan 和 rowspan
                        colspan = int(cell.get('colspan', 1))
                        rowspan = int(cell.get('rowspan', 1))
        
                        # 获取文本内容
                        text = cell.get_text(strip=True)
        
                        # 获取对齐方式
                        align = cell.get('align')
                        align_map = {
                            'left': WD_ALIGN_PARAGRAPH.LEFT,
                            'center': WD_ALIGN_PARAGRAPH.CENTER,
                            'right': WD_ALIGN_PARAGRAPH.RIGHT
                        }
        
                        # 获取背景颜色
          编程              style = cell.get('style', '')
                        bg_color = None
                        for s in style.split(';'):
                            if 'background-color' in s or 'background' in s:
                                bg_color = s.split(':')[1].strip()
                                break
        
                        # 获取 Word 单元格
                        word_cell = table.cell(row_idx, col_idx)
        
                        # 合并单元格
                        if colspan > 1 or rowspan > 1:
                            end_row = min(row_idx + rowspan - 1, rows - 1)
                            e编程nd_col = min(col_idx + colspan - 1, cols - 1)
                            merged_cell = table.cell(row_idx, col_idx).merge(table.cell(end_row, end_col))
                            word_cell = merged_cell
        
                        # 设置文本内容
                        para = word_cell.paragraphs[0]
                        para.text = text
        
                        # 设置对齐方式
                        if align in align_map:
                            para.alignment = align_map[align]
        
                        # 设置背景颜色
                        if bg_color:
                            try:
                                set_cell_background(word_cell, bg_color)
                            except:
                                pass  # 忽略无效颜色格式
        
                        # 标记已使用的单元格
                        for r in range(row_idx, min(row_idx + rowspan, rows)):
                            for c in range(col_idx, min(col_idx + colspan, cols)):
                                used_cells[r][c] = True
        
                        # 移动到下一个可用列
                        col_idx += colspan
        
                # 添加空段落分隔
                doc.add_paragraph()
        
            return doc
        
        
        def copy_inline_shapes(old_paragraph):
            """复制段落中的所有内嵌形状(通常是图片)"""
            images = []
            for shape in old_paragraph._element.xpath('.//w:drawing'):
                blip = shape.find('.//a:blip', namespaces={'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'})
                if blip is not None:
                    rId = blip.attrib['{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed']
                    image_part = old_paragraph.part.related_parts[rId]
                    image_bytes = image_part.image.blob
                    images.append([image_bytes, image_part.image.width, image_part.image.height])
            return images
        
        
        def is_page_break(element):
            """判断元素是否为分页符(段落或表格后)"""
            if element.tag.endswith('p'):
                for child in element:
                    if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                        return True
            elif element.tag.endswith('tbl'):
                # 表格后可能有分页符(通过下一个元素判断)
                if element.getnext() is not None:
                    next_element = element.getnext()
                    if next_element.tag.endswith('p'):
                        for child in next_element:
                            if child.tag.endswith('br') and child.get(qn('type')) == 'page':
                                return True
            return False
        
        
        def clone_paragraph(old_para):
            """根据旧段落创建新段落"""
            style = {"run_style": []}
            if old_para.style:
                # 这里保存style  主要通过字体识别   是 几级标题
                style_name_to_style_obj = {old_para.style.name + "_" + str(old_para.alignment).split()[0]: old_para.style}
                style["style"] = style_name_to_style_obj
            paras = []
            for old_run in old_para.runs:
                text_to_style_name = {old_run.text: old_para.style.name + "_" + str(old_para.alignment).split()[0]}
                style["run_style"].append(old_run)
                paras.append(text_to_style_name)
        
            style_name_to_alignment = {old_para.style.name + "_" + str(old_para.alignment).split()[0]: old_para.alignment}
            style["alignment"] = style_name_to_alignment
        
            images = copy_inline_shapes(old_para)
            if len(images):
                style["image"] = images
                paras.append({"Image_None": "Image_None"})
            return style, paras
        
        
        def clone_document(old_doc_path):
            try:
                old_doc = Document(old_doc_path)
                new_doc = Document()
                # 复制主体内容
                elements = old_doc.element.body
                para_index = 0
                table_index = 0
                index = 0
        
                body_style = []
                body_paras = []
        
                while index < len(elements):
                    element = elements[index]
                    if element.tag.endswith('p'):
                        old_para = old_doc.paragraphs[para_index]
                        style, paras = clone_paragraph(old_para)
                        body_style.append(style)
                        body_paras += paras
                        para_index += 1
                        index += 1
                    elif element.tag.endswith('tbl'):
                        old_table = old_doc.tables[table_index]
                        body_paras += [{"table": docx_table_to_html(old_table)}]
                        table_index += 1
                        index += 1
                    elif element.tag.endswith('br') and element.get(qn('type')) == 'page':
                        if index > 0:
                            body_paras.append("br")
                            new_doc.add_paragraph().add_run().add_break(WD_BREAK.PAGE)
                        index += 1
                    else:
                        index += 1
        
                    # 检查分页符
                    if index < len(elements) and is_page_break(elements[index]):
                        if index > 0:
                            new_doc.add_paragraph().add_run().add_break(WD_BREAK.PAGE)
                            body_paras.append("br")
                        index += 1
        
                else:
                    return body_style, body_paras
            except Exception as e:
                print(f"复制文档时发生错误:{e}")
        
        
        # 使用示例
        if __name__ == "__main__":
            # 示例HTML表格
            body_s, body_p = clone_document('专报1.docx')
        

        生成可更改模版

        from docx import Document
        from docx.enum.text import WD_ALIGN_PARAGRAPH
        
        # 创建一个新的Word文档
        doc = Document()
        for align in [WD_ALIGN_PARAGRAPH.LEFT, WD_ALIGN_PARAGRAPH.RIGHT, WD_ALIGN_PARAGRAPH.CENTER, None]:
            for blod_flag in [True, False]:
        
                # 获取所有可用的段落样式名(只保留段落样式)
                paragraph_styles = [
                    style for style in doc.styles if style.type == 1  # type == 1 表示段落样式
                ]
        
                # 输出样式数量
                print(f"共找到 {len(paragraph_styles)} 种段落样式:")
                for style in paragraph_styles:
                    print(f"- {style.name}")
        
                # 在文档中添加每个样式对应的段落
                for style in paragraph_styles:
                    heading = doc.add_paragraph()
                    run = heading.add_run(f"样式名称: {style.name}")
                    run.bold = blod_flag
                    para = doc.add_paragraph(f"这是一个应用了 '{style.name}' 样式的段落示例。", style=style)
                    para.alignment = align
                    # 添加分隔线(可选)
                    doc.add_paragraph("-" * 40)
        
        # 保存为 demo_template.docx
        doc.save("demo_template.docx")
        print("\n✅ 已生成包含所有段落样式的模板文件:demo_template.docx")
        

        以上就是Python使用python-docx实现自动化处理Word文档的详细内容,更多关于Python自动化处理Word的资料请关注编程客栈(www.devze.com)其它相关文章!

        0

        精彩评论

        暂无评论...
        验证码 换一张
        取 消

        关注公众号