开发者

PDF stream to excel

开发者 https://www.devze.com 2023-04-06 02:59 出处:网络
I am having a PDF with tables in it. The main objective is to have the similar table structure being reflected in ExcelSheet.

I am having a PDF with tables in it. The main objective is to have the similar table structure being reflected in ExcelSheet.

Reading the PDF stream with iTextSharp or PDFSharp I could get the plain text with loosing the structure of the table as in plain text the stream which previously had the coordin开发者_JAVA百科ate values for the text elements are being stripped out.

How can I deal with the stream using the coordinates to place my text values in exact positions in excel


I had the same problem of importing tabular parts of a PDF into Excel. I did the following way:

  • manually open PDF, select all and copy
  • manually change to Excel
  • start a VBA which reads the clipboard, parses the data and writes out to the sheet

problem here was that the data in the buffer is not arranged horizontally - as you would expect - but vertically. So I had to develop some code around this as well. I used a class module to impelment functions like "next word", "next line", "search for word" etc.

I am happy to share this code if it helps.

EDIT:

I make use of a MSForms.DataObject to read the Clipboard. After having created a reference to Microsoft Forms 2.0 Object library (...\system32\FM20.DLL), create a new Class Module named ClipClass and put the following code in:

Public P As Integer                   ' line pointer
Public T As String                    ' total text buffer
Public L As String                    ' current line

Public Property Get FirstLine() As String
    P = 1
    FirstLine = NextLine()
End Property

Public Property Get NextLine() As String
    L = ""
    Do Until Mid(T, P, 2) = vbCrLf
        L = L & Mid(T, P, 1)
        P = P + 1
    Loop
    NextLine = L
    P = P + 2
End Property

Public Property Get FindLine(Arg As String) As String
Dim Tmp As String

    Tmp = FirstLine()
    
    Do Until Tmp = Arg
        Tmp = NextLine()
    Loop
    FindLine = Tmp
End Property

Private Sub Class_Initialize()
Dim Buf As MSForms.DataObject
    
    Set Buf = New MSForms.DataObject   ' this object interfaces with the clipboard
    Buf.GetFromClipboard               ' copy Clipboard to Object
    T = Buf.GetText                    ' copy text from Object to string var
    L = ""
    P = 1
    Set Buf = Nothing                  ' clean up

End Sub

This gives you all the functions you need to find a string and read out lines. Now for the fun part .... in my case I have a constant string in the PDF which always is situated 3 lines above the first table cell; and all table cells are arranged col by col in the text buffer. This is the Parser which is called by a button on the Excel sheet

Sub Parse()
Dim C As ClipClass, Tmp As String, WS As Range
Dim WSRow As Integer, WSCol As Integer

    ' initialize
    Set WS = Worksheets("Table").[A1]
    Set C = New ClipClass                  ' this creates the class instance and implicitely
                                           ' fires its Initialize() code which grabs the Clipboard
    
    ' get to head of table
    Tmp = C.FindLine("identifying string before table starts")
    ' advance to one line before first table field - each field is terminated by CRLF
    Tmp = C.NextLine
    Tmp = C.NextLine

    ' PDF table is 3 col's x 7 rows organized col by col
    For WSCol = 1 To 3
        For WSRow = 1 To 7
            WS(WSRow, WSCol) = C.NextLine
        Next WSRow
    Next WSCol
End Sub


In order to achieve the same first the PDF was read using iTextSharp (also tried with PDFCLown). The individual chunks with its coordinates were fetched from the PDF. As the PDF was following a similar pattern which was an Invoice file, logically the data were fetched accordingly and then with the help of NPOI the resulting excel format was achieved.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号