开发者

Extracting ToUnicode tables from PDF

开发者 https://www.devze.com 2023-04-11 20:58 出处：网络

Can anyone suggest an easy to implement way to extract ToUnicode tables from PDF? I can extract开发者_如何学Python fonts using pdfextract from mupdf, now I\'m looking for a way to extract ToUnicode ta

相关专题：pdf

Can anyone suggest an easy to implement way to extract ToUnicode tables from PDF? I can extract开发者_如何学Python fonts using pdfextract from mupdf, now I'm looking for a way to extract ToUnicode tables for those fonts.

You can modify pdfextract to extract the ToUnicode CMaps (not tables, CMaps).

You might look at the code in savefont and add something like :

obj = fz_dict_gets(dict, "ToUnicode");
if (obj)
{
    stream = obj;
}

If there is a ToUnicode (there need not be) then you could dump the stream in a similar way to the way the font stream is written to file.

obj = fz_dict_gets(dict, "ToUnicode");
if (obj)
{
    stream = obj;
        buf = fz_new_buffer(0);

        error = pdf_load_stream(&buf, xref, fz_to_num(stream), fz_to_gen(stream));
        if (error)
        die(error);
            /* Do something with the data */
    }

buf->data (of size buf->len) would then contain the CMap, which you could write to file, or whatever.