Embedded fonts in PDF: copy and paste problems -


when trying copy , paste ms word document pdf document has sets of fonts embedded, result illegible.

several symbols changed or disappear.

using adobe acrobat can check specific fonts embedded.

  • would installing such fonts in microsoft word work out?
  • if so, can or create subsets of fonts need?
  • if not, how solve problem?

you should check pdf document's fonts first of pdffonts utility. part of xpdf package windows , can used without installing, dos box.

in order extract text (or copy'n'paste it) pdf, font should either use standard encoding (not custom one), , should have /tounicode table associated inside pdf.

pdffonts returns few basic information items fonts used pdf.

example output:

$ pdffonts -f 3 -l 5 sample.pdf   name                      type          encoding     emb sub uni object id   ------------------------- ------------- ------------ --- --- --- ---------   iadkrb+arial-boldmt       cid truetype  identity-h   yes yes yes     10  0   sskfgj+arialmt            cid truetype  custom       yes yes no      11  0 

the command above asked fonts used in page range 3 (first check) 5 (last page check).

in above case, both used fonts embedded subsets (indicated xyzabc+-prefixes names, yes in emb , sub columns).

the font sskfgj+arialmt uses custom encoding, pdf has no /tounicode font, indicated no entry column headed uni.

hence not easy extract text shown font (extraction require manual reverse engineering -- can "read" pdf pages).

you should check first, if copy'n'pasting of text works if use simple text file target (not ms word document). if doesn't, can forget ms word...


  • would installing such fonts in microsoft word work out?
  • very likely: no. (i cannot give definite answer without having myself access pdf in question.)
  • if so, can or create subsets of fonts need?
  • you extract subsetted fonts pdf itself. (funnily, my popular stackoverflow answer deals question -- dunno why people seem crazy extracting fonts pdf files other debugging purposes...)
  • if not, how solve problem?
  • there no solution other doing manually.

update

you can, unfortunately, not same info fonts used pdf via acrobat or adobe reader. can via menu -> file -> properties...

  • the font names,
  • the subset info (but not prefixes used subset font names),
  • the encoding and
  • the font type.

but not info presence of /tounicode table.


Comments

Popular posts from this blog

css - SVG using textPath a symbol not rendering in Firefox -

Java 8 + Maven Javadoc plugin: Error fetching URL -

node.js - How to abort query on demand using Neo4j drivers -