How to extract text and text coordinates from a PDF file? PDF Parsing with Text and Coordinates. PDF Text Extraction with Coordinates.

I want to extract all the text boxes and text box coordinates from a PDF file. I would like to extract text from a portion (using coordinates) of PDF page, can anyone help me out?

Given a PDF file, output should look something like:

   489, 41,  "Signature"
   500, 52,  "b"
   630, 202, "a_g_i_r"

Customer #1  
-----------------------------------------------
Hi,

I was wondering if anyone could recommend a program which can extract the starting (top left) coordinates (x,y) of each word in a PDF file (and the end if possible). Ideally output would be in a format that could be easily inserted into a database.

Customer #2
-----------------------------------------------

image
Sometimes, we have some customers who want to extract text contents and their positions from PDF pages, the text positions are used to parse the values, such as read invoice numbers from PDF files or looking for some other information.

PDF Extractor SDK (PDF Parser SDK and Command Line) is a good product to extract various information from PDF files, of course, it can extract text contents and text coordinates also.

1. You may download the trial version of PDF Extractor SDK (PDF Parser SDK and Command Line) from this web page first,

https://veryutils.com/pdf-extractor-sdk-pdf-parser-sdk-and-command-line

2. After you download it, you may unzip it to a folder.

3. Please run a CMD window first, if you don't know how to run a CMD window, please look at following web page,

https://veryutils.com/blog/top-10-methods-to-run-a-command-line-window-in-windows-10/

4. pdfextract.exe is a command line application, it supports following command line options,

D:\VeryPDF_PDFExtractTool>pdfextract.exe
pdfextract.exe version 3.0
Copyright 1996-2017 VeryPDF.com Inc.
Product Name: VeryPDF PDF Extract Tool Command Line
http://www.verypdf.com
http://www.verydoc.com
http://support.verypdf.com
Email: support@verypdf.com
Usage: pdfextract.exe [options] <PDF-file>
  -f <int>           : first page to print
  -l <int>           : last page to print
  -opw <string>      : owner password (for encrypted files)
  -upw <string>      : user password (for encrypted files)
  -outfolder <string>: Set a folder to store extracted files
  -layout            : maintain original physical layout
  -textfile          : Extract text contents from PDF file
  -textpos           : Extract text and coordinates from PDF file
  -nopgbrk           : don't insert page breaks between pages
  -h                 : print usage information
  -help              : print usage information
  --help             : print usage information
  -?                 : print usage information
  -$ <string>        : input your license key
Example:
   pdfextract.exe D:\in.pdf
   pdfextract.exe -outfolder D:\out\ D:\in.pdf
   pdfextract.exe -outfolder D:\out\ D:\in.pdf
   pdfextract.exe -opw 123 -upw 456 -outfolder D:\out\ D:\in.pdf
   pdfextract.exe -outfolder D:\out\ D:\in.pdf > out.log
   pdfextract.exe -outfolder D:\out\ D:\in.pdf out.log
   pdfextract.exe D:\in.pdf out.log
   pdfextract.exe -textpos D:\in.pdf D:\out.txt
   pdfextract.exe -textpos -nopgbrk D:\in.pdf D:\out.txt
   pdfextract.exe -textfile D:\in.pdf D:\out.txt
   pdfextract.exe -layout -textfile D:\in.pdf D:\out.txt

5. You can simple run following command line to extract all information from your PDF file,

pdfextract.exe -outfolder D:\VeryUtils\test\ D:\downloads\Test_in.pdf

6. You will find a "TextFileWithPosition.txt" file in the "D:\VeryUtils\test" folder, this text file contains all text contents and coordinates for each word, such as,

image

7. "PageContents.xml" is a XML file which contain coordinates for each character, such as,

image

8. Now, you can write a simple PHP or Python application to read and parse X/Y positions from these PDF files, then you can process these PDF files easily.

image

If you wish extract more information from PDF files, such as hyperlinks, colorspaces, attachments, bookmarks, pictures, embedded fonts, forms, etc. elements, please feel free to contact us, we are glad to assist you asap,

https://veryutils.com/contact

No votes yet.
Please wait...

Related Posts

Leave a Reply

Your email address will not be published.