PDF Step pdfToTextFilter
Description
Extracts all text content from within the current PDF document.
In general, PDF documents can place text in documents using a variety of mechanisms. They may contain text as a stream of characters in an expected order, the order may not be expected but explicit positioning will place it in the correct position or it may contain graphical representations of the characters. For these reasons, this filter may not always produce what you expect. You will have to experiment to see what will work for you.
Parameters
- description
- Required? no
- The description of this test step.
- fragSep
- Required? no, default is a single space
- The fragment separator string to use, e.g. "" or " " or "," or " | ". Only used if mode is "groupByLines".
- lineSep
- Required? no, default is platform line separator
- The line separator string to use, e.g. " " or "\n".
- mode
- Required? no, default is normal
- Deprecated: doesn't do anything anymore.
- pageSep
- Required? no, default is [+++ NEW PAGE +++]\n
- The page separator string to use, e.g. "\n" or "------".
Details
Here is an example of using pdfToTextFilter:
pdfToTextFilter example
<steps>
<invoke url="testDocBookmarks.pdf"/>
<compareToExpected saveFiltered="true" readFiltered="false" toFile="${expectedFile}">
<pdfToTextFilter mode="groupByLines" lineSep="\n" description="extract PDF text"/>
<lineSeparatorFilter description="normalise line separators"/>
</compareToExpected>
</steps>
<invoke url="testDocBookmarks.pdf"/>
<compareToExpected saveFiltered="true" readFiltered="false" toFile="${expectedFile}">
<pdfToTextFilter mode="groupByLines" lineSep="\n" description="extract PDF text"/>
<lineSeparatorFilter description="normalise line separators"/>
</compareToExpected>
</steps>
As a result of invoking the above steps a file would be created containing something like the following:
pdfToTextFilter output
Heading One
Subheading
[+++ NEW PAGE +++]
Heading Two
[+++ NEW PAGE +++]
Subheading
[+++ NEW PAGE +++]
Heading Two
[+++ NEW PAGE +++]