Webtest WebTest GitHub Home

PDF Step pdfToTextFilter

Description

Extracts all text content from within the current PDF document.

In general, PDF documents can place text in documents using a variety of mechanisms. They may contain text as a stream of characters in an expected order, the order may not be expected but explicit positioning will place it in the correct position or it may contain graphical representations of the characters. For these reasons, this filter may not always produce what you expect. You will have to experiment to see what will work for you.

Parameters

description
Required? no
The description of this test step.
fragSep
Required? no, default is a single space
The fragment separator string to use, e.g. "" or " " or "," or " | ". Only used if mode is "groupByLines".
lineSep
Required? no, default is platform line separator
The line separator string to use, e.g. " " or "\n".
mode
Required? no, default is normal
Deprecated: doesn't do anything anymore.
pageSep
Required? no, default is [+++ NEW PAGE +++]\n
The page separator string to use, e.g. "\n" or "------".

Details

Here is an example of using pdfToTextFilter:

pdfToTextFilter example
<steps>
    <invoke url="testDocBookmarks.pdf"/>
    <compareToExpected saveFiltered="truereadFiltered="falsetoFile="${expectedFile}">
        <pdfToTextFilter mode="groupByLineslineSep="\ndescription="extract PDF text"/>
        <lineSeparatorFilter description="normalise line separators"/>
    </compareToExpected>
</steps>

As a result of invoking the above steps a file would be created containing something like the following:

pdfToTextFilter output
Heading One
Subheading
[+++ NEW PAGE +++]
Heading Two
[+++ NEW PAGE +++]