Class HTMLStripCharFilter
- java.lang.Object
-
- java.io.Reader
-
- org.apache.lucene.analysis.CharFilter
-
- org.apache.lucene.analysis.charfilter.BaseCharFilter
-
- org.apache.lucene.analysis.charfilter.HTMLStripCharFilter
-
- All Implemented Interfaces:
java.io.Closeable,java.lang.AutoCloseable,java.lang.Readable
public final class HTMLStripCharFilter extends BaseCharFilter
A CharFilter that wraps another Reader and attempts to strip out HTML constructs.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static classHTMLStripCharFilter.TextSegment
-
Field Summary
Fields Modifier and Type Field Description private static intAMPERSANDprivate static intBANGprivate static charBLOCK_LEVEL_END_TAG_REPLACEMENTprivate static charBLOCK_LEVEL_START_TAG_REPLACEMENTprivate static charBR_END_TAG_REPLACEMENTprivate static charBR_START_TAG_REPLACEMENTprivate static intCDATAprivate static intCHARACTER_REFERENCE_TAILprivate static intCOMMENTprivate intcumulativeDiffprivate static intDOUBLE_QUOTED_STRINGprivate static intEND_TAG_TAIL_EXCLUDEprivate static intEND_TAG_TAIL_INCLUDEprivate static intEND_TAG_TAIL_SUBSTITUTEprivate HTMLStripCharFilter.TextSegmententitySegmentprivate static CharArrayMap<java.lang.Character>entityValuesprivate inteofReturnValueprivate booleanescapeBRprivate CharArraySetescapedTagsprivate booleanescapeSCRIPTprivate booleanescapeSTYLEprivate static intINITIAL_INPUT_SEGMENT_SIZEprivate HTMLStripCharFilter.TextSegmentinputSegmentprivate longinputStartprivate static intLEFT_ANGLE_BRACKETprivate static intLEFT_ANGLE_BRACKET_SLASHprivate static intLEFT_ANGLE_BRACKET_SPACEprivate static intNUMERIC_CHARACTERprivate intoutputCharCountprivate HTMLStripCharFilter.TextSegmentoutputSegmentprivate intpreviousRestoreStateprivate static charREPLACEMENT_CHARACTERprivate intrestoreStateprivate static intSCRIPTprivate static intSCRIPT_COMMENTprivate static charSCRIPT_REPLACEMENTprivate static intSERVER_SIDE_INCLUDEprivate static intSINGLE_QUOTED_STRINGprivate static intSTART_TAG_TAIL_EXCLUDEprivate static intSTART_TAG_TAIL_INCLUDEprivate static intSTART_TAG_TAIL_SUBSTITUTEprivate static intSTYLEprivate static intSTYLE_COMMENTprivate static charSTYLE_REPLACEMENTprivate static java.util.Map<java.lang.String,java.lang.String>upperCaseVariantsAcceptedprivate longyycharNumber of characters up to the start of the matched text.private intyycolumnNumber of characters from the last newline up to the start of the matched text.private static intYYEOFThis character denotes the end of file.private static intYYINITIALLexical states.private intyylineNumber of newlines encountered up to the start of the matched text.private static int[]ZZ_ACTIONTranslates DFA states to action switch labels.private static java.lang.StringZZ_ACTION_PACKED_0private static int[]ZZ_ATTRIBUTEZZ_ATTRIBUTE[aState] contains the attributes of stateaStateprivate static java.lang.StringZZ_ATTRIBUTE_PACKED_0private static intZZ_BUFFERSIZEInitial size of the lookahead buffer.private static int[]ZZ_CMAP_BLOCKSSecond-level tables for translating characters to character classesprivate static java.lang.StringZZ_CMAP_BLOCKS_PACKED_0private static int[]ZZ_CMAP_TOPTop-level table for translating characters to character classesprivate static java.lang.StringZZ_CMAP_TOP_PACKED_0private static java.lang.String[]ZZ_ERROR_MSGprivate static int[]ZZ_LEXSTATEZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integerprivate static intZZ_NO_MATCHError code for "could not match input".private static intZZ_PUSHBACK_2BIGError code for "pushback value was too large".private static int[]ZZ_ROWMAPTranslates a state to a row index in the transition tableprivate static java.lang.StringZZ_ROWMAP_PACKED_0private static int[]ZZ_TRANSThe transition table of the DFAprivate static java.lang.StringZZ_TRANS_PACKED_0private static java.lang.StringZZ_TRANS_PACKED_1private static java.lang.StringZZ_TRANS_PACKED_10private static java.lang.StringZZ_TRANS_PACKED_11private static java.lang.StringZZ_TRANS_PACKED_12private static java.lang.StringZZ_TRANS_PACKED_13private static java.lang.StringZZ_TRANS_PACKED_14private static java.lang.StringZZ_TRANS_PACKED_2private static java.lang.StringZZ_TRANS_PACKED_3private static java.lang.StringZZ_TRANS_PACKED_4private static java.lang.StringZZ_TRANS_PACKED_5private static java.lang.StringZZ_TRANS_PACKED_6private static java.lang.StringZZ_TRANS_PACKED_7private static java.lang.StringZZ_TRANS_PACKED_8private static java.lang.StringZZ_TRANS_PACKED_9private static intZZ_UNKNOWN_ERRORError code for "Unknown internal scanner error".private booleanzzAtBOLWhether the scanner is currently at the beginning of a line.private booleanzzAtEOFWhether the scanner is at the end of file.private char[]zzBufferThis buffer contains the current text to be matched and is the source of theyytext()string.private intzzCurrentPosCurrent text position in the buffer.private intzzEndReadMarks the last character in the buffer, that has been read from input.private booleanzzEOFDoneWhether the user-EOF-code has already been executed.private intzzFinalHighSurrogateprivate intzzLexicalStateCurrent lexical state.private intzzMarkedPosText position at the last accepting state.private java.io.ReaderzzReaderInput device.private intzzStartReadMarks the beginning of theyytext()string in the buffer.private intzzStateCurrent state of the DFA.-
Fields inherited from class org.apache.lucene.analysis.CharFilter
input
-
-
Constructor Summary
Constructors Constructor Description HTMLStripCharFilter(java.io.Reader in)Creates a new scannerHTMLStripCharFilter(java.io.Reader in, java.util.Set<java.lang.String> escapedTags)Creates a new HTMLStripCharFilter over the provided Reader with the specified start and end tags.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()Closes the underlying input stream.(package private) static intgetInitialBufferSize()private intnextChar()Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.intread()intread(char[] cbuf, int off, int len)private booleanyyatEOF()Returns whether the scanner has reached the end of the reader it reads from.private voidyybegin(int newState)Enters a new lexical state.private charyycharat(int position)Returns the character at the given position from the matched text.private voidyyclose()Closes the input reader.private intyylength()How many characters were matched.private voidyypushback(int number)Pushes the specified amount of characters back into the input stream.private voidyyreset(java.io.Reader reader)Resets the scanner to read from a new input stream.private voidyyResetPosition()Resets the input position.private intyystate()Returns the current lexical state.private java.lang.Stringyytext()Returns the text matched by the current regular expression.private static intzzCMap(int input)Translates raw input code points to DFA table rowprivate voidzzDoEOF()Contains user EOF-code, which will be executed exactly once, when the end of file is reachedprivate booleanzzRefill()Refills the input buffer.private static voidzzScanError(int errorCode)Reports an error that occurred while scanning.private static int[]zzUnpackAction()private static intzzUnpackAction(java.lang.String packed, int offset, int[] result)private static int[]zzUnpackAttribute()private static intzzUnpackAttribute(java.lang.String packed, int offset, int[] result)private static int[]zzUnpackcmap_blocks()private static intzzUnpackcmap_blocks(java.lang.String packed, int offset, int[] result)private static int[]zzUnpackcmap_top()private static intzzUnpackcmap_top(java.lang.String packed, int offset, int[] result)private static int[]zzUnpackRowMap()private static intzzUnpackRowMap(java.lang.String packed, int offset, int[] result)private static int[]zzUnpackTrans()private static intzzUnpackTrans(java.lang.String packed, int offset, int[] result)-
Methods inherited from class org.apache.lucene.analysis.charfilter.BaseCharFilter
addOffCorrectMap, correct, getLastCumulativeDiff
-
Methods inherited from class org.apache.lucene.analysis.CharFilter
correctOffset
-
-
-
-
Field Detail
-
YYEOF
private static final int YYEOF
This character denotes the end of file.- See Also:
- Constant Field Values
-
ZZ_BUFFERSIZE
private static final int ZZ_BUFFERSIZE
Initial size of the lookahead buffer.- See Also:
- Constant Field Values
-
YYINITIAL
private static final int YYINITIAL
Lexical states.- See Also:
- Constant Field Values
-
AMPERSAND
private static final int AMPERSAND
- See Also:
- Constant Field Values
-
NUMERIC_CHARACTER
private static final int NUMERIC_CHARACTER
- See Also:
- Constant Field Values
-
CHARACTER_REFERENCE_TAIL
private static final int CHARACTER_REFERENCE_TAIL
- See Also:
- Constant Field Values
-
LEFT_ANGLE_BRACKET
private static final int LEFT_ANGLE_BRACKET
- See Also:
- Constant Field Values
-
BANG
private static final int BANG
- See Also:
- Constant Field Values
-
COMMENT
private static final int COMMENT
- See Also:
- Constant Field Values
-
SCRIPT
private static final int SCRIPT
- See Also:
- Constant Field Values
-
SCRIPT_COMMENT
private static final int SCRIPT_COMMENT
- See Also:
- Constant Field Values
-
LEFT_ANGLE_BRACKET_SLASH
private static final int LEFT_ANGLE_BRACKET_SLASH
- See Also:
- Constant Field Values
-
LEFT_ANGLE_BRACKET_SPACE
private static final int LEFT_ANGLE_BRACKET_SPACE
- See Also:
- Constant Field Values
-
CDATA
private static final int CDATA
- See Also:
- Constant Field Values
-
SERVER_SIDE_INCLUDE
private static final int SERVER_SIDE_INCLUDE
- See Also:
- Constant Field Values
-
SINGLE_QUOTED_STRING
private static final int SINGLE_QUOTED_STRING
- See Also:
- Constant Field Values
-
DOUBLE_QUOTED_STRING
private static final int DOUBLE_QUOTED_STRING
- See Also:
- Constant Field Values
-
END_TAG_TAIL_INCLUDE
private static final int END_TAG_TAIL_INCLUDE
- See Also:
- Constant Field Values
-
END_TAG_TAIL_EXCLUDE
private static final int END_TAG_TAIL_EXCLUDE
- See Also:
- Constant Field Values
-
END_TAG_TAIL_SUBSTITUTE
private static final int END_TAG_TAIL_SUBSTITUTE
- See Also:
- Constant Field Values
-
START_TAG_TAIL_INCLUDE
private static final int START_TAG_TAIL_INCLUDE
- See Also:
- Constant Field Values
-
START_TAG_TAIL_EXCLUDE
private static final int START_TAG_TAIL_EXCLUDE
- See Also:
- Constant Field Values
-
START_TAG_TAIL_SUBSTITUTE
private static final int START_TAG_TAIL_SUBSTITUTE
- See Also:
- Constant Field Values
-
STYLE
private static final int STYLE
- See Also:
- Constant Field Values
-
STYLE_COMMENT
private static final int STYLE_COMMENT
- See Also:
- Constant Field Values
-
ZZ_LEXSTATE
private static final int[] ZZ_LEXSTATE
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer
-
ZZ_CMAP_TOP
private static final int[] ZZ_CMAP_TOP
Top-level table for translating characters to character classes
-
ZZ_CMAP_TOP_PACKED_0
private static final java.lang.String ZZ_CMAP_TOP_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_CMAP_BLOCKS
private static final int[] ZZ_CMAP_BLOCKS
Second-level tables for translating characters to character classes
-
ZZ_CMAP_BLOCKS_PACKED_0
private static final java.lang.String ZZ_CMAP_BLOCKS_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_ACTION
private static final int[] ZZ_ACTION
Translates DFA states to action switch labels.
-
ZZ_ACTION_PACKED_0
private static final java.lang.String ZZ_ACTION_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_ROWMAP
private static final int[] ZZ_ROWMAP
Translates a state to a row index in the transition table
-
ZZ_ROWMAP_PACKED_0
private static final java.lang.String ZZ_ROWMAP_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_TRANS
private static final int[] ZZ_TRANS
The transition table of the DFA
-
ZZ_TRANS_PACKED_0
private static final java.lang.String ZZ_TRANS_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_1
private static final java.lang.String ZZ_TRANS_PACKED_1
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_2
private static final java.lang.String ZZ_TRANS_PACKED_2
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_3
private static final java.lang.String ZZ_TRANS_PACKED_3
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_4
private static final java.lang.String ZZ_TRANS_PACKED_4
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_5
private static final java.lang.String ZZ_TRANS_PACKED_5
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_6
private static final java.lang.String ZZ_TRANS_PACKED_6
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_7
private static final java.lang.String ZZ_TRANS_PACKED_7
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_8
private static final java.lang.String ZZ_TRANS_PACKED_8
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_9
private static final java.lang.String ZZ_TRANS_PACKED_9
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_10
private static final java.lang.String ZZ_TRANS_PACKED_10
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_11
private static final java.lang.String ZZ_TRANS_PACKED_11
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_12
private static final java.lang.String ZZ_TRANS_PACKED_12
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_13
private static final java.lang.String ZZ_TRANS_PACKED_13
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_14
private static final java.lang.String ZZ_TRANS_PACKED_14
- See Also:
- Constant Field Values
-
ZZ_UNKNOWN_ERROR
private static final int ZZ_UNKNOWN_ERROR
Error code for "Unknown internal scanner error".- See Also:
- Constant Field Values
-
ZZ_NO_MATCH
private static final int ZZ_NO_MATCH
Error code for "could not match input".- See Also:
- Constant Field Values
-
ZZ_PUSHBACK_2BIG
private static final int ZZ_PUSHBACK_2BIG
Error code for "pushback value was too large".- See Also:
- Constant Field Values
-
ZZ_ERROR_MSG
private static final java.lang.String[] ZZ_ERROR_MSG
-
ZZ_ATTRIBUTE
private static final int[] ZZ_ATTRIBUTE
ZZ_ATTRIBUTE[aState] contains the attributes of stateaState
-
ZZ_ATTRIBUTE_PACKED_0
private static final java.lang.String ZZ_ATTRIBUTE_PACKED_0
- See Also:
- Constant Field Values
-
zzReader
private java.io.Reader zzReader
Input device.
-
zzState
private int zzState
Current state of the DFA.
-
zzLexicalState
private int zzLexicalState
Current lexical state.
-
zzBuffer
private char[] zzBuffer
This buffer contains the current text to be matched and is the source of theyytext()string.
-
zzMarkedPos
private int zzMarkedPos
Text position at the last accepting state.
-
zzCurrentPos
private int zzCurrentPos
Current text position in the buffer.
-
zzStartRead
private int zzStartRead
Marks the beginning of theyytext()string in the buffer.
-
zzEndRead
private int zzEndRead
Marks the last character in the buffer, that has been read from input.
-
zzAtEOF
private boolean zzAtEOF
Whether the scanner is at the end of file.- See Also:
yyatEOF()
-
zzFinalHighSurrogate
private int zzFinalHighSurrogate
-
yyline
private int yyline
Number of newlines encountered up to the start of the matched text.
-
yycolumn
private int yycolumn
Number of characters from the last newline up to the start of the matched text.
-
yychar
private long yychar
Number of characters up to the start of the matched text.
-
zzAtBOL
private boolean zzAtBOL
Whether the scanner is currently at the beginning of a line.
-
zzEOFDone
private boolean zzEOFDone
Whether the user-EOF-code has already been executed.
-
upperCaseVariantsAccepted
private static final java.util.Map<java.lang.String,java.lang.String> upperCaseVariantsAccepted
-
entityValues
private static final CharArrayMap<java.lang.Character> entityValues
-
INITIAL_INPUT_SEGMENT_SIZE
private static final int INITIAL_INPUT_SEGMENT_SIZE
- See Also:
- Constant Field Values
-
BLOCK_LEVEL_START_TAG_REPLACEMENT
private static final char BLOCK_LEVEL_START_TAG_REPLACEMENT
- See Also:
- Constant Field Values
-
BLOCK_LEVEL_END_TAG_REPLACEMENT
private static final char BLOCK_LEVEL_END_TAG_REPLACEMENT
- See Also:
- Constant Field Values
-
BR_START_TAG_REPLACEMENT
private static final char BR_START_TAG_REPLACEMENT
- See Also:
- Constant Field Values
-
BR_END_TAG_REPLACEMENT
private static final char BR_END_TAG_REPLACEMENT
- See Also:
- Constant Field Values
-
SCRIPT_REPLACEMENT
private static final char SCRIPT_REPLACEMENT
- See Also:
- Constant Field Values
-
STYLE_REPLACEMENT
private static final char STYLE_REPLACEMENT
- See Also:
- Constant Field Values
-
REPLACEMENT_CHARACTER
private static final char REPLACEMENT_CHARACTER
- See Also:
- Constant Field Values
-
escapedTags
private CharArraySet escapedTags
-
inputStart
private long inputStart
-
cumulativeDiff
private int cumulativeDiff
-
escapeBR
private boolean escapeBR
-
escapeSCRIPT
private boolean escapeSCRIPT
-
escapeSTYLE
private boolean escapeSTYLE
-
restoreState
private int restoreState
-
previousRestoreState
private int previousRestoreState
-
outputCharCount
private int outputCharCount
-
eofReturnValue
private int eofReturnValue
-
inputSegment
private HTMLStripCharFilter.TextSegment inputSegment
-
outputSegment
private HTMLStripCharFilter.TextSegment outputSegment
-
entitySegment
private HTMLStripCharFilter.TextSegment entitySegment
-
-
Constructor Detail
-
HTMLStripCharFilter
public HTMLStripCharFilter(java.io.Reader in, java.util.Set<java.lang.String> escapedTags)Creates a new HTMLStripCharFilter over the provided Reader with the specified start and end tags.- Parameters:
in- Reader to strip html tags from.escapedTags- Tags in this set (both start and end tags) will not be filtered out.
-
HTMLStripCharFilter
public HTMLStripCharFilter(java.io.Reader in)
Creates a new scanner- Parameters:
in- the java.io.Reader to read input from.
-
-
Method Detail
-
zzUnpackcmap_top
private static int[] zzUnpackcmap_top()
-
zzUnpackcmap_top
private static int zzUnpackcmap_top(java.lang.String packed, int offset, int[] result)
-
zzUnpackcmap_blocks
private static int[] zzUnpackcmap_blocks()
-
zzUnpackcmap_blocks
private static int zzUnpackcmap_blocks(java.lang.String packed, int offset, int[] result)
-
zzUnpackAction
private static int[] zzUnpackAction()
-
zzUnpackAction
private static int zzUnpackAction(java.lang.String packed, int offset, int[] result)
-
zzUnpackRowMap
private static int[] zzUnpackRowMap()
-
zzUnpackRowMap
private static int zzUnpackRowMap(java.lang.String packed, int offset, int[] result)
-
zzUnpackTrans
private static int[] zzUnpackTrans()
-
zzUnpackTrans
private static int zzUnpackTrans(java.lang.String packed, int offset, int[] result)
-
zzUnpackAttribute
private static int[] zzUnpackAttribute()
-
zzUnpackAttribute
private static int zzUnpackAttribute(java.lang.String packed, int offset, int[] result)
-
read
public int read() throws java.io.IOException- Overrides:
readin classjava.io.Reader- Throws:
java.io.IOException
-
read
public int read(char[] cbuf, int off, int len) throws java.io.IOException- Specified by:
readin classjava.io.Reader- Throws:
java.io.IOException
-
close
public void close() throws java.io.IOExceptionDescription copied from class:CharFilterCloses the underlying input stream.NOTE: The default implementation closes the input Reader, so be sure to call
super.close()when overriding this method.- Specified by:
closein interfacejava.lang.AutoCloseable- Specified by:
closein interfacejava.io.Closeable- Overrides:
closein classCharFilter- Throws:
java.io.IOException
-
getInitialBufferSize
static int getInitialBufferSize()
-
zzCMap
private static int zzCMap(int input)
Translates raw input code points to DFA table row
-
zzRefill
private boolean zzRefill() throws java.io.IOExceptionRefills the input buffer.- Returns:
falseiff there was new input.- Throws:
java.io.IOException- if any I/O-Error occurs
-
yyclose
private final void yyclose() throws java.io.IOExceptionCloses the input reader.- Throws:
java.io.IOException- if the reader could not be closed.
-
yyreset
private final void yyreset(java.io.Reader reader)
Resets the scanner to read from a new input stream.Does not close the old reader.
All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to
ZZ_INITIAL.Internal scan buffer is resized down to its initial length, if it has grown.
- Parameters:
reader- The new input stream.
-
yyResetPosition
private final void yyResetPosition()
Resets the input position.
-
yyatEOF
private final boolean yyatEOF()
Returns whether the scanner has reached the end of the reader it reads from.- Returns:
- whether the scanner has reached EOF.
-
yystate
private final int yystate()
Returns the current lexical state.- Returns:
- the current lexical state.
-
yybegin
private final void yybegin(int newState)
Enters a new lexical state.- Parameters:
newState- the new lexical state
-
yytext
private final java.lang.String yytext()
Returns the text matched by the current regular expression.- Returns:
- the matched text.
-
yycharat
private final char yycharat(int position)
Returns the character at the given position from the matched text.It is equivalent to
yytext().charAt(pos), but faster.- Parameters:
position- the position of the character to fetch. A value from 0 toyylength()-1.- Returns:
- the character at
position.
-
yylength
private final int yylength()
How many characters were matched.- Returns:
- the length of the matched text region.
-
zzScanError
private static void zzScanError(int errorCode)
Reports an error that occurred while scanning.In a well-formed scanner (no or only correct usage of
yypushback(int)and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen".If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done in error fallback rules.
- Parameters:
errorCode- the code of the error message to display.
-
yypushback
private void yypushback(int number)
Pushes the specified amount of characters back into the input stream.They will be read again by then next call of the scanning method.
- Parameters:
number- the number of characters to be read again. This number must not be greater thanyylength().
-
zzDoEOF
private void zzDoEOF()
Contains user EOF-code, which will be executed exactly once, when the end of file is reached
-
nextChar
private int nextChar() throws java.io.IOExceptionResumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.- Returns:
- the next token.
- Throws:
java.io.IOException- if any I/O-Error occurs.
-
-