Lucene: Fix highlighter issue to always honor hl.fragsize


Summary
There is one bug in Lucene code: it adds all remaining text to the last fragment, in some cases, the last fragment is the most relevant one, so it would be returned to client in highlight section. This causes highlighter outputs more characters than hl.fragsize. 

This article describes how to fix the issue so that it always honor hl.fragsize: only return hl.fragsize characters to client in highlighter section.

The Problem
Recently, we hit a problem related with highlighter: I set hl.fragsize = 300 like below: 
<str name="hl">on</str>
<str name="hl.fl">title,body_stored</str>
<str name="hl.fragsize">300</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.body_stored.hl.fragsize">300</str>
But the highlight section for one document still outputs more than 2000 characters.

Look into the code, in org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream, String, boolean, int),  after the for loop, it appends whole remaining text into last fragment.
if (
  // if there is text beyond the last token considered..
  (lastEndOffset < text.length())
  &&
  // and that text is not too large...
  (text.length()<= maxDocCharsToAnalyze)
 )
{
 //append it to the last fragment
 newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
currentFrag.textEndPos = newText.length();

This code is problematical, as in some cases, the last fragment is the most relevant section and will be selected to return to client.
The Solution I made some change to the code like below. 
//Test what remains of the original text beyond the point where we stopped analyzing
if(lastEndOffset < text.length())
{
 if(textFragmenter instanceof SimpleFragmenter)
 {
  SimpleFragmenter simpleFragmenter = 
(SimpleFragmenter) textFragmenter;
  int remain =simpleFragmenter.getFragmentSize() -(newText.length() - currentFrag.textStartPos);
  if(remain > 0 )
  {
   int endIndex = lastEndOffset + remain;
   if (endIndex > text.length()) {
    endIndex = text.length();
   }
   newText.append(encoder.encodeText(text.substring(lastEndOffset,
     endIndex)));
  }
 }
 else
 {
  newText.append(encoder.encodeText(text.substring(lastEndOffset)));
 }
}
currentFrag.textEndPos = newText.length();

Resources
https://issues.apache.org/jira/browse/LUCENE-5381

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)