Collections in Guava


Google Guava is an open source library which provides lots of useful utility to Java programmer,  one of them is easy way to find the difference between two maps.
MapDifference of two Maps

public void getMapDiff() {
    Map<Integer,Integer> leftMap = ImmutableMap.of(1, 1, 2, 2, 4, 40);
    Map<Integer,Integer> rightMap = ImmutableMap.<Integer,Integer> builder()
        .put(2, 2).put(3, 3).put(4, 41).build();
    MapDifference<Integer,Integer> diff = Maps.difference(leftMap, rightMap);
    
    // output: entriesInCommon:{2=2}
    System.out.println("entriesInCommon:" + diff.entriesInCommon());
    // output: entriesOnlyOnLeft:{1=1}
    System.out.println("entriesOnlyOnLeft:" + diff.entriesOnlyOnLeft());
    // output: entriesOnlyOnRight:{3=3}
    System.out.println("entriesOnlyOnRight:" + diff.entriesOnlyOnRight());
    
    Map<Integer,ValueDifference<Integer>> entriesDiffering = diff
        .entriesDiffering();
    // output: entriesDiffering: {4=(40, 41)}
    System.out.println("entriesDiffering: " + entriesDiffering);
  }

Example: Using IFrames


maximize








Boost Search Relevancy Using boilerpipe, Nutch and Solr


The Problem
We use Nutch to crawl web sites and save the content into Solr for search.

A website usally applies a template which defines header, footer, navigation menu. Take one faked storage related documentation site as an example. The word storage appears multiple times in header, footer and menus.
The website has some very simple contact-us or login page. Because there is only a few content in these pages, it's very likely if user search storage, these 2 pages would be ranked highly and listed in first page.

We want to avoid this. We would like to save main content in one field in Solr, and boost that field.
The Solution
Change Nutch to Send Raw Content to Solr
By default, Nutch sends Solr the html tag stripped content to solr, not the rawl html page content.
To send raw content to Solr, we have to create one extra nutch plugin:

public class ExtraIndexingFilter implements IndexingFilter {
  public static final String FL_RAWCONTENT = "rawcontent";
  private Configuration conf;
  private boolean indexRawContent;
  private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();

  static {
    FIELDS.add(WebPage.Field.CONTENT);
  }
  public NutchDocument filter(NutchDocument doc, String url, WebPage page)
      throws IndexingException {
    if (indexRawContent) {
      ByteBuffer bb = page.getContent();
      if (bb != null) {
        doc.add(FL_RAWCONTENT, new String(bb.array()));
      }
    }
    return doc;
  }
  public void setConf(Configuration conf) {
    this.conf = conf;
    indexRawContent = conf.getBoolean("index-extra.rawcontent", false);
  }
}
Then change nutch-site.xml, add the plugin(index-extra) in plugin.includes. 
Set extra-index.rawcontent to true, and set http.content.limit to -1, so Nutch will crawl whole page.
<property>
  <name>extra-index.rawcontent</name>
  <value>true</value>
</property>
<property>
 <name>http.content.limit</name>
 <value>-1</value>
</property>
In solrindex-mapping.xml, add:
<field dest="rawcontent" source="rawcontent" />
Using boilerpipe to Remove Boilerplate Content in Solr
Next, we will define one Solr update processor which will use boilerpipe to remove the surplus "clutter" (boilerplate, templates).
BoilerpipeProcessor will use boilerpipe to remove boilerplate from originalfield and save stripped main content into strippedField. 
import de.l3s.boilerpipe.BoilerpipeProcessingException;
import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class BoilerpipeProcessorFactory extends UpdateRequestProcessorFactory {
  private static final Logger logger = LoggerFactory
      .getLogger(BoilerpipeProcessorFactory.class);
  private boolean enabled = true;
  private String originfield, strippedField;
  private boolean removeOriginfield = true;

  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      enabled = params.getBool("enabled", true);
      if (!enabled) return;
      removeOriginfield = params.getBool("removeOriginfield", true);
      originfield = Preconditions.checkNotNull(params.get("originfield"),
          "Must set originfield.");
      
      strippedField = Preconditions.checkNotNull(params.get("strippedField"),
          "Must set strippedField.");
    }
  }
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    if (!enabled) return null;
    return new BoilerpipeProcessor(next);
  }
  
  private class BoilerpipeProcessor extends UpdateRequestProcessor {
    public BoilerpipeProcessor(UpdateRequestProcessor next) {
      super(next);
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument doc = cmd.solrDoc;
      Collection<Object> colls = doc.getFieldValues(originfield);
      if (colls != null) {
        for (Object obj : colls) {
          if (obj != null) {
            String str = obj.toString();
            try {
              String strippedText = ArticleExtractor.getInstance().getText(str);
              doc.addField(strippedField, strippedText);
              if (removeOriginfield) {
                doc.removeField(originfield);
              }
            } catch (BoilerpipeProcessingException e) {
              logger.error("Error happened when use boilerpipe to strip text.",
                  e);
            }
          }
        }
      }
      super.processAdd(cmd);
    }
  }
}
Add the processor into the default chain in the solrconfig.xml:
<updateRequestProcessorChain name="defaultChain" default="true`">
  <processor
   class="org.lifelongprogrammer.BoilerpipeProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="originfield">rawcontent</str>
    <str name="strippedField">main_content</str>
    <bool name="removeOriginfield">true</bool>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" /> 
</updateRequestProcessorChain>

Add main_content field in schema.xml:
<field name="main_content" type="text_rev" indexed="true" stored="true"  omitNorms="false" />
After all this, we can change our search handler to boost on main_content field:
<requestHandler name="/select" class="solr.SearchHandler" default="true">
  <lst name="defaults">
    <!-- Omitted -->
    <str name="qf">main_content^10 body_stored</str>
  </lst>
</requestHandler>
Resources
boilerpipe library
Filtering Source Code Using boilerpipe

Text Mining: Integrate UIMA Regular Expression Annotator with Solr


UIMA RegexAnnotator
UIMA RegexAnnotator is an Apache UIMA analysis engine that uses regular expression to detect entities such as email addresses, URLs, phone numbers, zip codes or any other entity.

This article will introduce how to deploy RegexAnnotator as SOAP web service, add extra regex to extract other types of entities and integrate it with Solr.
Deploy RegexAnnotator as SOAP Web Service
For detailed steps, please refer to this post.
First copy all jars in %uima-addons-home%\addons\annotator\RegularExpressionAnnotator\lib\ to axis.war\WEB-INF\lib, copy RegularExpressionAnnotator\desc\concepts.xml to axis.war\WEB-INF\classes.

Then we need create web services deployment descriptor. Example WSDD files are provided in the examples/deploy/soap directory of the UIMA SDK. All we need do is to copy one wsdd(for example:Deploy_NamesAndPersonTitles.wsdd): change service name to urn:RegExAnnotator, change resourceSpecifierPath to point to %PEARS_HOME_REPLACE_THIS%\addons/annotator/RegularExpressionAnnotator/desc/RegExAnnotator.xml.

<deployment name="RegExAnnotator">
 <service name="urn:RegExAnnotator" provider="java:RPC">
  <parameter name="resourceSpecifierPath" value="%PEARS_HOME_REPLACE_THIS%\addons/annotator/RegularExpressionAnnotator/desc/RegExAnnotator.xml"/>
 </service>
</deployment>
If we are using tomcat, set CATALINA_HOME to the location where Tomcat is installed. if we are using other application server, we may update UIMA_CLASSPATH in runUimaClass.bat to include axis\WEB-INF\lib, axis\WEB-INF\classes.

Then run deploytool %FOLDER%\RegExAnnotator.wsdd to deploy the pear as SOAP service.
Test RegexAnnotator SOAP Service in CVD
Check How to Call a UIMA Service for detail.

We need define one SOAP Service Client Descriptor: RegExAnnotatorSoapServiceClient.xml
<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
 <resourceType>AnalysisEngine</resourceType>
 <uri>http://localhost:8080/axis/services/urn:RegExAnnotator</uri>
 <protocol>SOAP</protocol>
</uriSpecifier>
Then in CVD, click "Run" -> "Load AE" to load RegExAnnotatorSoapServiceClient.xml, then test it.
Adding RegEx to extract other types of entities
Regular expression can be used to extract many types of enties. 
We can use regex from this post: \(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4}) to extract North American Phone Numbers.

To add this feature to UIMA, we create one ExtraRegExAnnotator.xml - similar as RegExAnnotator.xml except using different concept xml file and type definition:
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>org.apache.uima.annotator.regex.impl.RegExAnnotator</annotatorImplementationName>
  <analysisEngineMetaData>
    <name>ExtraRegExAnnotator</name>
    <!--configurationParameters omitted here -->
    <configurationParameterSettings>
      <nameValuePair>
        <name>ConceptFiles</name>
        <value><array><string>extra-concepts.xml</string></array></value>
      </nameValuePair>      
    </configurationParameterSettings>
    <typeSystemDescription>
      <types>
        <typeDescription>
          <name>org.lifelongprogrammer.USAPhoneNumber</name>
          <description/>
          <supertypeName>uima.tcas.Annotation</supertypeName>
          <features>
            <featureDescription>
              <name>confidence</name>
              <description/>
              <rangeTypeName>uima.cas.Float</rangeTypeName>
            </featureDescription>            
          </features>
        </typeDescription>
      </types>
    </typeSystemDescription>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>org.lifelongprogrammer.USAPhoneNumber</type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <!-- operationalProperties omited here -->
  </analysisEngineMetaData>
</analysisEngineDescription>
extra-concepts.xml: - and copy it to axis.war\WEB-INF\classes
<conceptSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns="http://incubator.apache.org/uima/regex"
 xsi:schemaLocation="concept.xsd">
  <concept name="usaPhoneNumberDetection">
    <rules>
      <rule regEx="\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})"
    matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation"
    confidence="1.0" />
    </rules>
    <createAnnotations>
      <annotation id="usaPhoneNumber"
    type="org.lifelongprogrammer.USAPhoneNumber">
        <begin group="0" />
        <end group="0" />
        <setFeature name="confidence" type="Confidence" />
      </annotation>
    </createAnnotations>
  </concept>  
</conceptSet>
The web services deployment descriptor: Deploy_ExtraRegExAnnotator.wsdd: similar as Deploy_RegExAnnotator.wsdd.

The SOAP Service Client Descriptor: ExtraRegExAnnotatorSoapServiceClient.xml(similar as RegExAnnotatorSoapServiceClient.xml). Load it to CVD, and test it.
Integrate UIMA RegexAnnotator with Solr
Solr uses UIMAUpdateRequestProcessorFactory to send the text to SOAP web service and parse the soap response when add a document to Solr, UIMAUpdateRequestProcessorFactory will save the UIMA extracted information into Solr.


In order to call SOAP web service, we first need put the SOAP Service Client Descriptor: RegExAnnotatorSoapServiceClient.xml and ExtraRegExAnnotatorSoapServiceClient.xml in solr/collection1/conf folder.

Then wrap the two SOAP services urn:RegExAnnotator and urn:ExtraRegExAnnotator in an aggregate analysis engine.
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="RegExAnnotatorService">
      <import location="RegExAnnotatorSoapServiceClient.xml"/>
    </delegateAnalysisEngine>   
    <delegateAnalysisEngine key="ExtraRegExAnnotatorService">
      <import location="ExtraRegExAnnotatorSoapServiceClient.xml"/>
    </delegateAnalysisEngine>  
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>AllRegExAnnotatorService</name>
    <description/>
    <version>1.0</version>
    <vendor/>
    <configurationParameters searchStrategy="language_fallback">
    </configurationParameters>
    <flowConstraints>
      <fixedFlow>
        <node>RegExAnnotatorService</node>
        <node>ExtraRegExAnnotatorService</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <!-- capabilities omitted -->
    <!-- operationalProperties omitted -->
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>
Then define update chain: uima-regex in solrconfig.xml:
<updateRequestProcessorChain name="uima-regex" default="true">
    <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
      <lst name="uimaConfig">
        <lst name="runtimeParameters">
        </lst>javax.xml.rpc.ServiceException
        <str name="analysisEngine">file:///%REPLACE_THIS%\solr\collection1\conf\RegExAnnotatorAE.xml</str>
        <bool name="ignoreErrors">false</bool>
        <lst name="analyzeFields">
          <bool name="merge">false</bool>
          <arr name="fields">
            <str>content</str>
          </arr>
        </lst>
        <lst name="fieldMappings">
        <lst name="type">
            <str name="name">org.lifelongprogrammer.USAPhoneNumber</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">usaphone_mxf</str>
            </lst>
          </lst>
          <lst name="type">
            <str name="name">org.apache.uima.EmailAddress</str>
            <lst name="mapping">
              <str name="feature">normalizedEmail</str>
              <str name="field">email_mxf</str>
            </lst>
          </lst>          
          <lst name="type">
            <str name="name">org.apache.uima.ISBNNumber</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">isbn_mxf</str>
            </lst>
          </lst>

          <lst name="type">
            <str name="name">org.apache.uima.MoneyAmount</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">money_mxf</str>
            </lst>
          </lst>
          <lst name="type">
            <str name="name">org.apache.uima.CreditCardNumber</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">creditcard_mxf</str>
            </lst>
            <lst name="mapping">
              <str name="feature">cardType</str>
              <str name="field">creditcardType_mxf</str>
            </lst>
          </lst>
        </lst>
      </lst>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
After that we can call 
http://localhost:8080/solr/update?update.chain=uima-regex&commit=true&stream.body=<add><doc><field name="id">1</field><field name="content">some text here</field></doc></add>
Then run http://localhost:8080/solr/select?q=id:1, we can see it extracts entities like phone number, email, credit card, isbn etc.
Resources
Text Mining: Integrate OpenNLP, UIMA and Solr via SOAP Web Service

Using Powershell Get-ChildItem


List All Java Files in All Subfolders
gci -Recurse -filter *.java | % { $_.FullName }

Use gci | gm to check the type of $_ and its properties and methods
$_ is TypeName: System.IO.FileInfo, and has properties like Name,BaseName,FullName, Length.

Miscs
If we want to show all hidden or system files, use -Force

Text Mining: Integrate OpenNLP, UIMA AS and Solr


In this series, I will introduce how to integrate OpenNLP, UIMA and Solr.
Integrate OpenNLP with UIMA
Talk about how to install UIMA, build OpenNLP pear, and run OpenNLP pear in CVD or UIMA Simple Server. 
Integrate OpenNLP, UIMA and Solr via SOAP Web Service
Talk about how to deploy OpenNLP UIMA pear as SOAP web service, and integrate it with Solr.
Integrate OpenNLP, UIMA AS and Solr
Talk about how to deploy OpenNLP UIMA pear as UIMA AS Service, and integrate it with Solr.

Please refer to the part1 about how to install UIMA, build OpenNLP UIMA.
Deploy OpenNLP Pear as UIMA AS Service
UIMA AS(Asynchronous Scaleout) is the next generation scalability replacement for the Collection Processing Manager (CPM).

Download UIMA AS binary package, unzip it, then run bin/startBroker.bat to starts the ActiveMQ broker, which must be running before UIMA AS services can be deployed.

Then use deployAsyncService.bat to deploy UIMA-AS services: deployAsyncService.sh [testDD.xml] [-brokerURL url]

In order to deploy pear, we have to use 2.4.2 or newer UIMA AS version - 2.3.1 doesn't work.

First unzip the OpenNlpTextAnalyzer.pear to %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer. 
Create pear descriptor: opennlp.uima.OpenNlpTextAnalyzer_pear.xml in%PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer.
<?xml version="1.0" encoding="UTF-8"?>
<pearSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
    <pearPath>%PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer</pearPath>
</pearSpecifier>
Then %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer, create one UIMA-AS deployment descriptor: Deploy_OpenNLP.xml like below. We can refer AS deploy descriptors in uima-as-%version%-bin\examples\deploy\as.
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDeploymentDescription
  xmlns="http://uima.apache.org/resourceSpecifier">
  <name>OpenNLP Text Analyzer</name>
  <description>Deploys OpenNLP text analyzer.</description>

  <deployment protocol="jms" provider="activemq">
    <service>
      <inputQueue endpoint="OpenNLP-service"
brokerURL="tcp://localhost:61616"/>
      <topDescriptor>
       <import location="opennlp.uima.OpenNlpTextAnalyzer_pear.xml"/>
      </topDescriptor>
    </service>
  </deployment> 
</analysisEngineDeploymentDescription>
Then run: deployAsyncService.cmd %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer\Deploy_OpenNLP.xml
Test OpenNLP Pear UIMA AS Service in CVD
Refer to uima-as-version-bin\examples\descriptors\as\MeetingDetectorAsyncAE.xml, we need create client descriptor, OpenNLPAsyncAEClient.xml: we just ned change endpoint to OpenNLP-service.
<customResourceSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
   <resourceClassName>org.apache.uima.aae.jms_adapter.JmsAnalysisEngineServiceAdapter</resourceClassName>
   <parameters>
     <parameter name="brokerURL" value="tcp://localhost:61616"/>
     <parameter name="endpoint" value="OpenNLP-service"/>
     <parameter name="timeout" value="5000"/>
     <parameter name="getmetatimeout" value="5000"/>
     <parameter name="cpctimeout" value="5000"/>
   </parameters>
</customResourceSpecifier>
Then in CVD, click "Run" -> "Load AE" to load OpenNLPServiceClient.xml, then test it.
Integrate OpenNLP-UIMA with Solr
We can use Solr UIMAUpdateRequestProcessorFactory to send the text to OpenNLP-UIMA SOAP web service to analyze it when add a document to Solr, UIMAUpdateRequestProcessorFactory will save the UIMA extracted information into Solr.

In order to call SOAP web service, we first need put the SOAP Service Client Descriptor: OpenNLPAsyncAEClient.xml in solr/collection1/conf folder.
Then wrap the UIMA AS service to be a part of aggregate analysis engine.

We create an analysis engine descriptor file: AggragateOpenNLPAsyncService.xml like below.
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="OpenNLPAsyncAE">
      <import location="OpenNLPAsyncAEClient.xml"/>
    </delegateAnalysisEngine>    
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>ExtServicesAE</name>
    <description/>
    <version>1.0</version>
    <vendor/>
    <configurationParameters searchStrategy="language_fallback">
    </configurationParameters>
    <flowConstraints>
      <fixedFlow>
         <node>OpenNLPAsyncAE</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs/>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>false</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>

Define dynamicField *_mxf in schema.xml:
<dynamicField name="*_mxf" type="text" indexed="true" stored="true"  multiValued="true"/>
Now we update solrconfig.xml to include this UIAM analysis engine.
<updateRequestProcessorChain name="opennlp-uima-as" default="true">
    <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
      <lst name="uimaConfig">
        <lst name="runtimeParameters">
        </lst>
        <str name="analysisEngine">%REPLACE_THIS%\AggragateOpenNLPAsyncService.xml.xml</str>
        <bool name="ignoreErrors">false</bool>
        <lst name="analyzeFields">
          <bool name="merge">false</bool>
          <arr name="fields">
            <str>content</str>
          </arr>
        </lst>
        <lst name="fieldMappings">
          <lst name="type">
            <str name="name">opennlp.uima.Date</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">date_mxf</str>
            </lst>
          </lst>           
          <lst name="type">
            <str name="name">opennlp.uima.Location</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">location_mxf</str>
            </lst>
          </lst> 

          <lst name="type">
            <str name="name">opennlp.uima.Money</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">money_mxf</str>
            </lst>
          </lst>
          <lst name="type">
            <str name="name">opennlp.uima.Organization</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">organization_mxf</str>
            </lst>
        </lst>
          <lst name="type">
            <str name="name">opennlp.uima.Percentage</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">percentage_mxf</str>
            </lst>
          </lst> 
          <lst name="type">
            <str name="name">opennlp.uima.Sentence</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">sentence_mxf</str>
            </lst>
          </lst> 
          <lst name="type">
            <str name="name">opennlp.uima.Time</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">time_mxf</str>
            </lst>
          </lst>           
          <lst name="type">
            <str name="name">opennlp.uima.Person</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">person_mxf</str>
            </lst>
          </lst>           
        </lst>
      </lst>
    </processor>
After that we can call
http://localhost:8080/solr/update?update.chain=opennlp-uima-soap&commit=true&stream.body=<add><doc><field name="id">1</field><field name="content">some text here</field></doc></add>

Then run http://localhost:8080/solr/select?q=id:1, we can see it extracts some entity like organization, person name, location, time, date, money, percentage, etc.
Resources
UIMA Documentation Overview
UIMA Asynchronous Scaleout Documentation Overview
Refer to Re: Error deploying pear on AS 2.4.2

Text Mining: Integrate OpenNLP, UIMA and Solr via SOAP Web Service


In this series, I will introduce how to integrate OpenNLP, UIMA and Solr.
Integrate OpenNLP with UIMA
Talk about how to install UIMA, build OpenNLP pear, and run OpenNLP pear in CVD or UIMA Simple Server. 
Integrate OpenNLP, UIMA and Solr via SOAP Web Service
Talk about how to deploy OpenNLP UIMA pear as SOAP web service, and integrate it with Solr.
Integrate OpenNLP, UIMA AS and Solr
Talk about how to deploy OpenNLP UIMA pear as UIMA AS Service, and integrate it with Solr.

Please refer to the part1 about how to install UIMA, build OpenNLP UIMA.
Deploy OpenNLP Pear as SOAP Web Service
Check Working with Remote Services to figure out how to deploy a UIMA component as a SOAP web service. 
First unzip the OpenNlpTextAnalyzer.pear to %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer. 
Create pear descriptor: opennlp.uima.OpenNlpTextAnalyzer_pear.xml in%PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer like below.

<?xml version="1.0" encoding="UTF-8"?>
<pearSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
  <pearPath>%REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer</pearPath>
</pearSpecifier>
Then create the web services deployment descriptor, Deploy_OpenNLP.wsdd. Example WSDD files are provided in the examples/deploy/soap directory of the UIMA SDK. All we need do is to copy one wsdd(for example:Deploy_NamesAndPersonTitles.wsdd): change service name to urn:OpenNLP, change resourceSpecifierPath to point to %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer\opennlp.uima.OpenNlpTextAnalyzer_pear.xml. Replace %PEARS_HOME_REPLACE_THIS% with real location.
<deployment name="OpenNLP" xmlns="http://xml.apache.org/axis/wsdd/"
    xmlns:java="http://xml.apache.org/axis/wsdd/providers/java">
  <service name="urn:OpenNLP" provider="java:RPC">
    <paramater name="scope" value="Request"/>
    <parameter name="className" value="org.apache.uima.adapter.soap.AxisAnalysisEngineService_impl"/>
    <parameter name="allowedMethods" value="getMetaData process"/>
    <parameter name="allowedRoles" value="*"/>
    <parameter name="resourceSpecifierPath" value="%REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer_pear.xml"/>
    <parameter name="numInstances" value="3"/>
    <parameter name="enableLogging" value="true"/>
    <!-- typeMapping omitted -->
  </service>
</deployment>
If we are using tomcat, set CATALINA_HOME to the location where Tomcat is installed. if we are using other application server, we may update UIMA_CLASSPATH in runUimaClass.bat to include axis\WEB-INF\lib, axis\WEB-INF\classes.

Then run deploytool %FOLDER%\Deploy_OpenNLP.wsdd to deploy the pear as SOAP service.
Test OpenNLP SOAP Service in CVD
Check How to Call a UIMA Service for detail.
We need define one SOAP Service Client Descriptor: OpenNLPSOAPServiceClient.xml
<?xml version="1.0" encoding="UTF-8" ?> 
<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
 <resourceType>AnalysisEngine</resourceType>
 <uri>http://localhost:8080/axis/services/urn:OpenNLP</uri>
 <protocol>SOAP</protocol>
</uriSpecifier>
Then in CVD, click "Run" -> "Load AE" to load OpenNLPSOAPServiceClient.xml, then test it.

Integrate OpenNLP-UIMA with Solr
We can use Solr UIMAUpdateRequestProcessorFactory to send the text to OpenNLP-UIMA SOAP web service to analyze it when add a document to Solr, UIMAUpdateRequestProcessorFactory will save the UIMA extracted information into Solr.

In order to call SOAP web service, we first need put the SOAP Service Client Descriptor: OpenNLPSOAPServiceClient.xml in solr/collection1/conf folder.
Then wrap the SOAP service to be a part of aggragate analysis engine.

We create an analysis engine descriptor file: AggragateOpenNLPSOAPService.xml like below.
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="OpenNLPSOAPService">
      <import location="OpenNLPSOAPServiceClient.xml"/>
    </delegateAnalysisEngine>    
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>ExtServicesAE</name>
    <description/>
    <version>1.0</version>
    <vendor/>
    <configurationParameters searchStrategy="language_fallback">
    </configurationParameters>
    <flowConstraints>
      <fixedFlow>
         <node>OpenNLPSOAPService</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs/>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>false</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>
Define dynamicField *_mxf in schema.xml:
<dynamicField name="*_mxf" type="text" indexed="true" stored="true"  multiValued="true"/>

Now we update solrconfig.xml to include this UIAM analysis engine.
<updateRequestProcessorChain name="opennlp-uima-soap" default="true">
  <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
    <lst name="uimaConfig">
      <lst name="runtimeParameters">
      </lst>javax.xml.rpc.ServiceException
      <str name="analysisEngine">%REPLACE_THIS%\AggragateOpenNLPSOAPService.xml</str>
      <bool name="ignoreErrors">false</bool>
      <lst name="analyzeFields">
        <bool name="merge">false</bool>
        <arr name="fields">
          <str>content</str>
        </arr>
      </lst>
      <lst name="fieldMappings">
        <lst name="type">
          <str name="name">opennlp.uima.Date</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">date_mxf</str>
          </lst>
        </lst>           
        <lst name="type">
          <str name="name">opennlp.uima.Location</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">location_mxf</str>
          </lst>
        </lst> 

        <lst name="type">
          <str name="name">opennlp.uima.Money</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">money_mxf</str>
          </lst>
        </lst>
        <lst name="type">
          <str name="name">opennlp.uima.Organization</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">organization_mxf</str>
          </lst>
        </lst>
        <lst name="type">
          <str name="name">opennlp.uima.Percentage</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">percentage_mxf</str>
          </lst>
        </lst> 
        <lst name="type">
          <str name="name">opennlp.uima.Sentence</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">sentence_mxf</str>
          </lst>
        </lst> 
        <lst name="type">
          <str name="name">opennlp.uima.Time</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">time_mxf</str>
          </lst>
        </lst> 
        <lst name="type">
          <str name="name">opennlp.uima.Person</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">person_mxf</str>
          </lst>
        </lst>           
      </lst>
    </lst>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain> 
After that we can call 
http://localhost:8080/solr/update?update.chain=opennlp-uima-soap&commit=true&stream.body=<add><doc><field name="id">1</field><field name="content">some text here</field></doc></add>

Then run http://localhost:8080/solr/select?q=id:1, we can see it extracts some entity like organization, person name, location, time, date, money, percentage, etc.

Text Mining: Integrate OpenNLP with UIMA


In this series, I will introduce how to integrate OpenNLP, UIMA and Solr.
Integrate OpenNLP with UIMA
Talk about how to install UIMA, build OpenNLP pear, and run OpenNLP pear in CVD or UIMA Simple Server. 
Integrate OpenNLP, UIMA and Solr via SOAP Web Service
Talk about how to deploy OpenNLP UIMA pear as SOAP web service, and integrate it with Solr.
Integrate OpenNLP, UIMA AS and Solr
Talk about how to deploy OpenNLP UIMA pear as UIMA AS Service, and integrate it with Solr.

Installing the UIMA SDK
Follow README in UIMA SDK: section 2. Installation and Setup
Set JAVA_HOME, UIMA_HOME
Append %UIMA_HOME%/bin to your PATH
Run %UIMA_HOME%/bin/adjustExamplePaths.bat
Build OpenNLP UIMA Pear
Follow instruction at OpenNLP UIMA
Download latest source code, go to %opennlp_src_home%\opennlp, type mvn install.
Go to %opennlp_src_home%\opennlp-uima, type ant -f createPear.xml.

The built OpenNlpTextAnalyzer.pear would be in %opennlp_src_home%\opennlp-uima\target folder.

Run OpenNLP Pear in UIMA Cas Visual Debugger
Call Set UIMA_JVM_OPTS=-Xms128M -Xmx8g to adjust JVM heap size, we can also change this in runUimaClass.bat or add it to system environment.

Execute runPearInstaller.bat, point to the built OpenNlpTextAnalyzer.pear in PEAR file, and specify installation directory, for example: %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer

To run OpenNLP analysis engine, click "Run your AE in the CAS", paste some text, then click "Run" -> "Run OpenNlpTextAnalyzer", or use shortcut key: Ctrl+R.

In future, we call cvd.bat, click "Run" -> "Local AE", browse to the location where OpenNLP pear is installed. Select %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer\opennlp.uima.OpenNlpTextAnalyzer_pear.xml, paste some test, then click click "Run" -> "Run OpenNlpTextAnalyzer".
Deploy OpenNLP Pear in UIMA Simple Server
We can deploy OpenNLP Pear in web service as a REST service. Please follow UIMA Simple Server User Guide to build the war.
http://uima.apache.org/downloads/sandbox/simpleServerUserGuide/simpleServerUserGuide.html

Then copy OpenNlpTextAnalyzer.pear to WEB-INF/resources, add opennlp servlet in web.xml:
<servlet>
  <servlet-name>opennlp</servlet-name>
  <servlet-class>
      org.apache.uima.simpleserver.servlet.SimpleServerServlet
  </servlet-class>
  <!-- Define the path to the pear file -->
  <init-param>
      <param-name>PearPath</param-name>
      <param-value>
          WEB-INF/resources/OpenNlpTextAnalyzer.pear
      </param-value>
  </init-param>
</servlet>
<servlet-mapping>
  <servlet-name>opennlp</servlet-name>
  <url-pattern>/opennlp</url-pattern>
</servlet-mapping>
Browse to http://localhost:8080/uima-server/opennlp
Go to http://localhost:8080/uima-server/opennlp?mode=form, type some text, then hit the "Submit Query" button.

Or you can send a Get request: http://localhost:8080/uima-server/opennlp?text=some_text_here
Or send a post request:
curl http://localhost:8080/uima-server/opennlp -X POST -d "text=some_text_here"

Resources
UIMA Documentation Overview
UIMA Asynchronous Scaleout Documentation Overview

Create Extensible Ant Script


When we write code, we need consider flexibility, extensibility and maintainability. These rules also applies when we write ant script or batch files.
The Task
In our solr related project, there are multiple cores, and we may add more cores in future. All cores share some common configuration files as the default core collection1. Every core has its own schema.xml and solrconfig.xml. 
So we put the default and shared core configuration files in solr.zip. The file structure looks like below:
solr.zip
solr/collection1/conf/schema.xml
solr/collection1/conf/solrconfig.xml
solr/core-app1/conf/schema.xml
solr/core-app1/conf/solrconfig.xml

As we may add more cores in future, so we want our ant build.xml stays stable: we don't have to update build.xml when we add new cores.

The Solution
In ant build.xml script, we use ant-contrib's foreach task to iterate all core folders: the folder name starts with core-, execute create-core task for each core folder.
<ac:foreach target="create-core" param="coreFolder">
  <path>
    <dirset id="cores" dir="solr/">
      <include name="core-*"/>
    </dirset>
  </path>
</ac:foreach>
The complete Ant Script
<project name="SolrProject" basedir="." default="all" xmlns:ac="antlib:net.sf.antcontrib">
  <dirname  property="project-basedir"  file="${ant.file.SolrProject}"/>
  <taskdef resource="net/sf/antcontrib/antlib.xml" uri="antlib:net.sf.antcontrib">
    <classpath>
      <pathelement location="BuildTools\apache-ant-1.8.4\lib\ant-contrib-1.0b3.jar"/>
    </classpath>
  </taskdef>
  <property name="application-name" value="SolrProject" />
  <property name="main-build-path" value="${project-basedir}/build" />
  <property name="project-build-path" value="${main-build-path}/${application-name}" />
  <property name="solr-zip" value="solr.zip" />
  <target name="clean">
    <delete dir="${main-build-path}" />
    <delete dir="${main-dist-path}" />
  </target>
  <target name="init">
    <mkdir dir="${main-build-path}" />
    <mkdir dir="${project-build-path}" />
  </target>
  <target name="build-package" depends="clean, init">
    <!-- copy configuration files to default core: collection1 -->
    <unzip src="${solr-zip}" dest="${project-build-path}/solr/collection1/conf" >
      <mapper>
        <globmapper from="collection1/conf/*" to="*"/>
      </mapper>
    </unzip>
    <!-- 
    <dirset id="cores" dir="solr/">
      <include name="core-*"/>
    </dirset>
    The pathconvert task can be used to format a dirset with arbitrary separators 
    <pathconvert dirsep="/" pathsep="," property="dirs" refid="cores"/>
    <echo message="${dirs}"/>
    -->
    <ac:foreach target="create-core" param="coreFolder">
      <path>
        <dirset id="cores" dir="solr/">
          <include name="core-*"/>
        </dirset>
      </path>
    </ac:foreach>
  </target>
  <target name="create-core">
    <basename property="basename" file="${coreFolder}"/>
    <echo message="file: ${coreFolder}, basename: ${basename}" />
    <mkdir dir="${project-build-path}/solr/${basename}/conf" />
    <unzip src="${solr-zip}" dest="${project-build-path}/solr/${basename}/conf" >
      <mapper>
        <globmapper from="collection1/conf/*" to="*"/>
      </mapper>
    </unzip>
  </target>
</project>

Maven: Invalid maximum heap size


The Problem
I follow the stanbol tutorial to build stanbol. 
export MAVEN_OPTS="-Xmx1024M -XX:MaxPermSize=256M"

As I am building stanbol in Windows, so I changed it to:
set MAVEN_OPTS="-Xmx1024M -XX:MaxPermSize=256M"

Then whenI run "mvn clean install -Dmaven.test.skip=true", it failed with error:
Invalid maximum heap size: -Xmx1024M -XX:MaxPermSize=256M
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

The Solution
Check mvn.bat: the MAVEN_OPTS is passed as JVM parameter directly:
SET MAVEN_JAVA_EXE="%JAVA_HOME%\bin\java.exe"
%MAVEN_JAVA_EXE% %MAVEN_OPTS% -classpath %CLASSWORLDS_JAR% "-Dclassworlds.conf=%M2_HOME%\bin\m2.conf" "-Dmaven.home=%M2_HOME%" %CLASSWORLDS_LAUNCHER% %MAVEN_CMD_LINE_ARGS%

If we run set MAVEN_OPTS="-Xmx1024M -XX:MaxPermSize=256M",the previous command would be replaced with %MAVEN_JAVA_EXE% "-Xmx1024M -XX:MaxPermSize=256M" ...

If we run java "-Xmx1024M -XX:MaxermSize=256M" -version: we will see same error message.

The solution is simple: 
We can run:
set "MAVEN_OPTS=-Xmx1024M -XX:MaxPermSize=256M"
 or remove the double quotes completely:
set MAVEN_OPTS=-Xmx1024M -XX:MaxPermSize=256M

The value of MAVEN_OPTS would be with no surrounding double quotes.

Now, we know why windows version batch file (mvn.bat) failed. But why the Linux version succeed? Check Linux shell file: mvn: it doesn't call java command directly, but call exec command.
exec "$JAVACMD" \
  $MAVEN_OPTS \
  -classpath "${M2_HOME}"/boot/plexus-classworlds-*.jar \
  "-Dclassworlds.conf=${M2_HOME}/bin/m2.conf" \
  "-Dmaven.home=${M2_HOME}"  \
  ${CLASSWORLDS_LAUNCHER} "$@"

Compile Using Memory Allocation Enhancements in pom.xml

Customizing Eclipse Project Names With maven-eclipse-plugin


The Problem
When learning Apache UIMA, I need to import it into eclipse. Also I need swtich between different UIMA version 2.4.2 and the latest(trunk) version frequently.

I can use maven-eclipse-plugin to import uima projects to eclipse. But I have to add a prefix to these uima projects.

The Solution
Check maven-eclipse-pluginit provides several ways to change eclipse project name.
addGroupIdToProjectName
If set to true, the groupId of the artifact is appended to the name of the generated Eclipse project.
Expression: ${eclipse.addGroupIdToProjectName}

addVersionToProjectName
If set to true, the version number of the artifact is appended to the name of the generated Eclipse project. See projectNameTemplate for other options.
Expression: ${eclipse.addVersionToProjectName}

projectNameTemplate
Allows configuring the name of the eclipse projects. This property ifn set wins over addVersionToProjectName and addGroupIdToProjectName You can use [groupId], [artifactId] and [version] variables. eg. [groupId].[artifactId]-[version]
Expression: ${eclipse.projectNameTemplate}

So now the solution is obvious, we can add -Declipse.addVersionToProjectName=true
mvn -Declipse.addVersionToProjectName=true eclipse:eclipse
The project name would be updated as uimaj-core-2.4.2 or uimaj-core-2.5.1.


Or we can add -Declipse.projectNameTemplate=Prefix-[artifactId]-[version] 

Miscs
Use -DdownloadSources=true to tell maven to download source code.

Resource
Maven Eclipse Plugin
Get source jar files attached to Eclipse for Maven-managed dependencies

Windows C++: Handling unicode file names


The Problem
In C++, I used to use ifstream and ofstream to read and write file, to determine file existence.

But in recent project, as the file name may be unicode, we need use wide string, wchar_t.  Just find out that ifstream and ofstream doesn't work with unicode file name.

The program is simple: it checks whether one file exists, if not, write some text into it, if exists, do nothing.
The code that doesn't work

void codeDoesnotwork()
{
 wchar_t* filePath =L"E:\\tmp\\test.txt";
 ifstream infile(filePath);
 if (!infile.is_open() || 
  !infile.good())
 {
  cout << "file doesn't exist" << endl;
 }
 else
 {
  // it alwalys goes here, even the file doesn't exist.
  cout << "file exists" << endl;
 }

 // ofstream << or write doens't work.
 ofstream ofs(filePath,ios::out | ios::trunc);
 ofs<< "hello world!" << endl;
 ofs.write("1", 1);
 ofs.close();
}
This is because ifstream and ofstream doesn't work with wide string.
The code that works
using namespace std;
typedef unsigned int uint;
void codeThatWorks()
{
 wchar_t *installFolder=L"E:\\tmp";
 wchar_t *fileName = L"test.txt";
 uint fileNameBufferSize = wcslen(installFolder) + wcslen(fileName) + 2;

 wchar_t *fileNameBuffer = new wchar_t[fileNameBufferSize];
 assert(fileNameBuffer);
 memset(fileNameBuffer, 0, fileNameBufferSize * sizeof(wchar_t)); 
 swprintf_s(fileNameBuffer, fileNameBufferSize, L"%s/%s", installFolder, fileName);

 FILE* oFile;
 oFile = _wfopen(fileNameBuffer,L"r");
 if(oFile==NULL)
 {
  cout << "file doesn't exist" << endl;
  //FILE* oFile;
  oFile = _wfopen(fileNameBuffer,L"w+");
  fprintf(oFile,"%s", "hello world");  
 }
 else
 {
  cout << "file exists" << endl;
 }
 fclose(oFile);
 delete[] fileNameBuffer;
}
Write Unicode Content to File
If we need write unicode content to file, we should use  _wfopen or wofstream to write wide string to file.
f = _wfopen(COMMON_FILE_PATH,L"w, ccs=UTF-16LE");

Use Shlwapi.lib PathFileExists
We can use Shlwapi.lib's PathFileExists to check file existence:
BOOL exist = PathFileExists(fileNameBuffer);
Also in order to use Shlwapi.h, we need add Shlwapi.lib library by right clicking the project -> "Configuration Properties" -> "Linker" -> "Input" -> "Additional Dependencies" -> "Edit" -> type "Shlwapi.lib" in the text box.

Resource
MSDN fopen, _wfopen
Wrote to a file using std::wofstream

Using Fiddler to Capture Http Requests of a Java Application


Task1: Using Fiddler as a Proxy to Monitor Request from a Java Application
To configure a Java application to send web traffic to Fiddler, add the following parameters to JVM:
-Dhttp.proxyHost=proxyHost -Dhttp.proxyPort=8888

The proxy - here is the Fiddler and the java application can be run in different machines. In this case, ensure allow remote clients to connect is checked by clicking "Tools" -> "Fiddler Options" -> "Connections" -> "Allow remote computers to connect". Then restart Fiddler.

Use Apache Http Client SystemDefaultHttpClient
If you are using Apache Http Client 4.2 or newer, we can use SystemDefaultHttpClient which honor standard system properties.

In 4.3 or newer, we can use HttpClientBuilder, which also honors these system properties.
HttpClient client = HttpClientBuilder.create().build();

Task2: Monitor Requests in Web Server
We want to monitor request and response in Java web server, for example: the web server is running at server1:8080.

Solution: Using Fiddler as a Reverse Proxy
We can use fiddler as a reverse proxy, first we change Fidder to listen on Port 8080 by right clicking "Tools" -> "Fiddler Options" -> "Connections": change port number and allow remote computers to connect, then restart Fiddler if prompted.

Then change web server to run at another port, for example: 9090.

Create a FiddlerScript Rule
Then write a custom rule to tell Fidder to forward requests to server1:8080(the port Fiddler is listening) to server1:9090 where the real web server is listening.

Click Rules > Customize Rules.
Inside the OnBeforeRequest handler*, add a new line of code:
if (oSession.host.toLowerCase() == "server1:8080") oSession.host = "server1:9090";

Troubleshooting: Enable "Help" -> "Troubleshooting Filters..." 
If for some reason, fiddler is not capturing the request, we can enable option: "Help" -> "Troubleshooting Filters...". This will show all traffic but strike out the requests that would be excluded by the filter.  It also provides a comment about why the request would be hidden. 

Resources
The Fiddler Proxy
Using Fiddler as a Reverse Proxy

Regular Expression in Action: Remove or Merge Empty Lines


The Problem
I like to copy the code to Eclipse or NetBeans to read it and run it when read some interesting code in internet. The code in the web is usually not well formatted: many empty lines. This makes it harder to read the code.
So I like to remove empty lines  to make the code looks smaller and concise.

Here regex comes into play.

Remove Empty Lines

Find: ^\s*\n
Replace: (empty)

Merging Empty Lines
Find: ^\s*\n
Replace: \r\n or \n in windows 

Next, use Ctrl+Shif+F in Eclipse or Alt+Shif+F in Netbeans to fromat the code.

Online Regex Tools
http://regex101.com/
Here it will explain the meaning of the regular expression, we can test it.
  • /^\s*\n/
    • ^ assert position at start of the string
    • \s* match any white space character [\r\n\t\f ]
      • Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
    • \n matches a fine-feed (newline) character (ASCII 10)
http://www.regexplanet.com/
Here we can test our regualr expression in almost all languages, it can also give its java or .Net String, so we don't have to manually escape special chracters such as \ or " in the regex string.

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)