Solr Deep Pagination Problem Fixed in Solr-5463


The Problem
We need iterate and dump more than millions data into a file. The data is deployed in multiple solr server in mltiple virtual machine with just 4gb memory.
When I tried to run the dump task, these vms totally froze for more than hours. The memory usage is more than 98%.

The problem is caused by a long-lasting problem in Solr:
When we try to get the 1,000,000 to 1,001,000 data, Solr has to load 1,001,00 sorted documents from index, then get last 1000 data.

In the case of SolrCloud, the problem gets even worse, as every shard has to sorted 1,001,00 documents and send all docs to one dest solr server, which will then iterate all data to get the 1000 data.

In older release, developers found some workaround, such as described at Solr Deep Pagination.

Solution: Solr-5463(LUCENE-3514)
In coming Solr 4.7, it solves this problem in SOLR-5463.

The basic idea is that:
Get the first 1000 rows:
http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc &shards=solr1:8080/solr/cvcorefla4,solr2:8080/solr/cvcorefla4,solr3:8080/solr/cvcorefla4&overwrite=true&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=*


Parse the response to get the value of nextCursorMark:
<response>
<str name="nextCursorMark">AoJ42tmu/Z4CKTQxMDMyMzEwMw==</str>
</response>
Then to get the next 1000 rows: [10001-2000]
http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc &shards=solr1:8080/solr/cvcorefla4,solr2:8080/solr/cvcorefla4,solr3:8080/solr/cvcorefla4&overwrite=true&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=AoJ42tmu%2FZ4CKTQxMDMyMzEwMw%3D%3D

Repeat until the nextCursorMark value stops changing, or you have collected as many docs as you need.

Basic Usage from Solr-5463
start must be "0" in all request when use cursorMark
sort can be anything, but must include the uniqueKey field (as a tie breaker)
"N" can be any number you want per page
"*" denotes you want to use a cursor starting at the beginning mark
Replace the "*" value in your initial request params with the nextCursorMark value from the response in the subsequent request

Resources
Solr-5463: Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")
LUCENE-3514: deep paging with Sort
Coming Soon to Solr: Efficient Cursor Based Iteration of Large Result Sets
Deep Paging Problem

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)