This is a valid article, and is considered technically accurate up to Feb. 21, 2010, but has an
update associated with it.
While working on the indexer though, we did manage to get the total indexing time for 300,000 documents from 3 hours (on professional grade hardware) to 10 minutes (on consumer grade hardware). This article will describe how that was done.
The originally indexing algorithm was pretty dumb and naive. Essentially (at this time, we did not use the tf-idf algorithm for the weight, the weight was simply the number of times the token was in a specific zone):
Indexer #1
foreach ( document in universe ) {
foreach ( zone in document ) {
var token_list = document.zone.value;
foreach ( token_list as token ) {
var count = token.count;
var weight = tokenWeight(token.count);
// Load up the token if it exists.
var token_object = new Token(token.value);
// If the token exists, no query will be executed, otherwise, it will be inserted.
// At all times, the token_id is returned.
var token_id = token_object.write();
var sql = "INSERT INTO document_index VALUES(NULL, document_id, zone_id, token_id, count, weight)";
executeSql(sql);
}
}
}
| Document Count |
Running Time |
Memory Usage |
| 5,000 |
28.19s |
.79MB |
| 10,000 |
49.19s |
.79MB |
| 50,000 |
429.95s |
.84MB |
The running times for the Indexer #1 algorithm.
Clearly, this is incredibly slow. For the 300,000 documents we had in our universe, it created 1,000,000 tokens with an index of 10,000,000 rows. This results in a tremendous amount of processing; it requires one INSERT for each of the 10,000,000 rows, another 1,000,000 INSERT’s for each token, and a large number (at most 10,000,000) of SELECT’s to see if a token exists. Even with query logging off, this took more than 3 hours to complete, and generally locked up the server, even with a niceness value of 10-19. Although slow, this algorithm is incredibly memory efficient: performing the entire indexing used less than 1MB of memory.
The first thing we thought of was to do batch inserts. Essentially, only a single INSERT query would be executed for all tokens in a single zone (bolded lines are new):
Indexer #2
foreach ( document in universe ) {
foreach ( zone in document ) {
var token_batch_list = array();
var token_list = document.zone.value;
foreach ( token_list as token ) {
var count = token.count;
var weight = tokenWeight(token.count);
// Load up the token if it exists.
var token_object = new Token(token.value);
// If the token exists, no query will be executed, otherwise, it will be inserted.
// At all times, the token_id is returned.
var token_id = token_object.write();
var token_batch_list[] = "(NULL, document_id, zone_id, token_id, count, weight)";
}
var sql = "INSERT INTO document_index VALUES " + implode(", ", token_batch_list);
executeSql(sql);
}
}
Thus, a query would look like:
INSERT INTO document_index VALUES (NULL, 1, 1, 1, 3, 1.875), (NULL, 1, 1, 2, 1, 1), (NULL, 1, 1, 3, 2, 1.5)";
Doing the first 5,000 documents with the Index #1 method took 28.19 seconds, and with this method was only a few seconds savings.
Our next thought was to batch an entire document indexing. Because the query is the same for each document regardless of zone or token, we could batch a whole document at a time. We were still executing 1 SELECT and potentially 1 INSERT query per token. Even though the SELECT’s were fast, why have them at all? Simply use the actual code to see if a token is unique, if so, insert it, if not, then load up the Token ID from a large hash map.
Indexer #3
var token_hash = array();
foreach ( document in universe ) {
var token_batch_list = array();
foreach ( zone in document ) {
var token_list = document.zone.value;
foreach ( token_list as token ) {
var count = token.count;
var weight = tokenWeight(token.count);
if ( true === isset(token_hash[token.value]) ) {
token_id = token_hash[token.value];
} else {
// The Token class was updated to not do any selection, simply insert the token value
var token = new Token(token.value);
var token_id = token.write();
token_hash[token.value] = token_id;
}
var token_batch_list[] = "(NULL, document_id, zone_id, token_id, count, weight)";
}
}
var sql = "INSERT INTO document_index VALUES " + implode(", ", token_batch_list);
executeSql(sql);
}
This was our fastest method yet. Doing the first 5,000 products sped things up to 13.60 seconds. Although memory usage soared (50,000 documents resulted in about 13.28MB of memory), the speed improvements were too dramatic to ignore. Memory usage is only temporary, and with severs having many gigabytes of memory today, using 100MB+ to do a full index was more than worth it. With over 300,000 documents, it would still take about an hour to index the entire universe of documents.
| Document Count |
Running Time |
Memory Usage |
| 5,000 |
13.60s |
2.41MB |
| 10,000 |
23.85s |
2.97MB |
| 50,000 |
248.49s |
13.38MB |
The running times for the Indexer #3 algorithm.
We believed that batching N number of products at a time with the hash would produce significant running time improvements.
Indexer #4
var BATCH_NUMBER = 10;
var batch_count = 0;
var token_hash = array();
foreach ( document in universe ) {
var token_batch_list = array();
foreach ( zone in document ) {
var token_list = document.zone.value;
foreach ( token_list as token ) {
var count = token.count;
var weight = tokenWeight(token.count);
if ( true === isset(token_hash[token.value]) ) {
token_id = token_hash[token.value];
} else {
// The Token class was updated to not do any selection, simply insert the token value
var token = new Token(token.value);
var token_id = token.write();
token_hash[token.value] = token_id;
}
var token_batch_list[] = "(NULL, document_id, zone_id, token_id, count, weight)";
}
}
if ( batch_count == BATCH_NUMBER ) {
var sql = "INSERT INTO document_index VALUES " + implode(", ", token_batch_list);
executeSql(sql);
batch_count = 0;
}
batch_count++;
}
This lowered the running time tremendously. We did tests with 10, 20, and 100 batch inserts at a time, with each one speeding up the indexing tremendously. (For this test, we did not include the 5,000 and 10,000 document results).
| Document Count |
Running Time |
Memory Usage |
Document Batch Count |
| 50,000 |
212.49s |
13.38MB |
10 |
| 50,000 |
203.49s |
13.44MB |
20 |
| 50,000 |
166.11s |
14.10MB |
100 |
The running times for the Indexer #4 algorithm.
Indexer #4 resulted in an improvement from 429.95s to 166.11s, a 61% reduction in running time from the Indexer #1 algorithm. Even though we had improved it this much, why bother with any SQL statements at all? It is clearly the largest bottleneck of the entire operation.
We were using a MySQL database to store the tokens and index. MySQL has a LOAD DATA INFILE ‘filename.csv’ query which loads in a comma separated file (by default) into a table. By writing two files, one with all of the tokens and one with the document index relations, we could then load these files into the tables. MySQL is very fast at this operation as it has no query parsing/caching overhead. We were already storing each token in memory, so there would be no additionally memory usage. With this, we would no longer need to batch anything either.
Indexer #5
var token_hash = array();
var token_id = 0;
var token_id_current = 0;
foreach ( document in universe ) {
foreach ( zone in document ) {
var token_list = document.zone.value;
foreach ( token_list as token ) {
var count = token.count;
var weight = tokenWeight(token.count);
if ( false === isset(token_hash[token.value]) ) {
fwrite("token.csv", "token_id,token.value");
token_hash[$token] = token_id;
token_id_current = token_id;
token_id++;
} else {
token_id_current = token_hash[token.value];
}
fwrite("document_index.csv", "NULL,document_id,zone_id,token_id_current,count,weight");
}
}
}
unset(token_hash);
executeSql("TRUNCATE `token`");
executeSql("LOAD DATA INFILE 'token.csv' INTO TABLE `token` FIELDS TERMINATED BY ','");
executeSql("TRUNCATE `document_index`");
executeSql("LOAD DATA INFILE 'document_index.csv' INTO TABLE `document_index` FIELDS TERMINATED BY ','");
delete("token.csv");
delete("document_index.csv");
By using the two CSV files, we were able to decrease the execution time for 50,000 products to 88.11s with 13.38MB of memory usage.
| Document Count |
Running Time |
Memory Usage |
| 50,000 |
88.11s |
13.38MB |
| 100,000 |
229.01s |
31.60MB |
| 300,000 |
606.91s |
118.83MB |
The running times for the Indexer #5 algorithm.
At this point, we felt the algorithm could not increase in efficiency at all. The original naive algorithm (Indexer #1) took 10,800 seconds for 300,000 documents. The Indexer #5 algorithm took 607s, a 94% increase in speed.
Conclusion
The famous quote by Donald Knuth rings true, “Preoptimization is the root of all evil.” Initially we spent time trying to reduce the time from O(n3) to something faster. After investigating the real bottlenecks, we were able to get the running time down 94% with a relatively slight increase in memory.
Note
In this article, “we” refers to myself and other programmer, Trevor Andreas.