Update: On Building an Efficient Search Indexer 1

Posted by Vic Cherubini on December 01, 2009

This is a valid article, and is considered technically accurate up to Feb. 21, 2010

This article is a short update to my article on building a fast search indexer for a simple search engine I recently wrote with a fellow programmer.

I wrote the article entirely by myself and made a large mistake. First, the indexer does not run in O(n3) time. It actually runs in O(N) time where N is the number of documents. Because for each document, there are Z number of Zones and within each zone there are T tokens, I mistakenly though it ran in O(n3) time. It’s embarrassing to me that I made such an elementary mistake in computer science. I do have a degree in computer science from a respectable university and I feel that I did a disservice by not ensuring my writing was correct before publishing it. I apologize.

Other commenter’s were correct in saying I should focus my efforts in writing the algorithm to take parallelization into account, or to simply insert the index into a memory/heap table. I will certainly consider both alternatives.

I thank the commenter’s who pointed out my errors as it only helps. Thank you.

Happy programming.

On Building an Efficient Search Indexer 1

Posted by Vic Cherubini on December 01, 2009

This is a valid article, and is considered technically accurate up to Feb. 21, 2010, but has an update associated with it.

While working on the indexer though, we did manage to get the total indexing time for 300,000 documents from 3 hours (on professional grade hardware) to 10 minutes (on consumer grade hardware). This article will describe how that was done.

The originally indexing algorithm was pretty dumb and naive. Essentially (at this time, we did not use the tf-idf algorithm for the weight, the weight was simply the number of times the token was in a specific zone):

Indexer #1

foreach ( document in universe ) {
  foreach ( zone in document ) {
    var token_list = document.zone.value;
    foreach ( token_list as token ) {
      var count = token.count;
      var weight = tokenWeight(token.count);

      // Load up the token if it exists.
      var token_object = new Token(token.value);

      // If the token exists, no query will be executed, otherwise, it will be inserted.
      // At all times, the token_id is returned.
      var token_id = token_object.write();

      var sql = "INSERT INTO document_index VALUES(NULL, document_id, zone_id, token_id, count, weight)";
      executeSql(sql);
    }
  }
}
Document Count Running Time Memory Usage
5,000 28.19s .79MB
10,000 49.19s .79MB
50,000 429.95s .84MB

The running times for the Indexer #1 algorithm.

Clearly, this is incredibly slow. For the 300,000 documents we had in our universe, it created 1,000,000 tokens with an index of 10,000,000 rows. This results in a tremendous amount of processing; it requires one INSERT for each of the 10,000,000 rows, another 1,000,000 INSERT’s for each token, and a large number (at most 10,000,000) of SELECT’s to see if a token exists. Even with query logging off, this took more than 3 hours to complete, and generally locked up the server, even with a niceness value of 10-19. Although slow, this algorithm is incredibly memory efficient: performing the entire indexing used less than 1MB of memory.
The first thing we thought of was to do batch inserts. Essentially, only a single INSERT query would be executed for all tokens in a single zone (bolded lines are new):

Indexer #2

foreach ( document in universe ) {
  foreach ( zone in document ) {
    var token_batch_list = array();
    var token_list = document.zone.value;
    foreach ( token_list as token ) {
      var count = token.count;
      var weight = tokenWeight(token.count);

      // Load up the token if it exists.
      var token_object = new Token(token.value);

      // If the token exists, no query will be executed, otherwise, it will be inserted.
      // At all times, the token_id is returned.
      var token_id = token_object.write();

      var token_batch_list[] = "(NULL, document_id, zone_id, token_id, count, weight)";
    }

    var sql = "INSERT INTO document_index VALUES " + implode(", ", token_batch_list);
    executeSql(sql);
  }
}

Thus, a query would look like:

INSERT INTO document_index VALUES (NULL, 1, 1, 1, 3, 1.875), (NULL, 1, 1, 2, 1, 1), (NULL, 1, 1, 3, 2, 1.5)";

Doing the first 5,000 documents with the Index #1 method took 28.19 seconds, and with this method was only a few seconds savings.
Our next thought was to batch an entire document indexing. Because the query is the same for each document regardless of zone or token, we could batch a whole document at a time. We were still executing 1 SELECT and potentially 1 INSERT query per token. Even though the SELECT’s were fast, why have them at all? Simply use the actual code to see if a token is unique, if so, insert it, if not, then load up the Token ID from a large hash map.

Indexer #3

var token_hash = array();
foreach ( document in universe ) {
  var token_batch_list = array();
  foreach ( zone in document ) {
    var token_list = document.zone.value;
    foreach ( token_list as token ) {
      var count = token.count;
      var weight = tokenWeight(token.count);

      if ( true === isset(token_hash[token.value]) ) {
        token_id = token_hash[token.value];
      } else {
        // The Token class was updated to not do any selection, simply insert the token value
        var token = new Token(token.value);
        var token_id = token.write();
        token_hash[token.value] = token_id;
      }

      var token_batch_list[] = "(NULL, document_id, zone_id, token_id, count, weight)";
    }
  }

  var sql = "INSERT INTO document_index VALUES " + implode(", ", token_batch_list);
  executeSql(sql);
}

This was our fastest method yet. Doing the first 5,000 products sped things up to 13.60 seconds. Although memory usage soared (50,000 documents resulted in about 13.28MB of memory), the speed improvements were too dramatic to ignore. Memory usage is only temporary, and with severs having many gigabytes of memory today, using 100MB+ to do a full index was more than worth it. With over 300,000 documents, it would still take about an hour to index the entire universe of documents.

Document Count Running Time Memory Usage
5,000 13.60s 2.41MB
10,000 23.85s 2.97MB
50,000 248.49s 13.38MB

The running times for the Indexer #3 algorithm.

We believed that batching N number of products at a time with the hash would produce significant running time improvements.

Indexer #4

var BATCH_NUMBER = 10;
var batch_count = 0;
var token_hash = array();
foreach ( document in universe ) {
  var token_batch_list = array();
  foreach ( zone in document ) {
    var token_list = document.zone.value;
    foreach ( token_list as token ) {
      var count = token.count;
      var weight = tokenWeight(token.count);

      if ( true === isset(token_hash[token.value]) ) {
        token_id = token_hash[token.value];
      } else {
        // The Token class was updated to not do any selection, simply insert the token value
        var token = new Token(token.value);
        var token_id = token.write();
        token_hash[token.value] = token_id;
      }

      var token_batch_list[] = "(NULL, document_id, zone_id, token_id, count, weight)";
    }
  }

  if ( batch_count == BATCH_NUMBER ) {
    var sql = "INSERT INTO document_index VALUES " + implode(", ", token_batch_list);
    executeSql(sql);
    batch_count = 0;
  }

  batch_count++;
}

This lowered the running time tremendously. We did tests with 10, 20, and 100 batch inserts at a time, with each one speeding up the indexing tremendously. (For this test, we did not include the 5,000 and 10,000 document results).

Document Count Running Time Memory Usage Document Batch Count
50,000 212.49s 13.38MB 10
50,000 203.49s 13.44MB 20
50,000 166.11s 14.10MB 100

The running times for the Indexer #4 algorithm.

Indexer #4 resulted in an improvement from 429.95s to 166.11s, a 61% reduction in running time from the Indexer #1 algorithm. Even though we had improved it this much, why bother with any SQL statements at all? It is clearly the largest bottleneck of the entire operation.

We were using a MySQL database to store the tokens and index. MySQL has a LOAD DATA INFILE ‘filename.csv’ query which loads in a comma separated file (by default) into a table. By writing two files, one with all of the tokens and one with the document index relations, we could then load these files into the tables. MySQL is very fast at this operation as it has no query parsing/caching overhead. We were already storing each token in memory, so there would be no additionally memory usage. With this, we would no longer need to batch anything either.

Indexer #5

var token_hash = array();
var token_id = 0;
var token_id_current = 0;
foreach ( document in universe ) {
  foreach ( zone in document ) {
    var token_list = document.zone.value;
    foreach ( token_list as token ) {
      var count = token.count;
      var weight = tokenWeight(token.count);

      if ( false === isset(token_hash[token.value]) ) {
        fwrite("token.csv", "token_id,token.value");
        token_hash[$token] = token_id;
        token_id_current = token_id;
        token_id++;
      } else {
        token_id_current = token_hash[token.value];
      }

      fwrite("document_index.csv", "NULL,document_id,zone_id,token_id_current,count,weight");
    }
  }
}

unset(token_hash);

executeSql("TRUNCATE `token`");
executeSql("LOAD DATA INFILE 'token.csv' INTO TABLE `token` FIELDS TERMINATED BY ','");

executeSql("TRUNCATE `document_index`");
executeSql("LOAD DATA INFILE 'document_index.csv' INTO TABLE `document_index` FIELDS TERMINATED BY ','");

delete("token.csv");
delete("document_index.csv");

By using the two CSV files, we were able to decrease the execution time for 50,000 products to 88.11s with 13.38MB of memory usage.

Document Count Running Time Memory Usage
50,000 88.11s 13.38MB
100,000 229.01s 31.60MB
300,000 606.91s 118.83MB

The running times for the Indexer #5 algorithm.

At this point, we felt the algorithm could not increase in efficiency at all. The original naive algorithm (Indexer #1) took 10,800 seconds for 300,000 documents. The Indexer #5 algorithm took 607s, a 94% increase in speed.

Conclusion

The famous quote by Donald Knuth rings true, “Preoptimization is the root of all evil.” Initially we spent time trying to reduce the time from O(n3) to something faster. After investigating the real bottlenecks, we were able to get the running time down 94% with a relatively slight increase in memory.

Note

In this article, “we” refers to myself and other programmer, Trevor Andreas.

On Building an Efficient, Indexed Search Engine With a Word Proximity Algorithm

Posted by Vic Cherubini on December 01, 2009

This is a valid article, and is considered technically accurate up to Feb. 21, 2010

This article describes how to build a very fast, index based search engine for a group of documents taking word proximity into account with a very basic algorithm. Actual coding will be left to the reader, however, the basic algorithms will be presented, along with the theory behind them. Additionally, the 0.4 version of Artisan System will have the search engine described in this article built into it.
The searching algorithm is actually quite simple. It uses very little memory and processing power. It is even possible to have the entire algorithm take place within a database using SQL. The algorithm uses a weight based approach, where the resulting items are weighted against the search query. Thus, the overall weight for each item is computed at runtime, rather than at index time.
To create the speed of the search engine, an index must be built first. The time to index the entire set of documents, called the universe, is directly proportional to the number of documents within the index.
Pieces of this algorithm are derived from the Apache Lucene Project.

Tokens

The most basic part of any search engine is a token. A token is a single term of characters that describes a document. One document can have many tokens. In addition to a document having tokens, a search query is comprised of tokens as well. The act of splitting a group of terms into tokens is known as tokenization. The most basic delimiter when splitting a group of terms into tokens is the space character since it’s what English speakers use to denote separation of words. This article assumes English as a first language, however, the search engine supports Unicode characters as well. Thus, for this example, the space character will remain as the tokenizing character.

Tokenizing a search query must occur in the same manner that a document is tokenized. This will result in the tokens returned from the search query are the same as the document. This does not mean some preprocessing can not be done on the query before hand to parse simple regular expressions or boolean expressions. For the time being, this article assumes that no additional preprocessing will occur on the search query, and it will be tokenized identically to each document.

Zones

A document is comprised of one or more zones. A zone is simply a section of the document that holds data about the document. For example, the zones of an academic paper may be:

  • title
  • authors
  • keywords
  • abstract
  • document
  • references

Zones exist to further segregate the information about a document so that each zone can carry a different weight. Thus, tokens from the search query found in the title rank much higher than they would in the document itself, for example. Additionally, this allows one to specifically search in an exact zone of the overall document. One could easily search in the Abstract only, for instance. Essentially, each zone weight allows fine tuning when the overall search weight is calculated.

Word Proximity

Solving the word proximity issue is difficult. For example, the title Point Based Approximate Color Bleeding from Pixar Animation Studios would normally be tokenized to:

  • point
  • based
  • approximate
  • color
  • bleeding

However, there is no relationship between the location of the words to each other. If one searched with the query point color, a document entitled Point Color Pixelation Techniques should show with a higher weight than Point Based Approximate Color Bleeding. Thus, without any word proximity algorithms, the only way Point Color Pixelation Techniques would show up first is if it had a higher weight in each other zone. We want to take word proximity into account within each zone.
Complex algorithms exist to determine the word proximity at search time. These are out of the scope of this document. This article aims to provide a simple, non-mathematically proven method of searching large datasets quickly and efficiently without having advanced mathematical knowldge. An algorithm devised by Trevor Andreas will store a token as more than an individual term, but rather as a grouping of terms. Grouping the terms into unique tokens with a yield count of 2 for Point Based Approximate Color Bleeding yields:

  • point
  • based
  • approximate
  • color
  • bleeding
  • point based
  • based approximate
  • approximate color
  • color bleeding

And for Point Color Pixelation Techniques yields:

  • point
  • color
  • pixelation
  • techniques
  • point color
  • color pixelation
  • pixelation techniques

Each of these become unique tokens themselves. Because the same process tokenizes both the search query and the documents, a search query of point color tokenizes to:

  • point
  • color
  • point color

As a result, both of the documents have tokens point and color, however, only the second document has token point color associated with it. Thus, the weight of the second document will be higher than that of the first.
As a side note, if the tokenizing yield number is 3, it should include terms up to 3. Point Based Approximate Color Bleeding yields:

  • point
  • based
  • approximate
  • color
  • bleeding
  • point based
  • based approximate
  • approximate color
  • color bleeding
  • point based approximate
  • based approximate color
  • approximate color bleeding

Obviously, the higher the yield number, the more tokens exist.

Term Frequency Inverse Document Frequency

Also known as the tfi-df, Term Frequency-Inverse Document Frequency describes how common a term, or token, is to the entire corpus. The corpus is the list of all unique tokens in the universe. The tfi-df in this example is used to calculate the number of times a token is found within a document. This value is stored with the index so that it can be retrieved quickly when performing the actual search. More can be read about this topic at the Term Frequency-Inverse Document Frequency Wikipedia document.

Indexing

The most important part of the search engine is indexing. Writing a good indexer that captures all of the tokens in each zone of a document is crucial. While optimization should always be in the hindsight of every engineer, the indexer is not built for speed, it is built for correctness. The indexing algorithm in this example runs in O(n) time.

Indexing the entire universe is surprisingly simple:

var document_count = document.length;
foreach ( document in universe ) {
 foreach ( zone in document ) {
   // Perform the actual tokenization of the value in the document's zone.
   token_list = get_token_list(document.zone.value);
   if ( !token_list.empty ) {
     foreach ( token in token_list ) {
       // token.length is the number of tokens in this token entry.
       insert_into_document_to_token_table(document.id, zone.id, token.id, token.length);
     }
   }
 }
}

// Count the number of tokens in the entire corpus.
var token_count = get_total_token_count();

foreach ( document in universe ) {
  foreach ( zone in document ) {
    // Get all of the tokens in this zone of this document
    var token_list = get_document_zone_token_list(document.id, zone.id);

    // The number of total tokens within this document zone.
    var token_count = token_list.length;

    foreach ( token in token_list ) {
      // The number of times this token is used in the entire document zone.
      var indv_token_count = token.length;
      var term_freq = indv_token_count / token_count;

      // Return the number of times token X is used within this document.
      var indv_token_count = token.length;
      var inv_freq = ln(document_count / indv_token_count );
      var tf_idf = inv_freq * term_freq;

      // Updated the tf-idf in the index.
      update_tf_idf(document.id, zone.id, token.id, tf_idf);
    }
  }
}

At this point, a large index is built up with the following fields (for example):

document_id     zone_id    token_id     count     tf_idf
1               1          1            3         0.58
1               1          2            1         0.14
2               2          5            1         0.23

Calculating the Search Weight

After a search query is entered and tokenized, a list of documents without any of those terms can be quickly discarded. Finding the documents containing those terms is important. After each of those documents are found, the zone of each document must be analyzed in order to determine if the search query token is in that zone. This aids in overall weight calculation.
After each document with those tokens are found in any zone, calculating the search weight becomes:

var document_list = new Array();
foreach ( document in document_found_list ) {
  document_list[document.id] = 0;
  var weight = 0;
  foreach ( zone in document ) {
    foreach ( token in tokens_in_zone ) {
      weight += token.tf_idf;
    }
    weight = weight * zone.weight;
  }

  document_list[document.id] = weight;
}

sort(document_list);

After the document_list is sorted in descending order, one can simply display the results as desired.
The industrious programmer will quickly see that this can easily be stored in two tables in a database. One table, `document_index` would hold the document_id, zone_id, token_id, count, and tf_idf. The other table, `token` would store a list of tokens. Querying these tables and calculating the search results is nearly trivial.

Further Thoughts

Because this algorithm is not based on any mathematical proof, it can not be considered mathematically sound. However, it works very well, and is very efficient and fast. One area of improvement would be to make it have an artificial intelligence learning algorithm. Essentially, one would store the search queries, what documents were returned, and what documents the user selected. From there, the number of times a document is accessed would be included in the search algorithm to determine the overall weight.

Conclusion

Currently, most sites implement very simple search engines that are slow and do not return correct results. Using a simple query, “WHERE title LIKE ‘%some query%’” is often slow and yielding incorrect results. Presented here is an alternative algorithm that is easy to implement and will handle any type of document.

Acknowledgements

Versioned Objects in Artisan System 1

Posted by Vic Cherubini on December 01, 2009

This is a valid article, and is considered technically accurate up to Feb. 21, 2010. Versioned objects were removed in Artisan System, but will make a re-appearance soon.

A new feature in Artisan System 0.3a1 are versioned objects. Versionable objects have been mentioned in several previous posts, however I am officially announcing them now. Additionally, I want to hear if the community in general believe they are a neat or useful idea.

A versionable object is one who stores historical information about the changes in data from within that object. It works very similarly to how source control works with source code: changes made over time are tracked, and thus, you can go back to a previous change should you break something. Versionable objects in Artisan System 0.3a1 are stylistically based off of Subversion.

Code Snippet

Currently, the only object that can be versioned right now is the Customer class. I did it there simply because I had no plans to write a versionable object, I was working on the Customer class when I had this idea, so I simply implemented it there first. I plan on pushing out the idea of a versionable object to a generic class, where one could create an object outside of the versionable object, pass it to the versionable object, and automatically track versions. Essentially:

require_once 'Artisan/Db/Adapter/Mysqli.php';
require_once 'Artisan/Customer.php';
require_once 'Artisan/Versionable.php';
 
// Assume $config_db was defined elsewhere and contains connection information
$db = new Artisan_Db_Adapter_Mysqli($config_db);
$db->connect();
 
// Load up customer ID #10
$customer = new Artisan_Customer(10);
 
// Make the customer versionbale now, loading it up at the HEAD revision
$versionable = new Artisan_Versionable($customer);
$versionable->setStorageDb($db);
 
$customer->firstname = "newfirstname";
$customer->age = 24;
 
// Must use $versionable to write the object's historical data, just like you
// must use `svn ci` to check in changes.
$versionable->write();
 
// However, you can simply call write() directly to save a small amount of changes.
$customer->write();

The above code is only a sample, and will not work now. I have not fully designed the generic Versionable object, but I believe it will work similarly to the code above. This example will show the versionability with the Customer class.

require_once 'Artisan/Db/Adapter/Mysqli.php';
require_once 'Artisan/Customer/Db.php';
 
// Assume $config_db was defined elsewhere and contains connection information
$db = new Artisan_Db_Adapter_Mysqli($config_db);
$db->connect();
 
// Create a new customer object that stores it's data in the database, this
// will eventually be changed to have all customer data stored in the database
// by default because honestly, where else are you going to store customer data?
$customer = new Artisan_Customer_Db($db);
 
// Load up customer ID #10 at the HEAD revision.
$customer->load(10);
 
$customer->lastname = "Fisher"; // Update the lastname to Fisher
$customer->age = 19;
$customer->tax_id = "19300x9d93";
 
$c_data = array(
 'favorite_book' => "The Art of Computer Programming"
);
 
$customer->setFromArray($c_data);
 
unset($customer->dateofbirth);
 
// Create a new revision, assume <em>lastname</em> and <em>age</em> have already been created,
// and thus, will get an M for Modified flag associated with them, <em>tax_id</em>
// and <em>favorite_book</em> are new fields, so they get an A flag for Added,
// and <em>dateofbirth</em> was deleted via unset(), and thus, gets a D flag for deleted.
$customer->write(); // Assume the latest revision is now 18.
 
// Load up customer ID #10 at revision 17
$customer->load(10, 17);
 
echo $customer->lastname; // echo's the value previously stored before Fisher was set as the lastname.
echo $customer->favorite_book; // echo's nothing, the field hasn't been created yet.
echo $customer->dateofbirth; // echo's December 18, 1982 for example

With each write() call, a new revision is created with the changes to the data. The Customer class makes use of an Entity-Attribute Value model. The fields associated with the Customer object are not stored as columns in the database. Instead, the fields are only stored historically with the specific customer. However, you can enforce a Customer to have a default set of fields. Each field must have a type, which can be as generic as “text” or as complex as “email”. More complex fields can have a regular expression and a hook function they must match in order to have their values set.

$customer->email_address = "vmc@leftnode.com"; // email_address is a specific field type, and thus, must pass a regex
$customer->email_address = "vmc"; // Doesn't match, value won't be set.

Versionable objects also have the ability to be cloned (similar to copied in Subversion). Thus, if one object is created from another, the primary key (in this case, the customer_id) is copied as the parent_id of the second object.

// Copies all of $customer at whatever revision it's
// loaded at with $customer2's parent_id as the id of $customer
$customer2 = clone $customer;
$customer2->write();

Speed and Efficiency

The first question asked, is, “Isn’t this very slow?” Fortunately, the answer is No, it’s surprisingly quick. In several initial load tests, a random object with 4000 revisions each with 0-20 fields updated per revision could have an arbitrary revision loaded in about 0.003 seconds. Of course, it is slower than using a database with a `customer` table having several columns, but it much, much more flexible. Again, it’s similar to normal source revision control: an overhead exists but at a nice feature improvement.

Conclusion

At first, especially for a Customer object, it may seem like overkill. However, there are times where historical information is necessary or even required. For example, in several industries, software must keep track of every change made to a piece of data (the banking industry is one such). It would be nice to build a system that stores historical data automatically.

The usefulness of this extends past just a basic Customer object. One could easily build an object for storing Wiki documents. With each write, the historical data is stored, and you could easily revert back to a previous version if necessary.

It is my hope that the new version of Artisan System will have some incredibly useful features that programmers can take advantage of. I’m particularly excited about versionable objects, and I look forward to hearing if its a good idea or not, and how it can be improved if so.

PS. I just know someone is going to let me know that Django or Rails has had functionality like this for several years.

Magento Software Security Vulnerabilities 1

Posted by Vic Cherubini on December 01, 2009

This is an invalid article, and the vulnerability has since been patched.

It’s not often that you’ll find me disparaging the work of other developers. I realize the hard work and effort that is put into developing a large scale website. Unfortunately, there is a security vulnerability in Magento Commerce in 1.2.1.1 (and previous versions) that I believe should be fixed before anyone else uses it.

Magento Commerce has been criticized in the past for being too bloated. I agree it is. A single, uncached page load takes as much as 20MB of memory, which is a bit much for a web application. While Magento can be criticized for being too bloated, many other popular apps are bloated, and that bloat can generally be solved by server tuning and hardware. The issue of Magento extends further to a much more important issue: security vulnerabilities in the administration panel.

The problem starts with an attack called a cross site request forgery, or CSRF. The brilliance with this attack is that the person executing it is generally unaware and an authorized user of the website, and thus, the web software has no way to stop the actual actions from occurring.

The attacker will generally use more social engineering than actual coding/cracking. First, I’ll explain how the request forgery works and then how to fix it.

The cracker installs Magento on their own system. They configure it properly, set them self up as the administrator, and create a fake order so they receive the layout of the email the administrator receives when a new order is created. Next, the attacker creates a real order on the system they’re looking to attack. With that, they’ll get any modifications to the order email (albeit’s the customer’s email, they’re checking any major color changes, font changes, etc). With this order email, they now have the email address of the administrator of the site.

With this knowledge, the attacker has enough information to execute the attack. The attack lies in the fact that the administration panel of the Magento installation allows you to execute CRUD (more importantly, the CUD) – Create, Read, Update, Delete – actions through a GET request rather than a POST request. Using Firebug with their own installation, the attacker can easily find the URL to, for example, duplicate or delete a product.

I’ve blurred out the location of my Magento installation, but you can easily see that the duplicating or deleting of a product is done with a simple JavaScript function that just sets the window’s URL to http://mymagentosite.com/index.php/admin/catalog_product/delete/id/1 or http://mymagentosite.com/index.php/admin/catalog_product/duplicate/id/1.

Copying one of those URL’s and pasting it into another tab will cause the action to execute! This is where the source of the problem lays. Actions such as Create, Update, and Delete should always happen through a POST request, not a GET request. If they must happen through a GET request, an interstitial page is necessary to confirm their actions.

The Attack

The attacker now has everything they need: new order email, administrator email address, and sample URL’s to attack. Here’s what they do:

  • Create a simple HTML page on their website that when accessed, draws a hidden IFRAME with the SRC attribute being the URL of the attack you want to execute. For example, if you want to duplicate a product, the page could be as simple as:
  • <iframe src=”http://mymagentosite.com/index.php/admin/catalog_product/duplicate/id/1″ style=”display: none”></iframe>
  • and installed on http://badsite.com/crack_magento.html
  • Then, through the use of a phishing attack, sends an email to the administrator masquerading as a legitimate new order email. If the attacker sends this in the morning, chances are, the administrator will already be logged into the backend of the website processing legitimate orders, meaning a session has been authenticated for them. The phishing email looks nearly identical to their emails they get regularly, and the administrator clicks on the link to go process the “new order”. In reality, because the email’s are HTML emails, the value of the link is different than the HREF. The administrator clicks a link they’ve seen hundreds of times before, but instead of going to the website, it goes to the page on http://badsite.com/crack_magento.html
  • Because the administrator is already authenticated, even if this link opens in a new tab or even window, the action will be allowed to execute! With that, they’ll automatically have a duplicate product on their store! Deleting a product works in the same way.
  • Even more frightening, the attacker could easily update the crack_magento.html page to use a loop in JavaScript to continually draw a hidden iframe 100 times to create 100 duplicate products. As a result of the size of Magento, this can do several things:
    • Create 100 products that are duplicates and erroneous.
    • Cause the server to fail under the load of continual requests.

The Fix

Surprisingly, the fix for this is simple.

  • ALL Create, Update, Delete requests must come from a POST request.
  • A same domain policy is created to only allow POST requests from the website Magento is installed on. The checking for this would happen before the rest of the page content is loaded, requiring very little overhead:
  • <?php
     
    if ( 'POST' === strtoupper($_SERVER['REQUEST_METHOD']) ) {
      $this_domain = get_this_raw_domain(); // Would return something like mymagentosite.com
      if ( $this_domain != $config['site_domain'] ) {
        exit('Hacking attack!');
      }
    }
  • Of course, the above code is a proof of concept, but would be fairly simple to implement in Magento (or any site).
  • On some forms, a hash value is in a hidden form field and this is checked against the session. This should be updated to have a random has generated on each page load in a hidden form field. The same value is stored in the session. On each page load, this hash is recreated. Next, a static hash is created when the administrator logs into the site. On each page submit, the hash from the form and the hash from the session are checked. If both of these are the same, the request continues.
  • Add a secret PIN to the subject of all emails coming from the store to the administrator. For example, if the administrator set his PIN as 82799, that value would appear in every subject of every email to him from the store. Because he’s the only one who knows the PIN, he’ll know the email is legitimate.

Conclusion

An obvious question is why would someone exploit this? This is the wrong question to ask. People do amazingly cruel things for no reason whatsoever. The question should be, “what is the attack and how do I fix it?”. Although duplicating a product is something that’s a minor nuisance, my fear lies in that if something as simple as a cross site request forgery attack is not caught, where else are major security vulnerabilities present? It is not my point to discredit or disparage the Magento developers as I believe Magento has a chance to be a great product. However, security vulnerabilities such as these are bound to be exploited. I urge you to speak with the Magento developers to ensure this will be fixed immediately.

A Situation Not So Unique

Posted by Vic Cherubini on December 01, 2009

This is a valid article, and is considered technically accurate up to Feb. 21, 2010

Often, programmers will make assumptions as to how a system will work, either based on good evidence or on their past history with how the system previously worked. Recently, I made one such assumption about the uniqueness of the products in a shopping cart that lead to some interesting results.

The company I work for was recently commissioned to do a small, single page shopping cart for another company that sells wiper blades for cars. I was given a spreadsheet/CVS file in the format:

MAKE,MODEL,YEAR,DRIVER_LENGTH,PASSENGER_LENGTH,REAR_LENGTH

The DRIVER_LENGTH, PASSENGER_LENGTH, and REAR_LENGTH values were all the length in inches of a wiper blade. The MAKE, MODEL, and YEAR combination would be a unique combination. For example, my 2008 Honda CR-V would be represented as: HONDA,CR-V,2006-2008 because the wipers from the 2006 to the 2008 models were the same. The database was set up in this fashion as well. I did not design the database as another company had and the customer wished us to use the same format. Otherwise, I would have normalized it properly to have a `make`, `model`, and `year` table. However, because there was a small amount of rows (less than 2000), those fields could be indexed and lookups would be relatively fast, so I wasn’t worried too much about performance.

The mistake I made was assuming that the spreadsheet contained unique values. The assumption was that the HONDA,CR-V,2006-2008 value would be unique in the spreadsheet because there can only be one Honda CR-V 2008 car. However, I failed to realize that the spreadsheet was created by a human, and mistakes were made.

Because the spreadsheet was nearly 2000 rows in length, I didn’t take the time (nor was it my job), to verify that each record was unique. I assumed since the customer had given it to us that it was correct.

I built the site quickly since it was a fast deadline and didn’t do much to check for correctness. On the site, a simple list of dropdowns were created to quickly select the car. First, you’d select your Make, then an AJAX call would bring up a list of Models, you’d select the Model, and with an AJAX call, a list of Years would appear. Once you select the Year, a final AJAX call would retrieve the ID associated with that row and store it in the JavaScript for later usage. Fairly simple and straightforward. However, after we launched the site, we noticed that some orders appeared to have incorrect total calculations. The order appeared to have a single product associated with it, however, the total amounts indicated a second wiper. I recreated the order on the development system and with no avail, I could not reproduce the error. Could it be a session collision where a duplicate session ID was created? No, of course not, PHP prevents that from happening. I added logging and still, I could not reproduce the error.

Several days later, my boss was attempting to use the site on the development system and added a Buick LeSabre 2001-2005 wiper to his cart. When he checked out, the product was not there! I quickly searched the database for a Buick LeSabre 2001-2005. To my surprise, there were two records for that car with identical wiper lengths! I quickly wrote a SQL query to search for other cars with duplicate entries, and to my surprise, there were 10 of them.

So here’s what happened: I assumed incorrectly that the products in the spreadsheet given to me were unique. Instead, human error accidentally duplicated a few of these, and because of my incorrect assumption, the importer I wrote simply inserted these into the database. Thus, when the final AJAX call would execute to get the product ID, a value of NULL would be returned because I incorrectly assumed that only a single record would be returned. Instead, two records were being returned, and the code I had written didn’t know how to handle that because it assumed only one record would be returned. Thus, a NULL product ID was sent back to the page. However, the product was still added to the persons cart, but when the order was finalized, it did not appear because the product ID was NULL, no record of it was written to the database.

Solving this issue was easy: either delete the duplicate records (generally not the best idea), or LIMIT the query to 1 record. I added this and instantly, everything worked fine, even when selecting a wiper with multiple records. We notified the customer and they apologized for their mistake.

Of course, the mistake was ultimately mine for incorrectly assuming the data even when I knew that a single record should be returned. It seems like a silly, beginner mistake, and it is, but I’ve definitely learned from it.

Why Writing Software Is So Difficult

Posted by Vic Cherubini on December 01, 2009

This is a valid article, and is considered technically accurate up to Feb. 21, 2010

Writing software, no matter what level, or how much, is very difficult. The most difficult thing about writing any piece of software is finishing it. Writing good software is even harder. It takes hours upon hours of time invested in the smallest of details to ensure each piece works together smoothly. It is precisely this that makes software such a rewarding and most difficult occupation.

One may be surprised, but to me, the most difficult part of writing software is actually writing it. Sitting down and typing out the code that forms the finalized program. Yes, the details can be very difficult and agonizing at times, but there are document after document written on how to write good software, so implementation details can eventually be solved.

One thing that has always amazed me is the amount of software that exists. The sheer volume is staggering. How many billions of lines of code have been written is a number that can never accurately be counted. So the question asked is often, “how do I go about getting my software noticed?” Whether it is free software like Artisan System, or proprietary closed source software, the answer is always: FINISH IT.

Several years ago when I was going to choose a college, I visited the University of Texas at Dallas, which eventually became my Alma mater. During my visit there, I sat in on a presentation by famed game programmer Tom Hall (with slightly less hair at that time). The point of his presentation was how to write a game. By far, the thing he stressed most was: FINISH IT. No one wants to play an unfinished game, and the same can be said for software in general: no one wants to use unfinished software.

At that point in my life, I was dead set on being a game developer. To me, there was no greater glory than writing games. Years later, I realized I really did not want to write games, and as the web became a very powerful platform for development, it attracted me more. I never finished any of the game software I started.

I think every developer has felt that at some point during their career: how do I go about finishing this, when is it considered done? Done can mean several things from a new stable release to the end of a service contract to the archiving of an old piece of software never to be touched again. The point being, reach that goal. For me, each new stable release (0.1, 0.2, 0.3, etc) of Artisan System means it is done, I have a finished product worthy of releasing.

This is especially difficult when starting a piece of software from scratch, such as Artisan System. Artisan System started as a desire for me to take years of PHP programming and turn it into a productive framework that other people, including myself, can use to make great web software.

Why is finishing software so difficult? Lets dissect open source software for a moment, which generally starts off as more passion than anything. Artisan System started as a passion of mine and has now flourished into a full time hobby. Hitting the 0.1 release of Artisan System was the hardest piece of software I have ever written. Why? I now had the time to set an accurate deadline of everything I wanted to include in the 0.1 release. To that day, I had never personally released a piece of software to the public. Sure, I am a professional developer by day and I write software that my employer sells, but I see writing Artisan System as different since I am unpaid and doing it as a hobby. Hitting that 0.1 release meant that I was done. I had finished something. That is not to say that Artisan System is finished as a product, not by any stretch, but I had finished that milestone. I now had a grouping of classes that worked together that people could use to build web software. Sure, there were bugs, as in most software, but for the most part, it was stable and well tested.

The 0.2 release was much, much easier to hit than the 0.1 release. I could feel it in me, I had already hit 0.1, so 0.2 must be right around the corner! It was! My drive for writing more of Artisan System was never stronger. I did small incremental releases: 0.2a and 0.2b, both getting a few hits.

Right now stop whatever piece of software you are working on and physically write down a finishing point. Again, this can be the next point release, or the fully finished product with as few bugs as possible, or anything in between. Hit that goal, let nothing get in your way of finishing it. By finishing a piece of software, you have now completed the hardest, most difficult part of ever writing any piece of software ever.

Edit: It appears there has been some positive and negative reactions to this article, which is what I wanted. I thought about it more, and here’s ultimately what I wanted to say. My father owns a a small software (no, not that epic) company, so I have always been around programmers my entire life. When I was much younger and would hang around his office, I would always see the programmers working on something complex and it absolutely blew my mind that they could manipulate a computer with code. I thought never in a million years would I be able to understand the syntax of another language, much less multiple! Never did I ever think I could understand algorithms (of course I didn’t know they were called that at the time). However, once I started to mature a bit and learn how software is written, it finally occurred to me that, yes, some algorithms are difficult, some I’ll never fully understand, yes, the syntax of some languages is archaic or difficult, but there are endless manuals to overcome that, but the most difficult part of writing any piece of software is finishing it. Just ask yourself: how many Subversion or git repositories do you have sitting out there with unfinished work, projects started and not finished? Chances are, if you’re like me and passionate about this, a lot. So my advice is simply this: see if you can finish off one of the projects. I know how much better it made me feel to release 0.1 of Artisan System.

Artisan System Code Swarm

Posted by Vic Cherubini on December 01, 2009

This is a valid article, and is considered technically accurate up to Feb. 21, 2010

Here is the codeswarm for Artisan System. I always love doing things like this. Represents the last 7 months of work.

History and Name of Artisan System

Posted by Vic Cherubini on December 01, 2009

This is a valid article, and is considered technically accurate up to Feb. 21, 2010

Artisan System has been in development for about a year and has gone through two distinct rewrites from scratch. The rewrites were as the result of underestimating the complexity of a framework. The version released today has been in development for 6 months. Development has been slow, but deliberate. Artisan System was written to be fast, small, and with an emphasis on good PHP coding standards.

Where did the name come from? Why Artisan System? In it’s first revision, the framework was named Artisan. Clearly, an Artisan is a master of their craft, and we envision PHP programmers using Artisan System as a tool to perfect their craft. After the initial versions of Artisan were written, they were named simply Artisan. Because it severely lacked a lot of features, development on this version began. The framework was renamed to Artisan System because it is meant to be used as system level libraries for web applications. In the end, a group of applications will be written with the name Artisan and they will use the Artisan System framework for their underlying structure.

Finally, we want to improve the image of PHP and PHP Programmers. Yes, there are a lot of very poorly written PHP software for myriad of reasons. However, we feel that great software can be written in PHP and we want Artisan System to prove that.

Questions, comments, and help offers can be redirected to vmc@leftnode.com

Artisan System Framework License

Posted by Vic Cherubini on December 01, 2009

This is a valid article, and is considered technically accurate up to Feb. 21, 2010

This software is provided ‘as-is’, without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.

Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.