Posts Tagged ‘xarg’

Fruits of Experience: Massively Scalable File Operations

August 4th, 2009 No comments

I recently met a friend, and we were shooting the bull, discussing what we do that is different from other computer software developers. Aside from all the statistics, copious reading of dense technical reports, and our presentation of esoteric information in human-digestible format, the issue of scale frequently crops up. As machine learning aficionados, my friend and I come from the school of though that the more data you have, the better it is for you. However, this truism is not limited to people like us who deal with massive data repositories for which they are trying to build classification models (a million word corpus is considered entry stakes in our world). This is an important skill for people in any successful organization in the Web 2.0 world, where massive data repositories sourced as input from users have to be dealt with. Let’s take a relatively simple scenario:

Your team has to remove all the phone numbers from 50,000 Amazon web page templates, since many of the numbers are no longer in service, you also want to route all customer contacts through a single page.

Let’s simplify the problem and say that you have to identify the pages having probable U.S. phone numbers in them. To simplify the problem even further, assume we have 50,000 HTML files in a Unix directory tree, under a directory called “/website”. We have 2 days to get a list of file paths to the editorial staff. You need to identify a list of the .html files in this directory tree that appear to contain phone numbers in the following format: xxx-xxx-xxxx.

Assuming that the standard UNIX tools are available (and variants are available for windows), the solution is given below:

Step 1. Identify the Regular Expression Pattern
The work-horse command for this would be the ‘grep’ tool, and the pattern would be:

grep –e “[[:digit:]]\{3\}[ -]\?[[:digit:]]\{3\}[ -]\?[[:digit:]]\{4\}” FILENAME

The regular expression pattern looks for a sequence of 3 + 3 + 4 numbers separated by either a space or a dash. There may also be parenthesis around the area code (although unlikely in a site like which has national rather than local reach).

Step 2. Search Through an Entire Directory Tree
This approach in step 1 would only search within a single file, and is useful in nailing down the correct regular expression (which will be used in the next step). To search through multiple files, in a directory tree, this has to be coupled with the ‘find’ tool, which also has regular expression based search support.

find /website *.html –exec grep “[[:digit:]]\{3\}[ -]\?[[:digit:]]\{3\}[ -]\?[[:digit:]]\{4\}” ‘{}’ \; -print 2> /tmp/error.txt

The redirect of stderr to ‘/tmp/error.txt‘ is being carried out to identify files and directories that may not have sufficient permission to be parsed.

Step 3. Optimizing via xargs
If this is a one-off development effort, the above should suffice. However, with 50,000 files spread across a number of directories, the system will probably run out of memory when thousands of instances of grep are invoked.

Although a high-performance system with high capacity RAM resources may not croak on this task, it could still be unacceptably slow.

To avoid this problem, it is necessary to use xargs, which will run the find once, and feed the matching files to a SINGLE instance of grep.

The command for this would be:

find /website *.html -type f| xargs grep “[[:digit:]]\{3\}[ -]\?[[:digit:]]\{3\}[ -]\?[[:digit:]]\{4\}”

With 50,000 files, I estimate that the difference would be on an order of magnitude of at least 20 to 1, which would be important for a system on which the users are waiting after giving a command. However, this may be overkill for a simple one-off operation, and a sharding strategy may be more appropriate (by which each subdirectory of /website is processed one by one).

What I really love about this story is that it fondly reminds me of the pain that I went through in the early years of my career. Problems that appear so trivial to me now used to really stump me. This include simple tasks like configuring web servers, piggy-backing data transfer through preexisting protocols, compression, security schemes, a million different strategies for optimizations; they all used to take me several weeks or months of effort and lots of travels down blind alleys to figure out.

Now, as soon as I come across a solution, I can invariably match a pattern to the problem (based on scenarios I’ve experience in the past). Even when a pattern is not obvious, at the very least I can very quickly formulate an efficacious plan to arrive at the desired solution. Experience, which can be defined as the collections of your failures and ‘wasted’ efforts in the past, is useful for something after all, as long as you’re willing to accept that you have a situation, and a solution is invariably available. Without that guiding hope, you’re sunk!

It was Nietzche who uttered the famous quote

what does not kill me, makes me stronger

I can attest to the truth of this, at least from the point of view of a computer scientist who has suffered many ‘near deaths’ as I have a very unhealthy tendency to pick the riskiest, most challenging projects to work on… instead of skirting around the problems like most rational people, I’m the mouse that really feels that we ought to bell the cat, and if you keep the bell silent enough during deployment, arrange for the cat to be sedated, and engineer an attachment device that would not be taken of by a panicky cat, it just may be possible to pull this off.

.. and, no, I cannot possibly be experiencing the ‘second system effect‘ on every project that I work on!