Home > Uncategorized > Quick Computation Infusion

Quick Computation Infusion

I remember when I had to carry out an experiment on Word-Sense-Disambiguation while doing by PhD. I was dealing with several million word collections of text (corpora), and when I started the WSD process, I realized that it would take me around two months to get the data I needed.

That was crazy. However, luckily, Cambridge had access to a CONDOR supercomputing cluster that was available. I paralleliized my code,  and launched it on this cluster. After a couple of false starts I got all the data I needed. 42 computers working over 4 days (a weekend+!) got me the 60 days of data.

Nowadays, if you have a bit of cash to burn, you can use the cloud for the same purposes. Here is a case study of an individual who (two years ago!) employed a thousand nodes on Amazon to get their data processed.


It cost them 900 for the CPU and perhaps another 800 or so for the data transfer. Not bad.. that’s pretty cheap to have a thousand computers working for you !

  1. Asif Jan
    June 8th, 2010 at 12:32 | #1

    The good thing about cloud computing is that it tackles the spikes in your usage. And many people would want to have that luxury/flexibility. Other important factor is that of eliminating the need to maintain your own cluster; and so on so forth.

    I think so far the cloud computing has been good for compute intensive problems; probably next logical step would be to offer attractive incentive for data intensive problems. Even today, amazon offers you to ship your data to them in order to reduce the time and the cost of data ingestion.

  2. June 8th, 2010 at 13:58 | #2

    @asif the value proposition is unquestionable when it comes to computational spikes. As you mentioned, you can also employ this infrastructure on data intensive problems, however most orgs don’t have such huge data stores to deal with, and I would probably argue that if they had an enduring task like this, they should consider investing in their own infrastructure.

    However, for one off problems, it’s really brilliant. Indeed, Amazon has explored using humans processing (via the Mechanical Turk) for such problems as well!

    – Shahzad

  3. December 13th, 2010 at 14:05 | #3

    We address the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse. The matching process is the first step of a framework to integrate data feeds from third-party data providers into a structured-search engine s data warehouse.

  1. No trackbacks yet.