Archive for August, 2009

Quanta++ fix for Cervisia CVS plugin

August 21st, 2009 7 comments

If you’re like me and use the latest and greatest, you can end up using software with dated links. I love my Quanta+ editor, it’s lightweight and does everything I need without a 2 minutes startup time.

However, recently, it’s been giving me an error when I start it up. the same error will presumably crop up when you try and use the Cervisia CVS plug-in.

“..the file kde3/ is not installed or it is not reachable.”

The solution is to go to Settings->configure plug-ins
Double click on ‘CVS Management (Cervisia)’
Set the location to /usr/lib/
and the file name to kde4/

This way, your quanta+ installation will use the KDE4 version of Cervisia at /usr/lib/kde4/ (which is probably installed) as opposed to the older KDE3 version (which Quanta still thinks is in place).

Open Source Analytics and Reporting Frameworks

August 9th, 2009 No comments

I gave a talk at the FOSSLC summercamp ’09 event, on open source analytics and reporting frameworks.

The talk focussed on techniques to make data more accessible to stakeholders. I covered techniques for accessing, analysing, and visualizing data using web based reporting frameworks as well as employing programmatic means to generate natural language English reports.

The topics covered included:

* The importance of reporting technology
* The sources of data
* Data extraction and modelling
* Data cleaning techniques
* Data transformation
* Analytics — choosing the metrics
* Reporting using web frameworks
* Reporting using Natural Language Generation
* Things that can go wrong
* Best practices

The video is available at Free and Open Source Software Learning Center (FOSSLC)

Fruits of Experience: Massively Scalable File Operations

August 4th, 2009 No comments

I recently met a friend, and we were shooting the bull, discussing what we do that is different from other computer software developers. Aside from all the statistics, copious reading of dense technical reports, and our presentation of esoteric information in human-digestible format, the issue of scale frequently crops up. As machine learning aficionados, my friend and I come from the school of though that the more data you have, the better it is for you. However, this truism is not limited to people like us who deal with massive data repositories for which they are trying to build classification models (a million word corpus is considered entry stakes in our world). This is an important skill for people in any successful organization in the Web 2.0 world, where massive data repositories sourced as input from users have to be dealt with. Let’s take a relatively simple scenario:

Your team has to remove all the phone numbers from 50,000 Amazon web page templates, since many of the numbers are no longer in service, you also want to route all customer contacts through a single page.

Let’s simplify the problem and say that you have to identify the pages having probable U.S. phone numbers in them. To simplify the problem even further, assume we have 50,000 HTML files in a Unix directory tree, under a directory called “/website”. We have 2 days to get a list of file paths to the editorial staff. You need to identify a list of the .html files in this directory tree that appear to contain phone numbers in the following format: xxx-xxx-xxxx.

Assuming that the standard UNIX tools are available (and variants are available for windows), the solution is given below:

Step 1. Identify the Regular Expression Pattern
The work-horse command for this would be the ‘grep’ tool, and the pattern would be:

grep –e “[[:digit:]]\{3\}[ -]\?[[:digit:]]\{3\}[ -]\?[[:digit:]]\{4\}” FILENAME

The regular expression pattern looks for a sequence of 3 + 3 + 4 numbers separated by either a space or a dash. There may also be parenthesis around the area code (although unlikely in a site like which has national rather than local reach).

Step 2. Search Through an Entire Directory Tree
This approach in step 1 would only search within a single file, and is useful in nailing down the correct regular expression (which will be used in the next step). To search through multiple files, in a directory tree, this has to be coupled with the ‘find’ tool, which also has regular expression based search support.

find /website *.html –exec grep “[[:digit:]]\{3\}[ -]\?[[:digit:]]\{3\}[ -]\?[[:digit:]]\{4\}” ‘{}’ \; -print 2> /tmp/error.txt

The redirect of stderr to ‘/tmp/error.txt‘ is being carried out to identify files and directories that may not have sufficient permission to be parsed.

Step 3. Optimizing via xargs
If this is a one-off development effort, the above should suffice. However, with 50,000 files spread across a number of directories, the system will probably run out of memory when thousands of instances of grep are invoked.

Although a high-performance system with high capacity RAM resources may not croak on this task, it could still be unacceptably slow.

To avoid this problem, it is necessary to use xargs, which will run the find once, and feed the matching files to a SINGLE instance of grep.

The command for this would be:

find /website *.html -type f| xargs grep “[[:digit:]]\{3\}[ -]\?[[:digit:]]\{3\}[ -]\?[[:digit:]]\{4\}”

With 50,000 files, I estimate that the difference would be on an order of magnitude of at least 20 to 1, which would be important for a system on which the users are waiting after giving a command. However, this may be overkill for a simple one-off operation, and a sharding strategy may be more appropriate (by which each subdirectory of /website is processed one by one).

What I really love about this story is that it fondly reminds me of the pain that I went through in the early years of my career. Problems that appear so trivial to me now used to really stump me. This include simple tasks like configuring web servers, piggy-backing data transfer through preexisting protocols, compression, security schemes, a million different strategies for optimizations; they all used to take me several weeks or months of effort and lots of travels down blind alleys to figure out.

Now, as soon as I come across a solution, I can invariably match a pattern to the problem (based on scenarios I’ve experience in the past). Even when a pattern is not obvious, at the very least I can very quickly formulate an efficacious plan to arrive at the desired solution. Experience, which can be defined as the collections of your failures and ‘wasted’ efforts in the past, is useful for something after all, as long as you’re willing to accept that you have a situation, and a solution is invariably available. Without that guiding hope, you’re sunk!

It was Nietzche who uttered the famous quote

what does not kill me, makes me stronger

I can attest to the truth of this, at least from the point of view of a computer scientist who has suffered many ‘near deaths’ as I have a very unhealthy tendency to pick the riskiest, most challenging projects to work on… instead of skirting around the problems like most rational people, I’m the mouse that really feels that we ought to bell the cat, and if you keep the bell silent enough during deployment, arrange for the cat to be sedated, and engineer an attachment device that would not be taken of by a panicky cat, it just may be possible to pull this off.

.. and, no, I cannot possibly be experiencing the ‘second system effect‘ on every project that I work on!

Open Source Search Engines: Talk from Summercamp ’09

August 3rd, 2009 No comments

My talk at FOSSLC summercamp ’09 is now available via ePresence. Spend 20 minutes, and learn about the magic behind search engines. Full money back guarantee.. (wait, you didn’t pay me anything!)

Securing Government Contracts

August 3rd, 2009 No comments

If you’re looking for clients who can pay you for your services, you really should consider dealing with the entity that you’re invariably doing business with every year, the Canadian Government (if you’re in Canada that is). After all, you do give them lots of money in taxes!

Canada’s governments account for 40% of every dollar produced in Canada. Yet many small businesses don’t even consider the public sector as a market for their goods and services, while in reality there is a strong chance that some of this business could easily flow their way. However, selling to the Government is a bit different from what we may be used to. You don’t have the usual set of end-users, influencers and decision makers that make up the bulk of your interactions with corporate and consumer clients that your firm normally deals with. Let me take this opportunity to demystify the procedure and launch you on your path to success in selling to the Government.

I’ll cover the following points:

  1. How do you find opportunities to bid for?
  2. How do you decide if this candidate opportunity is one that you’d like to bid for
  3. How do you win the business for your firm
  4. Some gotchas that you need to avoid

Opportunities are knocking every week
First read up Contracts Canada Business Primer. This site is an on-line primer that spells out the basic rules and answers many questions you’ll have. Then you may want to register once in order to be considered for government business. You can do this on-line at the Supplier Registration Information (SRI) and you’ll need either your business number or your GST number.

Once you’ve registered, you’re eligible to bid for contracts, congratulations!

Now mosey over to MERX, which is Canada’s official, public-sector electronic tendering service. ‘Tendering’ is a fancy word for reverse auctions, where people bid for the privilege of doing business with her majesty’s Government, and (normally) the lowest bid wins! MERX has a nice interface for finding opportunities, and as you’re already registered, you are eligible to download the full RFP (Request for Proposal) package for free (at least for Government opportunities).

Is this for real? Am I in the running?
The next step is to figure out if you can bid for the RFP. In most cases, you can, as there is no bar (usually) on who can bid for a project. The real question is, do you want to bid for this particular contract? Applying for a contract is a complex activity, where you need to gather together a lot of supporting information, and weave it together into a credible proposal that outlines a plan for fulfilling the service within a competitive rate. Fortunately, there is a simple checklist that you can follow:

  1. Is my company capable of offering this service?
  2. Only apply for those opportunities where you can complete the work you’ve committed to. The alternative is to be sued for breach of contract, which is not a pretty situation to be in.

  3. Does my company meet the evaluation criteria? (Typically section 4 of the RFP)
  4. Sometimes there are mandatory conditions such as security or HRDSC certifications, requirements for financial stability, or proof of having provided similar services in the past, which you need to supply. If you cannot meet the evaluation criteria yourself, you may want to consider finding a partner to apply for this contract with, or subcontracting. You don’t have to be Sun Tzu to realise that a weak ally speaks with the voice of the stronger ally! This principle works equally well in business as well as warfare. I speak more about this in the next section.

  5. Does my company accept the terms and conditions?
  6. Part 1 to 5 in the RFP describes how you can apply for this oppotunity, and how your bid is going to be evaluated. Part 6 contains the contract that you’ll have to agree to, and these are non-negotiable. You should only apply if you’re happy to sign on the dotted line; if you’re not, and you bid, all you can do is withdraw your bid before the contract is awarded!

  7. Is my company capable of winning the contract against the competition?
  8. MERX allows you to track the list of other companies that have downloaded the RFP. You can easily find out if you’re going mano-a-mano against Jo Bloe, or IBM.

  9. Does my company accept the payment and pricing method?
  10. The Government typically payes Net-30, which means 30 days after you’ve invoiced the Government or the delivery of services have been made (whichever is later). Some departments are notorious for paying a year later! This can kill your firm if you’ve got working capital issues.

Are you a winner?
Firstly, let’s cover the basics of the bid. There are three parties to the decision-making process.

  1. The end-users are the Government employees who requested this product or service. They are completely anonymous, and you only find out who they are once the contract is awarded to you. the only you can request information from them is via the procurement officer who will only contact them if he cannot answer the question herself. These guys provide the technical and managerial evaluation of your solution and your firm, and can only provide a pass/fail judgment (typically). They do not determine if the contract is awarded to you, but can kill your bid if they specify that your proposal does not fulfill their requirements. This evaluation is typically done by three individuals independently.
  2. the procurement officer (typically from from PWGSC) then carries out the financial evaluation, and usually assigns the project to the lowest bid unless they have a more complex evaluation methodology, which is always specified in the RFP.

Gotchas.. avoid these like the plague!
the following mistakes can kill your bid:

  • Being late to submit the bid. If the bid says 2:00pm on Friday, and you’re a minute late, tough tamales… you’re out of the running.
  • Forgetting to sign the document!. Imagine that, you spend a week getting the bid together, and forget to sign. ouch!
  • Not meeting the mandatory requirements. If the RFP requires some sort of certification, or that you provide 100% of the requested items, in a purple box with yellow polka dot pattern… well, you’d better cover all these mandatory requirements in your proposal, and ideally prove that you can provide them! If you are convinced that the requirements are unreasonable, be sure to inform the procurement office of that. Finding Ruby on Rails experts with 10 years experience is impossible, and procurement officers are not experts in all areas. They will be happy to amend the RFP, and provide an extension
  • Not submitting a single-sided paper bid!. If you submit an electronic bid, you’re trusting the procurement offices to print out your bid, and make copies etc. Well, if a page goes missing, who is to blame. Especially if this page deals with a mandatory requirement, which would effectively kill your chances at securing this contract?
  • Not separating the financial and technical aspects of this proposal (ideally in different binders, or via a cardboard separator). Remember, the end-users do not make decisions around the finances; all they do is decide if your proposal meets their requirements or no. Allowing them to view the financial info can influence their recommendations, which is why most RFPs require that the financial information is kept separate.
  • Sending the proposal to the wrong address! Confusingly, there are three different addresses on the front cover page of the RFP. The one you need to send your proposal to is marked clearly as ‘return bids to’.. but the history books are full of incidents where the bids were sent to either the procurement officer, or the service/goods destination.

If you win the bid congratulations! You’ve made your first successful step in doing business with the Government. If you’re not won, try again, there is lots of business to go around, and be sure to ask for a debrief (which is your right) to learn why you lost. Be polite when asking for the debrief, it always helps. If this has helped you, please be sure to comment below. If you need any further information, I’ll be happy to dig up the answers for you, just ask!