Web spider (crawler) that gathers information about websites in a specific geographic location (country).
The spider will take an initial seed listing of domain names, and expand this list by crawling links and related sites that have been determined to be pertinent to the same geographic area.
The factors for determining site location are:
Top level domain, whois info for gTLDs, Site IP address or DNS, Geographical terms (e.g. London/England/UK/English/British). Manual fudge factor.s
"Expensive" information such as DNS and whois should be cached (with a specified expiry) to redule load.
Crawl should be controllable as regards number of sites per day/hour, it should be possible to restrict the hours at which the crawl operates, and the crawl process must be interruptable across reboots etc.
Crawl process should be ongoing but not recrawl sites until after a predetermined period of time.
The spider should populate the database with information such as the the domain name, Site Title, Description & Keywords, and scores for the above geographic criteria. It will compute a "georelevance" score based on a weighted sum of these factors. It will not catalog sites below a preset relevance threshold, and not follow links on sites below a second (lower) threshold.
For IP geolocation, a third party service may be utilised.
Cacheing of page information or content keyword indexing is no required at this stage.
There should be a screen where the directory administrator(s) can review new sites, allocate them a category (Entertainment->Theatre) and geographic sub-region (e.g. State/Province->County->City/Town) in the directory, add manual review comments, edit keywords and/or descriptions. Existing (reviewed) sites need not be overwritten, but updated scores should be available to prompt re-reviewing.
The system should produce an automatic directory (and simple keyword search facility) for use on a directory website.
Capacity of the directory itself should be in excess of 100,000 sites (crawl capacity should be larger).
[url removed, login to view] standard should be obeyed.
The system should have whitelists (sites always to visit/include) and blacklists (sites never to visit/include).
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
Application to run under RedHat Enterprise Linux 2.1 and utilise only the normally available tools.
Use of standard LAMP env (Linux Apache MySQL PHP) is preferred, though perl can also be used for backend functions (limited to standard CPAN modules).