Geolimited Web Spider/Directory

Web spider (crawler) that gathers information about websites in a specific geographic location (country).

The spider will take an initial seed listing of domain names, and expand this list by crawling links and related sites that have been determined to be pertinent to the same geographic area.

The factors for determining site location are:

Top level domain, whois info for gTLDs, Site IP address or DNS, Geographical terms (e.g. London/England/UK/English/British). Manual fudge factor.s

"Expensive" information such as DNS and whois should be cached (with a specified expiry) to redule load.

Crawl should be controllable as regards number of sites per day/hour, it should be possible to restrict the hours at which the crawl operates, and the crawl process must be interruptable across reboots etc.

Crawl process should be ongoing but not recrawl sites until after a predetermined period of time.

The spider should populate the database with information such as the the domain name, Site Title, Description & Keywords, and scores for the above geographic criteria. It will compute a "georelevance" score based on a weighted sum of these factors. It will not catalog sites below a preset relevance threshold, and not follow links on sites below a second (lower) threshold.

For IP geolocation, a third party service may be utilised.

Cacheing of page information or content keyword indexing is no required at this stage.

There should be a screen where the directory administrator(s) can review new sites, allocate them a category (Entertainment->Theatre) and geographic sub-region (e.g. State/Province->County->City/Town) in the directory, add manual review comments, edit keywords and/or descriptions. Existing (reviewed) sites need not be overwritten, but updated scores should be available to prompt re-reviewing.

The system should produce an automatic directory (and simple keyword search facility) for use on a directory website.

Capacity of the directory itself should be in excess of 100,000 sites (crawl capacity should be larger).

[url removed, login to view] standard should be obeyed.

The system should have whitelists (sites always to visit/include) and blacklists (sites never to visit/include).

## Deliverables

1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.

2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):

a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.

b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.

3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).

## Platform

Application to run under RedHat Enterprise Linux 2.1 and utilise only the normally available tools.

Use of standard LAMP env (Linux Apache MySQL PHP) is preferred, though perl can also be used for backend functions (limited to standard CPAN modules).

Habilidades: Servicios web de Amazon, Ingeniería, Linux, MySQL, Perl, PHP, Arquitectura de software, Verificación de software, Web Hosting, Gestión de páginas web, Verificación de páginas web

Ver más: working of web crawler, standard service agreement form, party city application, lower third title, list of available domain names, linux administrator uk, legal work london, hire party entertainment, hire linux server administrator, hire entertainment, hire a linux system administrator, hire a linux administrator, hire a coder in london, entertainment for hire, english websites for g 4, england hire, e enterprise, dns service, country state and city database mysql, automatic web crawler service, web spider software, web crawler tools, web crawler service, mysql database administrator service, theatre

Información del empleador:
( 3 comentarios ) Ireland

ID de proyecto: #3274747

Adjudicado a:


See private message.

$680 USD en 195 días
(29 comentarios)

5 freelancers están ofertando el promedio de $931 para este trabajo


See private message.

$148.75 USD en 195 días
(10 comentarios)

See private message.

$1275 USD en 195 días
(0 comentarios)

See private message.

$850 USD en 195 días
(0 comentarios)

See private message.

$1700 USD en 195 días
(1 comentario)