# Analysis of "spider" traffic based on web logs
We require a unique analysis of search engine "spider" activity to our website. Our goal is to identify trends of which pages on our site are being indexed.
## Deliverables
# Analysis of "spider" traffic based on web logs
We require a unique analysis of search engine "spider" activity to our website. Our goal is to identify trends of which pages on our site are being indexed.
## Background.
The [login to view URL] (p4a) website is a subscription-based site, with a large volume of publicly available content. A long-term challenge has been to ensure good search engine visibility, and identify trends in which pages are being indexed. We currently see only a (large) portion of our pages being indexed according to the Google webmaster tools, but our goal is to have a much larger percentage indexed.
We wish to have an analysis of the existing log files to identify trends in pages examined by the spiders as well as by search engine referals. For the moment, we are only interested in Google-based traffic.
Pages in the p4A site represent antique "items" which have been sold at auction. Each is identified with an internal identifier (itemID) as well as a display version of the identifier. (displayID). The URLs encode these identifiers as follows:
/advertising/gas-stations-related/[login to view URL]
Where the displayID is the trailing "B199152" in this case (always represented by the regular expression /^.*-([A-H]\d+)\.htm$/).
We will provide our perl or php function to convert from the displayID to the itemID.
The web server has "gd" libraries installed.
Our web logs are formatted as the following example:
[login to view URL] - - [31/May/2011:13:29:52 -0400] "GET **/tools-measuring-devices/scales/[login to view URL]** HTTP/1.1" 200 12737 [login to view URL] 66.249.71.16.1306862992103524 "-" "**Mozilla/5.0 (compatible; Googlebot/2.1; +[login to view URL])**" [Tms=102040]
[login to view URL] - - [31/May/2011:13:29:31 -0400] "GET **/toys/construction/[login to view URL]** HTTP/1.1" 200 14871 [login to view URL] 76.223.250.220.1306862890637346 "**[login to view URL]**" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; BO1IE8_v1;ENUS)" [Tms=88721]
## Task Requirement
The programmer shall provide a perl or php program which will use the web logs as input, and produce the requested graph as an output. We anticipate this program will be run from the command line on an as-needed basis with a variety of timeframes of web logs.
We are envisioning a graphical representation of the "spider" traffic and "referal" traffic, with the itemID on the X-axis, and number of "hits" on the Y-axis, with one line representing search engine views (by browser string) and another line representing search engine referals (by referring URL). In the case of referring URL, we're only interested in the fact that a particular hit was referred by Google. The particular referring URL or search pattern is not of specific interest at this time.
## Deliverable
The programmer shall deliver a perl or php program which will run from the command line, using the web logs as input (in the format as described above), and producing a graph (jpg, png or gif) format representing the number of hits as described above.
## Further information
Based on the results of this task, it is conceivable this task could be expanded or modified. We are also looking for a longer-term programmer to assist with similar tasks on an ongoing basis.