I am looking to get a custom software or script built that will scrape the outgoing links from a particular website which we call it as "seed site" or backlinks from a particular website
This will be in 2 parts :
Part 1 : SCRAPER
Example : Lets consider [login to view URL] as the seed site. So I want to scrape all the domains that link out from [login to view URL] or all the domains that [login to view URL] is backlinking to. For example a post about domain "[login to view URL]" posted on bbc and has a backlink from it. So bbc links out to thousands of sites and I want to extract all those sites
So not just bbc I want this to work for any of the seed sites or scrape from any of the sites that i enter in software
Part 2 : Check for domain metrics by Integration with API
After it scrapes these domains I want to check metrics of these extracted domains like PA, DA, Tf etc. Meaning they should work with or intergrate with API of [login to view URL], [login to view URL] and [login to view URL] services. It should also check for domain availibility for registration.
I am aware that many such similar scripts have been built in freelancer sucessfully. I would be glad to award them this project
__________________________________________________________________________
Inputs to the tool
------------------
* Mandatory - 1 or more seed urls
* Optional - Crawl depth (Default value = 0, max value = 10)
* Optional - TLD list (Default values = [.org, .net, .com, .info, .biz]) If user enters TLDs, then append them to existing ones.
* Optional - Number of parallel threads to use. (Default value = 6)
* Optional - Proxy server configuration
Output from the tool
--------------------
* CSV file with list of domain names scraped
Requirements
-------------
SCRAPER :
1) Take 1 or more seed urls as input via UI field or from a file
2) Take crawl/scrape depth (e.g., 1, 2, 3 and so forth), that is to determinate in a parameter field
3) Take TLD from a list, that is to determinate in a parameter field (.org,.net,.com,.info,.biz and a customer needs to be able to add more and his preferred TLDs)
4) It also needs to work with subdomains
5) Crawl the urls for backlinks (showing the process, so customer knows that something happens and is working, like counting the processed
6) If the backlink is invalid (e.g., HTTP 404 not found), write it to a separate file
7) If the depth is 0, crawl only the seed url and domain. If the depth is 1, crawl backlink domain [login to view URL] depth is 3 count backlinks of the backlinks, and so forth.”
8) Possibility to use proxies (to determinate in a parameter field) for proxies)
9) Use multiple threads to scrape
10) Save the invalid to cvs file
11) Build a web application using JSP which will run on a Tomcat. The wordpress site / pop up window
a) should display the status of the scraping
b) should work in all browsers
DOMAIN METRICS CHECKER
1)Upload all the domains in UI or text file
2)It should check for MOZ - DA PA ; Majestic : Trust flow & citation flow; check for domain availibilty
Deliverables & Scopes
---------------------
Following are the deliverables the developer will provide the employer
1) A standalone Java program that scrape
2) A web page to enter the inputs mentioned above
Example of such exisitng and working domain scraper :
[login to view URL]