Please tell me:
1) Your chosen development language (Python, Java, Javascript, #C, ETC)
2) Are you capable of developing without a browser or headless browser if necessary? (After RDP session closed, GUI will terminate)
3) If you are using Python please tell me what libraries are capable with ( example: BS4, Scrappy, Selenium, , lxml, Requests) if you are using Java let me know the libraries you are using (example: jsoup, selenium) if you are another language let me know what libraries you use for those.
Project #1:
• 7 root sites and in total 48 subsites.
•The script will be run on Windows Server 2016.
• As you know the RDP session will terminate therefor script will need to use a headless browser in the background sometimes.
• The script takes a list of names and from this list, it will generate direct links for 99%-100% of them (and crawl the remaining sites). There will be many different input files, the format always remains the same, however, the data/names will be different.
• All of the data is in a table on the site
• All output formats and documentation are written
• Basic features such as enabling/disabling sites, custom crawl delay, pause, play, skip, on-screen status display, custom timeout limits /retry attempts are required.
• Proxies rotation functionality required.
• 1 site has a login.
• Should be optimized for efficient use of memory and CPU + Use API links when possible.
Project #2:
• 5 Root sites, 0 subsites.
• The script will input the same input file onto the sites and use the sites "download to excel" feature.
I am the project manager and a Windows System/Networking Administrator with a high IT expertise and project high feedback with 5 years experience here. I'll provide a lot of testing and system resources such as a few Windows VPS'S. Contact me if you are serious about the project. Python is preferred but not required. Long term work/ more projects are available. Unfinished script available.
With respect to this project I would like to present myself as a candidate for your consideration.
1) development language : Java
2) Yes I am capable of developing without a browser
3) I will be using Java
Hello,
Greetings!
I am an individual developer having experience of 3+ years in scraping.
With a proven track record of successful achievements, I am pleased to present my application for your consideration as a Freelancer.
Please have a look at my profile and portfolio to get an idea of our capabilities and some previous work on freelancer.
This bid is approx. I’ll give you the exact budget and price once discussing in details with you.
Many Thanks & Regards,
Rishi A.
Hi,I'm all senior software developers,I've just checked your project requirements,I'm able to perform your project and you'll be interesting with us,please come in contact with us to discuss for more project details...
Develop a Scraper Script - Project 1: 50 Sites & Project 2: 5 sites
BeautifulSoup, Java, Python, Scrapy, Web Scraping,
Hi,
1. I will use python.
2. You can run scripts in command line so you can run it easily on windows server.
3. I am comfortable in both beautiful soup(bs4) and scrapy. I can use anyone of them, whatever you say.
I can take care of proxies and other things mentioned on project description.
Let's discuss it over chat.
Thanks
I've done some scraping and data integration projects as part of my work. I've also implemented end-to-end browser tests for some projects, using selenium.
I have more java experience than python, but for this I would use python (requests + bs4). wget has a recursive function that might also be useful for this. If there is AJAX or other dynamic content on the page then it might be necessary to use selenium or something with headless chrome. In that case I would probably go with java. Sqlite or some other persistence might make sense if pause/retry are needed, or if you want the program to continue after a restart.
I've used rate limiting and retry policies in java before, so I think I would know how to approach those tasks.