Devex is seeking a data mining, data parsing and web-scraping expert to assist in a project focused on linking together related data in our system and tying it to data available on the web.
Qualified candidates will be provided with more detailed information. The core skills needed to successfully complete this project include:
1. Database, data mapping, and parsing skills to extract data from a variety of file types (XML, PDF, HTML) to populate tables with information about international projects.
2. Advanced string comparison logic to try to match company names between this data and the Devex company database in cases where the string might not match exactly in every case.
3. XML and advanced text parsing skills to go through PDFs and HTML files to extract the necessary data and database it.
4. Webscraping skills in the few cases where the data is stored in an online database.
I estimate the level of effort for the project to be 3-5 weeks. It will cover data gathering, parsing and matching for 5 different data sets (source information will be provided to the successful bidder.) The source files are structured as follows:
XML: 1
PDF: 1
HTML: 1
Excel: 1
Web-scraping: 1
Each source will have to go through 3 steps:
1. To do the data processing, extract the data and organize them in one database. NOTE: The data architecture is already complete. The output of this phase would be one (or more) parsers per institution. We are agnostic about the technology used but are using Ruby on Rails for our other applications.
2. Tie the information in the resultant database of projects to company records in the Devex database (this could be done on URL or email domain matching or using basic string comparisons of the company name.) Access will be given to necessary tables from our database.
3. Tie the information in the resultant database of projects to research reports in the Devex database – these are the reports that notify companies of the availability of the contract. Once the companies create proposals and bid, one or more companies wins the contract. I would like to tie the award to the project notice in this phase.
Thanks,
Kami
Dear Customer:
I have written several applications with web-scraping capabilities in which complex data had to be extracted for later processing, stored in databases, etc. It would be desirable though to know a few more details about the scope of the project, the magnitude of the data sources, etc.
Regards,