Find Jobs
Hire Freelancers

Extract data from files compressed in gz archives

$30-250 USD

Terminado
Publicado hace casi 8 años

$30-250 USD

Pagado a la entrega
I need a script that can scan a large number of .gz files as fast as possible and extract certain data from them. I need the script to be able to do this without having to decompress all the files (there are over 40000 of them!) Speed is critical and I would prefer a solution in python but am not closed to the idea of using other languages especially if it will run faster. Also, the source files should not be modified just checking them for matches and writing the results to a "results" file. Every .gz file contains a single json file containing information on up to 750 different keywords/items so it needs to be able to recurse through the file and match the data found with the correct keyword. Here is an example: The data I need to locate is entries in the json file that include the "images" tag. Here is a sample: { "date": "2016-06-10", "gl": "us", "hl": "en", "custom_id": "2", "keyword": "cat drinking water", "data": { "1": { "pos": 1, "href": "[login to view URL]", "title": "", "description": "", "tags": [ "images" ] }, "2": { "pos": 2, "href": "[login to view URL]", "title": "Amazing Slow Motion Cat Drinking - YouTube", "description": "Discover the beauty of a cat in super slow motion thanks to a high definition ... Giant 6ft Water Balloon - The ...", "tags": [ "video" ] }, "3": { "pos": 3, "href": "[login to view URL]", "title": "Problems With a Cat Drinking Excessive Water - Pets", "description": "Perhaps better known as finicky eaters, cats aren't prolific water drinkers. If your cat is drinking a lot of water , it could be a sign of a serious health issue including \u00a0...", "tags": [] ... As you can see, the first entry under "data" has the "images" tag. Once the "images" tag is found I need to get this information from the "top" of the entry as seen in the example above: "gl": "us", "hl": "en", "custom_id": "2", "keyword": "cat drinking water", This information should be saved to a text file in this format: "GL;HL;ID;keyword" So based on this example it would save this to the file: "US;EN;2;cat drinking water" For every match found in all the .gz files a new line should be written to the "results" file so everything is stored in a single file. I have attached a sample .gz file you can test with. I need this completes as soon as possible so how soon you can begin and complete work will factor into my bid selection. Thanks and feel free to ask questions!
ID del proyecto: 10742030

Información sobre el proyecto

16 propuestas
Proyecto remoto
Activo hace 8 años

¿Buscas ganar dinero?

Beneficios de presentar ofertas en Freelancer

Fija tu plazo y presupuesto
Cobra por tu trabajo
Describe tu propuesta
Es gratis registrarse y presentar ofertas en los trabajos
Adjudicado a:
Avatar del usuario
I think a mixture of python (to parse JSON an build results) and shell script (to decompress) will work best for the file. Only one sample file was attached. I will run the test on say 10 files to determine the average time consumed each file and then if required I will parallelize the scripts to decompress and parse multiple files simultaneously. I have 6+ years of experience in Python and Linux tools. I work as a full-time employee at Google. To know more about me visit : [login to view URL] (www [dot] ashishkedia [dot] me)
$83 USD en 1 día
4,9 (10 comentarios)
3,9
3,9
16 freelancers están ofertando un promedio de $240 USD por este trabajo
Avatar del usuario
I can do this, no problem .
$199 USD en 3 días
5,0 (157 comentarios)
8,4
8,4
Avatar del usuario
Web scraping expert I use python language. My scripts works on windows, mac or linux, but linux is preferably. I can schedule scripts on server if it is required. I have more 100 finished projects (google scraping, facebook scraping, yellow pages, linkedinIn, amazon, webshops and other sites with lists of any items). I can scrape secured and protected sites, my crawlers can enter into login form, emulate ajax requests etc. If site block IP i can use proxy or TOR. I can try avoid captha on site in avtomatic or manual mode. I can export data into json, xml, csv (excel), or any database (mysql, mongodb, mssql, etc). I can develop web-interface for management running script (start, stop, etc), using PHP, HTML, JS.
$200 USD en 3 días
4,8 (103 comentarios)
6,2
6,2
Avatar del usuario
Hello, Thank you very much for this Web Scraping Project. I read through the job details extremely carefully and understand your required, for this I am absolutely sure that I can do the project very well. I can complete this Web Scraping project on time and within your budget. I have worked on similar Web Scraping projects, and I am confident I can exceed your expectations. Please click on Chat & reply me for see demo work or talk more details. Regards by Feroz Ahmed See My Feedback: www.freelancer.com/u/ferozstk.html
$100 USD en 3 días
4,8 (47 comentarios)
5,5
5,5
Avatar del usuario
Dear Sir/Ma'am, I am a Web research, Data Entry & Webs Scrapping expert. I checked and understood your requirements. I can handle this job very well to your appreciation. I can find and extract the information from different websites into an Excel sheet. I am ready to hear the details of the project more in detail now. I have always created a long-term collaboration with my clients through hard work and quality output for a reasonable price. If you have questions or doubts about anything, please feel free to ask me. Sincerely, Mir
$250 USD en 5 días
4,9 (27 comentarios)
5,1
5,1
Avatar del usuario
A proposal has not yet been provided
$222 USD en 3 días
4,9 (15 comentarios)
4,3
4,3
Avatar del usuario
Hi. I can do such script in python. I hope each json file not as large to fit into memory with json decode. I already done prototype, and have result for your example file.
$111 USD en 0 día
5,0 (14 comentarios)
4,3
4,3
Avatar del usuario
Hi, I am experienced python programmer and can offer you my solution on this topic with work on it starting today. There is no way to know the content of the files without unzipping them (who says otherwise is fooling you), but we will need to unzip just ones that script is working in a moment, if space complexity is your concern it shouldn't be. Looking forward to hear from you to discuss full solution details. If necessary I can create proof of concept with report time chart before the bid is accepted. Yours sincerely, Ivan
$200 USD en 2 días
5,0 (14 comentarios)
3,8
3,8
Avatar del usuario
I got 7+years work experience in Data Collection,Bulk Email Campaign,Excel VBA and Internet Research in IT companies here.I can do create crawler and scrap datas from sites using C++,Python and Perl coding as per your requirements in excel with multiple ip rotations.I have dealt with US,UK and Australia companies President,Directors and Managers for web design and development projects successfully and I have Good Communication with writing skills.I am well versed in Internet,MS Office Applications and Phone Etiquette manners with latest Technologies.I can accept your payment terms.
$155 USD en 2 días
3,9 (6 comentarios)
4,5
4,5
Avatar del usuario
Dear sir or madam, I have more than 5 years of experience in PHP programming. I know how to process gz files, how to analyze JSON data, etc. I can handle this project in a few hours. Kind regards, Alen
$200 USD en 1 día
5,0 (3 comentarios)
2,3
2,3
Avatar del usuario
I have been doing the exact same work throughout my professional career. The bid is low because I am getting started on freelancer but the work will be of very high quality.
$98 USD en 3 días
0,0 (0 comentarios)
0,0
0,0

Sobre este cliente

Bandera de UNITED STATES
Andalusia, United States
5,0
172
Forma de pago verificada
Miembro desde jul 9, 2012

Verificación del cliente

¡Gracias! Te hemos enviado un enlace para reclamar tu crédito gratuito.
Algo salió mal al enviar tu correo electrónico. Por favor, intenta de nuevo.
Usuarios registrados Total de empleos publicados
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Cargando visualización previa
Permiso concedido para Geolocalización.
Tu sesión de acceso ha expirado y has sido desconectado. Por favor, inica sesión nuevamente.