Find Jobs
Hire Freelancers

Data crunching: 150 mailing list text files to UTF8 MySQL database

$30-50 USD

Terminado
Publicado hace más de 18 años

$30-50 USD

Pagado a la entrega
Goal: A PHP script that can convert about 150 text files containing posts from 10 years of 2 newsgroups to a single MySQL table in UTF8 format, for use with a search engine. All files to convert will be given at the start of the project. Details: I'm trying to create a single MySQL table from 10 years of newsgroup postings. The postings are spread over 150 text files, each file containing a month of posts. There are a few issues such as encoding and other conditions, as described in the deliverables. ## Deliverables Deliverables: A PHP script that will convert all .txt files in the same directory to a single MySQL table dump file with the following fields: (1) sequential ID of post (2) mailing list ID (3) name of author (4) email address of author (5) date/time of post (6) actual encoding of original post (7) title of post (8) full text of post (9) full text of post with quoted text removed (for searching) Issues: (a) Because several mailing list systems were used, the format by which each post is separated and the format of the headers of each post differ. There are maybe 5 total such formats. As an example, some of the files needing conversion are here: [login to view URL] (b) The posts are mostly in the SJIS encoding. However, there are several that are in EUC or ISO 2022-JP. The _actual_ encoding of each post needs to be checked, and the post needs to be converted to UTF8 before being stored in the database. This may be the trickiest part of the project, so make sure that you are comfortable with multi-byte Japanese encodings. For example, if you open one of the files found at the above website in a Web browser, some will only render properly when SJIS is selected as the encoding. Others will only render properly when ISO 2022-JP is selected as the encoding. The actual encoding for each post needs to be figured out, and stored as field (6). (c) All email addresses need to be obscured. For example, "someguy[at][login to view URL]" would need to be changed to "someguy[at]g...". This is true for both the email address field (4) as well as all full text fields (8) and (9). Note that [at] has been used here in place of the at sign, due to the RAC site restrictions. (d) The dates and times for all posts need to be unified to the format used by MySQL, for sorting. This is stored in field (5). (e) The full text field without quoted portions (9) is the same as the original text, but with all lines beginning with ">" removed, or all lines following a line with "----- Original Message -----" removed. You will need to be creative to create a good way to remove these portions, but 95% is acceptible. ## Platform PHP 5
ID del proyecto: 3808531

Información sobre el proyecto

6 propuestas
Proyecto remoto
Activo hace 19 años

¿Buscas ganar dinero?

Beneficios de presentar ofertas en Freelancer

Fija tu plazo y presupuesto
Cobra por tu trabajo
Describe tu propuesta
Es gratis registrarse y presentar ofertas en los trabajos
Adjudicado a:
Avatar del usuario
See private message.
$21,24 USD en 10 días
3,6 (6 comentarios)
1,4
1,4
6 freelancers están ofertando un promedio de $29 USD por este trabajo
Avatar del usuario
See private message.
$42,50 USD en 10 días
4,6 (84 comentarios)
5,6
5,6
Avatar del usuario
See private message.
$23,80 USD en 10 días
4,9 (46 comentarios)
5,4
5,4
Avatar del usuario
See private message.
$38,25 USD en 10 días
4,9 (12 comentarios)
2,7
2,7
Avatar del usuario
See private message.
$42,50 USD en 10 días
5,0 (1 comentario)
0,8
0,8
Avatar del usuario
See private message.
$8,50 USD en 10 días
0,0 (0 comentarios)
0,0
0,0

Sobre este cliente

Bandera de UNITED STATES
United States
5,0
1
Miembro desde jul 16, 2005

Verificación del cliente

¡Gracias! Te hemos enviado un enlace para reclamar tu crédito gratuito.
Algo salió mal al enviar tu correo electrónico. Por favor, intenta de nuevo.
Usuarios registrados Total de empleos publicados
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Cargando visualización previa
Permiso concedido para Geolocalización.
Tu sesión de acceso ha expirado y has sido desconectado. Por favor, inica sesión nuevamente.