Develop algorithm to remove repeating text

Every web site has repeating text on each page. For example, the header and footer, and perhaps a sidebar.

Usually the important text on the page is the unique text on the page.

For example, if you look at these two websites, you can see there is duplicate text on both pages (mostly at the top and bottom of the pages) which is not important:

[url removed, login to view]

[url removed, login to view]

The important text is mostly the unique job description text.

I need you to develop an algorithm in Python (you can use a library; it doesn't need to be original code) which is able to detect duplicate text. So if you could imagine we merged the HTML from the two links above into a single document, your code would remove the header and footer (and perhaps some other text) due to it being duplicate text in the document.

Any questions, just ask.

I am not interested in a Wordpress website. Thanks.

Habilidades: Python, Extracción de datos web

Ver más: text enhance remove, text clustering java algorithm, text file search algorithm, pdf text background remove, job description text conversion, anchor text link domain algorithm, job description realtime text, job description text, php random text repeating, text file remove double lines, text background remove, develop sms text screen freelance, remove empty line opening text file, php remove links text, can software develop text based mmorpg, remove color text, text remove duplicates words, job description general transcription, copywriter job description wiki, video game designer job description, html job description joomla, social network marketing job description, call center agent timesharing outbound job description, flash banner designer job description, paintball worker job description

Información del empleador:
( 0 comentarios ) Netherlands

Nº del proyecto: #12155198

23 freelancers están ofertando el promedio de €148 para este trabajo


hello, I'd be glad to implement the desired Python tool for you. Looking forward to chat with you soon for more details. Best regards,

€94 EUR en 2 días
(117 comentarios)

Hi, I can do this for you. Please send a massage in the PMB for details.......Best Regards flashsaiful

€155 EUR en 3 días
(118 comentarios)

Hello there, my name is Daniel and I would love to help you out with this project. I have a lot of experience parsing texts in order to obtain useful information so I think that can be apply here to identify duplicate Más

€198 EUR en 5 días
(81 comentarios)
€88 EUR en 3 días
(34 comentarios)
€155 EUR en 3 días
(41 comentarios)

Hello! My name is Mehnaz Bashir. I am writing in response to your Project. After carefully reviewing the experience requirements and skills required for the job, I feel that I am a suitable match for the job. I have Más

€30 EUR en 3 días
(18 comentarios)

Hi, I am competitive to this kind of task, can take good care of this project. In fact, I already done related to this job before. We can use regex and import difflib to compare both data. Let me know the best of you Más

€249 EUR en 5 días
(23 comentarios)

Hi, I can do this using python. I have done something similar to wikipedia. The exact solution will depend on how many pages you need

€277 EUR en 3 días
(10 comentarios)

Hello I am very interested in your project. I am an expert in C#, php, asp.net, asp, jQuery, javascript, python, etc. I have much experience in some frameworks such as wordpress, Yii, magento, bootstrap, cakephp. I Más

€222 EUR en 3 días
(7 comentarios)

Hello, Thank you for the posting. I checked the sites and would like to collaborate with you over this task. Regarding texts, we can use native Python libraries like beautifulsoup or urlib and for the desired Más

€111 EUR en 3 días
(5 comentarios)

To whom it may concern, if I understood you well, I take both pages, compare them and everything that is the same would be deleted, and the rest would be merged to one page? I am at your disposal for further ques Más

€250 EUR en 10 días
(2 comentarios)

Hey, I can write such code by scraping links, structuring into some tokens and then comparing them. But are you interested in HTML DOM browsing. In your example link that means to scrape everything in this tag: <articl Más

€70 EUR en 3 días
(4 comentarios)

Greetings, You're looking for a python programmer to develop a Web scraping tool to scrape the details of a job from the website mentioned in the project details. Talking about a perfect match, I am a core python pro Más

€90 EUR en 2 días
(1 comentario)
€222 EUR en 3 días
(0 comentarios)

Hi, I have a lot of experience building web scrapers and have done plenty of work in Python. I can build the required algorithm (have done a duplicate text detector in the past). Also, if you provide me with the con Más

€88 EUR en 2 días
(0 comentarios)

I have been learning and using Python for the past two years. I have completed Python Specialization on Coursera by University of Michigan. I've used multiple Python libraries for creating a variety of programs. I'm Más

€55 EUR en 5 días
(0 comentarios)

Hello, I have multiple solution for your task. At first, we could extract only unique text, which is a good solution for checking lots of websites with one tool. But such wide implementation will be resulting for sur Más

€110 EUR en 14 días
(0 comentarios)

i am a python proggramer that have written lots of projects. python is the best language for your targeted project. i will be pleased if you accept. thank you.

€44 EUR en 3 días
(0 comentarios)

I think I understood what you want. Still not sure what output format do are you looking for ? do you want html output or just text files ? An algorithm for this task maybe not be perfect if the pages layout/tags a Más

€222 EUR en 3 días
(0 comentarios)

A proposal has not yet been provided

€155 EUR en 4 días
(0 comentarios)