Develop algorithm to remove repeating text

Cerrado

Every web site has repeating text on each page. For example, the header and footer, and perhaps a sidebar.

Usually the important text on the page is the unique text on the page.

For example, if you look at these two websites, you can see there is duplicate text on both pages (mostly at the top and bottom of the pages) which is not important:

[url removed, login to view]

[url removed, login to view]

The important text is mostly the unique job description text.

I need you to develop an algorithm in Python (you can use a library; it doesn't need to be original code) which is able to detect duplicate text. So if you could imagine we merged the HTML from the two links above into a single document, your code would remove the header and footer (and perhaps some other text) due to it being duplicate text in the document.

Any questions, just ask.

I am not interested in a Wordpress website. Thanks.

Habilidades: Python, Extracción de datos web

Ver más: text enhance remove, text clustering java algorithm, text file search algorithm, pdf text background remove, job description text conversion, anchor text link domain algorithm, job description realtime text, job description text, php random text repeating, text file remove double lines, text background remove, develop sms text screen freelance, remove empty line opening text file, php remove links text, can software develop text based mmorpg, remove color text, text remove duplicates words, job description general transcription, copywriter job description wiki, video game designer job description, html job description joomla, social network marketing job description, call center agent timesharing outbound job description, flash banner designer job description, paintball worker job description

ID de proyecto: #12155198

24 los freelancers están ofertando un promedio de €143 para este trabajo.

flashsaiful

Hi, I can do this for you. Please send a massage in the PMB for details.......Best Regards flashsaiful

€155 EUR en 3 días
(112 comentarios)
6.4
DanielVizcaya91

Hello there, my name is Daniel and I would love to help you out with this project. I have a lot of experience parsing texts in order to obtain useful information so I think that can be apply here to identify duplicate Más

€198 EUR en 5 días
(58 comentarios)
6.4
lkhelladi

hello, I'd be glad to implement the desired Python tool for you. Looking forward to chat with you soon for more details. Best regards,

€94 EUR en 2 días
(39 comentarios)
5.4
€155 EUR en 3 días
(29 comentarios)
5.0
cracken

Hi, I am competitive to this kind of task, can take good care of this project. In fact, I already done related to this job before. We can use regex and import difflib to compare both data. Let me know the best of you Más

€249 EUR en 5 días
(12 comentarios)
4.4
adilhussain0411

Hello! My name is Mehnaz Bashir. I am writing in response to your Project. After carefully reviewing the experience requirements and skills required for the job, I feel that I am a suitable match for the job. I have Más

€30 EUR en 3 días
(9 comentarios)
4.2
some235one

Hi, I can do this using python. I have done something similar to wikipedia. The exact solution will depend on how many pages you need

€277 EUR en 3 días
(9 comentarios)
4.0
Gnus

Hey, I can write such code by scraping links, structuring into some tokens and then comparing them. But are you interested in HTML DOM browsing. In your example link that means to scrape everything in this tag: <articl Más

€70 EUR en 3 días
(4 comentarios)
3.6
MacJeremy

To whom it may concern, if I understood you well, I take both pages, compare them and everything that is the same would be deleted, and the rest would be merged to one page? I am at your disposal for further ques Más

€250 EUR en 10 días
(2 comentarios)
3.4
€155 EUR en 3 días
(1 comentario)
2.9
€155 EUR en 3 días
(2 comentarios)
3.0
€88 EUR en 3 días
(4 comentarios)
2.5
Orpiv

Hello, i would like to introduce our company orpiv tech we have done projects like yours earlier as well we can show you our past work or you can check out our portfolio, we can perfectly develop an algorithm in Py Más

€40 EUR en 3 días
(1 comentario)
2.9
drishinfotech

Hello, Thank you for the posting. I checked the sites and would like to collaborate with you over this task. Regarding texts, we can use native Python libraries like beautifulsoup or urlib and for the desired Más

€111 EUR en 3 días
(1 comentario)
0.8
phourxx

Greetings, You're looking for a python programmer to develop a Web scraping tool to scrape the details of a job from the website mentioned in the project details. Talking about a perfect match, I am a core python pro Más

€90 EUR en 2 días
(1 comentario)
0.6
dichotamous

A proposal has not yet been provided

€155 EUR en 4 días
(0 comentarios)
0.0
jsbot

Can we go with Selenium Java. (You'll get better robot with selenium if language is not concern) We've scraped many websites with selenium. Some of them are rCommerce giant Amazon, Flipkart. For any query on Automati Más

€222 EUR en 53 días
(0 comentarios)
0.0
yuvalkainan

A proposal has not yet been provided

€133 EUR en 3 días
(0 comentarios)
0.0
DevoirTechsoft

Hello, We have studied the requirements and found it matches our skills. We are having an enthusiastic team with us having years of experience in HTML, CSS, UI design, PHP+MySQL, javascript, jquery, AJAX, e-commerce Más

€155 EUR en 3 días
(0 comentarios)
0.0
ngemzinou

I think I understood what you want. Still not sure what output format do are you looking for ? do you want html output or just text files ? An algorithm for this task maybe not be perfect if the pages layout/tags a Más

€222 EUR en 3 días
(0 comentarios)
0.0