Develop algorithm to remove repeating text

Este proyecto recibió 25 ofertas de freelancers talentosos con una oferta promedio de €144 EUR.

Obtén cotizaciones gratis para un proyecto como este
Presupuesto de Proyecto
€30 - €250 EUR
Ofertas Totales
Descripción del Proyecto

Every web site has repeating text on each page. For example, the header and footer, and perhaps a sidebar.

Usually the important text on the page is the unique text on the page.

For example, if you look at these two websites, you can see there is duplicate text on both pages (mostly at the top and bottom of the pages) which is not important:

[url removed, login to view]

[url removed, login to view]

The important text is mostly the unique job description text.

I need you to develop an algorithm in Python (you can use a library; it doesn't need to be original code) which is able to detect duplicate text. So if you could imagine we merged the HTML from the two links above into a single document, your code would remove the header and footer (and perhaps some other text) due to it being duplicate text in the document.

Any questions, just ask.

I am not interested in a Wordpress website. Thanks.

Habilidades Requeridas

Buscando hacer algo de dinero?

  • Establece tu presupuesto y período de tiempo
  • Describe tu propuesta
  • Consigue pago por tu trabajo

Contrata Freelancers que también oferten en este proyecto

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online