Remove all style, scripts, and html tags from an html page

Question

Here is what I have so far:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script"]): 
        script.extract()
    text = soup.get_text()
    return text
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"

cleaned = cleanme(testhtml)
print (cleaned)

This is working to remove the script

What is your expected output?
– salmanwahed
Commented Jun 1, 2015 at 3:57 — salmanwahed, Commented Jun 1, 2015 at 3:57

james-see · Accepted Answer · 2020-03-17 16:40:07Z

28

It looks like you almost have it. You need to also remove the html tags and css styling code. Here is my solution (I updated the function):

def cleanMe(html):
    soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

edited Mar 17, 2020 at 16:40

answered Jun 1, 2015 at 3:55

james-see

12.9k6 gold badges43 silver badges51 bronze badges

@Anu this works for me: relist = re.split("window.fbAsyncInit+", texttotest) print(relist[0]) you can see that the regex split works fine, for the texttotest variable I used your example text in full.
– james-see
Commented Feb 10, 2020 at 19:41

Add a comment |

Sede · Accepted Answer · 2017-05-17 11:55:55Z

15

You can use decompose to completely remove the tags from the document and stripped_strings generator to retrieve the tag content.

def clean_me(html):
    soup = BeautifulSoup(html)
    for s in soup(['script', 'style']):
        s.decompose()
    return ' '.join(soup.stripped_strings)

>>> clean_me(testhtml) 
'THIS IS AN EXAMPLE I need this text captured And this'

edited May 17, 2017 at 11:55

answered Jun 1, 2015 at 4:21

Sede

61.2k20 gold badges155 silver badges160 bronze badges

Add a comment |

Soroush · Accepted Answer · 2018-03-23 00:39:50Z

Removing specified tags and comments in a clean manner. Thanks to Kim Hyesung for this code.

from bs4 import BeautifulSoup
from bs4 import Comment

def cleanMe(html):
    soup = BeautifulSoup(html, "html5lib")    
    [x.extract() for x in soup.find_all('script')]
    [x.extract() for x in soup.find_all('style')]
    [x.extract() for x in soup.find_all('meta')]
    [x.extract() for x in soup.find_all('noscript')]
    [x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]
    return soup

Martin Thoma · Accepted Answer · 2019-08-10 13:59:57Z

Using lxml instead:

# Requirements: pip install lxml

import lxml.html.clean


def cleanme(content):
    cleaner = lxml.html.clean.Cleaner(
        allow_tags=[''],
        remove_unknown_tags=False,
        style=True,
    )
    html = lxml.html.document_fromstring(content)
    html_clean = cleaner.clean_html(html)
    return html_clean.text_content().strip()

testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"
cleaned = cleanme(testhtml)
print (cleaned)

Sede · Accepted Answer · 2015-06-01 04:22:45Z

2

If you want a quick and dirty solution you ca use:

re.sub(r'<[^>]*?>', '', value)

To make an equivalent of strip_tags in php. Is that what you want?

edited Jun 1, 2015 at 4:22

Sede

61.2k20 gold badges155 silver badges160 bronze badges

answered Jun 1, 2015 at 4:05

Sanxofon

96311 silver badges25 bronze badges

Add a comment |

Dmitriy Zub · Accepted Answer · 2021-08-27 18:04:51Z

Another implementation in addition to styvane answer. If you want to extract a lot of text, check out selectolax, it's much faster than lxml

Code and example in the online IDE:

def clean_me(html):
    soup = BeautifulSoup(html, 'lxml')

    body = soup.body
    if body is None:
        return None

    # removing everything besides text
    for tag in body.select('script'):
        tag.decompose()
    for tag in body.select('style'):
        tag.decompose()

    plain_text = body.get_text(separator='\n').strip()
    print(plain_text)

clean_me()

Collectives™ on Stack Overflow

Remove all style, scripts, and html tags from an html page

6 Answers 6

Not the answer you're looking for? Browse other questions tagged
python
html
beautifulsoup
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Not the answer you're looking for? Browse other questions tagged pythonhtmlbeautifulsoup or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
html
beautifulsoup
or ask your own question.