这是一种提取网页中的显示的文本内容,去除标签的方法。
主要用到了re库。直接上代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| from urllib import request import re
def get_html(url): wp = request.urlopen(url) content = wp.read() content = content.decode(encoding='utf-8') print(content) first = re.findall(r"<p.+?>(.+?)</p>", content) print(first)
for x in first:
wordList= re.sub(r'<(.+?)>', "", x) for t in wordList: k = open('test.htm', 'a') k.write(t) print (t)
url="https://mp.weixin.qq.com/s/JHoOoOH-3795hb3Y9LV7Gg" get_html(url)
|
展示一些结果: