<em>Mac</em>Book项目 2009年学校开始实施<em>Mac</em>Book项目,所有师生配备一本<em>Mac</em>Book,并同步更新了校园无线网络。学校每周进行电脑技术更新,每月发送技术支持资料,极大改变了教学及学习方式。因此2011
2021-06-01 09:32:01
html_downloader
from urllib import request def download(url): if url is None: return response = request.urlopen(url) if response.getcode() != 200: return None return response.read()
html_outeputer
data_list = [] def collect_data(data): data_list.append(data) def output_html(): fout = open('output.html', 'w') fout.write('<html>') fout.write('<body>') fout.write('<table>') for dataitem in data_list: fout.write('<tr>') fout.write('<td>%s</td>' % dataitem['url']) fout.write('<td>%s</td>' % dataitem['title']) fout.write('<td>%s</td>' % dataitem['datetime']) fout.write('<td>%s</td>' % dataitem['visitcount']) fout.write('</tr>') fout.write('</table>') fout.write('</body>') fout.write('</html>') fout.close()
html_parser
import re from bs4 import BeautifulSoup from urllib.parse import urljoin def get_new_urls(page_url, soup): new_urls = set() links = soup.find_all('a', href=re.compile(r"/d+/d+/w+/page.htm")) for link in links: new_url = link['href'] new_full_url = urljoin(page_url, new_url) new_urls.add(new_full_url) return new_urls def get_new_data(page_url, soup): res_data = {} title_node = soup.find('h1', class_='arti-title') if title_node is None: return res_data res_data['title'] = title_node.get_text() datetime_node = soup.find('span', class_='arti-update') res_data['datetime'] = datetime_node.get_text() visitcount_node = soup.find('span', class_='WP_VisitCount') res_data['visitcount'] = visitcount_node.get_text() res_data['url'] = page_url return res_data def parse(page_url, html_cont): if page_url is None or html_cont is None: return soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8') new_urls = get_new_urls(page_url, soup) new_data = get_new_data(page_url, soup) return new_urls, new_data
spider_main
import urls_manager, html_downloader, html_parser, html_outputer def craw(root_url): count = 1 urls_manager.add_new_url(root_url) #啟動爬蟲迴圈 while urls_manager.has_new_url(): new_url = urls_manager.get_new_url() print('craw %d : %s' % (count, new_url)) html_cont = html_downloader.download(new_url) new_urls, new_data = html_parser.parse(new_url, html_cont) urls_manager.add_new_urls(new_urls) if new_data: html_outputer.collect_data(new_data) if count == 10: break count = count + 1 html_outputer.output_html() if __name__ == '__main__': root_url = 'http://news.zzuli.edu.cn/' craw(root_url) import urls_manager, html_downloader, html_parser, html_outputer def craw(root_url): count = 1 urls_manager.add_new_url(root_url) #啟動爬蟲迴圈 while urls_manager.has_new_url(): new_url = urls_manager.get_new_url() print('craw %d : %s' % (count, new_url)) html_cont = html_downloader.download(new_url) new_urls, new_data = html_parser.parse(new_url, html_cont) urls_manager.add_new_urls(new_urls) if new_data: html_outputer.collect_data(new_data) if count == 10: break count = count + 1 html_outputer.output_html() if __name__ == '__main__': root_url = 'http://news.zzuli.edu.cn/' craw(root_url)
test_64
from bs4 import BeautifulSoup import re html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc, 'html.parser') print('獲取所有連結') links = soup.find_all('a') for link in links: print(link.name, link['href'], link.get_text()) print('獲取lacie連結') link_node = soup.find('a', href='http://example.com/lacie') print(link_node.name, link_node['href'], link_node.get_text()) print('正則匹配') link_node = soup.find('a', href=re.compile(r'ill')) print(link_node.name, link_node['href'], link_node.get_text()) print('獲取P段落文字') p_node = soup.find('p', class_='title') print(p_node.name, p_node.get_text())
urls_manager
new_urls = set() old_urls = set() def add_new_url(url): if url is None: return if url not in new_urls and url not in old_urls: new_urls.add(url) def add_new_urls(urls): if urls is None or len(urls) == 0: return for url in urls: add_new_url(url) def get_new_url(): new_url = new_urls.pop() old_urls.add(new_url) return new_url def has_new_url(): return len(new_urls) != 0
本篇文章就到這裡了,希望能夠給你帶來幫助,也希望您能夠多多關注it145.com的更多內容!
相關文章
<em>Mac</em>Book项目 2009年学校开始实施<em>Mac</em>Book项目,所有师生配备一本<em>Mac</em>Book,并同步更新了校园无线网络。学校每周进行电脑技术更新,每月发送技术支持资料,极大改变了教学及学习方式。因此2011
2021-06-01 09:32:01
综合看Anker超能充系列的性价比很高,并且与不仅和iPhone12/苹果<em>Mac</em>Book很配,而且适合多设备充电需求的日常使用或差旅场景,不管是安卓还是Switch同样也能用得上它,希望这次分享能给准备购入充电器的小伙伴们有所
2021-06-01 09:31:42
除了L4WUDU与吴亦凡已经多次共事,成为了明面上的厂牌成员,吴亦凡还曾带领20XXCLUB全队参加2020年的一场音乐节,这也是20XXCLUB首次全员合照,王嗣尧Turbo、陈彦希Regi、<em>Mac</em> Ova Seas、林渝植等人全部出场。然而让
2021-06-01 09:31:34
目前应用IPFS的机构:1 谷歌<em>浏览器</em>支持IPFS分布式协议 2 万维网 (历史档案博物馆)数据库 3 火狐<em>浏览器</em>支持 IPFS分布式协议 4 EOS 等数字货币数据存储 5 美国国会图书馆,历史资料永久保存在 IPFS 6 加
2021-06-01 09:31:24
开拓者的车机是兼容苹果和<em>安卓</em>,虽然我不怎么用,但确实兼顾了我家人的很多需求:副驾的门板还配有解锁开关,有的时候老婆开车,下车的时候偶尔会忘记解锁,我在副驾驶可以自己开门:第二排设计很好,不仅配置了一个很大的
2021-06-01 09:30:48
不仅是<em>安卓</em>手机,苹果手机的降价力度也是前所未有了,iPhone12也“跳水价”了,发布价是6799元,如今已经跌至5308元,降价幅度超过1400元,最新定价确认了。iPhone12是苹果首款5G手机,同时也是全球首款5nm芯片的智能机,它
2021-06-01 09:30:45