多线程的小爬虫,测试后发现有三条内容重复了 40 多次

2015-05-18 15:58:24 +08:00
 cc7756789
<pre class="brush:python;toolbar:false">import requests
from bs4 import BeautifulSoup
import threading
url_num = 0
url_list = [&#39;http://ubuntuforums.org/forumdisplay.php?f=333&#39;,]
for x in range(1, 50):
url_num += 1
raw_url = &#39;http://ubuntuforums.org/forumdisplay.php?f=333&amp;page=%d&#39; % url_num
url_list.append(raw_url)
class MyThread(threading.Thread):
def __init__(self, func, args, name=&quot;&quot;):
threading.Thread.__init__(self)
self.func = func
self.args = args
self.name = name
def run(self):
apply(self.func, self.args)
def running(url):
# lock.acquire()
html = requests.get(url)
if html.status_code == 200:
html_text = html.text
soup = BeautifulSoup(html_text)
with open(&#39;/home/zhg/Pictures/cao.txt&#39;, &#39;a+&#39;) as f:
for link in soup.find_all(&#39;a&#39;, &#39;title&#39;):
s = &#39;http://ubuntuforums.org/&#39; + str(link.get(&#39;href&#39;)) + &#39; &#39; + str(link.get_text().encode(&#39;utf-8&#39;))
f.writelines(s)
f.writelines(&#39;\n&#39;)
# lock.release()
if __name__ == &#39;__main__&#39;:
thread_list = [ MyThread(running, (url, ), running.__name__) for url in url_list ]
for t in thread_list:
t.setDaemon(True)
t.start()
for i in thread_list:
i.join()
print &quot;process ended&quot;
with open(&#39;/home/zhg/Pictures/cao.txt&#39;, &#39;r&#39;) as f:
f_list = f.readlines()
set_list = set(f_list)
for x in set_list:
if f_list.count(x) &gt; 1:
print &quot;the &lt;%s&gt; has found &lt;%d&gt;&quot; % (x, f_list.count(x))</pre>
<p>
<br/>
</p>




结果:
process ended
the <http://ubuntuforums.org/showthread.php?t=2229766&s=bb6cd917fa0c28a6e9cb02be35fa4379 Forums Staff recommendations on WUBI
> has found <49>
the <http://ubuntuforums.org/showthread.php?t=1946145&s=bb6cd917fa0c28a6e9cb02be35fa4379 I upgraded, and now I have this error...
> has found <49>
the <http://ubuntuforums.org/showthread.php?t=1743535&s=bb6cd917fa0c28a6e9cb02be35fa4379 Graphics Resolution- Upgrade /Blank Screen after reboot
> has found <49>
3572 次点击
所在节点    Python
8 条回复
cc7756789
2015-05-18 15:59:13 +08:00
我想试试能不能插入高亮 额。。。。。

import requests
from bs4 import BeautifulSoup
import threading
url_num = 0
url_list = ['http://ubuntuforums.org/forumdisplay.php?f=333',]
for x in range(1, 50):
url_num += 1
raw_url = 'http://ubuntuforums.org/forumdisplay.php?f=333&page=%d' % url_num
url_list.append(raw_url)
class MyThread(threading.Thread):
def __init__(self, func, args, name=""):
threading.Thread.__init__(self)
self.func = func
self.args = args
self.name = name
def run(self):
apply(self.func, self.args)
def running(url):
# lock.acquire()
html = requests.get(url)
if html.status_code == 200:
html_text = html.text
soup = BeautifulSoup(html_text)
with open('/home/zhg/Pictures/cao.txt', 'a+') as f:
for link in soup.find_all('a', 'title'):
s = 'http://ubuntuforums.org/' + str(link.get('href')) + ' ' + str(link.get_text().encode('utf-8'))
f.writelines(s)
f.writelines('\n')
# lock.release()
if __name__ == '__main__':
thread_list = [ MyThread(running, (url, ), running.__name__) for url in url_list ]
for t in thread_list:
t.setDaemon(True)
t.start()
for i in thread_list:
i.join()
print "process ended"
with open('/home/zhg/Pictures/cao.txt', 'r') as f:
f_list = f.readlines()
set_list = set(f_list)
for x in set_list:
if f_list.count(x) > 1:
print "the <%s> has found <%d>" % (x, f_list.count(x))
fy
2015-05-18 16:08:38 +08:00
楼主一看就不是VIP用户,看我的:

```python

def foo():
pass
```
Earthman
2015-05-18 16:43:47 +08:00
@fy 你也一样213,v2ex回帖没有markdown功能,发帖才有的。发帖格式有问题可以考虑请管理员编辑一下或者重新发帖(不推荐)
cc7756789
2015-05-18 17:09:49 +08:00
问题找到了,就是要爬取的网站有置顶的帖子,所以置顶的记录被重复了n次
chairuosen
2015-05-18 17:15:31 +08:00
有个东西叫gist
RihcardLu
2015-05-18 18:14:24 +08:00
@fy @Earthman 笑死我了
kingname
2015-05-18 18:28:29 +08:00
使用scrapy天然自带去重复
withrock
2015-05-19 10:02:43 +08:00
多线程多进程抓取,都放到一个队列里,开一个进程从队列里取抓到的数据入库。

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://tanronggui.xyz/t/191954

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX