V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
cc7756789
V2EX  ›  Python

多线程的小爬虫,测试后发现有三条内容重复了 40 多次

  •  
  •   cc7756789 · 2015-05-18 15:58:24 +08:00 · 3572 次点击
    这是一个创建于 3538 天前的主题,其中的信息可能已经有所发展或是发生改变。
    <pre class="brush:python;toolbar:false">import requests
    from bs4 import BeautifulSoup
    import threading
    url_num = 0
    url_list = [&#39;http://ubuntuforums.org/forumdisplay.php?f=333&#39;,]
    for x in range(1, 50):
    url_num += 1
    raw_url = &#39;http://ubuntuforums.org/forumdisplay.php?f=333&amp;page=%d&#39; % url_num
    url_list.append(raw_url)
    class MyThread(threading.Thread):
    def __init__(self, func, args, name=&quot;&quot;):
    threading.Thread.__init__(self)
    self.func = func
    self.args = args
    self.name = name
    def run(self):
    apply(self.func, self.args)
    def running(url):
    # lock.acquire()
    html = requests.get(url)
    if html.status_code == 200:
    html_text = html.text
    soup = BeautifulSoup(html_text)
    with open(&#39;/home/zhg/Pictures/cao.txt&#39;, &#39;a+&#39;) as f:
    for link in soup.find_all(&#39;a&#39;, &#39;title&#39;):
    s = &#39;http://ubuntuforums.org/&#39; + str(link.get(&#39;href&#39;)) + &#39; &#39; + str(link.get_text().encode(&#39;utf-8&#39;))
    f.writelines(s)
    f.writelines(&#39;\n&#39;)
    # lock.release()
    if __name__ == &#39;__main__&#39;:
    thread_list = [ MyThread(running, (url, ), running.__name__) for url in url_list ]
    for t in thread_list:
    t.setDaemon(True)
    t.start()
    for i in thread_list:
    i.join()
    print &quot;process ended&quot;
    with open(&#39;/home/zhg/Pictures/cao.txt&#39;, &#39;r&#39;) as f:
    f_list = f.readlines()
    set_list = set(f_list)
    for x in set_list:
    if f_list.count(x) &gt; 1:
    print &quot;the &lt;%s&gt; has found &lt;%d&gt;&quot; % (x, f_list.count(x))</pre>
    <p>
    <br/>
    </p>




    结果:
    process ended
    the <http://ubuntuforums.org/showthread.php?t=2229766&s=bb6cd917fa0c28a6e9cb02be35fa4379 Forums Staff recommendations on WUBI
    > has found <49>
    the <http://ubuntuforums.org/showthread.php?t=1946145&s=bb6cd917fa0c28a6e9cb02be35fa4379 I upgraded, and now I have this error...
    > has found <49>
    the <http://ubuntuforums.org/showthread.php?t=1743535&s=bb6cd917fa0c28a6e9cb02be35fa4379 Graphics Resolution- Upgrade /Blank Screen after reboot
    > has found <49>
    8 条回复    2015-05-19 10:02:43 +08:00
    cc7756789
        1
    cc7756789  
    OP
       2015-05-18 15:59:13 +08:00
    我想试试能不能插入高亮 额。。。。。

    import requests
    from bs4 import BeautifulSoup
    import threading
    url_num = 0
    url_list = ['http://ubuntuforums.org/forumdisplay.php?f=333',]
    for x in range(1, 50):
    url_num += 1
    raw_url = 'http://ubuntuforums.org/forumdisplay.php?f=333&page=%d' % url_num
    url_list.append(raw_url)
    class MyThread(threading.Thread):
    def __init__(self, func, args, name=""):
    threading.Thread.__init__(self)
    self.func = func
    self.args = args
    self.name = name
    def run(self):
    apply(self.func, self.args)
    def running(url):
    # lock.acquire()
    html = requests.get(url)
    if html.status_code == 200:
    html_text = html.text
    soup = BeautifulSoup(html_text)
    with open('/home/zhg/Pictures/cao.txt', 'a+') as f:
    for link in soup.find_all('a', 'title'):
    s = 'http://ubuntuforums.org/' + str(link.get('href')) + ' ' + str(link.get_text().encode('utf-8'))
    f.writelines(s)
    f.writelines('\n')
    # lock.release()
    if __name__ == '__main__':
    thread_list = [ MyThread(running, (url, ), running.__name__) for url in url_list ]
    for t in thread_list:
    t.setDaemon(True)
    t.start()
    for i in thread_list:
    i.join()
    print "process ended"
    with open('/home/zhg/Pictures/cao.txt', 'r') as f:
    f_list = f.readlines()
    set_list = set(f_list)
    for x in set_list:
    if f_list.count(x) > 1:
    print "the <%s> has found <%d>" % (x, f_list.count(x))
    fy
        2
    fy  
       2015-05-18 16:08:38 +08:00
    楼主一看就不是VIP用户,看我的:

    ```python

    def foo():
    pass
    ```
    Earthman
        3
    Earthman  
       2015-05-18 16:43:47 +08:00
    @fy 你也一样213,v2ex回帖没有markdown功能,发帖才有的。发帖格式有问题可以考虑请管理员编辑一下或者重新发帖(不推荐)
    cc7756789
        4
    cc7756789  
    OP
       2015-05-18 17:09:49 +08:00
    问题找到了,就是要爬取的网站有置顶的帖子,所以置顶的记录被重复了n次
    chairuosen
        5
    chairuosen  
       2015-05-18 17:15:31 +08:00
    有个东西叫gist
    RihcardLu
        6
    RihcardLu  
       2015-05-18 18:14:24 +08:00
    @fy @Earthman 笑死我了
    kingname
        7
    kingname  
       2015-05-18 18:28:29 +08:00
    使用scrapy天然自带去重复
    withrock
        8
    withrock  
       2015-05-19 10:02:43 +08:00
    多线程多进程抓取,都放到一个队列里,开一个进程从队列里取抓到的数据入库。
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1226 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 22ms · UTC 17:42 · PVG 01:42 · LAX 09:42 · JFK 12:42
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.