加勒比久久综合久久678,中文字幕精品无码亚洲资源网,九九精品

當前位置：博客首頁 >> Python >> 閱讀正文

python函數(shù)式實現(xiàn)的多線程爬蟲練習(xí)

作者: 鄭曉分類: Python 發(fā)布于: 2014-10-29 17:35 瀏覽：6,764 評論(1)

寫的一個爬蟲練習(xí)，目的是抓取目標站點下所有鏈接，并記錄下問題鏈接url（包括問題url，入口鏈接，http狀態(tài)碼）。可以自行設(shè)置線程數(shù)量，程序開啟一個子線程來維護當前線程數(shù)量。之前還好點兒，現(xiàn)在是越改bug越多，問題越多。

目前發(fā)現(xiàn)的問題有：
1.線程的管理上，之前用傳統(tǒng)的方法，三個for循環(huán)來創(chuàng)建固定數(shù)量線程，不過發(fā)現(xiàn)如果某線程拋出異常后，線程終止，總線程數(shù)就會減少。所以自己改成了用一個死循環(huán)不停的監(jiān)聽活動線程數(shù)量。發(fā)現(xiàn)執(zhí)行過程中，線程name的數(shù)量不斷增長，每一個創(chuàng)建的線程在執(zhí)行完一次方法后好像就退出了。。。
2.url_new列表保存待抓取的url，發(fā)現(xiàn)還是有重復(fù)的現(xiàn)象。感覺用函數(shù)式的線程實現(xiàn)的話，線程間的同步好像不太好。。
3.lock有問題。。。因為自己還是沒有掌握lock的鎖法，發(fā)現(xiàn)url有重復(fù)時，自己把整個def都用lock鎖了起來。。還是問題不斷啊。。
4.目標站還是寫死在程序中。。。
待改進為面向?qū)ο螅?br /> #encoding: gb2312 import urllib2 import threading import logging import re import sys import os from bs4 import BeautifulSoup reload(sys) sys.setdefaultencoding("utf-8") #日志初始化 FILE = os.getcwd() logging.basicConfig(filename=os.path.join(FILE, 'log.txt'),level=logging.DEBUG) #待抓取的任務(wù)隊列 url_new = [('none','http://www.xxxxxx.com/')] #已完成的任務(wù) url_old = [] #已完成的狀態(tài) url_err = {200:[]} #鎖 lock = threading.Lock() lock2= threading.Lock()


#線程執(zhí)行主方法

#從任務(wù)列表中獲取一條url進行抓取

#分析url，去重復(fù)，將得到的urls重新放入任務(wù)列表

#保存當前url的訪問狀態(tài)

def geturl():

	global url_new

	try:

		while True:

			lock.acquire()

			if len(url_new)<=0:
				lock.release()
				continue
			url_t = url_new.pop(0)
			url = url_t[1]
			try:
				req = urllib2.urlopen(url)
			except urllib2.HTTPError, e:
				#記錄到對應(yīng)的列表中
				if url_err.has_key(e.code):
					url_err[e.code].append((url,url_t[0]))
				else:
					url_err[e.code] = [(url,url_t[0])]
				with open('log.html', 'a+') as f:
						f.write(str(e.code)+'：'+url+', 來路：'+url_t[0]+'
')

						continue

			else:

				url_err[200].append(url)

				with open('log.html', 'a+') as f:

						f.write('200：'+url+', 來路：'+url_t[0]+'
')
			#記錄到已訪問的列表中

			url_old.append(url)

			#開始提取頁面url

			soup = BeautifulSoup(req.read().decode('UTF-8', 'ignore'))

			alink= soup.find_all('a', attrs={'href':re.compile(".*?xxxxxx.*?")})

			tmp_url = []

			for a in alink:

				href = a.get('href')

				tmp_url.append(a.get('href') if a.get('href').find('http:')>=0 else 'http://www.xxxxxx.com'+a.get('href'))

			tmp_url= {}.fromkeys(tmp_url).keys()

			for link in tmp_url:

				if link not in url_old:

					url_new.append((url, link))

			tmp = []

			for i in xrange(len(url_new)):

				if url_new[i][1] not in tmp:

					tmp.append(url_new[i][1])

				else:

					del url_new[i] 
			#url_new = {}.fromkeys(url_new).keys()
			#輸出一下狀態(tài)信息

			os.system('cls')

			print threading.Thread().getName()+":當前線程數(shù)："+str(threading.activeCount())+"，當前剩余任務(wù)量："+str(len(url_new))+", 已訪問:"+str(len(url_old))

			for k in url_err.keys():

				print str(k)+'：'+str(len(url_err[k]))
			lock.release()

	except Exception as e:

		logging.debug(str(e))

		lock.release()
#線程數(shù)檢測 死循環(huán)持續(xù)檢測當前活動線程數(shù)

#不夠數(shù)量時自動創(chuàng)建啟動新線程

def threadcheck(num):

	t=threading.Thread(target=geturl)

	t.start()

	t.join()
#定義主方法

def main():

	"""初始 創(chuàng)建200個線程

	for i in xrange(190):

		t = threading.Thread(target=geturl)

		threads.append(t)

	for i in xrange(190):

		threads[i].start()

	for i in xrange(190):

		threads[i].join()"""

	t = threading.Thread(target=threadcheck, args=(10,))

	t.start()

	t.join()

	#geturl(url_new.pop(0))

#開始 if __name__ == '__main__': main() input('整站抓取已結(jié)束！')

? ? ? ?

本文采用知識共享署名-非商業(yè)性使用 3.0 中國大陸許可協(xié)議進行許可，轉(zhuǎn)載時請注明出處及相應(yīng)鏈接。

本文永久鏈接: http://yjfs.org.cn/python-threading-pachong1.html

python函數(shù)式實現(xiàn)的多線程爬蟲練習(xí)：目前有1 條留言

阿里百秀：發(fā)表于 2014年11月28日 14:35[回復(fù)]

很實用的技術(shù)文章

python函數(shù)式實現(xiàn)的多線程爬蟲練習(xí)

python函數(shù)式實現(xiàn)的多線程爬蟲練習(xí)：目前有1 條留言

發(fā)表評論