练手入门级爬虫,利用requests+BeautifulSoup(美丽汤>_<)来爬取http://v2ex.com 网站上面的所有帖子,包括帖子标题,作者,时间,主要是正则表达式的学习:
#练习02:2014-11-01
#http://baidu.lecai.com/lottery/draw/list/50
#需求是获取http://baidu.lecai.com/lottery/draw/list/50 彩票网站上面开设双色球以来每一期的开奖日期,开奖期号,开奖号码,当期销量
#通过观察,发现该彩票网站上面包含2003~2014的数据,其url规律是:http://baidu.lecai.com/lottery/draw/list/50?d=2003-01-01
if __name__ == '__main__':
output = file('双色球2003-2014历史数据.txt', 'w+')
for i in range(2003, 2015):
req_html_doc = requests.get("http://baidu.lecai.com/lottery/draw/list/50?d=" + str(i) + "-01-01").text
my_soup = BeautifulSoup(req_html_doc)
result = my_soup.findAll('tr')
for each in result:
reg_lottery_date = re.compile(r'<td class="td1">(.*)</td>') # <td class="td1">2013-12-31</td>
lottery_date = reg_lottery_date.findall(str(each))
reg_lottery_qihao = re.compile(r'<a href="(.*)">(.*)</a>') #<a href="/lottery/draw/view/50?phase=2013154">2013154</a>
lottery_qihao = reg_lottery_qihao.findall(str(each))
reg_lottery_haoma = re.compile(r'<span class="ball_[12]">(.*)</span>') # <span class="ball_1">07</span> 6个ball_1红球,1个ball_2篮球
lottery_haoma = reg_lottery_haoma.findall(str(each))
reg_lottery_amount = re.compile(r'<td class="td4">(.*)</td>') #<td class="td4">67,728,980</td>
lottery_amount = reg_lottery_amount.findall(str(each))
if len(lottery_qihao) != 0:
out_line = str(lottery_date) + lottery_qihao[0][1] + str(lottery_haoma) + str(lottery_amount)+'\n'
output.write(out_line)
结果文件保存在 双色球2003-2014历史数据.txt
Comments !