批量下载某评书网的MP3文件

每年回家，给姥爷下载评书/戏，似乎是逃不掉的任务。

那就，写个脚本自动来下载吧。

这个网站还算是比较良心的吧，几乎所有的评书都提供了下载链接。不过做了一个简单的反爬：文件名不能遍历。

但是，虽然文件名不能遍历……你的下载页面是可以遍历的啊！

所以，很简单了，遍历下载页面，获取MP3链接，用urllib.request给下载出来就好了。

脚本写得比较丑，毕竟20分钟从分析到实现。

base_url = "http://www.zgpingshu.com/down/575/"  # 获取方法是，找到下面的第一集，点击“下载”，得到这个URL
album = "白眉大侠"
author = "单田芳"
total = 320

workers = 10

import urllib.request
from concurrent import futures
from pathlib import Path

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

store_dir = Path("down") / "{}-{}".format(author, album)
store_dir.mkdir(exist_ok=True, parents=True)


def get_mp3(ith):
    success = True
    try:
        url = base_url + "" if ith == 1 else "{}.html".format(ith)
        response = requests.get(url, timeout=10, allow_redirects=True)
        html = BeautifulSoup(response.text, "html5lib")
        audio_file_element = html.find(id="down")

        audio_file_url = audio_file_element['href']
        audio_file = str(store_dir / "{}-{:03}.mp3".format(album, ith))

        urllib.request.urlretrieve(audio_file_url, audio_file)
    except:
        success = False

    txt = "{} {}".format(ith, "OK" if success else "Fail")

    return txt


with futures.ThreadPoolExecutor(max_workers=workers) as executor:
    jobs = {
        executor.submit(get_mp3, i): str(i)
        for i in range(1, total + 1)
    }
    for future in tqdm(futures.as_completed(jobs), total=total, desc=album):
        tqdm.write(future.result())

    print("All is well")

还剩下出错重试和随机歇一会儿——但是他服务器还算比较好，即使用10个线程同时下载，也是OK的。这样耿直的网站不多啦！且用且珍惜。

一晚上下载了15G，大约2600回。效率真高。

希望站长不要来砍我……

批量下载某评书网的MP3文件

评论

发表回复取消回复

批量下载某评书网的MP3文件

分享到:

评论

发表回复 取消回复

发表回复取消回复