【Python】selenium自动化初探

需求

媳妇最近需要参加普通话考试，由于她是社会考生，很难报上名。（武汉市上一次举行社会考生参加普通话考试还是去年的12月份）基于此，媳妇想让我帮她写一个程序，监控普通话考试的报名信息。那我的python就大有用处了！

过程

既然要监控普通话考试的报名信息，肯定有一个目标网站。找来找去找到华中师范大学的普通话测试通知网页https://www.ccnu.edu.cn/shfw/pthcs.htm 以及湖北普通话水平测试在线报名系统http://hubeibm.cltt.org/pscweb/signUp.html。华师的网页是一个静态页面，比较好处理，使用python的lxml库将文本解析为html，使用xpath即可解析需要的数据。
初步对湖北普通话报名系统使用xpath解析时得到数据很明显无法使用（怀疑使用了反爬技术），退而求其次，咱可以使用python+selenium对网页进行截图，然后将图片进行压缩转成base64，发送到微信企业号，就可以进行监控报名系统了，大功告成！

开发

环境

通过前面的分析，本次需求的实现需要用到的第三方库，依次使用pip install 安装如下依赖

requests==2.25.1
lxml==4.6.3
APScheduler==3.7.0
selenium==3.141.0
pillow==8.2.0

code

import base64
import io
import os
import time
import requests
from PIL import Image
from lxml import etree
from apscheduler.schedulers.blocking import BlockingScheduler
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities


def send_message(title, content):
    url = 'https://sctapi.ftqq.com/SCT3143xxxxxxxxxcHKuDrRablnsELSCi.send'
    param = {
        'title': title,
        'desp': content
    }
    requests.post(url, data=param)


def job():
    url = "https://www.ccnu.edu.cn/shfw/pthcs.htm"
    try:
        text = requests.get(url).content
        html = etree.HTML(text)
        title = html.xpath('//*[@id="container"]/div[3]/div/div[2]/div/div/div/h3/strong/span')[0].text.strip()
        date = html.xpath('//*[@id="vsb_content_6"]/p[53]')[0].text.strip()
        content = '{} \r\n' \
                  '{} \r\n' \
                  '{}'.format(title, date, url)
        send_message("普通话考试信息通知", content)
    except Exception as e:
        print("has some error")
        send_message("普通话考试信息通知", "爬虫获取信息失败,请手动点击 {}".format(url))


def compress_image_bs4(b64, mb=190, k=0.9):
    """不改变图片尺寸压缩到指定大小
    :param outfile: 压缩文件保存地址
    :param mb: 压缩目标，KB
    :param step: 每次调整的压缩比率
    :param quality: 初始压缩比率
    :return: 压缩文件地址，压缩文件大小
    """
    f = base64.b64decode(b64)
    with io.BytesIO(f) as im:
        o_size = len(im.getvalue()) // 1024
        if o_size <= mb:
            return b64
        im_out = im
        while o_size > mb:
            img = Image.open(im_out)
            x, y = img.size
            out = img.resize((int(x * k), int(y * k)), Image.ANTIALIAS)
            im_out.close()
            im_out = io.BytesIO()
            out.save(im_out, 'png')
            o_size = len(im_out.getvalue()) // 1024
        b64 = base64.b64encode(im_out.getvalue())
        im_out.close()
        return str(b64, encoding='utf8')


def job2():
    url = "http://hubeibm.cltt.org/pscweb/signUp.html"
    driver = webdriver.Remote(
        command_executor="http://10.122.100.146:4444/wd/hub",
        desired_capabilities=DesiredCapabilities.CHROME
    )
    driver.get(url)
    wh = driver.find_element_by_xpath('/html/body/div[2]/div[2]/div/div[1]/online-widget/div/div[1]/ul/li[2]/a')
    wh.click()
    time.sleep(1)
    pic_name = r'{}_screen.png'.format('test')
    driver.save_screenshot(pic_name)

    with open(pic_name, "rb") as f:
        base64_data = base64.b64encode(f.read())
        base64_data = compress_image_bs4(base64_data, 30, 0.9)
    content = "普通话测试截图" \
              "![image](data:image/png;base64,{})".format(base64_data)
    send_message("普通话测试截图", content)

    driver.close()
    os.remove(pic_name)


if __name__ == '__main__':
    job2()
    job()
    scheduler = BlockingScheduler()
    scheduler.add_job(job, 'interval', seconds=10800)
    scheduler.add_job(job2, 'interval', seconds=10800)
    scheduler.start()

工程化

docker化

FROM python:3.7
WORKDIR /usr/src/app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENTRYPOINT ["python"]
CMD ["./main.py"]

sudo docker run -d -p4444:4444 selenium/standalone-chrome:latest #启动selenium grid hub
mkdir pthcs
cd pthcs
#1. 将上述Dockfile内容写入 Dockfile文件中
#2. 将上述python代码copy到 main.py 文件
#3. 将上述依赖写入requirements.txt 文件
sudo docker build -t pthcs:v0.1 .  #打包生成镜像
sudo docker run -d pthcs:v0.1 #后台启动服务

结束语

只为记录日常代码，内容比较琐碎，轻喷。。。