最近在爬取cctv5直播链接,由于m3u8链接具有时效性,需要在服务器定时更新链接。
由于系统只有单纯的shell界面,只能探索selenium无界面爬取。
安装Firefox
sudo apt-get remove --purge firefox
sudo apt-get install firefox
安装Xvbf
sudo apt-get install xvfb
安装Flash Plugin
sudo apt-get install flashplugin-installer
安装selenium && pyvirtualdisplay
pip install selenium
pip install pyvirtualdisplay
代码示例:
import datetime
import urllib
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from pyvirtualdisplay import Display
cctv5 = 'http://www.myp2pch.net/tiantian2.html?c=cctv5&w=800&h=600'
def get_cctv5_meu8(url, iframe):
display = Display()
display.start()
driver = webdriver.Firefox()
try:
driver.get(url)
# 等待iframe
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, iframe))
)
driver.switch_to.frame(driver.find_element(By.CSS_SELECTOR, iframe))
# 等待player
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "div#player object"))
)
# 处理得到的player
player = driver.find_element(By.CSS_SELECTOR, 'div#player object')
m3u8 = player.find_element(By.CSS_SELECTOR, 'param[name=flashvars]').get_attribute('value')
m3u8 = m3u8[m3u8.index('a=') + 2:]
return urllib.unquote(m3u8)
except Exception as e:
print e
return None
finally:
driver.quit()
display.stop()
由于这次是爬取cctv5直播源链接,需要有解决flash插件问题,所有选择了浏览器。若无需支持Flash,无界面爬取还可以选择 selenium + phantomjs,它是nodejs的一个库,用npm安装也算是方便,用作为一个js的执行库,相对比较轻一点,操作上也和浏览器大同小异。