成為初級資料分析師 | Python 與資料科學應用

網頁資料擷取

郭耀仁

大綱

  • 網頁資料擷取的環境設定
  • 網頁資料擷取的先修知識
  • 網頁資料擷取的核心任務
  • 擷取 JSON 格式網頁資料
  • 擷取 XML 格式網頁資料
  • 擷取 HTML 格式網頁資料
  • 瀏覽器自動化
  • 延伸閱讀
  • 作業

網頁資料擷取的環境設定

我們所使用的開發環境為

  • 在瀏覽器使用 Google Colaboratory
  • 在本機端使用 Miniconda

除了「瀏覽器自動化」要使用本機端,其餘的時候我們都使用 Google Colaboratory

在瀏覽器使用 Google Colaboratory

  1. 登入 Google 帳號
  2. 前往 https://colab.research.google.com/
  3. 新增 Python 3 Notebook

更新 Google Colaboratory 的 beautifulsoup4

!pip install -U beautifulsoup4
Collecting beautifulsoup4
  Downloading https://files.pythonhosted.org/packages/3b/c8/a55eb6ea11cd7e5ac4bacdf92bac4693b90d3ba79268be16527555e186f0/beautifulsoup4-4.8.1-py3-none-any.whl (101kB)
     |████████████████████████████████| 102kB 6.0MB/s 
Collecting soupsieve>=1.2
  Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
  Found existing installation: beautifulsoup4 4.6.3
    Uninstalling beautifulsoup4-4.6.3:
      Successfully uninstalled beautifulsoup4-4.6.3
Successfully installed beautifulsoup4-4.8.1 soupsieve-1.9.5

網頁資料擷取的先修知識

網頁的基本組成是標記式語言、樣式表與程式語言的結合

  • 標記式語言:HTML
  • 樣式表:CSS
  • 程式語言:JavaScript

對比:一間大樓的組成

  • 鋼筋水泥:HTML
  • 裝潢隔間:CSS
  • 管線設施:JavaScript

其中 HTML 負責的是網頁架構

Imgur

網頁可以分成兩種大類

  1. 靜態:未與伺服器及資料庫連結,頁面內容需要開啟編輯器修改
  2. 動態:又稱網路應用程式(Web APP),具有後端程式與資料庫連結

基本 HTML 內容有兩個部分

  1. head: 設定與網頁相關的資訊、提供網頁需要的資源檔
  2. body: 使用者在瀏覽器看到的內容

基本 HTML 內容

<!DOCTYPE html>
<html>
  <head>

  </head>

  <body>

  </body>
</html>

其中 CSS 負責網頁樣式

樣式表(Cascading Style Sheets,CSS)是一種用來為 HTML 添加樣式(字型、間距和顏色等)的電腦語言,由 W3C 定義和維護。

Source: https://zh.wikipedia.org/zh-tw/%E5%B1%82%E5%8F%A0%E6%A0%B7%E5%BC%8F%E8%A1%A8

CSS 的撰寫規則

  • 選擇器
  • 屬性
  • 屬性值

選擇器的宣告方式

  • 單一標記:#id
  • 多組標記:.class、標記名稱

網頁資料擷取的核心任務

盤點核心任務

以 Python 豐富的套件、Chrome 瀏覽器外掛與開發者工具來進行兩項核心任務:

  1. 請求資料 Requesting Data
  2. 解析資料 Parsing Data

HTTP

超文本傳輸協定 (HTTP) 是一種用來傳輸超媒體文件 (像是HTML文件) 的應用層協定,被設計來讓瀏覽器和伺服器進行溝通,但也可做其他用途。HTTP 遵循標準客戶端—伺服器模式,由客戶端連線以發送請求,然後等待接收回應。

Source: https://developer.mozilla.org/zh-TW/docs/Web/HTTP

HTTP 定義了一組能令給定資源,執行特定操作的請求方法(request methods),其中與網頁資料擷取最相關的是:

  • GET
  • POST

請求資料是雙向的

  • 由瀏覽器發給網頁伺服器的請求稱為 HTTP Request
  • 由網頁伺服器發給瀏覽器的回應稱為 HTTP Response
  • Request Header 中的 Request Method 表示瀏覽器希望網頁伺服器做些什麼事
  • 順利取得資料之後,瀏覽器會將 Response Body 顯示出來

請求資料需要使用的工具

Chrome 開發者工具

Chrome 開發者工具是一套內建於 Google Chrome 中的 Web 開發和測試工具。

Source: https://developers.google.com/web/tools/chrome-devtools/?hl=zh-TW

Imgur

進行網頁資料擷取時,會高度仰賴 Chrome 開發者工具中的 Network 頁籤

使用 Network 頁籤瞭解請求和下載的檔案

點選 Network 之後重新整理網頁觀察

Imgur

通常我們需要擷取的資料會被歸類在這兩個大類檔案中

  • XHR(XMLHttpRequest)
  • Doc

可以使用 Quick JavaScript Switcher 協助判斷

快速地開啟、關閉 JavaScript

Source: https://chrome.google.com/webstore/detail/quick-javascript-switcher/geddoclleiomckbhadiaipdggiiccfje

找到資料之後即可檢視細節

  • Headers
    • General
    • Response Headers
    • Request Headers
    • Query String Parameters(if any)
    • Form Data(if any, for POST)
  • Preview
  • Response
  • Cookies

Imgur

Imgur

常用的 requests 函數

  • requests.get():進行 GET 請求(下載檔案)、常搭配 Query String Parameters
  • requests.post():進行 POST 請求(上傳資料)、搭配 Form Data
In [1]:
import requests

request_url = "https://www.imdb.com/"
response = requests.get(request_url)
In [2]:
request_url = "https://mops.twse.com.tw/mops/web/t05st10_ifrs"
response = requests.post(request_url)

回應(Response 類別)的方法與屬性

  • response.status_code:查看狀態碼
  • response.json():將回應直接轉換為 Python 的資料結構(listdict
  • response.content:將回應轉換為 bytes
  • response.text:將回應轉換為 str

檢視資料細節的 Preview 與 Response確認格式

  • 如果資料是 JSON 格式:呼叫回應的 .json() 方法後直接以 Python 資料結構解析
  • 如果資料是 XML 格式:呼叫回應的 .content 屬性後以 lxml 搭配 XPath 解析
  • 如果資料是 HTML 格式:呼叫回應的 .text 屬性後以 bs4 搭配 CSS Selector 解析

擷取 JSON 格式網頁資料

什麼是 JSON?

JavaScript Object Notation (JSON) 為將結構化資料 (structured data) 呈現為 JavaScript 物件的標準格式,常用於網站上的資料呈現、傳輸。

Source: mozilla.org

JSON 是依照 JavaScript 物件語法的資料格式,經 Douglas Crockford 推廣普及。雖然 JSON 是以 JavaScript 語法為基礎,但可獨立使用,且許多程式設計環境亦可讀取 (剖析) 並產生 JSON。

Source: mozilla.org

JSON 怎麼利用 Python 剖析與對應?

  • Python 具有標準套件 json 作為剖析的媒介
  • JSON 物件對應 Python 的 dict 類別
  • 陣列作為 JSON(array of JSON) 對應 Python 的 list of dict

JSON 格式網頁資料範例

幫助瀏覽 JSON 資料的 Chrome 外掛

JSON View

擷取 JSON 格式網頁資料步驟

  • 使用 requests 請求資料
  • 呼叫回應的 .json() 方法,例如 response.json()
  • 視需求進行摘要
In [3]:
request_url = "http://data.nba.net/prod/v2/2019/teams.json"
response = requests.get(request_url)
teams = response.json()
print(type(teams))
print(teams)
<class 'dict'>
{'_internal': {'pubDateTime': '2019-06-26 06:00:23.891 EDT', 'igorPath': 'cron,1561543218800,1561543218800|router,1561543218800,1561543218922|domUpdater,1561543219144,1561543219858|feedProducer,1561543221917,1561543224371', 'xslt': 'NBA/xsl/league/roster/marty_teams_list.xsl', 'xsltForceRecompile': 'true', 'xsltInCache': 'false', 'xsltCompileTimeMillis': '1545', 'xsltTransformTimeMillis': '540', 'consolidatedDomKey': 'qamanual__transform__marty_teams_list__5498140551604', 'endToEndTimeMillis': '5571'}, 'league': {'standard': [{'isNBAFranchise': False, 'isAllStar': False, 'city': 'Croatia', 'altCityName': 'Croatia', 'fullName': 'Team Croatia', 'tricode': 'CRO', 'teamId': '70', 'nickname': 'Croatia', 'urlName': 'croatia', 'teamShortName': 'Croatia', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'China', 'altCityName': 'China', 'fullName': 'Team China', 'tricode': 'CHN', 'teamId': '45', 'nickname': 'China', 'urlName': 'china', 'teamShortName': 'China', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Adelaide', 'altCityName': 'Adelaide', 'fullName': 'Adelaide 36ers', 'tricode': 'ADL', 'teamId': '15019', 'nickname': '36ers', 'urlName': '36ers', 'teamShortName': 'Adelaide', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Atlanta', 'altCityName': 'Atlanta', 'fullName': 'Atlanta Hawks', 'tricode': 'ATL', 'teamId': '1610612737', 'nickname': 'Hawks', 'urlName': 'hawks', 'teamShortName': 'Atlanta', 'confName': 'East', 'divName': 'Southeast'}, {'isNBAFranchise': False, 'isAllStar': True, 'city': 'Away', 'altCityName': 'Away', 'fullName': 'Away Away', 'tricode': 'AWY', 'teamId': '1610616840', 'nickname': 'Away', 'urlName': 'away', 'teamShortName': 'Away', 'confName': 'East', 'divName': 'East'}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Beijing', 'altCityName': 'Beijing', 'fullName': 'Beijing Ducks', 'tricode': 'BJD', 'teamId': '15021', 'nickname': 'Ducks', 'urlName': 'ducks', 'teamShortName': 'Beijing', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Boston', 'altCityName': 'Boston', 'fullName': 'Boston Celtics', 'tricode': 'BOS', 'teamId': '1610612738', 'nickname': 'Celtics', 'urlName': 'celtics', 'teamShortName': 'Boston', 'confName': 'East', 'divName': 'Atlantic'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Brooklyn', 'altCityName': 'Brooklyn', 'fullName': 'Brooklyn Nets', 'tricode': 'BKN', 'teamId': '1610612751', 'nickname': 'Nets', 'urlName': 'nets', 'teamShortName': 'Brooklyn', 'confName': 'East', 'divName': 'Atlantic'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Charlotte', 'altCityName': 'Charlotte', 'fullName': 'Charlotte Hornets', 'tricode': 'CHA', 'teamId': '1610612766', 'nickname': 'Hornets', 'urlName': 'hornets', 'teamShortName': 'Charlotte', 'confName': 'East', 'divName': 'Southeast'}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Buenos Aires', 'altCityName': 'Buenos Aires', 'fullName': 'San Lorenzo de Almagro', 'tricode': 'SLA', 'teamId': '12330', 'nickname': 'San Lorenzo', 'urlName': 'san_lorenzo', 'teamShortName': 'San Lorenzo', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Chicago', 'altCityName': 'Chicago', 'fullName': 'Chicago Bulls', 'tricode': 'CHI', 'teamId': '1610612741', 'nickname': 'Bulls', 'urlName': 'bulls', 'teamShortName': 'Chicago', 'confName': 'East', 'divName': 'Central'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Cleveland', 'altCityName': 'Cleveland', 'fullName': 'Cleveland Cavaliers', 'tricode': 'CLE', 'teamId': '1610612739', 'nickname': 'Cavaliers', 'urlName': 'cavaliers', 'teamShortName': 'Cleveland', 'confName': 'East', 'divName': 'Central'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Dallas', 'altCityName': 'Dallas', 'fullName': 'Dallas Mavericks', 'tricode': 'DAL', 'teamId': '1610612742', 'nickname': 'Mavericks', 'urlName': 'mavericks', 'teamShortName': 'Dallas', 'confName': 'West', 'divName': 'Southwest'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Denver', 'altCityName': 'Denver', 'fullName': 'Denver Nuggets', 'tricode': 'DEN', 'teamId': '1610612743', 'nickname': 'Nuggets', 'urlName': 'nuggets', 'teamShortName': 'Denver', 'confName': 'West', 'divName': 'Northwest'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Detroit', 'altCityName': 'Detroit', 'fullName': 'Detroit Pistons', 'tricode': 'DET', 'teamId': '1610612765', 'nickname': 'Pistons', 'urlName': 'pistons', 'teamShortName': 'Detroit', 'confName': 'East', 'divName': 'Central'}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Franca', 'altCityName': 'Franca', 'fullName': 'SESI/Franca', 'tricode': 'FRA', 'teamId': '12332', 'nickname': 'Franca', 'urlName': 'franca', 'teamShortName': 'Franca', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Golden State', 'altCityName': 'Golden State', 'fullName': 'Golden State Warriors', 'tricode': 'GSW', 'teamId': '1610612744', 'nickname': 'Warriors', 'urlName': 'warriors', 'teamShortName': 'Golden State', 'confName': 'West', 'divName': 'Pacific'}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Guangzhou', 'altCityName': 'Guangzhou', 'fullName': 'Guangzhou Long-Lions', 'tricode': 'GUA', 'teamId': '15018', 'nickname': 'Long-Lions', 'urlName': 'long-lions', 'teamShortName': 'Guangzhou', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Haifa', 'altCityName': 'Haifa', 'fullName': 'Haifa Maccabi Haifa', 'tricode': 'MAC', 'teamId': '93', 'nickname': 'Maccabi Haifa', 'urlName': 'maccabi_haifa', 'teamShortName': 'Maccabi Haifa', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': False, 'isAllStar': True, 'city': 'Home', 'altCityName': 'Home', 'fullName': 'Home Home', 'tricode': 'HME', 'teamId': '1610616839', 'nickname': 'Home', 'urlName': 'home', 'teamShortName': 'Home', 'confName': 'East', 'divName': 'East'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Houston', 'altCityName': 'Houston', 'fullName': 'Houston Rockets', 'tricode': 'HOU', 'teamId': '1610612745', 'nickname': 'Rockets', 'urlName': 'rockets', 'teamShortName': 'Houston', 'confName': 'West', 'divName': 'Southwest'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Indiana', 'altCityName': 'Indiana', 'fullName': 'Indiana Pacers', 'tricode': 'IND', 'teamId': '1610612754', 'nickname': 'Pacers', 'urlName': 'pacers', 'teamShortName': 'Indiana', 'confName': 'East', 'divName': 'Central'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'LA', 'altCityName': 'LA Clippers', 'fullName': 'LA Clippers', 'tricode': 'LAC', 'teamId': '1610612746', 'nickname': 'Clippers', 'urlName': 'clippers', 'teamShortName': 'LA Clippers', 'confName': 'West', 'divName': 'Pacific'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Los Angeles', 'altCityName': 'Los Angeles Lakers', 'fullName': 'Los Angeles Lakers', 'tricode': 'LAL', 'teamId': '1610612747', 'nickname': 'Lakers', 'urlName': 'lakers', 'teamShortName': 'L.A. Lakers', 'confName': 'West', 'divName': 'Pacific'}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Melbourne', 'altCityName': 'Melbourne', 'fullName': 'Melbourne United', 'tricode': 'MEL', 'teamId': '15016', 'nickname': 'United', 'urlName': 'united', 'teamShortName': 'Melbourne', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Memphis', 'altCityName': 'Memphis', 'fullName': 'Memphis Grizzlies', 'tricode': 'MEM', 'teamId': '1610612763', 'nickname': 'Grizzlies', 'urlName': 'grizzlies', 'teamShortName': 'Memphis', 'confName': 'West', 'divName': 'Southwest'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Miami', 'altCityName': 'Miami', 'fullName': 'Miami Heat', 'tricode': 'MIA', 'teamId': '1610612748', 'nickname': 'Heat', 'urlName': 'heat', 'teamShortName': 'Miami', 'confName': 'East', 'divName': 'Southeast'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Milwaukee', 'altCityName': 'Milwaukee', 'fullName': 'Milwaukee Bucks', 'tricode': 'MIL', 'teamId': '1610612749', 'nickname': 'Bucks', 'urlName': 'bucks', 'teamShortName': 'Milwaukee', 'confName': 'East', 'divName': 'Central'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Minnesota', 'altCityName': 'Minnesota', 'fullName': 'Minnesota Timberwolves', 'tricode': 'MIN', 'teamId': '1610612750', 'nickname': 'Timberwolves', 'urlName': 'timberwolves', 'teamShortName': 'Minnesota', 'confName': 'West', 'divName': 'Northwest'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'New Orleans', 'altCityName': 'New Orleans', 'fullName': 'New Orleans Pelicans', 'tricode': 'NOP', 'teamId': '1610612740', 'nickname': 'Pelicans', 'urlName': 'pelicans', 'teamShortName': 'New Orleans', 'confName': 'West', 'divName': 'Southwest'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'New York', 'altCityName': 'New York', 'fullName': 'New York Knicks', 'tricode': 'NYK', 'teamId': '1610612752', 'nickname': 'Knicks', 'urlName': 'knicks', 'teamShortName': 'New York', 'confName': 'East', 'divName': 'Atlantic'}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'New Zealand', 'altCityName': 'New Zealand', 'fullName': 'New Zealand Breakers', 'tricode': 'NZB', 'teamId': '15020', 'nickname': 'Breakers', 'urlName': 'breakers', 'teamShortName': 'New Zealand', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Oklahoma City', 'altCityName': 'Oklahoma City', 'fullName': 'Oklahoma City Thunder', 'tricode': 'OKC', 'teamId': '1610612760', 'nickname': 'Thunder', 'urlName': 'thunder', 'teamShortName': 'Oklahoma City', 'confName': 'West', 'divName': 'Northwest'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Orlando', 'altCityName': 'Orlando', 'fullName': 'Orlando Magic', 'tricode': 'ORL', 'teamId': '1610612753', 'nickname': 'Magic', 'urlName': 'magic', 'teamShortName': 'Orlando', 'confName': 'East', 'divName': 'Southeast'}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Perth', 'altCityName': 'Perth', 'fullName': 'Perth Wildcats', 'tricode': 'PER', 'teamId': '104', 'nickname': 'Wildcats', 'urlName': 'wildcats', 'teamShortName': 'Perth', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Philadelphia', 'altCityName': 'Philadelphia', 'fullName': 'Philadelphia 76ers', 'tricode': 'PHI', 'teamId': '1610612755', 'nickname': '76ers', 'urlName': 'sixers', 'teamShortName': 'Philadelphia', 'confName': 'East', 'divName': 'Atlantic'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Phoenix', 'altCityName': 'Phoenix', 'fullName': 'Phoenix Suns', 'tricode': 'PHX', 'teamId': '1610612756', 'nickname': 'Suns', 'urlName': 'suns', 'teamShortName': 'Phoenix', 'confName': 'West', 'divName': 'Pacific'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Portland', 'altCityName': 'Portland', 'fullName': 'Portland Trail Blazers', 'tricode': 'POR', 'teamId': '1610612757', 'nickname': 'Trail Blazers', 'urlName': 'blazers', 'teamShortName': 'Portland', 'confName': 'West', 'divName': 'Northwest'}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Rio de Janeiro', 'altCityName': 'Rio de Janeiro', 'fullName': 'Rio de Janeiro Flamengo', 'tricode': 'FLA', 'teamId': '12325', 'nickname': 'Flamengo', 'urlName': 'flamengo', 'teamShortName': 'Flamengo', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Sacramento', 'altCityName': 'Sacramento', 'fullName': 'Sacramento Kings', 'tricode': 'SAC', 'teamId': '1610612758', 'nickname': 'Kings', 'urlName': 'kings', 'teamShortName': 'Sacramento', 'confName': 'West', 'divName': 'Pacific'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'San Antonio', 'altCityName': 'San Antonio', 'fullName': 'San Antonio Spurs', 'tricode': 'SAS', 'teamId': '1610612759', 'nickname': 'Spurs', 'urlName': 'spurs', 'teamShortName': 'San Antonio', 'confName': 'West', 'divName': 'Southwest'}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Shanghai', 'altCityName': 'Shanghai', 'fullName': 'Shanghai Sharks', 'tricode': 'SDS', 'teamId': '12329', 'nickname': 'Sharks', 'urlName': 'shanghai_sharks', 'teamShortName': 'Shanghai', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Sydney', 'altCityName': 'Sydney', 'fullName': 'Sydney Kings', 'tricode': 'SYD', 'teamId': '15015', 'nickname': 'Kings', 'urlName': 'sydkings', 'teamShortName': 'Sydney', 'confName': 'Intl', 'divName': ''}, {'isNBAFranchise': False, 'isAllStar': True, 'city': 'Team', 'altCityName': 'Team', 'fullName': 'All-Stars', 'tricode': 'EST', 'teamId': '1699999999', 'nickname': 'All-Stars', 'urlName': 'assn_away', 'confName': 'East', 'divName': 'East'}, {'isNBAFranchise': False, 'isAllStar': True, 'city': 'Team', 'altCityName': 'Team', 'fullName': 'All-Stars', 'tricode': 'WST', 'teamId': '1699999998', 'nickname': 'All-Stars', 'urlName': 'assn_home', 'confName': 'West', 'divName': 'West'}, {'isNBAFranchise': False, 'isAllStar': True, 'city': 'Team Giannis', 'altCityName': 'Team Giannis', 'fullName': 'Team Giannis', 'tricode': 'GNS', 'teamId': '1610616833', 'nickname': 'Team Giannis', 'urlName': 'team_giannis', 'teamShortName': 'Team Giannis', 'confName': 'East', 'divName': 'East'}, {'isNBAFranchise': False, 'isAllStar': True, 'city': 'Team LeBron', 'altCityName': 'Team LeBron', 'fullName': 'Team LeBron', 'tricode': 'LBN', 'teamId': '1610616834', 'nickname': 'Team LeBron', 'urlName': 'team_lebron', 'teamShortName': 'Team LeBron', 'confName': 'West', 'divName': 'West'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Toronto', 'altCityName': 'Toronto', 'fullName': 'Toronto Raptors', 'tricode': 'TOR', 'teamId': '1610612761', 'nickname': 'Raptors', 'urlName': 'raptors', 'teamShortName': 'Toronto', 'confName': 'East', 'divName': 'Atlantic'}, {'isNBAFranchise': False, 'isAllStar': True, 'city': 'USA', 'altCityName': 'USA', 'fullName': 'USA', 'tricode': 'USA', 'teamId': '1610616843', 'nickname': 'USA', 'urlName': 'usa', 'teamShortName': 'USA', 'confName': 'East', 'divName': 'East'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Utah', 'altCityName': 'Utah', 'fullName': 'Utah Jazz', 'tricode': 'UTA', 'teamId': '1610612762', 'nickname': 'Jazz', 'urlName': 'jazz', 'teamShortName': 'Utah', 'confName': 'West', 'divName': 'Northwest'}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Washington', 'altCityName': 'Washington', 'fullName': 'Washington Wizards', 'tricode': 'WAS', 'teamId': '1610612764', 'nickname': 'Wizards', 'urlName': 'wizards', 'teamShortName': 'Washington', 'confName': 'East', 'divName': 'Southeast'}, {'isNBAFranchise': False, 'isAllStar': True, 'city': 'World', 'altCityName': 'World', 'fullName': 'World', 'tricode': 'WLD', 'teamId': '1610616844', 'nickname': 'World', 'urlName': 'world', 'teamShortName': 'World', 'confName': 'East', 'divName': 'East'}], 'africa': [{'isNBAFranchise': False, 'isAllStar': False, 'city': 'Team', 'altCityName': 'Team', 'fullName': 'Team USA', 'tricode': 'USA', 'teamId': '22', 'nickname': 'USA', 'urlName': 'nhs_usa', 'teamShortName': 'USA', 'confName': '', 'divName': ''}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Team', 'altCityName': 'Team', 'fullName': 'Team World', 'tricode': 'WLD', 'teamId': '21', 'nickname': 'World', 'urlName': 'nhs_world', 'teamShortName': 'World', 'confName': '', 'divName': ''}], 'sacramento': [{'isNBAFranchise': True, 'isAllStar': False, 'city': 'Golden State', 'altCityName': 'Golden State', 'fullName': 'Golden State Warriors', 'tricode': 'GSW', 'teamId': '1610612744', 'nickname': 'Warriors', 'urlName': 'warriors', 'teamShortName': 'Golden State', 'confName': 'Sacramento', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Los Angeles', 'altCityName': 'Los Angeles Lakers', 'fullName': 'Los Angeles Lakers', 'tricode': 'LAL', 'teamId': '1610612747', 'nickname': 'Lakers', 'urlName': 'lakers', 'teamShortName': 'L.A. Lakers', 'confName': 'Sacramento', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Miami', 'altCityName': 'Miami', 'fullName': 'Miami Heat', 'tricode': 'MIA', 'teamId': '1610612748', 'nickname': 'Heat', 'urlName': 'heat', 'teamShortName': 'Miami', 'confName': 'Sacramento', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Sacramento', 'altCityName': 'Sacramento', 'fullName': 'Sacramento Kings', 'tricode': 'SAC', 'teamId': '1610612758', 'nickname': 'Kings', 'urlName': 'kings', 'teamShortName': 'Sacramento', 'confName': 'Sacramento', 'divName': ''}], 'vegas': [{'isNBAFranchise': True, 'isAllStar': False, 'city': 'Atlanta', 'altCityName': 'Atlanta', 'fullName': 'Atlanta Hawks', 'tricode': 'ATL', 'teamId': '1610612737', 'nickname': 'Hawks', 'urlName': 'hawks', 'teamShortName': 'Atlanta', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Boston', 'altCityName': 'Boston', 'fullName': 'Boston Celtics', 'tricode': 'BOS', 'teamId': '1610612738', 'nickname': 'Celtics', 'urlName': 'celtics', 'teamShortName': 'Boston', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Brooklyn', 'altCityName': 'Brooklyn', 'fullName': 'Brooklyn Nets', 'tricode': 'BKN', 'teamId': '1610612751', 'nickname': 'Nets', 'urlName': 'nets', 'teamShortName': 'Brooklyn', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Charlotte', 'altCityName': 'Charlotte', 'fullName': 'Charlotte Hornets', 'tricode': 'CHA', 'teamId': '1610612766', 'nickname': 'Hornets', 'urlName': 'hornets', 'teamShortName': 'Charlotte', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Chicago', 'altCityName': 'Chicago', 'fullName': 'Chicago Bulls', 'tricode': 'CHI', 'teamId': '1610612741', 'nickname': 'Bulls', 'urlName': 'bulls', 'teamShortName': 'Chicago', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'China', 'altCityName': 'China', 'fullName': 'Team China', 'tricode': 'CHN', 'teamId': '45', 'nickname': 'China', 'urlName': 'china', 'teamShortName': 'China', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Cleveland', 'altCityName': 'Cleveland', 'fullName': 'Cleveland Cavaliers', 'tricode': 'CLE', 'teamId': '1610612739', 'nickname': 'Cavaliers', 'urlName': 'cavaliers', 'teamShortName': 'Cleveland', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': False, 'isAllStar': False, 'city': 'Croatia', 'altCityName': 'Croatia', 'fullName': 'Team Croatia', 'tricode': 'CRO', 'teamId': '70', 'nickname': 'Croatia', 'urlName': 'croatia', 'teamShortName': 'Croatia', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Dallas', 'altCityName': 'Dallas', 'fullName': 'Dallas Mavericks', 'tricode': 'DAL', 'teamId': '1610612742', 'nickname': 'Mavericks', 'urlName': 'mavericks', 'teamShortName': 'Dallas', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Denver', 'altCityName': 'Denver', 'fullName': 'Denver Nuggets', 'tricode': 'DEN', 'teamId': '1610612743', 'nickname': 'Nuggets', 'urlName': 'nuggets', 'teamShortName': 'Denver', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Detroit', 'altCityName': 'Detroit', 'fullName': 'Detroit Pistons', 'tricode': 'DET', 'teamId': '1610612765', 'nickname': 'Pistons', 'urlName': 'pistons', 'teamShortName': 'Detroit', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Golden State', 'altCityName': 'Golden State', 'fullName': 'Golden State Warriors', 'tricode': 'GSW', 'teamId': '1610612744', 'nickname': 'Warriors', 'urlName': 'warriors', 'teamShortName': 'Golden State', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Houston', 'altCityName': 'Houston', 'fullName': 'Houston Rockets', 'tricode': 'HOU', 'teamId': '1610612745', 'nickname': 'Rockets', 'urlName': 'rockets', 'teamShortName': 'Houston', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Indiana', 'altCityName': 'Indiana', 'fullName': 'Indiana Pacers', 'tricode': 'IND', 'teamId': '1610612754', 'nickname': 'Pacers', 'urlName': 'pacers', 'teamShortName': 'Indiana', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'LA', 'altCityName': 'LA Clippers', 'fullName': 'LA Clippers', 'tricode': 'LAC', 'teamId': '1610612746', 'nickname': 'Clippers', 'urlName': 'clippers', 'teamShortName': 'LA Clippers', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Los Angeles', 'altCityName': 'Los Angeles Lakers', 'fullName': 'Los Angeles Lakers', 'tricode': 'LAL', 'teamId': '1610612747', 'nickname': 'Lakers', 'urlName': 'lakers', 'teamShortName': 'L.A. Lakers', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Memphis', 'altCityName': 'Memphis', 'fullName': 'Memphis Grizzlies', 'tricode': 'MEM', 'teamId': '1610612763', 'nickname': 'Grizzlies', 'urlName': 'grizzlies', 'teamShortName': 'Memphis', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Miami', 'altCityName': 'Miami', 'fullName': 'Miami Heat', 'tricode': 'MIA', 'teamId': '1610612748', 'nickname': 'Heat', 'urlName': 'heat', 'teamShortName': 'Miami', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Milwaukee', 'altCityName': 'Milwaukee', 'fullName': 'Milwaukee Bucks', 'tricode': 'MIL', 'teamId': '1610612749', 'nickname': 'Bucks', 'urlName': 'bucks', 'teamShortName': 'Milwaukee', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Minnesota', 'altCityName': 'Minnesota', 'fullName': 'Minnesota Timberwolves', 'tricode': 'MIN', 'teamId': '1610612750', 'nickname': 'Timberwolves', 'urlName': 'timberwolves', 'teamShortName': 'Minnesota', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'New Orleans', 'altCityName': 'New Orleans', 'fullName': 'New Orleans Pelicans', 'tricode': 'NOP', 'teamId': '1610612740', 'nickname': 'Pelicans', 'urlName': 'pelicans', 'teamShortName': 'New Orleans', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'New York', 'altCityName': 'New York', 'fullName': 'New York Knicks', 'tricode': 'NYK', 'teamId': '1610612752', 'nickname': 'Knicks', 'urlName': 'knicks', 'teamShortName': 'New York', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Oklahoma City', 'altCityName': 'Oklahoma City', 'fullName': 'Oklahoma City Thunder', 'tricode': 'OKC', 'teamId': '1610612760', 'nickname': 'Thunder', 'urlName': 'thunder', 'teamShortName': 'Oklahoma City', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Orlando', 'altCityName': 'Orlando', 'fullName': 'Orlando Magic', 'tricode': 'ORL', 'teamId': '1610612753', 'nickname': 'Magic', 'urlName': 'magic', 'teamShortName': 'Orlando', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Philadelphia', 'altCityName': 'Philadelphia', 'fullName': 'Philadelphia 76ers', 'tricode': 'PHI', 'teamId': '1610612755', 'nickname': '76ers', 'urlName': 'sixers', 'teamShortName': 'Philadelphia', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Phoenix', 'altCityName': 'Phoenix', 'fullName': 'Phoenix Suns', 'tricode': 'PHX', 'teamId': '1610612756', 'nickname': 'Suns', 'urlName': 'suns', 'teamShortName': 'Phoenix', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Portland', 'altCityName': 'Portland', 'fullName': 'Portland Trail Blazers', 'tricode': 'POR', 'teamId': '1610612757', 'nickname': 'Trail Blazers', 'urlName': 'blazers', 'teamShortName': 'Portland', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Sacramento', 'altCityName': 'Sacramento', 'fullName': 'Sacramento Kings', 'tricode': 'SAC', 'teamId': '1610612758', 'nickname': 'Kings', 'urlName': 'kings', 'teamShortName': 'Sacramento', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'San Antonio', 'altCityName': 'San Antonio', 'fullName': 'San Antonio Spurs', 'tricode': 'SAS', 'teamId': '1610612759', 'nickname': 'Spurs', 'urlName': 'spurs', 'teamShortName': 'San Antonio', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Toronto', 'altCityName': 'Toronto', 'fullName': 'Toronto Raptors', 'tricode': 'TOR', 'teamId': '1610612761', 'nickname': 'Raptors', 'urlName': 'raptors', 'teamShortName': 'Toronto', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Utah', 'altCityName': 'Utah', 'fullName': 'Utah Jazz', 'tricode': 'UTA', 'teamId': '1610612762', 'nickname': 'Jazz', 'urlName': 'jazz', 'teamShortName': 'Utah', 'confName': 'summer', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Washington', 'altCityName': 'Washington', 'fullName': 'Washington Wizards', 'tricode': 'WAS', 'teamId': '1610612764', 'nickname': 'Wizards', 'urlName': 'wizards', 'teamShortName': 'Washington', 'confName': 'summer', 'divName': ''}], 'utah': [{'isNBAFranchise': True, 'isAllStar': False, 'city': 'Cleveland', 'altCityName': 'Cleveland', 'fullName': 'Cleveland Cavaliers', 'tricode': 'CLE', 'teamId': '1610612739', 'nickname': 'Cavaliers', 'urlName': 'cavaliers', 'teamShortName': 'Cleveland', 'confName': 'Utah', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Memphis', 'altCityName': 'Memphis', 'fullName': 'Memphis Grizzlies', 'tricode': 'MEM', 'teamId': '1610612763', 'nickname': 'Grizzlies', 'urlName': 'grizzlies', 'teamShortName': 'Memphis', 'confName': 'Utah', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'San Antonio', 'altCityName': 'San Antonio', 'fullName': 'San Antonio Spurs', 'tricode': 'SAS', 'teamId': '1610612759', 'nickname': 'Spurs', 'urlName': 'spurs', 'teamShortName': 'San Antonio', 'confName': 'Utah', 'divName': ''}, {'isNBAFranchise': True, 'isAllStar': False, 'city': 'Utah', 'altCityName': 'Utah', 'fullName': 'Utah Jazz', 'tricode': 'UTA', 'teamId': '1610612762', 'nickname': 'Jazz', 'urlName': 'jazz', 'teamShortName': 'Utah', 'confName': 'Utah', 'divName': ''}]}}

隨堂練習:2019-2020 球季 NBA 有幾支球隊?

In [4]:
import requests

request_url = "http://data.nba.net/prod/v2/2019/teams.json"
response = requests.get(request_url)
response_json = response.json()
teams = response_json["league"]["standard"]
# Continue from here...
In [6]:
print("2019-2020 球季 NBA 有 {} 支球隊".format(n_nba_teams))
2019-2020 球季 NBA 有 30 支球隊

隨堂練習:屬於 Atlantic 與 Southwest 的球隊有幾個?各隊名為?

In [7]:
team_dict = {}
for t in teams:
    div = t["divName"]
    full_name = t["fullName"]
    if div in team_dict:
        team_dict[div].append(full_name)
    else:
        team_dict[div] = [full_name]
# Continue from here...
In [9]:
print("屬於 Atlantic 與 Southwest 的球隊有 {} 個:".format(n_as_teams))
print("Atlantic: {}".format(team_dict["Atlantic"]))
print("Southwest: {}".format(team_dict["Southwest"]))
屬於 Atlantic 與 Southwest 的球隊有 10 個:
Atlantic: ['Boston Celtics', 'Brooklyn Nets', 'New York Knicks', 'Philadelphia 76ers', 'Toronto Raptors']
Southwest: ['Dallas Mavericks', 'Houston Rockets', 'Memphis Grizzlies', 'New Orleans Pelicans', 'San Antonio Spurs']

擷取 XML 格式網頁資料

什麼是 XML?

可延伸標示語(Extensible Markup Language)是一個讓文件同時能夠很容易地讓人去閱讀,又很容易讓電腦程式去辨識的語言格式,和 JSON 格式相同常被用於網站上的資料呈現、傳輸。

Source: https://www.w3schools.com/xml/

擷取 XML 格式網頁資料步驟

  • 使用 requests 請求資料
  • 使用回應的 .content 屬性,例如 response.content
  • lxml 搭配 XPath 解析

什麼是 XPath?

XML Path Language,譯作 XML 路徑語言,用來定位 XML 檔案中特定資訊的位置。

Source: https://www.w3schools.com/xml/xpath_intro.asp

In [10]:
import requests

#進行 POST 請求時要攜帶資料
form_data = {
    "commandid": "GetTown",
    "cityid": "01"
}
request_url = "https://emap.pcsc.com.tw/EMapSDK.aspx"
response = requests.post(request_url, data=form_data)
print(response.status_code)
200

使用 .content 屬性

In [11]:
response_content = response.content
print(response_content)
b'<?xml version="1.0" encoding="utf-8"?><iMapSDKOutput><MessageID>00000</MessageID><CommandID>GetTown</CommandID><Status>\xe9\x80\xa3\xe7\xb7\x9a\xe6\x88\x90\xe5\x8a\x9f</Status><TimeStamp>2020/1/15 \xe4\xb8\x8b\xe5\x8d\x88 02:56:33</TimeStamp><GeoPosition><TownID>01</TownID><TownName>\xe6\x9d\xbe\xe5\xb1\xb1\xe5\x8d\x80</TownName><X>121577218</X><Y>25049837</Y></GeoPosition><GeoPosition><TownID>02</TownID><TownName>\xe4\xbf\xa1\xe7\xbe\xa9\xe5\x8d\x80</TownName><X>121567161</X><Y>25033147</Y></GeoPosition><GeoPosition><TownID>03</TownID><TownName>\xe5\xa4\xa7\xe5\xae\x89\xe5\x8d\x80</TownName><X>121534593</X><Y>25026482</Y></GeoPosition><GeoPosition><TownID>04</TownID><TownName>\xe4\xb8\xad\xe5\xb1\xb1\xe5\x8d\x80</TownName><X>121533655</X><Y>25064427</Y></GeoPosition><GeoPosition><TownID>05</TownID><TownName>\xe4\xb8\xad\xe6\xad\xa3\xe5\x8d\x80</TownName><X>121518245</X><Y>25032251</Y></GeoPosition><GeoPosition><TownID>06</TownID><TownName>\xe5\xa4\xa7\xe5\x90\x8c\xe5\x8d\x80</TownName><X>121515830</X><Y>25066142</Y></GeoPosition><GeoPosition><TownID>07</TownID><TownName>\xe8\x90\xac\xe8\x8f\xaf\xe5\x8d\x80</TownName><X>121499745</X><Y>25034807</Y></GeoPosition><GeoPosition><TownID>08</TownID><TownName>\xe6\x96\x87\xe5\xb1\xb1\xe5\x8d\x80</TownName><X>121570280</X><Y>24989800</Y></GeoPosition><GeoPosition><TownID>09</TownID><TownName>\xe5\x8d\x97\xe6\xb8\xaf\xe5\x8d\x80</TownName><X>121607043</X><Y>25054684</Y></GeoPosition><GeoPosition><TownID>10</TownID><TownName>\xe5\x85\xa7\xe6\xb9\x96\xe5\x8d\x80</TownName><X>121589471</X><Y>25069353</Y></GeoPosition><GeoPosition><TownID>11</TownID><TownName>\xe5\xa3\xab\xe6\x9e\x97\xe5\x8d\x80</TownName><X>121525380</X><Y>25090430</Y></GeoPosition><GeoPosition><TownID>12</TownID><TownName>\xe5\x8c\x97\xe6\x8a\x95\xe5\x8d\x80</TownName><X>121503066</X><Y>25132054</Y></GeoPosition></iMapSDKOutput>'

利用開發人員工具的 Preview 頁籤檢視 XML 的樹狀結構:行政區

Imgur

TownName 標籤的 XPath

  • /iMapSDKOutput/GeoPosition/TownName
  • //TownName

利用開發人員工具的 Preview 頁籤檢視 XML 的樹狀結構:路段資訊

Imgur

rd_name_1 標籤的 XPath

  • /iMapSDKOutput/RoadName/rd_name_1
  • //rd_name_1

section_1 標籤的 XPath

  • /iMapSDKOutput/RoadName/section_1
  • //section_1

利用開發人員工具的 Preview 頁籤檢視 XML 的樹狀結構:商店資訊

Imgur

POIName 標籤的 XPath

  • /iMapSDKOutput/GeoPosition/POIName
  • //POIName

lxml 解析行政區資訊

In [12]:
from lxml import etree
from io import BytesIO

file = BytesIO(response_content)
tree = etree.parse(file)
town_names = [t.text for t in tree.xpath("//TownName")] # XPath 亦可以指定 /iMapSDKOutput/GeoPosition/TownName
print(town_names)
['松山區', '信義區', '大安區', '中山區', '中正區', '大同區', '萬華區', '文山區', '南港區', '內湖區', '士林區', '北投區']

擷取台北市所有商店資訊

In [13]:
import time
import random

tp_711_stores = {}
for town in town_names:
    form_data = {
        "commandid": "SearchStore",
        "city": "台北市",
        "town": town
    }
    r = requests.post("https://emap.pcsc.com.tw/EMapSDK.aspx", data=form_data)
    f = BytesIO(r.content)
    tree = etree.parse(f)
    poi_ids = [t.text.strip() for t in tree.xpath("//POIID")]
    poi_names = [t.text for t in tree.xpath("//POIName")]
    lons = [float(t.text)/1000000 for t in tree.xpath("//X")]
    lats = [float(t.text)/1000000 for t in tree.xpath("//Y")]
    adds = [t.text for t in tree.xpath("//Address")]
    tp_711_stores[town] = []
    for poi_id, poi_name, lon, lat, add in zip(poi_ids, poi_names, lons, lats, adds):
        store_info = {
            "POIID": poi_id,
            "POIName": poi_name,
            "Longitude": lon,
            "Latitude": lat,
            "Address": add
        }
        tp_711_stores[town].append(store_info)
    time.sleep(random.randint(1, 6))
    print("Scraping {}".format(town))
Scraping 松山區
Scraping 信義區
Scraping 大安區
Scraping 中山區
Scraping 中正區
Scraping 大同區
Scraping 萬華區
Scraping 文山區
Scraping 南港區
Scraping 內湖區
Scraping 士林區
Scraping 北投區
In [14]:
print(tp_711_stores["松山區"][0])
print(tp_711_stores["信義區"][0])
print(tp_711_stores["大安區"][0])
{'POIID': '170945', 'POIName': '上弘', 'Longitude': 121.548287390895, 'Latitude': 25.056390968531797, 'Address': '台北市松山區敦化北路168號B2'}
{'POIID': '167651', 'POIName': '一零一', 'Longitude': 121.565077, 'Latitude': 25.033373, 'Address': '台北市信義區信義路五段7號35樓'}
{'POIID': '153319', 'POIName': '大台', 'Longitude': 121.53261437826, 'Latitude': 25.0179598345753, 'Address': '台北市大安區羅斯福路三段283巷14弄16號1樓'}

擷取 HTML 格式網頁資料

擷取 HTML 格式網頁資料步驟

  • 使用 requests 請求資料
  • 使用回應的 .text 屬性,例如 response.text
  • bs4 搭配 Tag Name/CSS Selector 解析

常見用來標示 HTML 資料的方法

  • HTML 的標籤名稱
  • HTML 標籤中給予的 id
  • HTML 標籤中給予的 class
  • 資料所在的 CSS 選擇器(CSS Selector)
  • 資料所在的 XPath

幫助定位 CSS 選擇器的 Chrome 外掛

SelectorGadget

SelectorGadget 的使用方法

  1. 點選 SelectorGadget 的外掛圖示
  2. 留意 SelectorGadget 的 CSS 選擇器
  3. 移動滑鼠到想要定位的元素
  4. 在想要定位的資料上面點選左鍵,留意 Clear 後面數字表示有多少個元素被選擇到
  5. 移動滑鼠點選不要選擇的元素(改以紅底標記),並同時注意 CSS 選擇器位址與 Clear 後面數字

Avengers: Endgame (2019) 示範 SelectorGadget 的使用方法

  • 電影名稱
  • 電影海報
  • 評分
  • 劇情類型
  • 演員陣容

常用的 bs4 函數

BeautifulSoup():創建 BeautifulSoup 類別

In [15]:
# !pip install -U BeautifulSoup4
from bs4 import BeautifulSoup

request_url = "https://www.imdb.com/title/tt4154796"
response = requests.get(request_url)
response_text = response.text
soup = BeautifulSoup(response_text)
print(type(soup))
<class 'bs4.BeautifulSoup'>

常用的方法

  • soup.find():尋找第一個符合標記名稱的資料
  • soup.find_all():尋找所有符合標記名稱的資料
  • soup.select():尋找所有符合 CSS 選擇的資料
In [16]:
print(soup.find("h1"))
print(type(soup.find("h1")))
print(soup.find("h1").text)
print(soup.select("strong span"))
print(float(soup.select("strong span")[0].text))
<h1 class="">復仇者聯盟:終局之戰 <span id="titleYear">(<a href="/year/2019/">2019</a>)</span> </h1>
<class 'bs4.element.Tag'>
復仇者聯盟:終局之戰 (2019) 
[<span itemprop="ratingValue">8.5</span>]
8.5

常用的 element.Tag 屬性、方法

  • element.Tag.text:取出標記中的文字值
  • element.Tag.get(attr):取出標記中的指定屬性
In [17]:
print(len(soup.find_all("img")))
print(soup.find_all("img")[2])
print(soup.find_all("img")[2].get("alt"))
print(soup.find_all("img")[2].get("src"))
78
<img class="pro_logo" src="https://m.media-amazon.com/images/G/01/imdb/IMDbConsumerSiteProTitleViews/images/logo/pro_logo_dark-3176609149._CB455053166_.png"/>
None
https://m.media-amazon.com/images/G/01/imdb/IMDbConsumerSiteProTitleViews/images/logo/pro_logo_dark-3176609149._CB455053166_.png
In [18]:
print(soup.select("strong span"))
print(float(soup.select("strong span")[0].text))
[<span itemprop="ratingValue">8.5</span>]
8.5

隨堂練習:以 requests 搭配 bs4 擷取 Avengers: Endgame (2019) 的劇情類型

In [19]:
response = requests.get("https://www.imdb.com/title/tt4154796")
soup = BeautifulSoup(response.text)
# Continue from here...
In [21]:
print(genre)
['Action', 'Adventure', 'Drama']

隨堂練習:以 requests 搭配 bs4 擷取 Avengers: Endgame (2019) 的演員陣容

In [22]:
response = requests.get("https://www.imdb.com/title/tt4154796")
soup = BeautifulSoup(response.text)
# Continue from here...
In [24]:
print(cast)
['Robert Downey Jr.', 'Chris Evans', 'Mark Ruffalo', 'Chris Hemsworth', 'Scarlett Johansson', 'Jeremy Renner', 'Don Cheadle', 'Paul Rudd', 'Benedict Cumberbatch', 'Chadwick Boseman', 'Brie Larson', 'Tom Holland', 'Karen Gillan', 'Zoe Saldana', 'Evangeline Lilly']

隨堂練習:自訂一個函數 get_movie_data(movie_url)

In [34]:
get_movie_data("https://www.imdb.com/title/tt4154796")
Out[34]:
{'movieTitle': '復仇者聯盟:終局之戰(2019)',
 'moviePoster': 'https://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX182_CR0,0,182,268_AL_.jpg',
 'movieRating': 8.5,
 'movieGenre': ['Action', 'Adventure', 'Drama'],
 'movieCast': ['Robert Downey Jr.',
  'Chris Evans',
  'Mark Ruffalo',
  'Chris Hemsworth',
  'Scarlett Johansson',
  'Jeremy Renner',
  'Don Cheadle',
  'Paul Rudd',
  'Benedict Cumberbatch',
  'Chadwick Boseman',
  'Brie Larson',
  'Tom Holland',
  'Karen Gillan',
  'Zoe Saldana',
  'Evangeline Lilly']}

get_movie_data() 更方便使用

get() 中加入 params

query_string_parameters = {
    'q': 'Avengers: Endgame',
    'ref_': 'nv_sr_sm'
}

Imgur

In [35]:
query_string_parameters = {
    'q': 'Avengers: Endgame',
    'ref_': 'nv_sr_sm'
}
request_url = "https://www.imdb.com/find"
response = requests.get(request_url, params=query_string_parameters)
print(response.status_code)
200

利用 .result_text > a CSS 選擇器把所有的搜尋結果擷取下來

In [36]:
soup = BeautifulSoup(response.text)
result_hrefs = [e.get("href") for e in soup.select(".result_text > a")]
print(result_hrefs)
['/title/tt4154796/', '/title/tt10258872/', '/title/tt9827182/', '/title/tt10025738/', '/title/tt10042140/', '/title/tt10022970/', '/title/tt10778688/', '/title/tt10213650/', '/search/keyword?keywords=reference-to-avengers-endgame', '/search/keyword?keywords=reference-to-%27avengers-endgame%27-2019']

最相近搜尋結果的電影頁面網址

In [37]:
movie_url = "https://www.imdb.com" + result_hrefs[0]
print(movie_url)
https://www.imdb.com/title/tt4154796/

隨堂練習:自訂一個函數 get_movie_data(movie_title)

In [39]:
get_movie_data("Avengers: Endgame (2019)")
Out[39]:
{'movieTitle': '復仇者聯盟:終局之戰 (2019)',
 'moviePoster': 'https://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX182_CR0,0,182,268_AL_.jpg',
 'movieRating': 8.5,
 'movieGenre': ['Action', 'Adventure', 'Drama'],
 'movieCast': ['Robert Downey Jr.',
  'Chris Evans',
  'Mark Ruffalo',
  'Chris Hemsworth',
  'Scarlett Johansson',
  'Jeremy Renner',
  'Don Cheadle',
  'Paul Rudd',
  'Benedict Cumberbatch',
  'Chadwick Boseman',
  'Brie Larson',
  'Tom Holland',
  'Karen Gillan',
  'Zoe Saldana',
  'Evangeline Lilly']}

有時候 requests 送出的請求需要攜帶餅乾(cookies),否則回傳的資料會不符合預期

In [40]:
response = requests.get("https://www.ptt.cc/bbs/Gossiping/index.html")
print(response.text)
<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print">




	</head>
    <body>
		
<div class="bbs-screen bbs-content">
    <div class="over18-notice">
        <p>本網站已依網站內容分級規定處理</p>

        <p>警告︰您即將進入之看板內容需滿十八歲方可瀏覽。</p>

        <p>若您尚未年滿十八歲,請點選離開。若您已滿十八歲,亦不可將本區之內容派發、傳閱、出售、出租、交給或借予年齡未滿18歲的人士瀏覽,或將本網站內容向該人士出示、播放或放映。</p>
    </div>
</div>

<div class="bbs-screen bbs-content center clear">
    <form action="/ask/over18" method="post">
        <input type="hidden" name="from" value="/bbs/Gossiping/index.html">
        <div class="over18-button-container">
            <button class="btn-big" type="submit" name="yes" value="yes">我同意,我已年滿十八歲<br><small>進入</small></button>
        </div>
        <div class="over18-button-container">
            <button class="btn-big" type="submit" name="no" value="no">未滿十八歲或不同意本條款<br><small>離開</small></button>
        </div>
    </form>
</div>

		

<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-32365737-1', {
    cookieDomain: 'ptt.cc',
    legacyCookieDomain: 'ptt.cc'
  });
  ga('send', 'pageview');
</script>


		
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/bbs/v2.27/bbs.js"></script>

    </body>
</html>

In [41]:
response = requests.get("http://www.fantasy-sky.com/ContentList.aspx?section=002")
soup = BeautifulSoup(response.text)
movie_titles = [i.text for i in soup.select(".movies-name")]
print(movie_titles)
['唐頓莊園', '星際救援', '雙子殺手', '牠:第二章', '黑魔女2', '電流大戰', '屍樂園:髒比雙拼', '金翅雀', '玩命關頭:特別行動', '全面攻佔3:天使救援', '舞孃騙很大', '盧斯', '瞞天機密', '弒婚遊戲', '獅子王', '五月天人生無限公司', '花椒之味', '下半場', '情牽拉麵茶', '光', '追龍II:賊王', '流浪地球', '一定要結婚嗎', '亡命之途', '柴公園', '東京喰種', '匿名的畫作', '殺手寓言', '新聞記者', '小委託人', '驅魔使者', '辛巴', '門當護不對', '電影哆啦A夢:大雄的月球探測記', '極限逃生', '陪審團', '跳痛先生', '出發吧!我的脫單假期', '偵兇']

從開發人員工具檢視 Cookies

Imgur

Imgur

In [42]:
import requests

response = requests.get("https://www.ptt.cc/bbs/Gossiping/index.html", cookies={'over18': '1'})
print(response.text)
<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>看板 Gossiping 文章列表 - 批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print">




	</head>
    <body>
		
<div id="topbar-container">
	<div id="topbar" class="bbs-content">
		<a id="logo" href="/bbs/">批踢踢實業坊</a>
		<span>&rsaquo;</span>
		<a class="board" href="/bbs/Gossiping/index.html"><span class="board-label">看板 </span>Gossiping</a>
		<a class="right small" href="/about.html">關於我們</a>
		<a class="right small" href="/contact.html">聯絡資訊</a>
	</div>
</div>

<div id="main-container">
	<div id="action-bar-container">
		<div class="action-bar">
			<div class="btn-group btn-group-dir">
				<a class="btn selected" href="/bbs/Gossiping/index.html">看板</a>
				<a class="btn" href="/man/Gossiping/index.html">精華區</a>
			</div>
			<div class="btn-group btn-group-paging">
				<a class="btn wide" href="/bbs/Gossiping/index1.html">最舊</a>
				<a class="btn wide" href="/bbs/Gossiping/index39176.html">&lsaquo; 上頁</a>
				<a class="btn wide disabled">下頁 &rsaquo;</a>
				<a class="btn wide" href="/bbs/Gossiping/index.html">最新</a>
			</div>
		</div>
	</div>

	<div class="r-list-container action-bar-margin bbs-screen">
		<div class="search-bar">
			<form type="get" action="search" id="search-bar">
				<input class="query" type="text" name="q" value="" placeholder="搜尋文章&#x22ef;">
			</form>
		</div>

		
		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">1</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055284.A.87A.html">[問卦] 在微信上收到長輩的文要怎回</a>
			
			</div>
			<div class="meta">
				<div class="author">leolivein</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E5%9C%A8%E5%BE%AE%E4%BF%A1%E4%B8%8A%E6%94%B6%E5%88%B0%E9%95%B7%E8%BC%A9%E7%9A%84%E6%96%87%E8%A6%81%E6%80%8E%E5%9B%9E">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Aleolivein">搜尋看板內 leolivein 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">5</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055313.A.1D1.html">[問卦] 要如何擁有一堆無腦粉絲的八卦</a>
			
			</div>
			<div class="meta">
				<div class="author">meblessme</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E8%A6%81%E5%A6%82%E4%BD%95%E6%93%81%E6%9C%89%E4%B8%80%E5%A0%86%E7%84%A1%E8%85%A6%E7%B2%89%E7%B5%B2%E7%9A%84%E5%85%AB%E5%8D%A6">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Ameblessme">搜尋看板內 meblessme 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">4</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055315.A.976.html">[問卦] 有沒有南澳鄉的八卦</a>
			
			</div>
			<div class="meta">
				<div class="author">azt911231</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E6%9C%89%E6%B2%92%E6%9C%89%E5%8D%97%E6%BE%B3%E9%84%89%E7%9A%84%E5%85%AB%E5%8D%A6">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Aazt911231">搜尋看板內 azt911231 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">7</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055318.A.F06.html">[問卦] 出生在哪個國家最爽</a>
			
			</div>
			<div class="meta">
				<div class="author">paulabxz123</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E5%87%BA%E7%94%9F%E5%9C%A8%E5%93%AA%E5%80%8B%E5%9C%8B%E5%AE%B6%E6%9C%80%E7%88%BD">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Apaulabxz123">搜尋看板內 paulabxz123 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">1</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055333.A.0BA.html">[問卦] 捷克獵人的八卦?</a>
			
			</div>
			<div class="meta">
				<div class="author">Clarence</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E6%8D%B7%E5%85%8B%E7%8D%B5%E4%BA%BA%E7%9A%84%E5%85%AB%E5%8D%A6%EF%BC%9F">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3AClarence">搜尋看板內 Clarence 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f3">11</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055357.A.A27.html">[新聞] 上海譴責布拉格友台柯文哲:雙城論壇續辦</a>
			
			</div>
			<div class="meta">
				<div class="author">safefree</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D&#43;%E4%B8%8A%E6%B5%B7%E8%AD%B4%E8%B2%AC%E5%B8%83%E6%8B%89%E6%A0%BC%E5%8F%8B%E5%8F%B0%E6%9F%AF%E6%96%87%E5%93%B2%EF%BC%9A%E9%9B%99%E5%9F%8E%E8%AB%96%E5%A3%87%E7%BA%8C%E8%BE%A6">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Asafefree">搜尋看板內 safefree 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055362.A.530.html">[新聞] 掃蕩伊斯蘭好戰分子 德國警方分兵多路搜</a>
			
			</div>
			<div class="meta">
				<div class="author">dragonjj</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D&#43;%E6%8E%83%E8%95%A9%E4%BC%8A%E6%96%AF%E8%98%AD%E5%A5%BD%E6%88%B0%E5%88%86%E5%AD%90&#43;%E5%BE%B7%E5%9C%8B%E8%AD%A6%E6%96%B9%E5%88%86%E5%85%B5%E5%A4%9A%E8%B7%AF%E6%90%9C">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Adragonjj">搜尋看板內 dragonjj 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055384.A.77C.html">[問卦] 小隻馬同事露出肩帶</a>
			
			</div>
			<div class="meta">
				<div class="author">ComeThrough</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E5%B0%8F%E9%9A%BB%E9%A6%AC%E5%90%8C%E4%BA%8B%E9%9C%B2%E5%87%BA%E8%82%A9%E5%B8%B6">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3AComeThrough">搜尋看板內 ComeThrough 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f3">75</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055386.A.3AE.html">[爆卦] 黃淵夏:反滲透法第一天白狼被約談</a>
			
			</div>
			<div class="meta">
				<div class="author">GO19870325</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E7%88%86%E5%8D%A6%5D&#43;%E9%BB%83%E6%B7%B5%E5%A4%8F%3A%E5%8F%8D%E6%BB%B2%E9%80%8F%E6%B3%95%E7%AC%AC%E4%B8%80%E5%A4%A9%E7%99%BD%E7%8B%BC%E8%A2%AB%E7%B4%84%E8%AB%87">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3AGO19870325">搜尋看板內 GO19870325 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055387.A.9E3.html">Re: [問卦] 中國為何這麼容易出現零號病人阿</a>
			
			</div>
			<div class="meta">
				<div class="author">cdcardabc</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E4%B8%AD%E5%9C%8B%E7%82%BA%E4%BD%95%E9%80%99%E9%BA%BC%E5%AE%B9%E6%98%93%E5%87%BA%E7%8F%BE%E9%9B%B6%E8%99%9F%E7%97%85%E4%BA%BA%E9%98%BF">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Acdcardabc">搜尋看板內 cdcardabc 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f3">41</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055469.A.CA2.html">[新聞] 蔡總統10:40發表談話 將公布施行「反滲</a>
			
			</div>
			<div class="meta">
				<div class="author">Gaffky</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D&#43;%E8%94%A1%E7%B8%BD%E7%B5%B110%EF%BC%9A40%E7%99%BC%E8%A1%A8%E8%AB%87%E8%A9%B1&#43;%E5%B0%87%E5%85%AC%E5%B8%83%E6%96%BD%E8%A1%8C%E3%80%8C%E5%8F%8D%E6%BB%B2">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3AGaffky">搜尋看板內 Gaffky 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">4</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055478.A.3E9.html">[問卦] 故宮有什麼必看的啊?</a>
			
			</div>
			<div class="meta">
				<div class="author">joe911joeop</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E6%95%85%E5%AE%AE%E6%9C%89%E4%BB%80%E9%BA%BC%E5%BF%85%E7%9C%8B%E7%9A%84%E5%95%8A%EF%BC%9F">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Ajoe911joeop">搜尋看板內 joe911joeop 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">2</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055496.A.D81.html">[問卦] 衛生紙為什麼要兩張?</a>
			
			</div>
			<div class="meta">
				<div class="author">LAKobeBryant</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E8%A1%9B%E7%94%9F%E7%B4%99%E7%82%BA%E4%BB%80%E9%BA%BC%E8%A6%81%E5%85%A9%E5%BC%B5%EF%BC%9F">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3ALAKobeBryant">搜尋看板內 LAKobeBryant 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055537.A.4B6.html">[問卦] 牛寺哥是不是很可憐那</a>
			
			</div>
			<div class="meta">
				<div class="author">taker627</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E7%89%9B%E5%AF%BA%E5%93%A5%E6%98%AF%E4%B8%8D%E6%98%AF%E5%BE%88%E5%8F%AF%E6%86%90%E9%82%A3">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Ataker627">搜尋看板內 taker627 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">2</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055562.A.D0D.html">Re: [問卦] 柯粉變多了嗎?</a>
			
			</div>
			<div class="meta">
				<div class="author">opfish</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E6%9F%AF%E7%B2%89%E8%AE%8A%E5%A4%9A%E4%BA%86%E5%97%8E%EF%BC%9F">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Aopfish">搜尋看板內 opfish 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">2</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055634.A.C0A.html">Re: [問卦] 沒服過兵役是不是就沒資格喊台獨?</a>
			
			</div>
			<div class="meta">
				<div class="author">klm</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E6%B2%92%E6%9C%8D%E9%81%8E%E5%85%B5%E5%BD%B9%E6%98%AF%E4%B8%8D%E6%98%AF%E5%B0%B1%E6%B2%92%E8%B3%87%E6%A0%BC%E5%96%8A%E5%8F%B0%E7%8D%A8%3F">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Aklm">搜尋看板內 klm 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">3</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055655.A.2DA.html">[問卦] 可憐吶~慈濟竟然放棄line改用telegram</a>
			
			</div>
			<div class="meta">
				<div class="author">TellthEtRee</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E5%8F%AF%E6%86%90%E5%90%B6%EF%BD%9E%E6%85%88%E6%BF%9F%E7%AB%9F%E7%84%B6%E6%94%BE%E6%A3%84line%E6%94%B9%E7%94%A8telegram">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3ATellthEtRee">搜尋看板內 TellthEtRee 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">5</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055679.A.281.html">[問卦] 請問高雄市長今天有上班嗎</a>
			
			</div>
			<div class="meta">
				<div class="author">cococat1028</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%95%8F%E5%8D%A6%5D&#43;%E8%AB%8B%E5%95%8F%E9%AB%98%E9%9B%84%E5%B8%82%E9%95%B7%E4%BB%8A%E5%A4%A9%E6%9C%89%E4%B8%8A%E7%8F%AD%E5%97%8E">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Acococat1028">搜尋看板內 cococat1028 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">2</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055687.A.B60.html">Re: [討論] 劉仕傑臉書:對不起,我看不下去。 </a>
			
			</div>
			<div class="meta">
				<div class="author">kuluma</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E8%A8%8E%E8%AB%96%5D&#43;%E5%8A%89%E4%BB%95%E5%82%91%E8%87%89%E6%9B%B8%3A%E5%B0%8D%E4%B8%8D%E8%B5%B7%EF%BC%8C%E6%88%91%E7%9C%8B%E4%B8%8D%E4%B8%8B%E5%8E%BB%E3%80%82&#43;">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Akuluma">搜尋看板內 kuluma 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
            
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">2</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1579055798.A.62A.html">Re: [新聞] 上海解除布拉格姊妹市 柯文哲:中國無權</a>
			
			</div>
			<div class="meta">
				<div class="author">jiouje</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D&#43;%E4%B8%8A%E6%B5%B7%E8%A7%A3%E9%99%A4%E5%B8%83%E6%8B%89%E6%A0%BC%E5%A7%8A%E5%A6%B9%E5%B8%82&#43;%E6%9F%AF%E6%96%87%E5%93%B2%EF%BC%9A%E4%B8%AD%E5%9C%8B%E7%84%A1%E6%AC%8A">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Ajiouje">搜尋看板內 jiouje 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/15</div>
				<div class="mark"></div>
			</div>
		</div>

		
        
        <div class="r-list-sep"></div>
            
                
        
        
		<div class="r-ent">
			<div class="nrec"></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1566347622.A.9C7.html">[公告] 八卦板板規(2019.08.21)</a>
			
			</div>
			<div class="meta">
				<div class="author">arsonlolita</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D&#43;%E5%85%AB%E5%8D%A6%E6%9D%BF%E6%9D%BF%E8%A6%8F%282019.08.21%29">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Aarsonlolita">搜尋看板內 arsonlolita 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 8/21</div>
				<div class="mark">!</div>
			</div>
		</div>

            
                
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f3">57</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1578271293.A.A78.html">[協尋] 車禍過世 1/2 甲提南路立新一街 </a>
			
			</div>
			<div class="meta">
				<div class="author">arsonlolita</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%8D%94%E5%B0%8B%5D&#43;%E8%BB%8A%E7%A6%8D%E9%81%8E%E4%B8%96&#43;1%2F2&#43;%E7%94%B2%E6%8F%90%E5%8D%97%E8%B7%AF%E7%AB%8B%E6%96%B0%E4%B8%80%E8%A1%97&#43;">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Aarsonlolita">搜尋看板內 arsonlolita 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/06</div>
				<div class="mark"></div>
			</div>
		</div>

            
                
        
        
		<div class="r-ent">
			<div class="nrec"></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1577812250.A.592.html">[公告] 赤鴻飛羽,一月份置底閒聊文</a>
			
			</div>
			<div class="meta">
				<div class="author">Bignana</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D&#43;%E8%B5%A4%E9%B4%BB%E9%A3%9B%E7%BE%BD%EF%BC%8C%E4%B8%80%E6%9C%88%E4%BB%BD%E7%BD%AE%E5%BA%95%E9%96%92%E8%81%8A%E6%96%87">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3ABignana">搜尋看板內 Bignana 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/01</div>
				<div class="mark">M</div>
			</div>
		</div>

            
                
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f2">8</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1578694625.A.277.html">[協尋] 1/8晚間北市光復橋車禍</a>
			
			</div>
			<div class="meta">
				<div class="author">DirKuan</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%8D%94%E5%B0%8B%5D&#43;1%2F8%E6%99%9A%E9%96%93%E5%8C%97%E5%B8%82%E5%85%89%E5%BE%A9%E6%A9%8B%E8%BB%8A%E7%A6%8D">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3ADirKuan">搜尋看板內 DirKuan 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/11</div>
				<div class="mark"></div>
			</div>
		</div>

            
                
        
        
		<div class="r-ent">
			<div class="nrec"><span class="hl f3">10</span></div>
			<div class="title">
			
				<a href="/bbs/Gossiping/M.1578961532.A.51E.html">[協尋] 高雄左營區 行車記錄器 </a>
			
			</div>
			<div class="meta">
				<div class="author">arsonlolita</div>
				<div class="article-menu">
					
					<div class="trigger">&#x22ef;</div>
					<div class="dropdown">
						<div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%8D%94%E5%B0%8B%5D&#43;%E9%AB%98%E9%9B%84%E5%B7%A6%E7%87%9F%E5%8D%80&#43;%E8%A1%8C%E8%BB%8A%E8%A8%98%E9%8C%84%E5%99%A8&#43;">搜尋同標題文章</a></div>
						
						<div class="item"><a href="/bbs/Gossiping/search?q=author%3Aarsonlolita">搜尋看板內 arsonlolita 的文章</a></div>
						
					</div>
					
				</div>
				<div class="date"> 1/14</div>
				<div class="mark"></div>
			</div>
		</div>

            
        
	</div>

    
<div class="bbs-screen bbs-footer-message">本網站已依台灣網站內容分級規定處理。此區域為限制級,未滿十八歲者不得瀏覽。</div>

</div>

		

<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-32365737-1', {
    cookieDomain: 'ptt.cc',
    legacyCookieDomain: 'ptt.cc'
  });
  ga('send', 'pageview');
</script>


		
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/bbs/v2.27/bbs.js"></script>

    </body>
</html>

In [43]:
import requests
from bs4 import BeautifulSoup

response = requests.get("http://www.fantasy-sky.com/ContentList.aspx?section=002", cookies={'COOKIE_LANGUAGE': 'en'})
soup = BeautifulSoup(response.text)
movie_titles = [i.text for i in soup.select(".movies-name")]
print(movie_titles)
['Downton Abbey', 'Ad Astra', 'Gemini Man', 'It Chapter Two', 'Maleficent: Mistress of Evil', 'The Current War', 'Zombieland: Double Tap', 'The Goldfinch', 'Fast & Furious Presents: Hobbs…', 'Angel Has Fallen', 'Hustlers', 'Luce', 'Official Secrets', 'Ready or Not', 'Disney’s The Lion King', 'Mayday Life', 'Fagara', 'We Are Champions', 'Ramen Shop', 'Guang', 'Chasing The Dragon II…', 'The Wandering Earth', 'Marriage Hunting Beauty', 'Paradise Next', 'Shiba-Park', "Tokyo Ghoul 'S'", 'One Last Deal', 'The Fable', 'The Journalist', 'My First Client', 'The Divine Fury', 'Simmba', 'Cold Feet', "Doraemon: Nobita's…", 'EXIT', 'Juror 8', 'The Man Who Feels No Pain', 'Our Happy Holiday', 'The Invisible Witness']

隨堂練習:擷取所有華航機上電影清單

In [44]:
ca_movie_urls = ["http://www.fantasy-sky.com/ContentList.aspx?section=002&category=0020{}".format(i) for i in range(1, 5)]
# Continue from here ...
In [46]:
print(ca_movie_titles)
['Downton Abbey', 'Ad Astra', 'Gemini Man', 'It Chapter Two', 'Maleficent: Mistress of Evil', 'The Current War', 'Zombieland: Double Tap', 'The Goldfinch', 'Fast & Furious Presents: Hobbs…', 'Angel Has Fallen', 'Hustlers', 'Luce', 'Official Secrets', 'Ready or Not', 'Disney’s The Lion King', 'Mayday Life', 'Fagara', 'We Are Champions', 'Ramen Shop', 'Guang', 'Chasing The Dragon II…', 'The Wandering Earth', 'Marriage Hunting Beauty', 'Paradise Next', 'Shiba-Park', "Tokyo Ghoul 'S'", 'One Last Deal', 'The Fable', 'The Journalist', 'My First Client', 'The Divine Fury', 'Simmba', 'Cold Feet', "Doraemon: Nobita's…", 'EXIT', 'Juror 8', 'The Man Who Feels No Pain', 'Our Happy Holiday', 'The Invisible Witness', 'DISNEY AND PIXAR’S Inside Out', 'Up', 'Toy Story 2', 'Toy Story 3', 'The Peanuts Movie', 'Shark Tale', 'The Lego Batman Movie', 'Toy Story', "Tim Burton's Corpse Bride", 'Smallfoot', 'Ice Age: Collision Course', 'Ferdinand', 'Railroad Tigers', 'So Young', 'A Simple Life', 'Beyond Beauty - Taiwan from Above', 'Dying To Survive', 'Infernal Affairs', 'Millennium Mambo', 'The Golden Era', 'Three Times', 'The Wedding Banquet', 'Cloud In The Wind', 'Fall in Love at First Kiss', 'Integrity', 'Still Human', 'Run for Dream', 'Love The Way You Are', 'More Than Blue', 'Stolen Identity', "Long Day's Journey Into Night", 'Shadow', 'Tracey', 'Masquerade Hotel', 'The Confidence Man JP: The Movie', 'Inseparable Bros', 'Cheer Boys!!', 'The White Storm 2 – Drug Lords', 'The 12th Man', 'Kingdom', "Jupiter's Moon", 'Simpel…', 'A Real Vermeer', 'My Hero Academia: Two Heroes', "Midsummer's Equation", 'Money', 'Gold', 'The Gangster, The Cop, The Devil', 'My Extraordinary Summer with Tess', 'The Lady Improper', 'Another World', 'Who You Think I Am', 'The Conductor', 'The Shiny Shrimps', 'All About Me', 'A Long Goodbye', 'Miss & Mrs Cops', 'Capernaum', 'The Disaster Artist', 'Black Swan', 'Crazy Heart', 'Moulin Rouge', 'The Devil Wears Prada', 'Walk the Line', 'The Hobbit: The Battle Of…', 'Disney’s Maleficent', 'Zombieland', "Bridget Jones's Baby", 'American Made', 'Home Again', 'Johnny English Reborn', 'Romeo + Juliet', 'The Great Wall', 'Spider-ManTM: Far From Home', 'Pokémon Detective Pikachu', 'Avengers: Infinity War', 'The Avengers', 'Shaft', 'Love Actually', 'Before I Fall', 'Godzilla', 'Superman Returns', 'Invictus', 'Chef', 'Spider-ManTM: Homecoming', 'Avengers: Age of Ultron', 'London Has Fallen', 'Never Let Me Go', 'John Wick', 'The Book of Henry', "A Dog's Purpose", 'The Lost City of Z', 'Love the Coopers', 'Runner Runner', 'The Intern', 'Café Society', 'Sherlock Holmes: A Game of Shadows', 'Deepwater Horizon', 'I, Daniel Blake', 'Captain America: Civil War', 'Iron Man', 'Iron Man 2', 'Iron Man 3', 'The Pianist', 'The Curious Case of Benjamin Button', 'Australia', 'The Tree Of Life', 'The Bucket List', 'The Legend of Tarzan', 'Furious 7', 'The Fate of the Furious', 'The Book Thief', 'Crazy, Stupid, Love.', 'The Holiday', 'The Mummy', 'Unstoppable', 'Straight Outta Compton', 'The Drop', 'The Judge', 'X-Men', 'X-Men: First Class', 'X-Men: Days of Future Past', 'X-Men: Apocalypse', 'Wrath of the Titans', 'Why Him?', 'The Shawshank Redemption', 'Wonder Woman']

隨堂練習:找出華航機上最高評等的電影

In [50]:
print(ca_movie_titles[best_movie_index])
The Shawshank Redemption

Web Scraping in a Nutshell

  • 請求資料
    • 以 Quick JavaScript Switcher 判斷資料分類在 XHR 或 Doc
    • 以 Chrome 開發人員工具檢視 Preview/Response 確認資料格式
    • 以 Chrome 開發人員工具檢視請求資料的 Request URL/Request Method/Query String Parameters/Form Data/Cookies
    • requests 發送請求獲得回應
  • 解析資料
    • 資料是 JSON 格式,呼叫回應的 .json() 方法後直接以 Python 資料結構解析
    • 資料是 XML 格式,使用回應的 .content 屬性後以 lxml 搭配 XPath 解析
    • 資料是 HTML 格式,使用回應的 .text 屬性後以 bs4 搭配 CSS Selector 解析

瀏覽器自動化

在研究如何使 get_movie_data() 更方便的過程中我們做了幾個動作

  1. 前往 https://www.imdb.com/ 首頁
  2. 輸入電影名稱
  3. 點選搜尋
  4. 點選 Movie 分類標籤
  5. 點選相似度最高的搜尋結果

這些操作可以利用 selenium 來自動化!

什麼是 Selenium

  • Selenium 是瀏覽器自動化測試的解決方案
  • Python 透過 Selenium WebDriver 呼叫瀏覽器驅動程式,再由瀏覽器驅動程式去呼叫瀏覽器
  • 對 Google Chrome 與 Mozilla Firefox 兩個主流瀏覽器的支援最好

Selenium 環境設定:移除教室電腦中不必要的 Python 版本

  • Python.org 的版本
  • 安裝在非使用者路徑下的 Anaconda 版本

Selenium 環境設定:安裝 Miniconda 的步驟

  1. 前往 Miniconda 下載頁面,依照作業系統點選對應的 Python 3.X 安裝檔
  2. 依照提示點選下一步
  3. 選擇安裝路徑
  4. 依照提示點選我同意
  5. 等待安裝完成

Selenium 環境設定:建立環境步驟

  1. 開啟 Anaconda Prompt
  2. 更新 conda
  3. 安裝 jupyter
  4. 創建環境
  5. 啟動環境
  6. 安裝套件
  7. 創建 Jupyter Notebook Kernel(在已經啟動環境的情況下)
  8. 卸載環境
  9. 開啟 Jupyter Notebook

開啟 Anaconda Prompt

Imgur

更新 conda

# run in command line
(base) conda update conda

安裝 jupyter

# run in command line
(base) conda install jupyter

檢視可用環境

# run in command line
(base) conda env list

創建環境

# run in command line
(base) conda create --name <env_name> python=3.7

啟動環境

# run in command line
(base) conda activate <env_name>
# conda deactivate # 回到原本的 (base)

安裝套件

# run in command line
(env_name) conda install ipykernel requests lxml beautifulsoup4 selenium

這些套件的用途分別是

  • 環境
    • ipykernel
  • 網路爬蟲
    • requests
    • lxml
    • beautifulsoup4
    • selenium

創建 Jupyter Notebook Kernel(在已經啟動環境的情況下)

# run in command line
(env_name) python -m ipykernel install --user --name <kernel_name> --display-name "Python Web Scraping"

檢視可用的 Jupyter Notebook Kernel

# run in command line
(env_name) jupyter kernelspec list

Selenium 環境設定:Chrome

  • 前往 Chrome 官方網站下載最新版的瀏覽器
  • 下載最新版的瀏覽器驅動程式 ChromeDriver
  • 下載完成以後解壓縮在熟悉路徑讓後續指派較為方便

Selenium 環境設定:Firefox

  • 前往 Firefox 官方網站下載最新版的瀏覽器
  • 下載最新版的瀏覽器驅動程式 geckodriver
  • 下載完成以後解壓縮在熟悉路徑讓後續指派較為方便

測試 Chrome 是否設定完成

用程式碼透過 ChromeDriver 操控 Chrome 瀏覽器前往 IMDB 首頁並將首頁的網址印出再關閉瀏覽器

In [ ]:
from selenium import webdriver

driver_path = "c:/YOUR/PATH/TO/CHROMEDRIVER"
imdb_home = "https://www.imdb.com/"
driver = webdriver.Chrome(executable_path=driver_path) # Use Chrome
driver.get(imdb_home)
print(driver.current_url)
driver.close()

測試 Firefox 是否設定完成

用程式碼透過 geckodriver 操控 Firefox 瀏覽器前往 IMDB 首頁並將首頁的網址印出再關閉瀏覽器

In [ ]:
from selenium import webdriver

driver_path = "c:/YOUR/PATH/TO/GECKODRIVER"
imdb_home = "https://www.imdb.com/"
driver = webdriver.Firefox(executable_path=driver_path) # Use Firefox
driver.get(imdb_home)
print(driver.current_url)
driver.close()

常使用的 driver 方法、屬性

  • driver.get() :前往指定網址
  • driver.find_element_by_css_selector() :定位搜尋欄位、搜尋按鈕與搜尋結果連結(單數)
  • driver.find_elements_by_css_selector() :定位搜尋欄位、搜尋按鈕與搜尋結果連結(複數)
  • driver.find_element_by_xpath() :定位搜尋欄位、搜尋按鈕與搜尋結果連結(單數)
  • driver.find_elements_by_xpath() :定位搜尋欄位、搜尋按鈕與搜尋結果連結(複數)
  • driver.current_url :取得當下瀏覽器的網址

幫助檢視 XPath 的 Chrome 外掛

XPath Helper

XPath Helper 的使用方法

  • 點選 XPath Helper 的外掛圖示
  • 留意 XPath Helper 介面左邊的 XPath 與右邊被定位到的資料
  • 按住 shift 鍵移動滑鼠到想要定位的元素
  • 試著縮減 XPath,從最前面開始刪減並置換為 //

Avengers: Endgame (2019) 示範 XPath Helper 的使用方法

  • 電影名稱
  • 電影海報
  • 評分
  • 劇情類型
  • 演員陣容

常使用的 element 方法、屬性

  • element.send_keys() :輸入文字
  • element.click() :按下搜尋按鈕與連結
  • element.text:取出標記中的文字值
  • element.get_attribute(ATTR):取出標記中的指定屬性

隨堂練習:以 selenium 實作 get_movie_data(movie_title)

In [ ]:
get_movie_data("Avengers: Endgame (2019)")

隨堂練習:以 selenium 擷取四部復仇者聯盟的電影資訊

avengers_movies = ["The Avengers (2012)", "Avengers: Age of Ultron (2015)", "Avengers: Infinity War (2018)", "Avengers: Endgame (2019)"]
In [ ]:
print(avengers_movie_data)

將擷取的電影資訊匯出

In [ ]:
import json

with open("avengers.json", "w") as f:
    json.dump(avengers_movie_data, f)

作業

擷取 Avengers: Endgame (2019) 的上映日期列表,最多的上映日期為哪一天?有幾個國家在那天上映?

In [52]:
ans()
Out[52]:
{'22 April 2019': 1,
 '23 April 2019': 1,
 '24 April 2019': 33,
 '25 April 2019': 23,
 '26 April 2019': 14,
 '28 April 2019': 1,
 '29 April 2019': 1,
 '28 June 2019': 3,
 '29 June 2019': 1,
 '4 July 2019': 1,
 '12 July 2019': 2,
 '26 July 2019': 1,
 '2 September 2019': 1}