URL与HTTP

image.png
image.png

使用urllib 下载文件

  1. # Import package
  2. from urllib.request import urlretrieve
  3. # Import pandas
  4. import pandas as pd
  5. # Assign url of file: url
  6. url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
  7. # Save file locally
  8. urlretrieve(url, 'winequality-red.csv')
  9. # Read file into a DataFrame and print its head
  10. df = pd.read_csv('winequality-red.csv', sep=';')
  11. print(df.head())

获取request方式

想要得到网页内容,必须得先向该网页发送一个request,以获得访问。

通过urllib

image.png

  1. # Import packages
  2. from urllib.request import urlopen, Request
  3. # Specify the url
  4. url = "http://www.datacamp.com/teach/documentation"
  5. # This packages the request
  6. request = Request(url)
  7. # Sends the request and catches the response: response
  8. response = urlopen(request)
  9. # Extract the response: html
  10. html = response.read()
  11. # Print the html
  12. print(html)
  13. # Be polite and close the response!
  14. response.close()

通过request

image.png

  1. # Import package
  2. import requests
  3. # Specify the url: url
  4. url = "http://www.datacamp.com/teach/documentation"
  5. # Packages the request, send the request and catch the response: r
  6. r = requests.get(url)
  7. # Extract the response: text
  8. text = r.text
  9. # Print the html
  10. print(text)

通过bs4获取HTML 内容

image.png
ps:关于bs4 库详细用法,可以参考010_爬虫学习004-006内容

  1. # Import packages
  2. import requests
  3. from bs4 import BeautifulSoup
  4. # Specify url
  5. url = 'https://www.python.org/~guido/'
  6. # Package the request, send the request and catch the response: r
  7. r = requests.get(url)
  8. # Extracts the response as html: html_doc
  9. html_doc = r.text
  10. # create a BeautifulSoup object from the HTML: soup
  11. soup = BeautifulSoup(html_doc)
  12. # Print the title of Guido's webpage
  13. print(soup.title)
  14. # Find all 'a' tags (which define hyperlinks): a_tags
  15. a_tags = soup.find_all('a')
  16. # Print the URLs to the shell
  17. for link in a_tags:
  18. print(link.get('href'))

通过APIs 获取信息

image.png

通常来说,APIs 传输文件类型为JSON。

JSON

image.png
采用有类型的key-value,适合程序使用(直接作为程序的一部分)。

  1. # Load JSON: json_data
  2. with open("a_movie.json") as json_file:
  3. json_data = json.load(json_file)
  4. # Print each key-value pair in json_data
  5. for k in json_data.keys():
  6. print(k + ': ', json_data[k])

获取APIs 过程

image.png
image.png

  1. # Import package
  2. import requests
  3. # Assign URL to variable: url
  4. url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'
  5. # Package the request, send the request and catch the response: r
  6. r = requests.get(url)
  7. # Decode the JSON data into a dictionary: json_data
  8. json_data = r.json()
  9. # Print the Wikipedia page extract
  10. pizza_extract = json_data['query']['pages']['24768']['extract']
  11. print(pizza_extract)

获取推特API 进行分析

  • 创建推特账户
  • 获取token 与API 与相关的密钥
  • 推特提供许多APIs 选择合适的(streaming API,public stream)
  • 使用get statuses,获取小量随机的public statuses。如果获取大量的实际public statuses,可以使用firehose API。
  • 获取JSON文件。

代码操作:
1)通过token 与API 与相关的密钥访问,获得授权(authorization)
image.png

2)设置一个类,用于监听数据流。
image.png

3)抓取指定数据流
image.png

实战:获取推特random_statuses 并进行分析

获取推特API

  1. # Import package
  2. import json
  3. # String of path to file: tweets_data_path
  4. tweets_data_path = 'tweets.txt'
  5. # Initialize empty list to store tweets: tweets_data
  6. tweets_data = []
  7. # Open connection to file
  8. tweets_file = open(tweets_data_path, "r")
  9. # Read in tweets and store in list: tweets_data
  10. for line in tweets_file:
  11. tweet = json.loads(line)
  12. tweets_data.append(tweet)
  13. # Close connection to file
  14. tweets_file.close()

导入为pandas 表格

  1. # Import package
  2. import pandas as pd
  3. # Build DataFrame of tweet texts and languages
  4. df = pd.DataFrame(tweets_data, columns=['text', 'lang'])
  5. # Print head of DataFrame
  6. print(df.head())

分析内容:找出单词出现次数

  1. # Initialize list to store tweet counts
  2. [clinton, trump, sanders, cruz] = [0, 0, 0, 0]
  3. # Iterate through df, counting the number of tweets in which
  4. # each candidate is mentioned
  5. for index, row in df.iterrows():
  6. clinton += word_in_text('clinton', row['text'])
  7. trump += word_in_text('trump', row['text'])
  8. sanders += word_in_text('sanders', row['text'])
  9. cruz += word_in_text('cruz', row['text'])

使用matplotlib 与seaborn 进行作图

  1. # Import packages
  2. import matplotlib.pyplot as plt
  3. import seaborn as sns
  4. # Set seaborn style
  5. sns.set(color_codes=True)
  6. # Create a list of labels:cd
  7. cd = ['clinton', 'trump', 'sanders', 'cruz']
  8. # Plot the bar chart
  9. ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
  10. ax.set(ylabel="count")
  11. plt.show()

image.png