## Webscrawling

- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In the ipython you can use

```ipython
!pip install beautifulsoup4
```

In [3]:
import requests
from bs4 import BeautifulSoup

In [1]:
!pip install beautifulsoup4



## Simple html

Here is one example of html file, how to extract these item? Coffee, Tea, Coke

```html
<html>
    <body>
        <h1>Welcome to My Website</h1>
        <ul>
            <li>Coffee</li>
            <li>Tea</li>
            <li>Coke</li>
        </ul>
    </body>
</html>
```

Imagine we want to get the items in the list. The ul tag indicates an unordered list. We’ll then want to get each list item (list items are in li tags). Specifically, we’ll want to extract the text inside each list item. To do this, we’ll use the following code, where `example` is the HTML of the page.



```python
soup = BeautifulSoup(example, 'html.parser')
items = soup.find("ul").find_all("li")
```

You’ll notice that items is a list of three items, since there are three list items in the unordered list. You’ll also see that items[0].text will give you the text of the first list item!

In [4]:
example = """
<html>
    <body>
        <h1>Welcome to My Website</h1>
        <ul>
            <li>Coffee</li>
            <li test = 'heheda'>Tea</li>
            <li favorate = 'hehe'>
                <span style="color:blue">Coke</span>
            </li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(example, 'html.parser')
items = soup.find("ul").find_all("li")

In [8]:
soup.find("ul").find_all("li")

[<li>Coffee</li>,
 <li test="heheda">Tea</li>,
 <li favorate="hehe">
 <span style="color:blue">Coke</span>
 </li>]

In [9]:
items[2].text

'\nCoke\n'

In [11]:
items[2].text.replace('\n', '')

'Coke'

In [12]:
it = []
for item in items:
    it.append(item.text.strip('\n'))

In [13]:
it

['Coffee', 'Tea', 'Coke']

## Course website scrapping

| Time         | Food                                   |   Calorie |
| :----------- | :------------------------------------- | --------: |
| breakfast    | egg, milk, cereal, avocado             |       600 |
| lunch        | chicken breast, brown rice, lettuce    |       700 |
| dinner       | steak, sweet potato, broccoli          |       800 |


The homepage [https://mlqmlq.github.io/STAT628/pages/d8.html](https://mlqmlq.github.io/STAT628/pages/d8.html) has a table which shows above, how do we get all the food and calorie from this table?

The following are the source code for this table. Which can use google chrom `Inspect` or `View Page Source` to check

```html
<table rules="groups">
  <thead>
    <tr>
      <th style="text-align: left">Time</th>
      <th style="text-align: left">Food</th>
      <th style="text-align: right">Calorie</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">breakfast</td>
      <td style="text-align: left">egg, milk, cereal, avocado</td>
      <td style="text-align: right">600</td>
    </tr>
    <tr>
      <td style="text-align: left">lunch</td>
      <td style="text-align: left">chicken breast, brown rice, lettuce</td>
      <td style="text-align: right">700</td>
    </tr>
    <tr>
      <td style="text-align: left">dinner</td>
      <td style="text-align: left">steak, sweet potato, broccoli</td>
      <td style="text-align: right">800</td>
    </tr>
  </tbody>
  <tbody>
    <tr>
      <td style="text-align: left"> </td>
      <td style="text-align: left"> </td>
      <td style="text-align: right"> </td>
    </tr>
  </tbody>
</table>
```

We need to use `request` to ask python to look through the webpage, and `BeautifulSoup` to parse the html text for us.

Sometimes we may see [404 page](http://mlqmlq.github.io/stat628/pages/notes0309.html), we can use `*.status_code` to check, here are the list of [status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

In [14]:
url = "https://mlqmlq.github.io/STAT628/pages/d8.html"
req_page = requests.get(url)

In [15]:
req_page.status_code ## Success

200

In [16]:
req_page.content ## Raw contents

b'\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <title>Discussion 8</title>\n    <meta name="author" content="Linquan Ma">\n\n    <!-- Enable responsive viewport -->\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n\n    <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->\n    <!--[if lt IE 9]>\n      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>\n    <![endif]-->\n\n    <!-- Le styles -->\n    <link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/bootstrap/css/bootstrap.2.2.2.min.css" rel="stylesheet">\n    <link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/style.css?body=1" rel="stylesheet" type="text/css" media="all">\n    <link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/main.css" rel="stylesheet" type="text/css" media="all">\n\n    <!-- Le fav and touch icons -->\n\n    <!-- atom & rss feed -->\n    <link href="http://mlqmlq.github.io/

In [17]:
page_content = req_page.content
page = BeautifulSoup(page_content, 'html.parser') ## Use beacutiful to parse the html, so that it will be easy to manipulate the webpage
page


<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Discussion 8</title>
<meta content="Linquan Ma" name="author"/>
<!-- Enable responsive viewport -->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
<!-- Le styles -->
<link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/bootstrap/css/bootstrap.2.2.2.min.css" rel="stylesheet"/>
<link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/style.css?body=1" media="all" rel="stylesheet" type="text/css"/>
<link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/main.css" media="all" rel="stylesheet" type="text/css"/>
<!-- Le fav and touch icons -->
<!-- atom & rss feed -->
<link href="http://mlqmlq.github.io/STAT628nil" rel="alternate" title="Sitewide ATOM Feed" type="application/atom+xml"

In [37]:
page.find("table")

<table rules="groups">
<thead>
<tr>
<th style="text-align: left">Time</th>
<th style="text-align: left">Food</th>
<th style="text-align: right">Calorie</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">breakfast</td>
<td style="text-align: left">egg, milk, cereal, avocado</td>
<td style="text-align: right">600</td>
</tr>
<tr>
<td style="text-align: left">lunch</td>
<td style="text-align: left">chicken breast, brown rice, lettuce</td>
<td style="text-align: right">700</td>
</tr>
<tr>
<td style="text-align: left">dinner</td>
<td style="text-align: left">steak, sweet potato, broccoli</td>
<td style="text-align: right">800</td>
</tr>
</tbody>
<tbody>
<tr>
<td style="text-align: left"> </td>
<td style="text-align: left"> </td>
<td style="text-align: right"> </td>
</tr>
</tbody>
</table>

In [18]:
page.find_all("table")

[<table rules="groups">
 <thead>
 <tr>
 <th style="text-align: left">Time</th>
 <th style="text-align: left">Food</th>
 <th style="text-align: right">Calorie</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td style="text-align: left">breakfast</td>
 <td style="text-align: left">egg, milk, cereal, avocado</td>
 <td style="text-align: right">600</td>
 </tr>
 <tr>
 <td style="text-align: left">lunch</td>
 <td style="text-align: left">chicken breast, brown rice, lettuce</td>
 <td style="text-align: right">700</td>
 </tr>
 <tr>
 <td style="text-align: left">dinner</td>
 <td style="text-align: left">steak, sweet potato, broccoli</td>
 <td style="text-align: right">800</td>
 </tr>
 </tbody>
 <tbody>
 <tr>
 <td style="text-align: left"> </td>
 <td style="text-align: left"> </td>
 <td style="text-align: right"> </td>
 </tr>
 </tbody>
 </table>]

In [19]:
table_part = page.find("table")

In [20]:
table_part.find_all("th")

[<th style="text-align: left">Time</th>,
 <th style="text-align: left">Food</th>,
 <th style="text-align: right">Calorie</th>]

In [24]:
table_part.find_all("td")

[<td style="text-align: left">breakfast</td>,
 <td style="text-align: left">egg, milk, cereal, avocado</td>,
 <td style="text-align: right">600</td>,
 <td style="text-align: left">lunch</td>,
 <td style="text-align: left">chicken breast, brown rice, lettuce</td>,
 <td style="text-align: right">700</td>,
 <td style="text-align: left">dinner</td>,
 <td style="text-align: left">steak, sweet potato, broccoli</td>,
 <td style="text-align: right">800</td>,
 <td style="text-align: left"> </td>,
 <td style="text-align: left"> </td>,
 <td style="text-align: right"> </td>]

In [25]:
import numpy as np
import pandas as pd
content = [x.text for x in table_part.find_all("td")]
values = np.array(content[:9]).reshape(3,3)
values

array([['breakfast', 'egg, milk, cereal, avocado', '600'],
       ['lunch', 'chicken breast, brown rice, lettuce', '700'],
       ['dinner', 'steak, sweet potato, broccoli', '800']], dtype='<U35')

In [26]:
index = [x.text for x in table_part.find_all("th")]
pd.DataFrame(data=values, columns=index)

Unnamed: 0,Time,Food,Calorie
0,breakfast,"egg, milk, cereal, avocado",600
1,lunch,"chicken breast, brown rice, lettuce",700
2,dinner,"steak, sweet potato, broccoli",800


## Practice

This is from [Brown cs1915A scraping](https://cs.brown.edu/courses/csci1951-a/assignments/scraping.html)

To get started, we’re going to want to collect some data on the most active stocks in the market. Conveniently, Yahoo Finance [publishes this exact data](https://finance.yahoo.com/most-active). To collect this data, you’ll make use of web scraping.

For purposes of this assignment, we've made a copy of this page to keep the data static. Note, some of the data in our static copy is intentionally modified from real stock data to ensure you've handled edge cases. As such, you will scrape from this URL: [https://cs.brown.edu/courses/csci1951-a/resources/yahoo_finance.html](https://cs.brown.edu/courses/csci1951-a/resources/yahoo_finance.html)

Before scraping, you'll need your code to access this webpage. You should make use of the `request` library to make an HTTP request and collect the HTML. If you're not familar with the `request` library, you can read about it [here](http://docs.python-requests.org/en/master).

Once you have accessed the HTML and assigned it to some variable, you'll want to scrape it, collecting the following for each stock in the table.

* company name
* price
* market cap
* percentage daily change

You'll use Beautiful Soup, a Python package, to scrape the HTML. This will require looking at the HTML structure of the Yahoo Finance page. You can select various HTML elements on a page by tag name, class name, and/or id. Using [inspect element](https://zapier.com/blog/inspect-element-tutorial/) on your web browser, you can check what HTML tags and classes contain the relevant information.

In [29]:
"""Your code here"""
url = "https://cs.brown.edu/courses/csci1951-a/resources/yahoo_finance.html"
page = BeautifulSoup(requests.get(url).content, 'html.parser')

In [31]:
len(page.find_all("table"))

1

In [32]:
table_part = page.find("table")

In [37]:
colnames = [x.text for x in table_part.find_all("th")]

In [38]:
colnames

['Symbol',
 'Name',
 'Price (Intraday)',
 'Change',
 '% Change',
 'Volume',
 'Avg Vol (3 month)',
 'Market Cap',
 'PE Ratio (TTM)',
 '52 Week Range']

In [41]:
values = [x.text for x in table_part.find_all("td")]
values = np.array(values).reshape(-1, 10)

In [43]:
stock= pd.DataFrame(data=values, columns=colnames) 
stock

Unnamed: 0,Symbol,Name,Price (Intraday),Change,% Change,Volume,Avg Vol (3 month),Market Cap,PE Ratio (TTM),52 Week Range
0,GE,General Electric Company,10.19,+0.01,+0.05%,100.948M,132.455M,88.677B,,
1,P,"Pandora Media, Inc.",0.0000,-8.3800,-100.00%,76.19M,12.809M,0,-0.00,
2,AMD,"Advanced Micro Devices, Inc.",24.13,-0.38,-1.55%,69.435M,102.925M,24.116M,75.41,
3,CRON,Cronos Group Inc.,23.25,+2.44,+11.73%,67.145M,12.146M,4.111B,,
4,ACB,Aurora Cannabis Inc.,8.03,+0.63,+8.51%,66.479M,16.618M,7.92B,37.18,
...,...,...,...,...,...,...,...,...,...,...
95,GM,General Motors Company,38.93,+0.15,+0.39%,7.082M,11.799M,54.946B,74.58,
96,KGC,Kinross Gold Corporation,3.3500,-0.0200,-0.59%,6.996M,16.997M,4.113B,19.36,
97,FLEX,Flex Ltd.,9.42,+0.02,+0.21%,7.296M,7.975M,4.96B,36.23,
98,BABA,Alibaba Group Holding Limited,166.70,-1.27,-0.76%,6.82M,17.708M,432.116B,47.67,


In [44]:
#company name
#price
#market cap
#percentage daily change
stock[['Name', 'Price (Intraday)', 'Market Cap', '% Change']]

Unnamed: 0,Name,Price (Intraday),Market Cap,% Change
0,General Electric Company,10.19,88.677B,+0.05%
1,"Pandora Media, Inc.",0.0000,0,-100.00%
2,"Advanced Micro Devices, Inc.",24.13,24.116M,-1.55%
3,Cronos Group Inc.,23.25,4.111B,+11.73%
4,Aurora Cannabis Inc.,8.03,7.92B,+8.51%
...,...,...,...,...
95,General Motors Company,38.93,54.946B,+0.39%
96,Kinross Gold Corporation,3.3500,4.113B,-0.59%
97,Flex Ltd.,9.42,4.96B,+0.21%
98,Alibaba Group Holding Limited,166.70,432.116B,-0.76%


## Scraping Covid Data

Wikipedia [https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data](https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data) has the real time Covid data for each of the countries. 

We’re going to want to collect the Covid table:

* Construct a Pandas dataframe containing the following columns: **Cases, Deaths, Recoveries, Death Rate, Recovery Rate**

**Hint**: Worldwide data are contained in `th` instead of `td`. You need to address that issue. 

In [None]:
"""Your code here"""

In [45]:
url = "https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data"
page = BeautifulSoup(requests.get(url).content, 'html.parser')

In [50]:
page.find_all('table')[0]
headers = page.find_all('table')[0].find_all('th')

In [58]:
headers = [x.text.replace('\n', '') for x in headers]

In [55]:
values = [x.text.replace('\n', '') for x in page.find_all('table')[0].find_all('td')]

In [56]:
values

['11,592,276',
 '253,940',
 '6,098,053',
 '[9]',
 '8,912,907',
 '130,993',
 '8,335,109',
 '[10]',
 '5,947,403',
 '167,497',
 '5,389,863',
 '[11][12]',
 '2,065,138',
 '46,698',
 '145,391',
 '[13][14]',
 '1,991,998',
 '34,387',
 '1,501,083',
 '[15]',
 '1,525,341',
 '42,039',
 'No data',
 '[16]',
 '1,430,341',
 '53,274',
 'No data',
 '[18]',
 '1,339,324',
 '36,347',
 '1,156,461',
 '[20]',
 '1,272,352',
 '47,217',
 '481,967',
 '[21]',
 '1,218,003',
 '34,563',
 '1,125,184',
 '[22]',
 '1,015,071',
 '99,528',
 '762,025',
 '[23]',
 '939,931',
 '35,317',
 '867,306',
 '[24][25]',
 '854,533',
 '13,236',
 '546,372',
 '[27][26]',
 '801,894',
 '42,941',
 '576,983',
 '[28]',
 '772,823',
 '11,451',
 '342,883',
 '[29]',
 '757,144',
 '20,556',
 '701,534',
 '[30][31]',
 '570,153',
 '10,112',
 '259,079',
 '[32][33]',
 '540,605',
 '14,839',
 'No data',
 '[35][36]',
 '534,558',
 '14,897',
 '510,766',
 '[40]',
 '526,852',
 '11,795',
 '455,176',
 '[41]',
 '478,720',
 '15,503',
 '402,347',
 '[42]',
 '475,284',

In [70]:
headers[7:11]

['56,166,387', '1,347,778', '36,091,767', '[2]']

In [60]:
colnames = headers[1:5]

In [61]:
colnames

['Cases[b]', 'Deaths[c]', 'Recov.[d]', 'Ref.']

In [62]:
rownames = [headers[6]] + headers[12::2]

In [63]:
rownames

['World[e]',
 'United States[f]',
 'India',
 'Brazil',
 'France[g]',
 'Russia[h]',
 'Spain[i]',
 'United Kingdom[j]',
 'Argentina[k]',
 'Italy',
 'Colombia',
 'Mexico',
 'Peru',
 'Germany[l]',
 'Iran',
 'Poland',
 'South Africa',
 'Ukraine[m]',
 'Belgium[n]',
 'Chile[o]',
 'Iraq',
 'Indonesia',
 'Czech Republic',
 'Netherlands[p]',
 'Bangladesh',
 'Turkey[q]',
 'Philippines',
 'Romania',
 'Pakistan',
 'Saudi Arabia',
 'Israel[r]',
 'Canada[s]',
 'Morocco[t]',
 'Switzerland[u]',
 'Portugal',
 'Austria',
 'Nepal',
 'Sweden',
 'Ecuador',
 'Jordan',
 'Hungary',
 'United Arab Emirates',
 'Panama',
 'Bolivia',
 'Kuwait',
 'Qatar',
 'Dominican Republic',
 'Costa Rica',
 'Kazakhstan',
 'Oman',
 'Japan[v]',
 'Armenia',
 'Belarus',
 'Guatemala',
 'Egypt[w]',
 'Bulgaria',
 'Lebanon',
 'Ethiopia',
 'Honduras',
 'Venezuela',
 'Serbia[x]',
 'Moldova[y]',
 'Croatia',
 'Slovakia',
 'Georgia[z]',
 'China[aa]',
 'Bahrain',
 'Tunisia',
 'Greece',
 'Azerbaijan[ab]',
 'Bosnia and Herzegovina',
 'Libya',
 '

In [66]:
rownames = [x.split('[')[0] for x in rownames]
rownames

['World',
 'United States',
 'India',
 'Brazil',
 'France',
 'Russia',
 'Spain',
 'United Kingdom',
 'Argentina',
 'Italy',
 'Colombia',
 'Mexico',
 'Peru',
 'Germany',
 'Iran',
 'Poland',
 'South Africa',
 'Ukraine',
 'Belgium',
 'Chile',
 'Iraq',
 'Indonesia',
 'Czech Republic',
 'Netherlands',
 'Bangladesh',
 'Turkey',
 'Philippines',
 'Romania',
 'Pakistan',
 'Saudi Arabia',
 'Israel',
 'Canada',
 'Morocco',
 'Switzerland',
 'Portugal',
 'Austria',
 'Nepal',
 'Sweden',
 'Ecuador',
 'Jordan',
 'Hungary',
 'United Arab Emirates',
 'Panama',
 'Bolivia',
 'Kuwait',
 'Qatar',
 'Dominican Republic',
 'Costa Rica',
 'Kazakhstan',
 'Oman',
 'Japan',
 'Armenia',
 'Belarus',
 'Guatemala',
 'Egypt',
 'Bulgaria',
 'Lebanon',
 'Ethiopia',
 'Honduras',
 'Venezuela',
 'Serbia',
 'Moldova',
 'Croatia',
 'Slovakia',
 'Georgia',
 'China',
 'Bahrain',
 'Tunisia',
 'Greece',
 'Azerbaijan',
 'Bosnia and Herzegovina',
 'Libya',
 'Paraguay',
 'Myanmar',
 'Kenya',
 'Uzbekistan',
 'Algeria',
 'Ireland',
 '

In [73]:
values = np.array(headers[7:11] + values[:(len(values)-2)]).reshape(-1, 4)

In [74]:
covid = pd.DataFrame(data = values, index=rownames, columns=colnames)

In [75]:
covid

Unnamed: 0,Cases[b],Deaths[c],Recov.[d],Ref.
World,56166387,1347778,36091767,[2]
United States,11592276,253940,6098053,[9]
India,8912907,130993,8335109,[10]
Brazil,5947403,167497,5389863,[11][12]
France,2065138,46698,145391,[13][14]
...,...,...,...,...
Marshall Islands,2,0,2,[334]
Wallis and Futuna,2,0,1,[335]
Samoa,1,0,0,[336]
Vanuatu,1,0,0,[337]


In [76]:
covid.dtypes

Cases[b]     object
Deaths[c]    object
Recov.[d]    object
Ref.         object
dtype: object

In [77]:
def func(x):
    return x.replace(',', '')

In [79]:
covid = covid.applymap(func)

In [80]:
covid.head(5)

Unnamed: 0,Cases[b],Deaths[c],Recov.[d],Ref.
World,56166387,1347778,36091767,[2]
United States,11592276,253940,6098053,[9]
India,8912907,130993,8335109,[10]
Brazil,5947403,167497,5389863,[11][12]
France,2065138,46698,145391,[13][14]


In [84]:
covid = covid.replace(to_replace='No data', value=np.nan)

In [86]:
covid = covid.drop(columns=['Ref.'])

In [87]:
covid = covid.apply(pd.to_numeric, axis=1)

In [88]:
covid

Unnamed: 0,Cases[b],Deaths[c],Recov.[d]
World,56166387.0,1347778.0,36091767.0
United States,11592276.0,253940.0,6098053.0
India,8912907.0,130993.0,8335109.0
Brazil,5947403.0,167497.0,5389863.0
France,2065138.0,46698.0,145391.0
...,...,...,...
Marshall Islands,2.0,0.0,2.0
Wallis and Futuna,2.0,0.0,1.0
Samoa,1.0,0.0,0.0
Vanuatu,1.0,0.0,0.0


In [None]:
covid['recoveryRate'] = covid['Recov.[d]']