## Webscrawling

- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In the ipython you can use

```ipython
!pip install beautifulsoup4
```

In [13]:
import json
import requests
from bs4 import BeautifulSoup

## Simple html

Here is one example of html file, how to extract these item? Coffee, Tea, Coke

```html
<html>
    <body>
        <h1>Welcome to My Website</h1>
        <ul>
            <li>Coffee</li>
            <li>Tea</li>
            <li>Coke</li>
        </ul>
    </body>
</html>
```

Imagine we want to get the items in the list. The ul tag indicates an unordered list. We’ll then want to get each list item (list items are in li tags). Specifically, we’ll want to extract the text inside each list item. To do this, we’ll use the following code, where `example` is the HTML of the page.



```python
soup = BeautifulSoup(example, 'html.parser')
items = soup.find("ul").find_all("li")
```

You’ll notice that items is a list of three items, since there are three list items in the unordered list. You’ll also see that items[0].text will give you the text of the first list item!

In [14]:
example = """
<html>
    <body>
        <h1>Welcome to My Website</h1>
        <ul>
            <li>Coffee</li>
            <li test = 'heheda'>Tea</li>
            <li favorate = 'hehe'>
                <span style="color:blue">Coke</span>
            </li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(example, 'html.parser')
items = soup.find("ul").find_all("li")

In [15]:
str(items[2].text)

'\nCoke\n'

In [16]:
items[2].text.strip('\n')

'Coke'

In [17]:
it = []
for item in items:
    it.append(item.text.strip('\n'))

In [18]:
it

['Coffee', 'Tea', 'Coke']

## Course website scrapping

| Time         | Food                                   |   Calorie |
| :----------- | :------------------------------------- | --------: |
| breakfast    | egg, milk, cereal, avocado             |       600 |
| lunch        | chicken breast, brown rice, lettuce    |       700 |
| dinner       | steak, sweet potato, broccoli          |       800 |


The homepage [https://mlqmlq.github.io/STAT628/pages/d8.html](https://mlqmlq.github.io/STAT628/pages/d8.html) has a table which shows above, how do we get all the food and calorie from this table?

The following are the source code for this table. Which can use google chrom `Inspect` or `View Page Source` to check

```html
<table rules="groups">
  <thead>
    <tr>
      <th style="text-align: left">Time</th>
      <th style="text-align: left">Food</th>
      <th style="text-align: right">Calorie</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">breakfast</td>
      <td style="text-align: left">egg, milk, cereal, avocado</td>
      <td style="text-align: right">600</td>
    </tr>
    <tr>
      <td style="text-align: left">lunch</td>
      <td style="text-align: left">chicken breast, brown rice, lettuce</td>
      <td style="text-align: right">700</td>
    </tr>
    <tr>
      <td style="text-align: left">dinner</td>
      <td style="text-align: left">steak, sweet potato, broccoli</td>
      <td style="text-align: right">800</td>
    </tr>
  </tbody>
  <tbody>
    <tr>
      <td style="text-align: left"> </td>
      <td style="text-align: left"> </td>
      <td style="text-align: right"> </td>
    </tr>
  </tbody>
</table>
```

We need to use `request` to ask python to look through the webpage, and `BeautifulSoup` to parse the html text for us.

Sometimes we may see [404 page](http://mlqmlq.github.io/stat628/pages/notes0309.html), we can use `*.status_code` to check, here are the list of [status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

In [19]:
url = "https://mlqmlq.github.io/STAT628/pages/d8.html"
req_page = requests.get(url)

In [20]:
req_page.status_code ## Success

200

In [21]:
req_page.content ## Raw contents

b'\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <title>Discussion 8</title>\n    <meta name="author" content="Linquan Ma">\n\n    <!-- Enable responsive viewport -->\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n\n    <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->\n    <!--[if lt IE 9]>\n      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>\n    <![endif]-->\n\n    <!-- Le styles -->\n    <link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/bootstrap/css/bootstrap.2.2.2.min.css" rel="stylesheet">\n    <link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/style.css?body=1" rel="stylesheet" type="text/css" media="all">\n    <link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/main.css" rel="stylesheet" type="text/css" media="all">\n\n    <!-- Le fav and touch icons -->\n\n    <!-- atom & rss feed -->\n    <link href="http://mlqmlq.github.io/

In [22]:
page_content = req_page.content
page = BeautifulSoup(page_content, 'html.parser') ## Use beacutiful to parse the html, so that it will be easy to manipulate the webpage
page


<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Discussion 8</title>
<meta content="Linquan Ma" name="author"/>
<!-- Enable responsive viewport -->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
<!-- Le styles -->
<link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/bootstrap/css/bootstrap.2.2.2.min.css" rel="stylesheet"/>
<link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/style.css?body=1" media="all" rel="stylesheet" type="text/css"/>
<link href="http://mlqmlq.github.io/STAT628/assets/themes/twitter/css/main.css" media="all" rel="stylesheet" type="text/css"/>
<!-- Le fav and touch icons -->
<!-- atom & rss feed -->
<link href="http://mlqmlq.github.io/STAT628nil" rel="alternate" title="Sitewide ATOM Feed" type="application/atom+xml"

In [23]:
page.find("table")

<table rules="groups">
<thead>
<tr>
<th style="text-align: left">Time</th>
<th style="text-align: left">Food</th>
<th style="text-align: right">Calorie</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">breakfast</td>
<td style="text-align: left">egg, milk, cereal, avocado</td>
<td style="text-align: right">600</td>
</tr>
<tr>
<td style="text-align: left">lunch</td>
<td style="text-align: left">chicken breast, brown rice, lettuce</td>
<td style="text-align: right">700</td>
</tr>
<tr>
<td style="text-align: left">dinner</td>
<td style="text-align: left">steak, sweet potato, broccoli</td>
<td style="text-align: right">800</td>
</tr>
</tbody>
<tbody>
<tr>
<td style="text-align: left"> </td>
<td style="text-align: left"> </td>
<td style="text-align: right"> </td>
</tr>
</tbody>
</table>

In [24]:
page.find_all("table")

[<table rules="groups">
 <thead>
 <tr>
 <th style="text-align: left">Time</th>
 <th style="text-align: left">Food</th>
 <th style="text-align: right">Calorie</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td style="text-align: left">breakfast</td>
 <td style="text-align: left">egg, milk, cereal, avocado</td>
 <td style="text-align: right">600</td>
 </tr>
 <tr>
 <td style="text-align: left">lunch</td>
 <td style="text-align: left">chicken breast, brown rice, lettuce</td>
 <td style="text-align: right">700</td>
 </tr>
 <tr>
 <td style="text-align: left">dinner</td>
 <td style="text-align: left">steak, sweet potato, broccoli</td>
 <td style="text-align: right">800</td>
 </tr>
 </tbody>
 <tbody>
 <tr>
 <td style="text-align: left"> </td>
 <td style="text-align: left"> </td>
 <td style="text-align: right"> </td>
 </tr>
 </tbody>
 </table>]

In [25]:
table_part = page.find("table")

In [26]:
table_part.find_all("th")

[<th style="text-align: left">Time</th>,
 <th style="text-align: left">Food</th>,
 <th style="text-align: right">Calorie</th>]

In [27]:
table_part.find_all("td")

[<td style="text-align: left">breakfast</td>,
 <td style="text-align: left">egg, milk, cereal, avocado</td>,
 <td style="text-align: right">600</td>,
 <td style="text-align: left">lunch</td>,
 <td style="text-align: left">chicken breast, brown rice, lettuce</td>,
 <td style="text-align: right">700</td>,
 <td style="text-align: left">dinner</td>,
 <td style="text-align: left">steak, sweet potato, broccoli</td>,
 <td style="text-align: right">800</td>,
 <td style="text-align: left"> </td>,
 <td style="text-align: left"> </td>,
 <td style="text-align: right"> </td>]

In [28]:
import numpy as np
import pandas as pd
content = [x.text for x in table_part.find_all("td")]
values = np.array(content[:9]).reshape(3,3)
index = [x.text for x in table_part.find_all("th")]
pd.DataFrame(data=values, columns=index)

Unnamed: 0,Time,Food,Calorie
0,breakfast,"egg, milk, cereal, avocado",600
1,lunch,"chicken breast, brown rice, lettuce",700
2,dinner,"steak, sweet potato, broccoli",800


## Practice

This is from [Brown cs1915A scraping](https://cs.brown.edu/courses/csci1951-a/assignments/scraping.html)

To get started, we’re going to want to collect some data on the most active stocks in the market. Conveniently, Yahoo Finance [publishes this exact data](https://finance.yahoo.com/most-active). To collect this data, you’ll make use of web scraping.

For purposes of this assignment, we've made a copy of this page to keep the data static. Note, some of the data in our static copy is intentionally modified from real stock data to ensure you've handled edge cases. As such, you will scrape from this URL: [https://cs.brown.edu/courses/csci1951-a/resources/yahoo_finance.html](https://cs.brown.edu/courses/csci1951-a/resources/yahoo_finance.html)

Before scraping, you'll need your code to access this webpage. You should make use of the `request` library to make an HTTP request and collect the HTML. If you're not familar with the `request` library, you can read about it [here](http://docs.python-requests.org/en/master).

Once you have accessed the HTML and assigned it to some variable, you'll want to scrape it, collecting the following for each stock in the table.

* company name
* price
* market cap
* percentage daily change

You'll use Beautiful Soup, a Python package, to scrape the HTML. This will require looking at the HTML structure of the Yahoo Finance page. You can select various HTML elements on a page by tag name, class name, and/or id. Using [inspect element](https://zapier.com/blog/inspect-element-tutorial/) on your web browser, you can check what HTML tags and classes contain the relevant information.

In [32]:
"""Your code here"""
url = "https://cs.brown.edu/courses/csci1951-a/resources/yahoo_finance.html"
req_page = requests.get(url)
page_content = req_page.content
page = BeautifulSoup(page_content, 'html.parser') ## Use beacutiful to parse the html, so that it will be easy to manipulate the webpage

In [33]:
len(page.find_all("table"))

1

In [34]:
table_part = page.find("table")

In [35]:
table_part.find_all("th")

[<th class="Ta(start) Pstart(6px) Pend(10px) Bgc(white) Fz(xs) Va(m) Py(5px)! Fw(400)! Ta(start) Start(0) Pend(10px) Pos(st) Bgc(white) Ta(start)!" data-reactid="45"><label class="Ta(c) Pos(r) Va(tb) Pend(5px) D(n)--print" data-reactid="46"><input class="Pos(a) V(h)" data-reactid="47" type="checkbox"/><svg class="Va(m)! H(16px) W(16px) Stk($plusGray)! Fill($plusGray)! Cur(p)" data-icon="checkbox-unchecked" data-reactid="48" height="16" style="fill:#000;stroke:#000;stroke-width:0;vertical-align:bottom;" viewbox="0 0 24 24" width="16"><path d="M3 3h18v18H3V3zm19-2H2c-.553 0-1 .448-1 1v20c0 .552.447 1 1 1h20c.552 0 1-.448 1-1V2c0-.552-.448-1-1-1z" data-reactid="49"></path></svg></label><span data-reactid="50">Symbol</span><div class="W(3px) Pend(5px) Pos(a) Start(100%) T(0) H(100%) Bg($pfColumnFakeShadowGradient) Pe(n)" data-reactid="51"></div></th>,
 <th class="Ta(start) Px(10px) Bgc(white) Fz(xs) Va(m) Py(5px)! Cur(p) Bgc($extraLightBlue):h Fw(400)!" data-reactid="52"><span data-reactid

In [36]:
colnames = [x.text for x in table_part.find_all("th")]
colnames

['Symbol',
 'Name',
 'Price (Intraday)',
 'Change',
 '% Change',
 'Volume',
 'Avg Vol (3 month)',
 'Market Cap',
 'PE Ratio (TTM)',
 '52 Week Range']

In [38]:
values = [x.text for x in table_part.find_all("td")]
values = np.array(values).reshape(-1,len(colnames))

In [41]:
stock = pd.DataFrame(data=values, columns=colnames)
stock

Unnamed: 0,Symbol,Name,Price (Intraday),Change,% Change,Volume,Avg Vol (3 month),Market Cap,PE Ratio (TTM),52 Week Range
0,GE,General Electric Company,10.19,+0.01,+0.05%,100.948M,132.455M,88.677B,,
1,P,"Pandora Media, Inc.",0.0000,-8.3800,-100.00%,76.19M,12.809M,0,-0.00,
2,AMD,"Advanced Micro Devices, Inc.",24.13,-0.38,-1.55%,69.435M,102.925M,24.116M,75.41,
3,CRON,Cronos Group Inc.,23.25,+2.44,+11.73%,67.145M,12.146M,4.111B,,
4,ACB,Aurora Cannabis Inc.,8.03,+0.63,+8.51%,66.479M,16.618M,7.92B,37.18,
...,...,...,...,...,...,...,...,...,...,...
95,GM,General Motors Company,38.93,+0.15,+0.39%,7.082M,11.799M,54.946B,74.58,
96,KGC,Kinross Gold Corporation,3.3500,-0.0200,-0.59%,6.996M,16.997M,4.113B,19.36,
97,FLEX,Flex Ltd.,9.42,+0.02,+0.21%,7.296M,7.975M,4.96B,36.23,
98,BABA,Alibaba Group Holding Limited,166.70,-1.27,-0.76%,6.82M,17.708M,432.116B,47.67,


In [43]:
# company name, price, market cap, percentage daily change
df = stock[['Name', 'Price (Intraday)', 'Market Cap', '% Change']]
df

Unnamed: 0,Name,Price (Intraday),Market Cap,% Change
0,General Electric Company,10.19,88.677B,+0.05%
1,"Pandora Media, Inc.",0.0000,0,-100.00%
2,"Advanced Micro Devices, Inc.",24.13,24.116M,-1.55%
3,Cronos Group Inc.,23.25,4.111B,+11.73%
4,Aurora Cannabis Inc.,8.03,7.92B,+8.51%
...,...,...,...,...
95,General Motors Company,38.93,54.946B,+0.39%
96,Kinross Gold Corporation,3.3500,4.113B,-0.59%
97,Flex Ltd.,9.42,4.96B,+0.21%
98,Alibaba Group Holding Limited,166.70,432.116B,-0.76%


## Scraping Covid Data

Wikipedia [https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data](https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data) has the real time Covid data for each of the countries. 

We’re going to want to collect the Covid table:

* Construct a Pandas dataframe containing the following columns: **Cases, Deaths, Recoveries, Death Rate, Recovery Rate**

**Hint**: Worldwide data are contained in `th` instead of `td`. You need to address that issue. 

In [44]:
"""Your code here"""
url = "https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data"
req_page = requests.get(url)
page_content = req_page.content
page = BeautifulSoup(page_content, 'html.parser') ## Use beacutiful to parse the html, so that it will be easy to manipulate the webpage

In [45]:
len(page.find_all("table"))

26

In [49]:
page.find_all("table")[0]

<table class="wikitable plainrowheaders sortable" id="thetable" style="text-align:right; margin:0 0 0.5em 1em;width:97%;">
<caption><div class="covid-show-table" style="font-size:80%;font-weight:500;"><a href="#covid-19-pandemic-data">[show all]</a></div><div class="covid-collapse-table" style="font-size:80%;font-weight:500;float: right;"><a href="#void">[collapse]</a></div><div class="plainlinks hlist navbar mini" style="float:left; text-align:left"><ul><li class="nv-view"><a class="mw-selflink selflink"><abbr title="View this template">v</abbr></a></li><li class="nv-talk"><a href="/wiki/Template_talk:COVID-19_pandemic_data" title="Template talk:COVID-19 pandemic data"><abbr title="Discuss this template">t</abbr></a></li><li class="nv-edit"><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Template:COVID-19_pandemic_data&amp;action=edit"><abbr title="Edit this template">e</abbr></a></li></ul></div><div style="font-size:114%;margin:0 4em"><span style="font-size:

In [73]:
table_part = page.find_all("table")[0]
headers = [x.text.strip('\n') for x in table_part.find_all("th")]
headers

['Location[a]',
 'Cases[b]',
 'Deaths[c]',
 'Recov.[d]',
 'Ref.',
 '',
 'World[e]',
 '56,024,130',
 '1,345,639',
 '35,969,240',
 '[2]',
 '',
 'United States[f]',
 '',
 'India',
 '',
 'Brazil',
 '',
 'France[g]',
 '',
 'Russia[h]',
 '',
 'Spain[i]',
 '',
 'United Kingdom[j]',
 '',
 'Argentina[k]',
 '',
 'Italy',
 '',
 'Colombia',
 '',
 'Mexico',
 '',
 'Peru',
 '',
 'Germany[l]',
 '',
 'Iran',
 '',
 'Poland',
 '',
 'South Africa',
 '',
 'Ukraine[m]',
 '',
 'Belgium[n]',
 '',
 'Chile[o]',
 '',
 'Iraq',
 '',
 'Indonesia',
 '',
 'Czech Republic',
 '',
 'Netherlands[p]',
 '',
 'Bangladesh',
 '',
 'Turkey[q]',
 '',
 'Philippines',
 '',
 'Romania',
 '',
 'Pakistan',
 '',
 'Saudi Arabia',
 '',
 'Israel[r]',
 '',
 'Canada[s]',
 '',
 'Morocco[t]',
 '',
 'Switzerland[u]',
 '',
 'Portugal',
 '',
 'Austria',
 '',
 'Nepal',
 '',
 'Ecuador',
 '',
 'Sweden',
 '',
 'Jordan',
 '',
 'United Arab Emirates',
 '',
 'Panama',
 '',
 'Hungary',
 '',
 'Bolivia',
 '',
 'Kuwait',
 '',
 'Qatar',
 '',
 'Dominican Re

In [112]:
values = [x.text.strip('\n') for x in table_part.find_all("td")]
values

['11,546,233',
 '253,423',
 '6,095,998',
 '[9]',
 '8,912,907',
 '130,993',
 '8,335,109',
 '[10]',
 '5,945,849',
 '167,455',
 '5,389,863',
 '[11][12]',
 '2,065,138',
 '46,698',
 '145,391',
 '[13][14]',
 '1,991,998',
 '34,387',
 '1,501,083',
 '[15]',
 '1,525,341',
 '42,039',
 'No data',
 '[16]',
 '1,430,341',
 '53,274',
 'No data',
 '[18]',
 '1,328,992',
 '36,106',
 '1,148,820',
 '[20]',
 '1,272,352',
 '47,217',
 '481,967',
 '[21]',
 '1,211,128',
 '34,381',
 '1,118,902',
 '[22]',
 '1,011,153',
 '99,026',
 '757,951',
 '[23]',
 '939,931',
 '35,317',
 '867,306',
 '[24][25]',
 '847,988',
 '13,396',
 '546,372',
 '[27][26]',
 '801,894',
 '42,941',
 '576,983',
 '[28]',
 '772,823',
 '11,451',
 '342,883',
 '[29]',
 '757,144',
 '20,556',
 '701,534',
 '[30][31]',
 '570,153',
 '10,112',
 '259,079',
 '[32][33]',
 '540,605',
 '14,839',
 'No data',
 '[35][36]',
 '534,558',
 '14,897',
 '510,766',
 '[40]',
 '526,852',
 '11,795',
 '455,176',
 '[41]',
 '478,720',
 '15,503',
 '402,347',
 '[42]',
 '472,250',

In [115]:
values = np.array(headers[7:11]+values[:len(values)-2]).reshape(-1, 4)
values

array([['56,024,130', '1,345,639', '35,969,240', '[2]'],
       ['11,546,233', '253,423', '6,095,998', '[9]'],
       ['8,912,907', '130,993', '8,335,109', '[10]'],
       ['5,945,849', '167,455', '5,389,863', '[11][12]'],
       ['2,065,138', '46,698', '145,391', '[13][14]'],
       ['1,991,998', '34,387', '1,501,083', '[15]'],
       ['1,525,341', '42,039', 'No data', '[16]'],
       ['1,430,341', '53,274', 'No data', '[18]'],
       ['1,328,992', '36,106', '1,148,820', '[20]'],
       ['1,272,352', '47,217', '481,967', '[21]'],
       ['1,211,128', '34,381', '1,118,902', '[22]'],
       ['1,011,153', '99,026', '757,951', '[23]'],
       ['939,931', '35,317', '867,306', '[24][25]'],
       ['847,988', '13,396', '546,372', '[27][26]'],
       ['801,894', '42,941', '576,983', '[28]'],
       ['772,823', '11,451', '342,883', '[29]'],
       ['757,144', '20,556', '701,534', '[30][31]'],
       ['570,153', '10,112', '259,079', '[32][33]'],
       ['540,605', '14,839', 'No data', '[35][36]

In [85]:
colnames = headers[1:5]
colnames 

['Cases[b]', 'Deaths[c]', 'Recov.[d]', 'Ref.']

In [86]:
colnames = [x.split('[')[0] for x in colnames]
colnames

['Cases', 'Deaths', 'Recov.', 'Ref.']

In [117]:
rownames = [headers[6]] + headers[12::2]
rownames 

['World[e]',
 'United States[f]',
 'India',
 'Brazil',
 'France[g]',
 'Russia[h]',
 'Spain[i]',
 'United Kingdom[j]',
 'Argentina[k]',
 'Italy',
 'Colombia',
 'Mexico',
 'Peru',
 'Germany[l]',
 'Iran',
 'Poland',
 'South Africa',
 'Ukraine[m]',
 'Belgium[n]',
 'Chile[o]',
 'Iraq',
 'Indonesia',
 'Czech Republic',
 'Netherlands[p]',
 'Bangladesh',
 'Turkey[q]',
 'Philippines',
 'Romania',
 'Pakistan',
 'Saudi Arabia',
 'Israel[r]',
 'Canada[s]',
 'Morocco[t]',
 'Switzerland[u]',
 'Portugal',
 'Austria',
 'Nepal',
 'Ecuador',
 'Sweden',
 'Jordan',
 'United Arab Emirates',
 'Panama',
 'Hungary',
 'Bolivia',
 'Kuwait',
 'Qatar',
 'Dominican Republic',
 'Costa Rica',
 'Kazakhstan',
 'Oman',
 'Japan[v]',
 'Armenia',
 'Belarus',
 'Guatemala',
 'Egypt[w]',
 'Lebanon',
 'Bulgaria',
 'Ethiopia',
 'Honduras',
 'Venezuela',
 'Serbia[x]',
 'Moldova[y]',
 'Croatia',
 'Slovakia',
 'China[z]',
 'Bahrain',
 'Georgia[aa]',
 'Greece',
 'Tunisia',
 'Azerbaijan[ab]',
 'Bosnia and Herzegovina',
 'Libya',
 '

In [118]:
rownames = [x.split('[')[0] for x in rownames]
rownames

['World',
 'United States',
 'India',
 'Brazil',
 'France',
 'Russia',
 'Spain',
 'United Kingdom',
 'Argentina',
 'Italy',
 'Colombia',
 'Mexico',
 'Peru',
 'Germany',
 'Iran',
 'Poland',
 'South Africa',
 'Ukraine',
 'Belgium',
 'Chile',
 'Iraq',
 'Indonesia',
 'Czech Republic',
 'Netherlands',
 'Bangladesh',
 'Turkey',
 'Philippines',
 'Romania',
 'Pakistan',
 'Saudi Arabia',
 'Israel',
 'Canada',
 'Morocco',
 'Switzerland',
 'Portugal',
 'Austria',
 'Nepal',
 'Ecuador',
 'Sweden',
 'Jordan',
 'United Arab Emirates',
 'Panama',
 'Hungary',
 'Bolivia',
 'Kuwait',
 'Qatar',
 'Dominican Republic',
 'Costa Rica',
 'Kazakhstan',
 'Oman',
 'Japan',
 'Armenia',
 'Belarus',
 'Guatemala',
 'Egypt',
 'Lebanon',
 'Bulgaria',
 'Ethiopia',
 'Honduras',
 'Venezuela',
 'Serbia',
 'Moldova',
 'Croatia',
 'Slovakia',
 'China',
 'Bahrain',
 'Georgia',
 'Greece',
 'Tunisia',
 'Azerbaijan',
 'Bosnia and Herzegovina',
 'Libya',
 'Myanmar',
 'Paraguay',
 'Kenya',
 'Uzbekistan',
 'Algeria',
 'Ireland',
 '

In [156]:
covid = pd.DataFrame(data = values, index=rownames, columns=colnames)
covid = covid.drop(['Ref.'], axis=1)
covid.tail()

Unnamed: 0,Cases,Deaths,Recov.
Marshall Islands,2,0,2
Wallis and Futuna,2,0,1
Samoa,1,0,0
Vanuatu,1,0,0
Tanzania,No data,No data,No data


In [157]:
covid.dtypes

Cases     object
Deaths    object
Recov.    object
dtype: object

In [158]:
def func(x):
    return x.replace(',', '')

In [160]:
covid = covid.applymap(func)
covid

Unnamed: 0,Cases,Deaths,Recov.
World,56024130,1345639,35969240
United States,11546233,253423,6095998
India,8912907,130993,8335109
Brazil,5945849,167455,5389863
France,2065138,46698,145391
...,...,...,...
Marshall Islands,2,0,2
Wallis and Futuna,2,0,1
Samoa,1,0,0
Vanuatu,1,0,0


In [164]:
covid = covid.replace(to_replace='No data', value=np.nan)
covid

Unnamed: 0,Cases,Deaths,Recov.
World,56024130,1345639,35969240
United States,11546233,253423,6095998
India,8912907,130993,8335109
Brazil,5945849,167455,5389863
France,2065138,46698,145391
...,...,...,...
Marshall Islands,2,0,2
Wallis and Futuna,2,0,1
Samoa,1,0,0
Vanuatu,1,0,0


In [174]:
covid = covid.apply(pd.to_numeric, axis=1)

In [175]:
covid['recoveryRate'] = covid['Recov.']/covid['Cases']

In [176]:
covid['deathRate'] = covid['Deaths']/covid['Cases']

In [177]:
covid

Unnamed: 0,Cases,Deaths,Recov.,recoveryRate,deathRate
World,56024130.0,1345639.0,35969240.0,0.642031,0.024019
United States,11546233.0,253423.0,6095998.0,0.527964,0.021949
India,8912907.0,130993.0,8335109.0,0.935173,0.014697
Brazil,5945849.0,167455.0,5389863.0,0.906492,0.028163
France,2065138.0,46698.0,145391.0,0.070403,0.022613
...,...,...,...,...,...
Marshall Islands,2.0,0.0,2.0,1.000000,0.000000
Wallis and Futuna,2.0,0.0,1.0,0.500000,0.000000
Samoa,1.0,0.0,0.0,0.000000,0.000000
Vanuatu,1.0,0.0,0.0,0.000000,0.000000
