COVID Analysis & Visualization

Comments · 378 Views

Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, which made people in every part of the world to quarantine themself and sit at their home.
We have Studied, Examined & Analyzed the COVID -19 outbreak & Visualized the outcome in t

Everywhere on the internet such as blogs, news, social media, and, etc., we are witnessing some common clichés such as "2020 is a worst possible year", " The year which destroyed my career is 2020", "Living at my home for the past 5 months" or an unprofessional comment like "2020 sux", which is also used as a meme by most of the content creator in social media. Furthermore, after hearing any death of a celebrity, a natural disaster, or a deadly explosion, leads again to a global statement that "2020 is the worst year possible". While these things have happened before, so why does everyone feel like 2020 is one of the most detrimental years exist? One word is all it needs to answer this question: COVID!.


COVID-19 is a disease caused by SARS-CoV-2 that can trigger what doctors call a respiratory tract infection. COVID is spreading across the globe for the past six months, which gave rise to the pandemic situation which we are facing right now. So this is a blog where we analyze how COVID-19 has impacted across the globe by analyzing specific details such as the number of countries spread, active cases to months, closed cases, Mortality Rate vs Recovery Rate, Growth Factor, top 15 affected countries, and many other.

Before going into details regarding the analysis, we shall look closely at the dataset we have:

As you can see, we have ObservationDate, Province/State, Country/Region, Last Update, Confirmed, Deaths Recovered are the details which we will be utilizing for our analysis. The observation Date exists until July 8th, 2020.

So now, let's start our analysis using Jupyter Notebook- Python version 3.7.4.

We start by including the necessary libraries and importing the CSV data.

More information regarding the dataset:

We now convert the ObservationDate to Date time and group the data to Country ObservationDate, which will help us to analyze concerning datewise.

We now calculate the active cases by subtracting the number of deaths Recovered cases from Confirmed Cases.

I.e. Active Cases = Confirmed - [Deaths + Recovered]

To analyze some basic pieces of information such as the Total number of affected Countries, Confirmed/ Recovered/ Death/ Active cases across the globe, an approx number of confirmed/recovered/death/active cases per day around, and many more. We begin by creating a new data frame as below:

By utilizing the created Dataframe, we fetch the required analysis as below:

Now we visualize the Active Cases Close cases for different months by plotting a bar graph for the number of respective cases and months which we grouped previously.

We follow a similar step to analyze only for the United States.

We start by creating a Dataframe named ‘US_Data’ which only has US data segregated from the original dataset.

Much as before, we add a new column for Active Cases for United Stated by subtracting the Number of Deaths Recovered from Confirmed Cases.

i.e. US_Data["Active Cases"] = US_Data["Confirmed"] -(US_Data["Deaths"] + US_Data["Recovered"])

Then we aggregate concerning Confirmed/Recovered/Deaths Active Cases, also group by State, and create a new DataFrame named ‘State_US_data’

By using the DataFrame, we visualize the top 20 states where the active case is high by plotting a graph as below:

Now we use the same Dataframe to plot for distribution of cases in the United Stated wrt Months.

Mortality Rate vs Recovery Rate:

So firstly, what is Mortality Rate?

According to Wikipedia: Mortality Rate, or death rate, is a measure of the number of deaths in a particular population, scaled to the size of that population, per unit of time.

So, to give rise to a Dataframe where we can calculate Mortality Rate and visualize it in a graph.

We use the already existing datewise Dataframe and we add a new column named ‘Mortality Rate’ by diving confirmed cases with death cases and multiplying it by 100.

i.e.: Mortality Rate = (Deaths/Confirmed)*100

And similarly, we do for Recovery Rate.

i.e.: Recovery Rate = (Recovered/Confirmed)*100

Now let us visualize our analysis:

Now we plot for daily increasing cases Cases distributed among 7 days I.e. per week.


The growth factor is the factor by which a quantity multiplies itself over time. The formula used is:

Formula: Every day's new (Confirmed,Recovered,Deaths) / new (Confirmed,Recovered,Deaths) on the previous day

Now let us visualize the growth factor:

Similarly, for active closed cases:


Now we are moving to the fun part that is let us begin the analysis via country wise.

Let us begin by modifying the Dataframe by Grouping the dataset concerning the Country.

As we did to the global Dataframe, we follow the same process for country wise Dataframe to generate new columns for Active Cases, Mortality Recovered Rate.

Then we collect the information of the last 24 hours (i.e. July 8th) create a Dataframe concerning it so we can start analyzing plotting for countries who have the highest number of cases.

Using the analyzed data, we start visualizing for several conditions such as:

  1. Top 15 highest Countries with the highest number of Confirmed/Recovered/Death Cases within the last 24 hours:

2. Top 15 countries as per the number of confirmed/Death cases:

3. Top 15 countries according to Mortality/Recovery Rate:

4. Top 15 Countries with most Active/Close Cases:

5. Top 25 Countries with maximum Survival Probability having more than 1000 Confirmed Cases bottom 15 countries as per survival probability:

I understand that this has been a long analysis. I hope it was worth your valuable time.

If you want to have a look at the entire code, please visit my GITHUB page: