Description of the dataset

Visual Reality
3 min readMar 19, 2021

We decided to work on the data about energy and its relationship with wealth. We mainly aim to observe the trends in energy production and consumption across the World. We would like to see the levels of prevalence for different energy sources in production and in consumption during the last half century. Additionally, we would also like to investigate their dependency on other variables such as GDP, population size and GDP per capita. The data are obtained via the following link: https://github.com/owid/energy-data

The data set consists of 117 variables. The list of explanations for each variable can be accessed via this link: https://tinyurl.com/8s8rke

These variables can be broadly grouped into 4 classes: source-specific energy production, source-specific energy consumption, population size and GDP, which are all reported over a time period of 54 years. The main data at hand have 10134 observations for 242 countries from 1965 to 2019. Each row represents annual data for a country. Besides country code, country and year; the other 114 variables present numerical energy data related to different sources (e.g., coal, electricity, gas, oil, solar, wind, etc), such as absolute consumption, consumption per capita, absolute electricity generation and relative source-specific electricity production. The tabular representation of the variables is given below. Overall energy levels are given in TWh (Terawatt-hour) and per-capita energy levels are given in kWh (kilowatt-hour).

Each row in the table represents an energy source (biofuels, coal etc.) and each column represents a type of data for corresponding energy source.

· % electricity: What percentage of an energy source is used for generating electricity.
· % growth: What percentage of growth (positive or negative) is observed in consumption of an energy source, compared to the previous year.
· % share: What percentage of an energy source constitutes the overall energy level.
· Consumption: Overall consumption of an energy source.
· Consumption per capita: Consumption of an energy source per person.
· Electricity generation: Generated electricity from an energy source.
· Electricity per capita: Electricity used by an energy source per person.
· Production: Total energy production by an energy source.
· Production per capita: Produced energy by an energy source per person.
· Annual production change: The observed change in the production of an energy source, compared to the previous year — both given in percentages and in absolute energy levels.

Missing Data:

Only coal, gas and oil have information about all variables explained above. However, we should definitely not limit ourselves to analyze only these 3 energy sources. We can still compare the data about different energy sources with the variables we have. Another big issue about the missing data is the missing values in rows. For example, there is no information about the percentage of change in biofuel consumption prior to 2010. Also, energy per GDP values are missing from 2017 and onwards. Overall, there are lots of missing values for each country from different variables. If we consider our data matrix with dimensions of 117x10134 (with 1185678 cells), only around 51% of the matrix has values. There are only 125 rows with no missing values, so, complete case analysis is out of question. How we deal with missing data will be the topic of a future blog post.

--

--