Introduction and Problem:
Global warming is a crucial topic when people consider environmental protection because it is getting even worse recently. Global warming is the result of human practices like emission of Greenhouse gases. Global warming can lead to rising temperatures of the oceans and the earth’s surface causing the melting of polar ice caps, a rise in sea levels, and also unnatural patterns of precipitation such as flash floods, excessive snow, or desertification (Rinkesh, 2017). From the effect shown above, global warming will lead to a disaster to all human beings, if there’s no effective action being taken. The objective of this project is to find out the most distinguishing factors that can cause global warming by conducting different methodologies. The analysis scope is between housing characteristics, equipment characteristics, and homeowners’ behavior with the energy consumption of residential homes. The project goal is to find the most relevant factors that directly cause more energy consumption, and give suggestions on reducing greenhouse gases emission.
The file data is collected by the American Housing Survey(AHS) for providing the residence characteristics, appliance characteristics, and homeowners’ behaviors with the energy consumption of residential homes, etc. The whole AHS microdata contains 41503 housing units, and each unit contains 23 topics, 87 subtopics, 1851 dummy, and numeric variables with high-dimension feature space. It provides current information on a wide range of housing information, mortgages and other housing costs, and energy usage information. Gas amount (GASAMT) is a dependent factor, and the rest of the variables are considered to be the independent factor for the following analysis usage.
Data transformation Methodology:
Before using the dataset to accomplish the project goal, the first thing that needs to do is to pre-process the data. Since this dataset is extracted from a survey result, it contains quite a lot of NAs, because of missing or skipping questions. Simply deleting the columns with NAs is not a durable way for later analysis, because then all columns will be deleted. Finally, I choose to use subset selection and low variance features to solve the NAS problems.
Subset selection – reduce the NA rows
The file contains a large amount of NA, so there’s one durable way to deal with the NA by deleting rows that contain more than 40% NA. Because a large percent of NA can lead to the wrong predictions on choosing distinguishing factors. By deleting NAs, the column number is reduced to 637.
Remove Low Variance Features
VarianceThreshold is a simple baseline approach to remove all features whose variance doesn’t meet the threshold. Low variance means housing units have relatively the same answer on the same question, which can be considered as noisy data without analysis value. By deleting low variance, the variable factors are reduced from 637 to 544. Removing NAs can also make the dataset smaller, resulting in faster reaction time and accurate results in choosing energy prediction factors.
The final goal of this project is to use distinguishing factors to predict gas energy consumption. Thus, when select distinguishing factors, multiple linear regression model (MLR) will be used to select the most related factors as dependent variables on predicting the independent variable energy consumption for both gas and electricity.
An MLR with an intercept can be written as
Gas (GASAMT) amount are dependent variables, and the rest variables are set to be independent variables in the MLR.
Algorithms 1: SelectFromModel
SelectFromModel is a meta-transformer in python that can be used along with any estimator that has a coefficient or feature importance attribute after fitting.
Python Result on Gas Amount:
The result is a little overwhelming because SelectFromModel Algorithms will list all factors that have a strong relation with Gas Amount, but it still can give a whole picture of what factors can cause gas consumption.
Algorithms 2: Recursive Feature Elimination(RFE):
REF is to select features by recursively considering smaller sets of features in python. The number of features can be set by the ‘n_feature_to_select’ parameter. Here I set this parameter as 10, which means I can only get the top 10 parameters that can cause gas consumption.
Python Result on Gas Amount:
Compared with SelectFromModel Algorithms, the result is clearer, because of the function of n_feature_to_select. RFE Algorithms will return the output based on the instruction.
Algorithms 3: Relaimpo, caret Package:
calc function belonged to Relaimpo in r is used to calculate the relation coefficient of each independent variable with the dependent variable. The variables with high ranking values are selected as distinguishing factors. I choose two types of coefficient as listed below:
The result based on type: lmg is the R^2 contribution averaged over orderings
The result based on type: betasq
In conclusion, three algorithms are conducted to select the most distinguishing factors for predicting gas energy consumption. SelectFromModel can list all independent variables that have a strong correlation with the gas consumption amount. 94 variables are selected by using this method; RFE method is like a narrow down function of SelectFromModel because it can select a specific number (9 parameters for this analysis) of independent variables by setting a threshold. Relaimpo method provides results based on the relation coefficient of each independent variable with the dependent variable. After compared with the result of each method, I decide to use RFE to select the top 9 distinguishing factors, which are kitchens, washer, heat fuel, root hole, electricity amount, oil amount, other amounts, trash amount, water amount. These factors are the top 9 variables that cause gas consumption in the residential building.
Analysis and Recommendation
Using the RFE model, the top 9 distinguishing factors are selected: kitchens, washer, heat fuel, root hole, electricity amount, oil amount, other amounts, trash amount, water amount. Among all the factors, kitchens and heat fuel are the main reasons causing gas consumption. because cooking and heating are directly fueled by gas. The rest factors are indirect reasons that cause gas consumption. For example, the root hole is a type of residential building interior. In winter, when people heat their house by fuel, they will waste more fuel, if their housing has root holes. Also, when people cook food, their washer, water amount, trash amount, and so on can be increased indirectly. As mention in the beginning, global warming is the result of human practices like emission of Greenhouse gases. These 9 factors can increase gas consumption, and lead to global warming.
Preventive action should be taken by environment organization, city construction organization, and residents:
- For environmental organizations, they should find alternative energy other than gas, then the total gas consumption will be reduced.
- For city construction organization, they should consider more about construction design and, using durable interior and exterior material of the residence building, which can also reduce the gas consumption.
- For building residences, they should turn off the fuel and electricity equipment, before they leave home.
All these three recommendations can help to reduce gas consumption, contributing to preventing global warming.
Rinkesh. (2017). Environmental Problems. Conserve energy future. Retrieved from https://www.conserve-energy-future.com/15-current-environmental-problems.php#abh_posts
AHS. Retrieved from https://www.census.gov/programs-surveys/ahs.html