Using linear regression and ANN techniques in determining variable importance

Mbandi, Aderiana Mutheu

Using linear regression and ANN techniques in determining variable importance

Author(s)

Mbandi, Aderiana Mutheu

Date Issued

2009

Type

Thesis

Publisher

Cape Peninsula University of Technology

Abstract

The use of Neural Networks in chemical engineering is well documented. There has
also been an increase in research concerned with the explanatory capacity of Neural
Networks although this has been hindered by the regard of Artificial Neural Networks
(ANN’s) as a black box technology.

Determining variable importance in complex systems that have many variables as
found in the fields of ecology, water treatment, petrochemical production, and
metallurgy, would reduce the variables to be used in optimisation exercises, easing
complexity of the model and ultimately saving money. In the process engineering
field, the use of data to optimise processes is limited if some degree of process
understanding is not present.

The project objective is to develop a methodology that uses Artificial Neural Network
(ANN) technology and Multiple Linear Regression (MLR) to identify explanatory
variables in a dataset and their importance on process outputs. The methodology is
tested by using data that exhibits defined and well known numeric relationships. The
numeric relationships are presented using four equations.

The research project assesses the relative importance of the independent variables
by using the “dropping method” on a regression model and ANN’s. Regression used
traditionally to determine variable contribution could be unsuccessful if a highly nonlinear
relationship exists. ANN’s could be the answer for this shortcoming.
For differentiation, the explanatory variables that do not contribute significantly
towards the output will be named “suspect variables”. Ultimately the suspect
variables identified in the regression model and ANN should be the same, assuming
a good regression model and network. The dummy variables introduced to the four equations are successfully identified as
suspect variables. Furthermore, the degree of variable importance was determined
using linear regression and ANN models. As the equations complexity increased, the
linear regression models accuracy decreased, thus suspect variables are not
correctly identified. The complexity of the equations does not affect the accuracy of
the ANN model, and the suspect variables are correctly identified.

The use of R2 and average error in establishing a criterion for identifying suspect
variables is explored. It is established that the cumulative variable importance
percentage (additive percentage), has to be below 5% for the explanatory variable to
be considered a suspect variable. Combining linear regression and ANN provides insight into the importance of explanatory variables and indeed suspect variables and their contribution can be determined. Suspect variables can be eliminated from the model once identified
simplifying the model, and increasing accuracy of the model.

Additional information

Thesis (MTech (Chemical Engineering))--Cape Peninsula University of Technology, 2009

Subjects

Regression analysis

Neural networks (Comp...

Artificial intelligen...

File(s)

Name

Mbandi_am_MTech_chem_eng_2009

Size

1.42 MB

Format

Adobe PDF

Checksum

(MD5):9aec7459f4ec9c3b40949a1a42427691