Using regression coefficients in a model
Moderator: Jim Duggan

 Posts: 16
 Joined: Sat Jul 18, 2009 5:22 pm
Using regression coefficients in a model
Hi,
We are trying to integrate (spatial) regression analysis with an SD model for a public health project.
From spatial regression analysis, say:
fraction happy people in neighborhood = constant + 0.05(greenspace index)  0.16(dog poop density)
y = c + m1(a) + m2(b)
m1 is a the regression coefficient for the effect of one point increase in (a) on (y), etc....
And I want to include greenspace and dog poop density in a model as separate policy variables that can be changed exogenously.
 might it be an okay interpretation to use a lookup table for greenspace with the equation y(a) = c + m1(a)
 and a second lookup table with the equation y(b) = c + m2(b)
 such that the inputs to the lookup would be greenspace (a) and dog poop density (b), respectively
 and the output from the lookup tables would be new 'indicated fraction happy people in neighborhood', used to feed a goalgap structure with the other input being 'actual fraction happy people in neighborhood'....
1. Does this use of a regression coefficient make sense?
And, if so, the second question is (and goes back to a recent discussion on this forum)...
2. should we multiply or should we add y(a) and y(b) to create a combined 'indicated fraction happy people in neighborhood' to feed the goalgap structure?
My guess is that in this case we should add them, as that's the relationship specified by the regression equation.
I can post a sample model tomorrow if it helps clarify the question.
Thoughts? It doesn't need to be a perfect interpretation, we're just trying to put these spatial regression coefficients to good use by integrating them into an SD model....
Thank you!
Best wishes,
Sarah
We are trying to integrate (spatial) regression analysis with an SD model for a public health project.
From spatial regression analysis, say:
fraction happy people in neighborhood = constant + 0.05(greenspace index)  0.16(dog poop density)
y = c + m1(a) + m2(b)
m1 is a the regression coefficient for the effect of one point increase in (a) on (y), etc....
And I want to include greenspace and dog poop density in a model as separate policy variables that can be changed exogenously.
 might it be an okay interpretation to use a lookup table for greenspace with the equation y(a) = c + m1(a)
 and a second lookup table with the equation y(b) = c + m2(b)
 such that the inputs to the lookup would be greenspace (a) and dog poop density (b), respectively
 and the output from the lookup tables would be new 'indicated fraction happy people in neighborhood', used to feed a goalgap structure with the other input being 'actual fraction happy people in neighborhood'....
1. Does this use of a regression coefficient make sense?
And, if so, the second question is (and goes back to a recent discussion on this forum)...
2. should we multiply or should we add y(a) and y(b) to create a combined 'indicated fraction happy people in neighborhood' to feed the goalgap structure?
My guess is that in this case we should add them, as that's the relationship specified by the regression equation.
I can post a sample model tomorrow if it helps clarify the question.
Thoughts? It doesn't need to be a perfect interpretation, we're just trying to put these spatial regression coefficients to good use by integrating them into an SD model....
Thank you!
Best wishes,
Sarah

 Site Admin
 Posts: 179
 Joined: Sat Dec 27, 2008 8:09 pm
Re: Using regression coefficients in a model
Hi Sarah,
What you can do with a cross sectional (or in this case spatial) formulation inside of an aggregate model depends strongly on the exact formulation being used. If the formulation is linear than you can just move the spatial integration outside of the equation and what is true for each neighborhood is true in aggregate for the entire community. That is is
average happiness in sector = a + b * average green space per person in sector
then
average happiness in community = a + b * average green space per person in community
So that exact same equation applies.
Once you leave linearity this does not work. That is, if y = f(x) then the expected value of y is not equal to f(expected value of x) unless f is linear. It is, however, possible to make assumptions about the distribution of x and come to some meaningful approximations. For example, in your case, you might assume that the green space exogenous effect was proportional to all locations so that doubling green space would double it everywhere. In this case you could derive computationally the relationship between average green space and average happiness. If there are 2 effects you might do this as a two dimensional lookup surface.
Can't say much more without specifics.
What you can do with a cross sectional (or in this case spatial) formulation inside of an aggregate model depends strongly on the exact formulation being used. If the formulation is linear than you can just move the spatial integration outside of the equation and what is true for each neighborhood is true in aggregate for the entire community. That is is
average happiness in sector = a + b * average green space per person in sector
then
average happiness in community = a + b * average green space per person in community
So that exact same equation applies.
Once you leave linearity this does not work. That is, if y = f(x) then the expected value of y is not equal to f(expected value of x) unless f is linear. It is, however, possible to make assumptions about the distribution of x and come to some meaningful approximations. For example, in your case, you might assume that the green space exogenous effect was proportional to all locations so that doubling green space would double it everywhere. In this case you could derive computationally the relationship between average green space and average happiness. If there are 2 effects you might do this as a two dimensional lookup surface.
Can't say much more without specifics.

 Posts: 20
 Joined: Mon Jan 12, 2009 3:39 pm
 Location: University at Albany, SUNY
 Contact:
Re: Using regression coefficients in a model
Good morning.
A clarification, please. I'm not familiar with spatial integration, and I don't know the specifics of the two independent variables you use as examples, except from an empirical perspective. When using regression results as the basis for a lookup, the units of the betas (coefficients of the independent variables) have dimensions. A linear regression relating income to education, such as " wage = alpha + (beta*education )+ u, beta carries an implied dimension of "wages/education". Yet using dimensioned variables as input to a table goes against our considered good practice.
If the regression betas are going to be used in look up tables, I expect that it would be advantageous to have the dependent datum normalized against a standard so that the input to the table are dimensionless.
What do you think?
Of course, if the model or its users gain faith in the model's dynamic implications by referencing established literature that includes regression parameters, that's a plus to consider.
Best,
Eliot Rich
A clarification, please. I'm not familiar with spatial integration, and I don't know the specifics of the two independent variables you use as examples, except from an empirical perspective. When using regression results as the basis for a lookup, the units of the betas (coefficients of the independent variables) have dimensions. A linear regression relating income to education, such as " wage = alpha + (beta*education )+ u, beta carries an implied dimension of "wages/education". Yet using dimensioned variables as input to a table goes against our considered good practice.
If the regression betas are going to be used in look up tables, I expect that it would be advantageous to have the dependent datum normalized against a standard so that the input to the table are dimensionless.
What do you think?
Of course, if the model or its users gain faith in the model's dynamic implications by referencing established literature that includes regression parameters, that's a plus to consider.
Best,
Eliot Rich

 Posts: 16
 Joined: Sat Jul 18, 2009 5:22 pm
Re: Using regression coefficients in a model
Thank you for the responses. I will try to clarify, as I now have a bit more clarity myself on what's happening.
First, the regression (not my work) is a geographicallyweighted regression. It looks at 200+ neighbourhoods and finds some significant variables correlated(?) with a specific public health issue. I will ask my colleague if we can try and publish this at some time in the next few months so I don't need to be so vague...sorry...
Now, the regression did not use timeseries data...I believe because of small samples sizes in the neighbourhoods, it aggregated 7 years of data into one equation:
y(x) = constant + alpha*variable1 + alpha*variable2 + alpha*variable3 + .... +(plus residuals I believe, which are the error deviations from the trendline?)
y(x) is a fraction of the population with a disease. The variables contribute to the disease.
Well, in the end we have not used lookup tables. Instead, we simply use the regression equation. So, we start the model in equilibrium where the actual y(x) or fraction of the population with the disease is equal to the regressionindicated y(x). Then, we change a variable...and therefore there is a gap between the actual and the indicated y(x)....so, the stocks adjust in a firstorder adjustment process, classic goalgap behaviour.
I am not sure that the part of Bob's response related to aggregation is relevant to this particular problem, because we have simply subscripted the same structure by the 200+ neighbourhoods and there is no interaction between the neighbourhoods. Although I'm sure that his feedback will be useful in the future (and hopefully I am not missing something important by discounting it in this case?) Again, I will try and supply an example model soon.
Also, I am not sure I understand: if y = f(x) then the expected value of y is not equal to f(expected value of x) unless f is linear...I need to think this over, though I do believe it relevant to the model we are working on.
In an attempt to respond to Eliot's comment...hmmm. By " using dimensioned variables as input to a table goes against our considered good practice" do you mean that using a dimensioned variable as an input to a lookup table sneaks around our good practice of unit consistency? Incredibly, I've never considered this!
Eliot, in the end, I believe that all of the variable1, variable2, etc. are indexed values, e.g. between 0 and 100 where 0 = no green space and 100 = maximum green space imaginable (although there's not necessarily a neighbourhood with a value of 100 for green space).
But all the independent variables are not indexed...for example, one refers to average level education in the neighbourhood...so using the regression equation in the way I described is not really unitconsistent, is it?
The purpose of the SD model is somewhat secondary to the regression...the geographicallyweighted regression is the core of the research, and we are trying to implement the regression coefficients as exogenous inputs into a simple SD model of disease progression because of the power of even a simple stock and flow model and the desire to add behaviourovertime.
I hope this helps clarify. I welcome your comments although I probably need more time to digest them and really understand the implications of what you've offered. Again, I apologize for not posting a model. I will do so at my earliest convenience.
Thank you.
Best wishes,
Sarah
First, the regression (not my work) is a geographicallyweighted regression. It looks at 200+ neighbourhoods and finds some significant variables correlated(?) with a specific public health issue. I will ask my colleague if we can try and publish this at some time in the next few months so I don't need to be so vague...sorry...
Now, the regression did not use timeseries data...I believe because of small samples sizes in the neighbourhoods, it aggregated 7 years of data into one equation:
y(x) = constant + alpha*variable1 + alpha*variable2 + alpha*variable3 + .... +(plus residuals I believe, which are the error deviations from the trendline?)
y(x) is a fraction of the population with a disease. The variables contribute to the disease.
Well, in the end we have not used lookup tables. Instead, we simply use the regression equation. So, we start the model in equilibrium where the actual y(x) or fraction of the population with the disease is equal to the regressionindicated y(x). Then, we change a variable...and therefore there is a gap between the actual and the indicated y(x)....so, the stocks adjust in a firstorder adjustment process, classic goalgap behaviour.
I am not sure that the part of Bob's response related to aggregation is relevant to this particular problem, because we have simply subscripted the same structure by the 200+ neighbourhoods and there is no interaction between the neighbourhoods. Although I'm sure that his feedback will be useful in the future (and hopefully I am not missing something important by discounting it in this case?) Again, I will try and supply an example model soon.
Also, I am not sure I understand: if y = f(x) then the expected value of y is not equal to f(expected value of x) unless f is linear...I need to think this over, though I do believe it relevant to the model we are working on.
In an attempt to respond to Eliot's comment...hmmm. By " using dimensioned variables as input to a table goes against our considered good practice" do you mean that using a dimensioned variable as an input to a lookup table sneaks around our good practice of unit consistency? Incredibly, I've never considered this!
Eliot, in the end, I believe that all of the variable1, variable2, etc. are indexed values, e.g. between 0 and 100 where 0 = no green space and 100 = maximum green space imaginable (although there's not necessarily a neighbourhood with a value of 100 for green space).
But all the independent variables are not indexed...for example, one refers to average level education in the neighbourhood...so using the regression equation in the way I described is not really unitconsistent, is it?
The purpose of the SD model is somewhat secondary to the regression...the geographicallyweighted regression is the core of the research, and we are trying to implement the regression coefficients as exogenous inputs into a simple SD model of disease progression because of the power of even a simple stock and flow model and the desire to add behaviourovertime.
I hope this helps clarify. I welcome your comments although I probably need more time to digest them and really understand the implications of what you've offered. Again, I apologize for not posting a model. I will do so at my earliest convenience.
Thank you.
Best wishes,
Sarah

 Posts: 152
 Joined: Thu Jan 15, 2009 6:55 pm
 Location: Bozeman, MT
 Contact:
Re: Using regression coefficients in a model
A few thoughts:
1. Using the regression equation directly is probably the most transparent and convenient thing to do. However, you might need to modify the relationships to handle extreme conditions. There could easily be combinations of inputs that don't occur in the original data that produce physically impossible outputs. A typical approach would be to take a linear relationship (which might yield impossiblynegative happiness at some point) and transform it with a logistic, so that it preserves the central slope, while constraining the extremes to a fixed range.
2. Units aren't really a problem. In y = a*x, the regression coefficient (a) has units (y per x). However, it's often helpful to put things in a normalized form, like y/y0 = a*(x/x0).
3. Whether you can use the regression directly seems like it boils down to the same question as whether the regression is right. In both cases, a percell, aggregate data approach assumes that the time constant of poop>happiness relationships is short with respect to the measurement intervals and that cells don't interact. The latter means that spatial decay of effects is rapid with respect to cell size. If those are approximately true, then everything's OK. Once should be able to detect whether this is true in the regression by looking at temporal and spatial correlations of residuals. Also, you could repeat the regression, including values from adjacent cells as explanatory vars for the cell under consideration.
4. I think Bob's point pertains to the relationship between aggregate and disaggregate data in general, which might be important in a number of circumstances. This might crop up, for example, if your grid cells are large, but there are important subgridscale processes going on. That might give rise to nonlinearities or tipping point dynamics that invalidate the assumptions of the regression.
5. An alternative to plugging the regression into the cell structure would be to build the spatial model, including extreme conditions constraints, nonlinearities, and feedback among cells, and calibrate it to the data directly. This would be better than the regression, because you wouldn't have to make so many restrictive assumptions, though it might also be computationally burdensome. Even if you couldn't get the spatialdynamic model to run fast enough to calibrate, you could at least use it to generate synthetic data, and use the data to test the regression approach to see if the answers make sense.
Tom
1. Using the regression equation directly is probably the most transparent and convenient thing to do. However, you might need to modify the relationships to handle extreme conditions. There could easily be combinations of inputs that don't occur in the original data that produce physically impossible outputs. A typical approach would be to take a linear relationship (which might yield impossiblynegative happiness at some point) and transform it with a logistic, so that it preserves the central slope, while constraining the extremes to a fixed range.
2. Units aren't really a problem. In y = a*x, the regression coefficient (a) has units (y per x). However, it's often helpful to put things in a normalized form, like y/y0 = a*(x/x0).
3. Whether you can use the regression directly seems like it boils down to the same question as whether the regression is right. In both cases, a percell, aggregate data approach assumes that the time constant of poop>happiness relationships is short with respect to the measurement intervals and that cells don't interact. The latter means that spatial decay of effects is rapid with respect to cell size. If those are approximately true, then everything's OK. Once should be able to detect whether this is true in the regression by looking at temporal and spatial correlations of residuals. Also, you could repeat the regression, including values from adjacent cells as explanatory vars for the cell under consideration.
4. I think Bob's point pertains to the relationship between aggregate and disaggregate data in general, which might be important in a number of circumstances. This might crop up, for example, if your grid cells are large, but there are important subgridscale processes going on. That might give rise to nonlinearities or tipping point dynamics that invalidate the assumptions of the regression.
5. An alternative to plugging the regression into the cell structure would be to build the spatial model, including extreme conditions constraints, nonlinearities, and feedback among cells, and calibrate it to the data directly. This would be better than the regression, because you wouldn't have to make so many restrictive assumptions, though it might also be computationally burdensome. Even if you couldn't get the spatialdynamic model to run fast enough to calibrate, you could at least use it to generate synthetic data, and use the data to test the regression approach to see if the answers make sense.
Tom
Blog: http://blog.metasd.com
Model library: http://models.metasd.com
Work: http://ventanasystems.com/ & http://vensim.com/
Model library: http://models.metasd.com
Work: http://ventanasystems.com/ & http://vensim.com/

 Site Admin
 Posts: 179
 Joined: Sat Dec 27, 2008 8:09 pm
Re: Using regression coefficients in a model
Hi Sarah,
With a linear model using the regression equation directly should work fine, but!
The but relates to the fact that the underlying regression equations used data from different times, and this means it assumes the relationship is static. Since the purpose of the SD component is to demonstrate the implications of having a dynamic component, using this directly seems inappropriate.
It is probably better in this case to use the conceptual formulation underlying the regression (this is typically different from the regression equation itself since these involve accommodation to available data) along with a parametrization that is in the ballpark of the regression equations. Then just show the implications of time based behavior and see if you can infer from that which way the regression coefficients might be biased, and any implications this would have for policies based on the regression equation. The purpose here is to add a dimension to the static analysis that is effectively qualitative even though it is a numerical simulation informing a statistical model.
Hope that helps.
With a linear model using the regression equation directly should work fine, but!
The but relates to the fact that the underlying regression equations used data from different times, and this means it assumes the relationship is static. Since the purpose of the SD component is to demonstrate the implications of having a dynamic component, using this directly seems inappropriate.
It is probably better in this case to use the conceptual formulation underlying the regression (this is typically different from the regression equation itself since these involve accommodation to available data) along with a parametrization that is in the ballpark of the regression equations. Then just show the implications of time based behavior and see if you can infer from that which way the regression coefficients might be biased, and any implications this would have for policies based on the regression equation. The purpose here is to add a dimension to the static analysis that is effectively qualitative even though it is a numerical simulation informing a statistical model.
Hope that helps.
Who is online
Users browsing this forum: No registered users and 1 guest