Dr. Schneider, a trained statistician and teacher in St. Tammany Parish, has analyzed the Louisiana version of VAM that represents the basis for COMPASS, our new teacher evaluation requirement that uses student test scores as 50% of the teacher evaluation equation. She has provided this are-you-as-smart-as-an-eighth-grader explanation of VAM for us. It can be used as a powerful argument against the legislation passed last year that is being instituted in all schools this year.
READ IT! STUDY IT! ASK QUESTIONS! I will pass on any comments or question to Dr. Schneider for a response. It is our responsbility to educate other teachers, parents, and legislators about the TRUTH of this devastating piece of legislation.
Value Added
Modeling (VAM) and “Reform”:
Under the
Microscope
Mercedes K.
Schneider, Ph.D.
December 28, 2012
Contact information:
Introduction
I am writing this paper with legislators in mind as my
audience, though the contents here are useful for any persons of influence to
read and digest. My primary goal is to
explain the shortcomings of value added modeling (VAM), particularly as such
relates to the Noell and Gleason VAM study presented to the Louisiana
legislature in February 2011 (link
here: http://www.regents.doa.louisiana.gov/assets/docs/TeacherPreparation/LegilsativeValueAddedReportFeb2011FINAL.pdf
). However, I also discuss broader
educational issues as a means of highlighting the limitations of the current
educational “reform” movement in general since it is this current “reform”
environment that promotes tenuous issues such as VAM.
My Background
In this age of self-promoting “reformers,” many perpetuate
the idea that education and experience are irrelevant, especially as concerns
imposing policy upon classroom instruction.
I oppose such a view because a “no experience or training necessary”
view allows those with superficial credentials to promote themselves as experts
when they are not. For this reason I
think it necessary to briefly inform my reading audience of my background,
including my professional experience and credentials.
I am a product of the St. Bernard Parish Public Schools
(1972-85). I hold three traditionally
earned college degrees, all in the field of education: B.S., secondary
education, English and German (LSU, 1985-91); M.Ed., guidance and counseling
(West Georgia, 1996-98), and Ph.D., applied statistics and research methods
(Northern Colorado, 1998-2002). I have
18 full-time years in the classroom and a couple of part-time years, as well. My range of classroom teaching experience
ranges from 7th grade to graduate-level, and I have taught several
subjects (English, German, teacher prep, and statistics), in regular ed
classes, alternative school, and special ed settings. I have taught in Louisiana, Georgia,
Colorado, and Indiana. For the past six
years I have taught sophomore English and a teacher prep course in St. Tammany
Parish, at Slidell High School.
As for my research experience, what is especially pertinent
to the discourse that follows is that I was a manuscript reviewer for the Journal
of Mental Health Counseling, the official professional, blind-review
research journal of the American Mental Health Counselors Association (AMHCA).
I later became the associate editor of research for the same journal. For four years (reviewer, 2006-08; associate
editor, 2008-10), I reviewed statistical research studies for their quality,
clarity, pertinence, and completeness for a nationally recognized professional
organization. It is with this experience
in mind that I approached my reading of the Noell’s and Gleason’s VAM study.
Education “Reform”Without Firsthand
Involvement
I find it interesting that for all of the talk of the need
to “reform” education, those who would impose policy upon schools do not
themselves spend time in those schools.
For all of his hot rhetoric against public school teachers who he says
earn a paycheck “for breathing,” Bobby Jindal has yet to spend a couple of
weeks substituting in any public school.
The same is true of John White, with his very limited two or three years
as a nontraditional, Teach for America temporary teacher. John White has not spent time substituting in
any Louisiana classroom. I am also
unaware of any effort on the part of legislators or BESE members to spend
regular time substituting in the public school classrooms in their respective
districts. I applaud Treasurer John
Kennedy for the time he spends each year substituting in public school
classrooms. Unfortunately, Treasurer
Kennedy is the exception, not the rule.
Like those who have the power to alter educational policy,
the VAM study is also removed from the immediacy of the classroom. When those in power are removed from those
over whom they exercise power, there will be a dangerous disconnect between
reality and appearance. To the
legislators reading this, let me say, that disconnect is a rift that will only
widen unless you invest some of your time becoming involved in the actual
school day as a means of informing your educational policy decisions.
Misquoting Teacher Influence
As noted in the Education for All article, “What if
Failing Schools… Aren’t?” (http://educatorsforall.org/blog/2012/3/8/why-schools-fail-or-what-if-failing-schoolsarent.html
), Dr. Tabitha Grossman, Senior Policy Analyst at the National Governor’s
Association, misquoted the role of
teacher influence on student outcomes: “Teacher effectiveness is the primary
influence on student achievement.” This
misquote feeds “reformer” rhetoric, as it places responsibility for student
outcomes squarely on the teacher.
However, the source Grossman cites actually says, "Teachers have the most immediate in-school effect
on student success." There is a
world of difference between the misquote and the actual finding. “The primary influence” is a global statement
that makes teacher influence sound as though it supersedes even parental
influence. In contrast, “the most immediate
in-school effect” limits teacher influence to the classroom.
Grossman also promoted the idea
that “Research shows that teacher quality is the primary influence on student
achievement." This also is a misquote of an Organization for Economic
Cooperation and Development (OECD) report.
The correct text, as noted in “What if Failing Schools… Aren’t?”, is as
follows:
"Three broad conclusions emerge from research on
student learning. The first and most solidly based finding is that the
largest source of variation in student learning is attributable to differences
in what students bring to school – their abilities and attitudes, and family
and community background. Such factors are difficult for policy makers to
influence, at least in the short-run. (Emphasis added)
"The
second broad conclusion is that of those variables that are potentially open to
policy influence, factors to do with teachers and teaching are the most
important influences on student learning. In particular, the broad consensus is
that “teacher quality” is the single most important school variable
influencing student achievement." (Emphasis added.)
It
is clear from the correct information above that teachers are not “the primary
influence” on student learning. This
certainly brings into question the “value” in “value added modeling” since a
major VAM premise is that the teacher directly controls the student outcome in
the form of a standardized test score.
“Louisiana Ranks 49th in the Nation in
Education”
VAM
attempts to rank teachers according to their “effectiveness” on student
standardized test scores. The idea here
is to discover what teachers “rank” lowest and rid the schools of such
teachers. Then perhaps Louisiana will
“rise in the rankings” on its “race to the top” as the schools are freed from
problematic teachers.
A
word is necessary regarding the nature of ranked data.
First,
if a group is ranked, someone must fall at the bottom of the ranking. So, to say that “Louisiana ranks 49th
in the nation in education” says little about the quality of education in the
classroom. Where there is a ranking,
there must be a bottom. I can walk into
any gifted classroom and rank the students according to IQ. The fact that all are highly intelligent is
irrelevant to the fact that when I rank the students via IQ, someone will be
last. And even in a gifted classroom, there is a stigma attached to being
ranked “last.”
Second,
the nature of rankings also makes the notion of “racing to the top” a foolish
one since all cannot be ranked as top schools.
Third,
the use of rankings can put unnecessary labels on people and institutions and
panic the public into believing a crisis exists if one is ranked “at the
bottom.”
These
points said, rankings do have their place, for rankings can be used to establish
a relationship between two concepts. For
example, I have done a comparison of 1) state educational rankings based on
graduation rates with 2) the percentage of families below poverty level, and I
arrived at an interesting conclusion:
The relationship between the two rankings is strong. The “reformers” note that “poverty is used as
an excuse” to not improve education.
However, if one examines the state’s own Recovery School District (RSD),
its schools “rank” second lowest of all districts based upon 2012 district
letter grades. The only districts to “earn” an F in 2012 are the RSD-LA and St.
Helena. Interestingly, St. Helena has
the highest percentage of students qualifying for free meals (see “What if Failing
Schools… Aren’t?” link for more info).
The
“reformers” who purport that poverty is an excuse cannot seem to “make” RSD
anything other than a “failing” district.
The state promoted the school letter grade system as a means of ranking
schools and districts, yet the state-run RSD does not include school letter
grades on its website. Both Jindal and
the LDOE website publicize the state-run RSD as a “success,” yet RSD has a
disproportionate number of Ds and Fs. Jindal even counts schools with a C as
“failing,” yet RSD has only a handful of Cs because most state-run RSD
schools have not “achieved” even a C ranking. None has achieved an A ranking. Given that issues of poverty are being
ignored, it is not surprising that the state-run RSD schools remain
overwhelmingly “failures” according to the state’s own letter grade ranking
system.
Notice
that Jindal, White, BESE, and Dobard (RSD superintendent) attempt to hide the
poor RSD rankings while continuing to attempt to benefit from promoting
Louisiana as “ranked at the bottom” educationally. If one is loud about the Louisiana 49th
ranking but silent about the low RSD ranking, one is able to both promote the
façade of RSD “success” while fostering the “blame game” of public school
“failure.” The “success” of VAM hinges on this image of public school “failure.”
VAM Teacher Rankings
If
carefully examined and understood, the Noell and Gleason VAM pilot study
reveals the failure of the VAM as a supposed measure of teacher performance. In
their VAM study, George Noell and Beth Gleason rank math and ELA teachers into
5 categories based upon student LEAP-21 and iLEAP scores for grades 3 thru 9:
“Bottom 1-10%,” “Bottom 11-20%,” “Middle 21-80%,” “Top 81-90%,” and “Top
91-99%.” Based upon their headliner for the data tables on pages 12 and 13 of
their study, it is clear that the teachers are being ranked (labeled) using
student standardized test scores.
The
two tables of data (one for math teachers and the other, for ELA teachers) are
measures of the stability of using the iLEAP and LEAP-21 tests as
measures of teacher performance.
Attention to this reporting is very important because the results of
these two tables demonstrate just how unstable using these two
standardized tests are at assessing teacher
quality/performance/effectiveness. It is
also crucial to note that based upon a reading of the explanation offered by
Noell and Gleason, one might not readily see this instability; Noell and
Gleason report “moderate stability across years” (pg. 13). In truth, the results are erratic. Here are the tables and followed by a brief,
accurate interpretation:
Table
5. Stability of Teacher Ranking in Mathematics across 2008-2009 to
2009-2010
2009-2010 Rank
2008-2009 Bottom Bottom Middle Top Top
Rank 1% - 10% 11% - 20% 21% - 80% 81% - 90% 91% - 99%
Bottom 26.8% 18.5% 46.2% 4.4% 4.2%
1%
- 10% (135) (93) (233) (22) (21)
Bottom 14.8% 15.6% 62.1% 5.4% 2.1%
11%
- 20% (71) (75) (298) (26) (10)
Middle 10.0% 9.9% 64.0% 9.3% 6.8%
21%
- 80% (508) (504) (3,258) (475) (348)
Top 2.9% 4.6% 54.0% 22.1% 16.5%
81%
- 90% (14) (22) (259) (106) (79)
Top 1.8% 1.5% 35.1% 15.8% 45.8%
91%
- 99% (8) (7) (160) (72) (209)
Table
6. Stability of Teacher Ranking in English Language Arts across
2008-2009 to 2009-2010
2009-2010 Rank
2008-2009 Bottom Bottom Middle Top Top
Rank 1% - 10% 11% - 20% 21% - 80% 81% - 90% 91% - 99%
Bottom 22.3% 17.5% 52.7% 4.9% 2.7%
1%
- 10% (126) (99) (298) (28) (15)
Bottom 17.1% 15.2% 59.7% 5.0% 3.0%
11%
- 20% (92) (82) (321) (27) (16)
Middle 9.9% 9.8% 63.2% 9.5% 7.6%
21%
- 80% (575) (566) (3,656) (551) (437)
Top 3.2% 6.1% 55.4% 17.7% 17.7%
81%
- 90% (17) (33) (298) (95) (95)
Top 4.5% 2.7% 37.1% 18.2% 37.5%
91% - 99% (23) (14) (190) (93) (192)
Each number in the table is a
percentage of teachers in the study/actual number of teachers who were first
ranked one way using 2008-09 student test scores (reading to the left) then
ranked either the same way (bolded diagonal) or a different way (all numbers
not bolded) using 2009-10 student test scores (reading at the top). For example, the percentage 4.5% (23
teachers) in Table 6 (immediately above this text) represents the percentage of
ELA teachers originally ranked in 2008-09 in the top 91-99% (reading to the
left) but reranked in 2009-10 in the bottom 1-10% (reading at the top of the
column) given that the teachers changed nothing in their teaching. Thus, these two tables represent how poorly
the standardized tests classify teachers (2008-09) then reclassify teachers (2009-10)
into their original rankings. Tables 5 and 6 are a test of the
consistency of using standardized tests to classify teachers. It is like
standing on a bathroom scale; reading your weight; stepping off (no change in
your weight); then, stepping on the scale again to determine how consistent
the scale is at measuring your weight.
Thus, if the standardized tests are stable (consistent) measures,
they will reclassify teachers into their original rankings with a high level of
accuracy. This high level of
accuracy is critical if school systems are told they must use standardized
tests to determine employment and merit pay decisions.
I have bolded the cells on the
diagonals of both tables to show just how unstable these two standardized tests
are at classifying then correctly reclassifying teachers. If the iLEAP and LEAP-21 were stable, then
the bolded percentages on the diagonals of both tables would be very high,
almost perfect (99%).
Here is what we see from the
diagonal in Table 5:
If a math teacher is originally
ranked as the lowest, without altering his or her teaching, the teacher
will likely be re-ranked in the lowest category only 26.8% of the time. Conversely, without altering his/her
teaching, a math teacher ranked as the highest would likely be re-ranked in the
highest group only 45.8% of the time even if she/he continued to teach the
same way.
Now, let’s examine some info
off of the diagonal for Table 5:
A math teacher originally
ranked in the highest category will be re-ranked in the middle category 35.1%
of the time and re-ranked in the lowest category 1.8% of the time. These alterations in ranking are out of
the teacher’s control and do not reflect any change in teaching. Even though 1.8% might seem low, notice that
in the study alone, this represented 8 math teachers, 8 real human beings, who
could potentially lose their jobs and face the stigma of being labeled “low
performers.”
As we did for Table 5, let’s
consider the diagonal for Table 6:
If an ELA teacher is originally
ranked as the lowest, without altering his or her teaching, the teacher
will likely be re-ranked in the lowest category only 22.3% of the time. Conversely, without altering his/her
teaching, an ELA teacher ranked as the highest would likely be re-ranked in the
highest group only 37.5% of the time even if she/he continued to teach the
same way.
As we did for Table 5, let’s
examine some info off of the diagonal for Table 6:
An ELA teacher originally
ranked in the highest category will be re-ranked in the middle category 37.1%
of the time and re-ranked in the lowest category 4.5% of the time. These alterations in ranking are out of
the teacher’s control and do not reflect any change in teaching. Even though 4.5% might seem low, notice
that in the study alone, this represented 23 ELA teachers who could potentially
lose their jobs and face the stigma of being labeled “low performers.”
Other info based upon Tables
5 and 6:
54.2% of math teachers and
62.5% of ELA teachers originally ranked in the top 10% (and likely eligible for
merit money) would have lost their status due to faulty measurement. Math teachers not originally ranked in the
lowest 10% were re-ranked into the lowest 10% at a rate of 30.5%; for ELA
teachers, the rate of misclassification into to lowest category was 34.7% from
other, higher categories. Keep in mind
that being ranked into the lowest category brings with it the threat of job
loss.
Conclusions Based on Tables
5 and 6:
Given the capriciousness of
the teacher rankings as represented in Tables 5 and 6, it is illogical to
assume that teachers were ranked correctly to begin with. Let’s reconsider the bathroom scale
example: If I stand on and step off a
bathroom scale twice in sequence, and the scale reports my weight as 158 and
172, respectively, why should I believe that the first measure of 158 is my
“true” weight?
I would discard the bathroom
scale.
By the same logic, if teacher
rankings are erratic from measurement time 1 to measurement time 2, why should
one believe that the first measure is an accurate reflection of teacher
performance?
Time to throw out the VAM.
Given the instability of
teacher rankings based upon standardized test scores, achieving “tenure” as it
is now defined (5 out of 6 years as rated “highly effective”) is almost a
statistical impossibility. John White
plans to make some exception for teachers ranked as “highly effective” to
circumvent the issue of being subsequently rated as “ineffective,” a situation
faced by teachers such as those from South Highlands Elementary Magnet School
in Caddo parish; however, he has offered no plans for the teachers of other
rankings who will be erroneously re-ranked as “ineffective.” The solution is to discard the entire system
and not enter into situations whereby BESE is asked to make “patchwork”
exceptions to a model replete with classification weaknesses.
Even George Noell confirmed
issues of VAM inaccuracy. For the
actual email from George Noell regarding VAM classification inaccuracy, see http://www.lae.org/cms/Stability+of+the+Louisiana+Value+Added+Model+%28VAM%29/263.html
; for additional VAM inaccuracy info, including VAM inaccuracy info from other
states, see http://www.louisianaeducator.blogspot.com/2012_09_30_archive.html
).
Data Quality Limitations of the
VAM Study
Based upon its inability to rank teachers with any
precision, the VAM is useless. However, the outcome inaccuracies are not the
only fatal flaws in the Noell and Gleason VAM study. Another notable limitation
involves issues of data quality. No
study is better than its data. Indeed, the three data issues I will discuss
here render the results of the study as limited at best and untrustworthy and
unusable at worst.
Nonrandom Selection of Schools
When I attended the April meeting in St. Tammany where John
White addressed St. Tammany faculty and administration, more than once teachers
expressed concerns about the teacher evaluation pilot study. In all cases, John White asked, “Did your
school choose to participate in the study?” as a means of ending discussion on
the topic. I remember thinking then,
“Self-selection into the research study is poor research practice because it
limits generalizeability of results.”
Let me explain.
The schools that participated in the Noell and Gleason VAM
study were “self-selected”; that is, these schools were not randomly selected
to participate but instead decided on their own to participate (or their
district administration decided for them).
This is poor research practice; in order to generalize the results to
all Louisiana schools, the participating schools should have been randomly chosen
from a pool of all Louisiana schools.
Random selection is a foundation of statistics/probability. Games of chance follow the laws of
probability, and gambling establishments are careful to guard these laws since
laws of probability guarantee that the house will win overall. I cannot walk up to a poker table and ask to
see the deck in order to intentionally remove certain cards, or put them in
order to my liking, or even to see them in order to choose my hand. These actions violate the rules of probability,
and the outcome of the game is “not fair” since I did not play with a complete
and/or randomly shuffled deck.
The “outcome of the game” in the VAM study is the ranking
and re-ranking of the teachers as represented in previously discussed Tables 5
and 6. The “tampered deck” is the group
of self-selecting schools. The “randomly
shuffled deck” should have been the correctly chosen, randomly selected schools
from among all Louisiana schools. The
outcome is affected because the results of the study cannot be extended to
apply to all Louisiana schools but instead to only the specific schools
participating in the study. In the language of statistics, the results of the
study “cannot generalize” to all Louisiana schools.
Even if this study had consistently ranked and re-ranked
teachers (which it did not), the result should not have been applied to any
schools except the ones included in this study.
Given what I know about “reformer-speak,” let me add that it is
unfounded speculation to conclude that VAM stability is “only poor for the
tested schools but not for the untested schools.”
Unverified Data
Regarding
the results in Tables 5 and 6, Noell and Gleason comment, “It is anticipated that a full set of
verified rosters may produce more stable results” (pg. 12). This brief comment is an amazing admission,
one that connects to a previous statement on page 8: “The comparative analyses between years
described below are based on unverified rosters for 2007-2008 and
2008-2009. It is the authors’ hypothesis that when two years of verified
rosters are available, the relationship between consecutive years may be
strengthened as error variance associated with inaccurate student-teacher
links is removed.” (Emphasis added.)
The researchers were uncertain whether teachers were correctly
matched to students. Such
uncertainty about data integrity renders the results of this study as
useless.
Up to
this point in this discourse, the Noell and Gleason VAM study has three serious
(and two fatal) limitations: Erratic
classification (fatal); nonrandom selection of participating schools
(limiting), and inaccurate teacher-student information (fatal).
“Shared Students”
A
third fatal data limitation involves Noell and Gleason’s treatment of data on
shared students, those students taught by more than one math or ELA teacher
during the course of a school year. As
Noell and Gleason note on page 10: “Students who had multiple teachers in a
content area were retained in the dataset for their promotional path for each
teacher, but were weighted in
proportion
to the number of teachers they had in that subject. So for example, if a
student had
two
mathematics teachers, the student would have a 0.5 weight in contributing to
each teacher’s assessment result.” (Emphasis added.)
In
other words, Noell and Gleason divided a student’s score and “gave an equal
part” to count towards each teacher’s evaluation. This decision defeats the goal of
assessing each teacher: How is one to distinguish “good” from “poor” teachers
if the score is equally shared? An equal sharing of the student’s score
presumes that all teachers contribute equally to the student’s education, that
all contributing teachers were “equally good.”
I
teach a number of students who will have me for English II for only the first
semester. They will have another English
II teacher for the second semester. It
is nonsensical to say on one hand that my colleague and I are being assessed as
independent teachers and on the other to equally divide the measure that
supposedly determines the quality of our teaching, the student’s EOC (End of
Course) English II score, between the two of us.
Equally
dividing the standardized test scores of shared students among two or more
teachers that taught the students is yet another fatal flaw in the Noell and
Gleason VAM study. And there’s
more. In the next section, I address the
final flaw for this paper: the selected analysis, HLM.
Hierarchical Linear Modeling (HLM)
HLM
is the statistical analysis used in VAM studies. I would like to present a brief, basic
explanation on why HLM cannot work as a means of determining teacher
contribution to a student’s standardized test score. Please keep in mind my discussion under the
previous section, “Misquoting Teacher Influence”; this section shows how the
premise is faulty for attempting to connect student standardized test scores to
teacher performance in the first place.
That reminder in place, I wanted to offer this section as a means of
understanding the limitation of the HLM analysis itself in determining teacher
contribution based upon the student’s score.
HLM
is a “nested” procedure. That is, it
assumes a layering to the data. Think of
it as concentric circles, one inside of the next. In an educational context, the student is
“inside” of the class, which is “inside” of the school, which is “inside” of
the district. HLM offers numeric info
for each of these levels as an entire level. In VAM, part of a student’s score is unique
to the student, and part is attributed to his/her class, and part to the
school, and part to the district (if researchers choose to measure the school
and district). This is key: The part of a student’s score attributed to the
student’s class is not connected to any specific component of the
classroom environment, including the teacher. HLM is not that precise. It can
only offer numeric info on each level as a whole level.
In
their analysis, Noell and Gleason presume that “teacher,” and “class” are
interchangeable. They are not. The teacher is not the “class.” Furthermore, attempts to isolate the teacher
from all other possible influences that form the conglomerate “class” are
futile since it is not possible using HLM to control for every other
classroom influence and “leave” only the teacher influence as directly
connected to the student test score.
When
careers are on the line, precision of measurement and analysis are a must.
I can give myself a haircut using garden shears, but it just
isn’t going to be that precise. I can
say I’ve done it before; that the haircut must happen now; that the shears
really will work because shears are blades and blades are able to cut and my
hair needs cutting.
The teacher assessment is the haircut; HLM is the garden
shears.
Teacher assessment based on student test scores: That’s an
attempt to make my hair grow backwards.
I can’t blame HLM for that one.
Concluding Remarks
To the legislators and others who have read this paper: Thank you.
Please seriously consider the content presented here. Please reread the contents so as to absorb
their import. Yes, teachers influence
their students. However, teachers are
not “the” primary influence in student lives.
Yes, teachers should be evaluated.
However, attempting to connect teacher performance to student
standardized test scores cannot work and will not improve education in
America.
VAM does not work; it cannot work and needs to be discarded.
The necessary first step to truly improve American education
is for policy makers to investigate the situation firsthand by involving
themselves directly in the day-to-day operations of the classroom. This direct involvement can be accomplished
by substitute teaching for even one week per year. Such experience would enable legislators to
more critically and insightfully weigh that which might erroneously be labeled
as “education reform.”