Dr. Schneider, a trained statistician and teacher in St. Tammany Parish, has analyzed the Louisiana version of VAM that represents the basis for COMPASS, our new teacher evaluation requirement that uses student test scores as 50% of the teacher evaluation equation. She has provided this are-you-as-smart-as-an-eighth-grader explanation of VAM for us. It can be used as a powerful argument against the legislation passed last year that is being instituted in all schools this year.
READ IT! STUDY IT! ASK QUESTIONS! I will pass on any comments or question to Dr. Schneider for a response. It is our responsbility to educate other teachers, parents, and legislators about the TRUTH of this devastating piece of legislation.
Value Added Modeling (VAM) and “Reform”:
Under the Microscope
Mercedes K. Schneider, Ph.D.
December 28, 2012
I am writing this paper with legislators in mind as my audience, though the contents here are useful for any persons of influence to read and digest. My primary goal is to explain the shortcomings of value added modeling (VAM), particularly as such relates to the Noell and Gleason VAM study presented to the Louisiana legislature in February 2011 (link here: http://www.regents.doa.louisiana.gov/assets/docs/TeacherPreparation/LegilsativeValueAddedReportFeb2011FINAL.pdf ). However, I also discuss broader educational issues as a means of highlighting the limitations of the current educational “reform” movement in general since it is this current “reform” environment that promotes tenuous issues such as VAM.
In this age of self-promoting “reformers,” many perpetuate the idea that education and experience are irrelevant, especially as concerns imposing policy upon classroom instruction. I oppose such a view because a “no experience or training necessary” view allows those with superficial credentials to promote themselves as experts when they are not. For this reason I think it necessary to briefly inform my reading audience of my background, including my professional experience and credentials.
I am a product of the St. Bernard Parish Public Schools (1972-85). I hold three traditionally earned college degrees, all in the field of education: B.S., secondary education, English and German (LSU, 1985-91); M.Ed., guidance and counseling (West Georgia, 1996-98), and Ph.D., applied statistics and research methods (Northern Colorado, 1998-2002). I have 18 full-time years in the classroom and a couple of part-time years, as well. My range of classroom teaching experience ranges from 7th grade to graduate-level, and I have taught several subjects (English, German, teacher prep, and statistics), in regular ed classes, alternative school, and special ed settings. I have taught in Louisiana, Georgia, Colorado, and Indiana. For the past six years I have taught sophomore English and a teacher prep course in St. Tammany Parish, at Slidell High School.
As for my research experience, what is especially pertinent to the discourse that follows is that I was a manuscript reviewer for the Journal of Mental Health Counseling, the official professional, blind-review research journal of the American Mental Health Counselors Association (AMHCA). I later became the associate editor of research for the same journal. For four years (reviewer, 2006-08; associate editor, 2008-10), I reviewed statistical research studies for their quality, clarity, pertinence, and completeness for a nationally recognized professional organization. It is with this experience in mind that I approached my reading of the Noell’s and Gleason’s VAM study.
Education “Reform”Without Firsthand Involvement
I find it interesting that for all of the talk of the need to “reform” education, those who would impose policy upon schools do not themselves spend time in those schools. For all of his hot rhetoric against public school teachers who he says earn a paycheck “for breathing,” Bobby Jindal has yet to spend a couple of weeks substituting in any public school. The same is true of John White, with his very limited two or three years as a nontraditional, Teach for America temporary teacher. John White has not spent time substituting in any Louisiana classroom. I am also unaware of any effort on the part of legislators or BESE members to spend regular time substituting in the public school classrooms in their respective districts. I applaud Treasurer John Kennedy for the time he spends each year substituting in public school classrooms. Unfortunately, Treasurer Kennedy is the exception, not the rule.
Like those who have the power to alter educational policy, the VAM study is also removed from the immediacy of the classroom. When those in power are removed from those over whom they exercise power, there will be a dangerous disconnect between reality and appearance. To the legislators reading this, let me say, that disconnect is a rift that will only widen unless you invest some of your time becoming involved in the actual school day as a means of informing your educational policy decisions.
Misquoting Teacher Influence
As noted in the Education for All article, “What if Failing Schools… Aren’t?” (http://educatorsforall.org/blog/2012/3/8/why-schools-fail-or-what-if-failing-schoolsarent.html ), Dr. Tabitha Grossman, Senior Policy Analyst at the National Governor’s Association, misquoted the role of teacher influence on student outcomes: “Teacher effectiveness is the primary influence on student achievement.” This misquote feeds “reformer” rhetoric, as it places responsibility for student outcomes squarely on the teacher. However, the source Grossman cites actually says, "Teachers have the most immediate in-school effect on student success." There is a world of difference between the misquote and the actual finding. “The primary influence” is a global statement that makes teacher influence sound as though it supersedes even parental influence. In contrast, “the most immediate in-school effect” limits teacher influence to the classroom.
Grossman also promoted the idea that “Research shows that teacher quality is the primary influence on student achievement." This also is a misquote of an Organization for Economic Cooperation and Development (OECD) report. The correct text, as noted in “What if Failing Schools… Aren’t?”, is as follows:
"Three broad conclusions emerge from research on student learning. The first and most solidly based finding is that the largest source of variation in student learning is attributable to differences in what students bring to school – their abilities and attitudes, and family and community background. Such factors are difficult for policy makers to influence, at least in the short-run. (Emphasis added)
"The second broad conclusion is that of those variables that are potentially open to policy influence, factors to do with teachers and teaching are the most important influences on student learning. In particular, the broad consensus is that “teacher quality” is the single most important school variable influencing student achievement." (Emphasis added.)
It is clear from the correct information above that teachers are not “the primary influence” on student learning. This certainly brings into question the “value” in “value added modeling” since a major VAM premise is that the teacher directly controls the student outcome in the form of a standardized test score.
“Louisiana Ranks 49th in the Nation in Education”
VAM attempts to rank teachers according to their “effectiveness” on student standardized test scores. The idea here is to discover what teachers “rank” lowest and rid the schools of such teachers. Then perhaps Louisiana will “rise in the rankings” on its “race to the top” as the schools are freed from problematic teachers.
A word is necessary regarding the nature of ranked data.
First, if a group is ranked, someone must fall at the bottom of the ranking. So, to say that “Louisiana ranks 49th in the nation in education” says little about the quality of education in the classroom. Where there is a ranking, there must be a bottom. I can walk into any gifted classroom and rank the students according to IQ. The fact that all are highly intelligent is irrelevant to the fact that when I rank the students via IQ, someone will be last. And even in a gifted classroom, there is a stigma attached to being ranked “last.”
Second, the nature of rankings also makes the notion of “racing to the top” a foolish one since all cannot be ranked as top schools.
Third, the use of rankings can put unnecessary labels on people and institutions and panic the public into believing a crisis exists if one is ranked “at the bottom.”
These points said, rankings do have their place, for rankings can be used to establish a relationship between two concepts. For example, I have done a comparison of 1) state educational rankings based on graduation rates with 2) the percentage of families below poverty level, and I arrived at an interesting conclusion: The relationship between the two rankings is strong. The “reformers” note that “poverty is used as an excuse” to not improve education. However, if one examines the state’s own Recovery School District (RSD), its schools “rank” second lowest of all districts based upon 2012 district letter grades. The only districts to “earn” an F in 2012 are the RSD-LA and St. Helena. Interestingly, St. Helena has the highest percentage of students qualifying for free meals (see “What if Failing Schools… Aren’t?” link for more info).
The “reformers” who purport that poverty is an excuse cannot seem to “make” RSD anything other than a “failing” district. The state promoted the school letter grade system as a means of ranking schools and districts, yet the state-run RSD does not include school letter grades on its website. Both Jindal and the LDOE website publicize the state-run RSD as a “success,” yet RSD has a disproportionate number of Ds and Fs. Jindal even counts schools with a C as “failing,” yet RSD has only a handful of Cs because most state-run RSD schools have not “achieved” even a C ranking. None has achieved an A ranking. Given that issues of poverty are being ignored, it is not surprising that the state-run RSD schools remain overwhelmingly “failures” according to the state’s own letter grade ranking system.
Notice that Jindal, White, BESE, and Dobard (RSD superintendent) attempt to hide the poor RSD rankings while continuing to attempt to benefit from promoting Louisiana as “ranked at the bottom” educationally. If one is loud about the Louisiana 49th ranking but silent about the low RSD ranking, one is able to both promote the façade of RSD “success” while fostering the “blame game” of public school “failure.” The “success” of VAM hinges on this image of public school “failure.”
VAM Teacher Rankings
If carefully examined and understood, the Noell and Gleason VAM pilot study reveals the failure of the VAM as a supposed measure of teacher performance. In their VAM study, George Noell and Beth Gleason rank math and ELA teachers into 5 categories based upon student LEAP-21 and iLEAP scores for grades 3 thru 9: “Bottom 1-10%,” “Bottom 11-20%,” “Middle 21-80%,” “Top 81-90%,” and “Top 91-99%.” Based upon their headliner for the data tables on pages 12 and 13 of their study, it is clear that the teachers are being ranked (labeled) using student standardized test scores.
The two tables of data (one for math teachers and the other, for ELA teachers) are measures of the stability of using the iLEAP and LEAP-21 tests as measures of teacher performance. Attention to this reporting is very important because the results of these two tables demonstrate just how unstable using these two standardized tests are at assessing teacher quality/performance/effectiveness. It is also crucial to note that based upon a reading of the explanation offered by Noell and Gleason, one might not readily see this instability; Noell and Gleason report “moderate stability across years” (pg. 13). In truth, the results are erratic. Here are the tables and followed by a brief, accurate interpretation:
Table 5. Stability of Teacher Ranking in Mathematics across 2008-2009 to 2009-2010
2008-2009 Bottom Bottom Middle Top Top
Rank 1% - 10% 11% - 20% 21% - 80% 81% - 90% 91% - 99%
Bottom 26.8% 18.5% 46.2% 4.4% 4.2%
1% - 10% (135) (93) (233) (22) (21)
Bottom 14.8% 15.6% 62.1% 5.4% 2.1%
11% - 20% (71) (75) (298) (26) (10)
Middle 10.0% 9.9% 64.0% 9.3% 6.8%
21% - 80% (508) (504) (3,258) (475) (348)
Top 2.9% 4.6% 54.0% 22.1% 16.5%
81% - 90% (14) (22) (259) (106) (79)
Top 1.8% 1.5% 35.1% 15.8% 45.8%
91% - 99% (8) (7) (160) (72) (209)
Table 6. Stability of Teacher Ranking in English Language Arts across 2008-2009 to 2009-2010
2008-2009 Bottom Bottom Middle Top Top
Rank 1% - 10% 11% - 20% 21% - 80% 81% - 90% 91% - 99%
Bottom 22.3% 17.5% 52.7% 4.9% 2.7%
1% - 10% (126) (99) (298) (28) (15)
Bottom 17.1% 15.2% 59.7% 5.0% 3.0%
11% - 20% (92) (82) (321) (27) (16)
Middle 9.9% 9.8% 63.2% 9.5% 7.6%
21% - 80% (575) (566) (3,656) (551) (437)
Top 3.2% 6.1% 55.4% 17.7% 17.7%
81% - 90% (17) (33) (298) (95) (95)
Top 4.5% 2.7% 37.1% 18.2% 37.5%
91% - 99% (23) (14) (190) (93) (192)
Each number in the table is a percentage of teachers in the study/actual number of teachers who were first ranked one way using 2008-09 student test scores (reading to the left) then ranked either the same way (bolded diagonal) or a different way (all numbers not bolded) using 2009-10 student test scores (reading at the top). For example, the percentage 4.5% (23 teachers) in Table 6 (immediately above this text) represents the percentage of ELA teachers originally ranked in 2008-09 in the top 91-99% (reading to the left) but reranked in 2009-10 in the bottom 1-10% (reading at the top of the column) given that the teachers changed nothing in their teaching. Thus, these two tables represent how poorly the standardized tests classify teachers (2008-09) then reclassify teachers (2009-10) into their original rankings. Tables 5 and 6 are a test of the consistency of using standardized tests to classify teachers. It is like standing on a bathroom scale; reading your weight; stepping off (no change in your weight); then, stepping on the scale again to determine how consistent the scale is at measuring your weight. Thus, if the standardized tests are stable (consistent) measures, they will reclassify teachers into their original rankings with a high level of accuracy. This high level of accuracy is critical if school systems are told they must use standardized tests to determine employment and merit pay decisions.
I have bolded the cells on the diagonals of both tables to show just how unstable these two standardized tests are at classifying then correctly reclassifying teachers. If the iLEAP and LEAP-21 were stable, then the bolded percentages on the diagonals of both tables would be very high, almost perfect (99%).
Here is what we see from the diagonal in Table 5:
If a math teacher is originally ranked as the lowest, without altering his or her teaching, the teacher will likely be re-ranked in the lowest category only 26.8% of the time. Conversely, without altering his/her teaching, a math teacher ranked as the highest would likely be re-ranked in the highest group only 45.8% of the time even if she/he continued to teach the same way.
Now, let’s examine some info off of the diagonal for Table 5:
A math teacher originally ranked in the highest category will be re-ranked in the middle category 35.1% of the time and re-ranked in the lowest category 1.8% of the time. These alterations in ranking are out of the teacher’s control and do not reflect any change in teaching. Even though 1.8% might seem low, notice that in the study alone, this represented 8 math teachers, 8 real human beings, who could potentially lose their jobs and face the stigma of being labeled “low performers.”
As we did for Table 5, let’s consider the diagonal for Table 6:
If an ELA teacher is originally ranked as the lowest, without altering his or her teaching, the teacher will likely be re-ranked in the lowest category only 22.3% of the time. Conversely, without altering his/her teaching, an ELA teacher ranked as the highest would likely be re-ranked in the highest group only 37.5% of the time even if she/he continued to teach the same way.
As we did for Table 5, let’s examine some info off of the diagonal for Table 6:
An ELA teacher originally ranked in the highest category will be re-ranked in the middle category 37.1% of the time and re-ranked in the lowest category 4.5% of the time. These alterations in ranking are out of the teacher’s control and do not reflect any change in teaching. Even though 4.5% might seem low, notice that in the study alone, this represented 23 ELA teachers who could potentially lose their jobs and face the stigma of being labeled “low performers.”
Other info based upon Tables 5 and 6:
54.2% of math teachers and 62.5% of ELA teachers originally ranked in the top 10% (and likely eligible for merit money) would have lost their status due to faulty measurement. Math teachers not originally ranked in the lowest 10% were re-ranked into the lowest 10% at a rate of 30.5%; for ELA teachers, the rate of misclassification into to lowest category was 34.7% from other, higher categories. Keep in mind that being ranked into the lowest category brings with it the threat of job loss.
Conclusions Based on Tables 5 and 6:
Given the capriciousness of the teacher rankings as represented in Tables 5 and 6, it is illogical to assume that teachers were ranked correctly to begin with. Let’s reconsider the bathroom scale example: If I stand on and step off a bathroom scale twice in sequence, and the scale reports my weight as 158 and 172, respectively, why should I believe that the first measure of 158 is my “true” weight?
I would discard the bathroom scale.
By the same logic, if teacher rankings are erratic from measurement time 1 to measurement time 2, why should one believe that the first measure is an accurate reflection of teacher performance?
Time to throw out the VAM.
Given the instability of teacher rankings based upon standardized test scores, achieving “tenure” as it is now defined (5 out of 6 years as rated “highly effective”) is almost a statistical impossibility. John White plans to make some exception for teachers ranked as “highly effective” to circumvent the issue of being subsequently rated as “ineffective,” a situation faced by teachers such as those from South Highlands Elementary Magnet School in Caddo parish; however, he has offered no plans for the teachers of other rankings who will be erroneously re-ranked as “ineffective.” The solution is to discard the entire system and not enter into situations whereby BESE is asked to make “patchwork” exceptions to a model replete with classification weaknesses.
Even George Noell confirmed issues of VAM inaccuracy. For the actual email from George Noell regarding VAM classification inaccuracy, see http://www.lae.org/cms/Stability+of+the+Louisiana+Value+Added+Model+%28VAM%29/263.html ; for additional VAM inaccuracy info, including VAM inaccuracy info from other states, see http://www.louisianaeducator.blogspot.com/2012_09_30_archive.html ).
Data Quality Limitations of the VAM Study
Based upon its inability to rank teachers with any precision, the VAM is useless. However, the outcome inaccuracies are not the only fatal flaws in the Noell and Gleason VAM study. Another notable limitation involves issues of data quality. No study is better than its data. Indeed, the three data issues I will discuss here render the results of the study as limited at best and untrustworthy and unusable at worst.
Nonrandom Selection of Schools
When I attended the April meeting in St. Tammany where John White addressed St. Tammany faculty and administration, more than once teachers expressed concerns about the teacher evaluation pilot study. In all cases, John White asked, “Did your school choose to participate in the study?” as a means of ending discussion on the topic. I remember thinking then, “Self-selection into the research study is poor research practice because it limits generalizeability of results.” Let me explain.
The schools that participated in the Noell and Gleason VAM study were “self-selected”; that is, these schools were not randomly selected to participate but instead decided on their own to participate (or their district administration decided for them). This is poor research practice; in order to generalize the results to all Louisiana schools, the participating schools should have been randomly chosen from a pool of all Louisiana schools. Random selection is a foundation of statistics/probability. Games of chance follow the laws of probability, and gambling establishments are careful to guard these laws since laws of probability guarantee that the house will win overall. I cannot walk up to a poker table and ask to see the deck in order to intentionally remove certain cards, or put them in order to my liking, or even to see them in order to choose my hand. These actions violate the rules of probability, and the outcome of the game is “not fair” since I did not play with a complete and/or randomly shuffled deck.
The “outcome of the game” in the VAM study is the ranking and re-ranking of the teachers as represented in previously discussed Tables 5 and 6. The “tampered deck” is the group of self-selecting schools. The “randomly shuffled deck” should have been the correctly chosen, randomly selected schools from among all Louisiana schools. The outcome is affected because the results of the study cannot be extended to apply to all Louisiana schools but instead to only the specific schools participating in the study. In the language of statistics, the results of the study “cannot generalize” to all Louisiana schools.
Even if this study had consistently ranked and re-ranked teachers (which it did not), the result should not have been applied to any schools except the ones included in this study. Given what I know about “reformer-speak,” let me add that it is unfounded speculation to conclude that VAM stability is “only poor for the tested schools but not for the untested schools.”
Regarding the results in Tables 5 and 6, Noell and Gleason comment, “It is anticipated that a full set of verified rosters may produce more stable results” (pg. 12). This brief comment is an amazing admission, one that connects to a previous statement on page 8: “The comparative analyses between years described below are based on unverified rosters for 2007-2008 and 2008-2009. It is the authors’ hypothesis that when two years of verified rosters are available, the relationship between consecutive years may be strengthened as error variance associated with inaccurate student-teacher links is removed.” (Emphasis added.) The researchers were uncertain whether teachers were correctly matched to students. Such uncertainty about data integrity renders the results of this study as useless.
Up to this point in this discourse, the Noell and Gleason VAM study has three serious (and two fatal) limitations: Erratic classification (fatal); nonrandom selection of participating schools (limiting), and inaccurate teacher-student information (fatal).
A third fatal data limitation involves Noell and Gleason’s treatment of data on shared students, those students taught by more than one math or ELA teacher during the course of a school year. As Noell and Gleason note on page 10: “Students who had multiple teachers in a content area were retained in the dataset for their promotional path for each teacher, but were weighted in
proportion to the number of teachers they had in that subject. So for example, if a student had
two mathematics teachers, the student would have a 0.5 weight in contributing to each teacher’s assessment result.” (Emphasis added.)
In other words, Noell and Gleason divided a student’s score and “gave an equal part” to count towards each teacher’s evaluation. This decision defeats the goal of assessing each teacher: How is one to distinguish “good” from “poor” teachers if the score is equally shared? An equal sharing of the student’s score presumes that all teachers contribute equally to the student’s education, that all contributing teachers were “equally good.”
I teach a number of students who will have me for English II for only the first semester. They will have another English II teacher for the second semester. It is nonsensical to say on one hand that my colleague and I are being assessed as independent teachers and on the other to equally divide the measure that supposedly determines the quality of our teaching, the student’s EOC (End of Course) English II score, between the two of us.
Equally dividing the standardized test scores of shared students among two or more teachers that taught the students is yet another fatal flaw in the Noell and Gleason VAM study. And there’s more. In the next section, I address the final flaw for this paper: the selected analysis, HLM.
Hierarchical Linear Modeling (HLM)
HLM is the statistical analysis used in VAM studies. I would like to present a brief, basic explanation on why HLM cannot work as a means of determining teacher contribution to a student’s standardized test score. Please keep in mind my discussion under the previous section, “Misquoting Teacher Influence”; this section shows how the premise is faulty for attempting to connect student standardized test scores to teacher performance in the first place. That reminder in place, I wanted to offer this section as a means of understanding the limitation of the HLM analysis itself in determining teacher contribution based upon the student’s score.
HLM is a “nested” procedure. That is, it assumes a layering to the data. Think of it as concentric circles, one inside of the next. In an educational context, the student is “inside” of the class, which is “inside” of the school, which is “inside” of the district. HLM offers numeric info for each of these levels as an entire level. In VAM, part of a student’s score is unique to the student, and part is attributed to his/her class, and part to the school, and part to the district (if researchers choose to measure the school and district). This is key: The part of a student’s score attributed to the student’s class is not connected to any specific component of the classroom environment, including the teacher. HLM is not that precise. It can only offer numeric info on each level as a whole level.
In their analysis, Noell and Gleason presume that “teacher,” and “class” are interchangeable. They are not. The teacher is not the “class.” Furthermore, attempts to isolate the teacher from all other possible influences that form the conglomerate “class” are futile since it is not possible using HLM to control for every other classroom influence and “leave” only the teacher influence as directly connected to the student test score.
When careers are on the line, precision of measurement and analysis are a must.
I can give myself a haircut using garden shears, but it just isn’t going to be that precise. I can say I’ve done it before; that the haircut must happen now; that the shears really will work because shears are blades and blades are able to cut and my hair needs cutting.
The teacher assessment is the haircut; HLM is the garden shears.
Teacher assessment based on student test scores: That’s an attempt to make my hair grow backwards. I can’t blame HLM for that one.
To the legislators and others who have read this paper: Thank you. Please seriously consider the content presented here. Please reread the contents so as to absorb their import. Yes, teachers influence their students. However, teachers are not “the” primary influence in student lives. Yes, teachers should be evaluated. However, attempting to connect teacher performance to student standardized test scores cannot work and will not improve education in America.
VAM does not work; it cannot work and needs to be discarded.
The necessary first step to truly improve American education is for policy makers to investigate the situation firsthand by involving themselves directly in the day-to-day operations of the classroom. This direct involvement can be accomplished by substitute teaching for even one week per year. Such experience would enable legislators to more critically and insightfully weigh that which might erroneously be labeled as “education reform.”