PROCEDURES FOR ENSURING THAT KEY ASSESSMENTS OF CANDIDATE PERFORMANCE AND EVALUATIONS OF UNIT OPERATIONS ARE FAIR, ACCURATE, CONSISTENT, AND FREE OF BIAS
Validity: Content, Construct, Prediction, Fairness, Utility, Consequences, Multiple measures - Reliability: Results, Standard setting - References
The unit believes that assessment accuracy, consistency, and freedom (and other concepts) from bias are key components of validity and reliability. This is also an area in which FSEHD is moving to the Target Level. (Hence, the text below overlaps considerably with the unit's description of how it is moving to the Target Level.) The procedures FSEHD uses to ensure that key assessments of candidate performance and evaluations of unit operations are fair, accurate, consistent, and free of bias are, as well as findings from sample validity and reliability studies, are presented in the sections below.
Fairness, accuracy, and freedom from bias are among the key components of validity. FSEHD utilizes a comprehensive approach to define and examine the validity and utility of its assessment system and the inferences that the system yields. Validity is a multifaceted concept and the most important technical consideration for assessments and assessment system design and use. Furthermore, the notion of usefulness, or utility, of assessment results is inherent in any examination of validity. Critical components of validity have been taken into account in the design and monitoring of the Initial and Advanced Programs Assessment Systems. They include:
- Content-related validity: Do assessment items/components adequately and representatively sample the content area(s) to be measured?
- Construct validity: Do assessments and the assessment system measure the content they purport to measure?
- Prediction: How well do assessment instrument predict how well candidates will do in future situations?
- Fairness: Are all candidates afforded a fair opportunity to demonstrate their skills, knowledge, and dispositions?
- Utility: How useful are the data generated from unit assessments?
- Consequences: Are assessment uses and interpretations contributing to increased student achievement and not producing unintended negative consequences? (Linn, 1994)
Content-related validity. With regard to content-related validity, assessments are aligned with the RIPTS, Conceptual Framework elements, and/or FSEHD Advanced Competencies, and Cultural Competency Areas identified by the larger professional community as important. Detailed alignment documents attesting to this high degree of alignment exist. Additionally, new unit assessments were designed based on best practice in teacher candidate assessment as identified in the literature and on the professional knowledge, experience, and consensus of FSEHD faculty, many of whom are developers and definers of best practice in their professional areas. Input from experienced practitioners (i.e., cooperating teachers) were also sought to help establish the validity of these assessment measures.
A next step in the examination of the content-related validity relating to the assessments in the FSEHD Unit Assessment System concerned “balance of representation,” or whether the content of the assessments in the assessment system is balanced or, on the other hand, weighted to represent the relative importance of relevant standards (Webb, 2005). FSEHD has analyzed the balance of representation of assessment indicators in its exit assessments (TCWS and OPR) according to RIBTS, Conceptual Framework element, Cultural Competence Area, and Professional Disposition. The goal of this exercise was to ensure that every standard was represented among the exit assessments and that the balance of representation was weighted to represent the relative importance of standards. Results of this analysis, which can be found at [link], indicate that every RIBT, Conceptual Framework element, Cultural Competence Area, and Professional Disposition standard is represented among the two exit assessments. Additionally, the balance of representation for each set of expectations was more heavily weighted in the following areas: RIBTS (#9-Teachers Use Appropriate Formal/Informal Assessment Strategies, 15%; #8-Teachers Use Effective Communication, 14%; #6: Teachers Create a Supportive Learning Environment, 13%; #4-Teachers Create Instructional Opportunities that Reflect Respect for Diversity of Learners, 11%); Conceptual Framework (Pedagogy, 52%; Professionalism, 29%); Cultural Competence Areas (Planning & Instruction, 46%; Communication, 25%); Professional Dispositions (Commitment to Equity, 27%; Work Ethic, 24%; Caring Nature, 20%). This balance of representation among the standards sets assures that multiple, important qualities/skills of candidates are assessed at the Exit transition point. Together, they provide a very comprehensive picture of a FSEHD candidate at the end of his/her program.
The unit also consulted research-based evidence regarding the content validity of its new assessments. For example, findings from a validity study conducted by Denner, Norman, Salzman, Pankratz, and Evans (2004) revealed that a panel of experts judged the processes targeted by the Renaissance Teacher Work Sample (upon which FSEHD's TCWS is based) to be very similar to those addressed by the INTASC Standards. Similarly, this expert panel determined that the Teacher Work Sample tasks to be authentic and critical to success as a classroom teacher.
Construct-related validity. An assessment has construct validity if it accurately measures a theoretical, non-observable construct or trait. The construct validity of an assessment is worked out over a period of time on the basis of an accumulation of evidence. FSEHD is investigating the validity of its unit assessment and accumulating validity evidence in a number of ways. High internal consistency is one type of evidence used to establish construct-related validity. That is, if an assessment or scale has construct validity, scores on the individual items/indicators should correlate highly with the total test score. This is evidence that the test is measuring a single construct. Internal consistency of TCWS components is high and in fact has increased over the past two years: Fall 2008/Spring 2009, n=48; Fall 2009, n=120; and Spring 2010, n=253. Estimates of internal reliability (coefficient alpha) during these time periods for the seven TCWS constructs has been as follows: Contextual Factors, α=.89, .93, .94; Learning Goals & Objectives, α=.83, .96, .94; Assessment Plan, α=.75, .96, .94; Design for Instruction, α=.91, .94, .91; Instructional Decision Making, α=.87, .94, .95; Analysis of Student Learning, α=.87, .96, .94; Self Evaluation, α=.85 (Fall 2008/Spring 2009 only); Candidate Reflection on Student Teaching Experience, α=.61, .94, .87. These findings provide strong evidence for the construct validity of the constructs being assessed in FSEHD's TCWS.
Internal consistency of OPR scales is high and has been improving semester by semester (Fall 2008/Spring 2009, n=72; Fall 2009, n=120; and Spring 2010, n=667, n=1460), as the instrument and the scales have been refined. Internal consistency estimates during each time period for OPR scales have been as follows: Planning, α=.96, .98,.97; Implementation, α=.94, .97, .96; Content, α=.93, .95, .95; Climate, α=.94, .96, .96; Classroom Management, α=.93, .06, .96; Professionalism, α=.95, .98, .97; Reflection, α=.96, .97, .97; Technology Use, α=.98, .97 (Fall 2009/Spring 2010 only). These findings provide strong evidence for the construct validity of the constructs being assessed in FSEHD's OPR.
Second, FSEHD has conducted factor analyses to examine whether the theoretical framework of FSEHD unit assessments match the factor representation yielded by confirmatory factor analysis. A confirmatory factor analysis of Spring 2010 TCWS data (principal components extraction, varimax rotation, n=253) revealed a seven factor solution accounting for 75% of the variance in the data. All TCWS rubric criteria except for one loaded on the appropriate, hypothesized factor. (The fit of this single criterion is being discussed.) The results of this study provide evidence to support seven teaching process construct on which the TCWS is constructed.
In order to gather evidence as to the construct validity of the OPR, confirmatory factor analyses (principal components extraction, varimax rotation, n=1622 in Spring and n=726 in fall ) were also conducted of OPR data in Spring and Fall 2010. The goal was to determine whether the items in the instrument measured eight theoretical traits as intended: Planning, Implementation, Content, Classroom Management, Classroom Climate, Reflection, Professional Behavior, and Technology Use. Results suggest that Planning, Implementation, Content, Classroom Management, and Classroom Climate form a single, strong, classroom instruction construct accounting for 70-71% of the variance in the data collected each semester. Professional Behavior and Technology Use items also load appropriately on individual, separate constructs explaining four to five percent of the variance in the data each. On the other hand, Reflection items do not consistently load on any other construct. These findings call into question the validity of reporting separate scores for Planning, Implementation, Content, Classroom Management , and Classroom Climate.
Another method of collecting construct-related validity evidence is to examine developmental changes. Assessments measuri candidate dispositions.
Prediction. FSEHD conducts ongoing checks of the predictive validity of assessments in the unit assessment system. While it is not feasible to investigate the predictive validity of every assessment each year, the unit conducts “spot checks” of predictive validity periodically. For example, following questions of whether the Career Commitment Essay admissions requirement might be redundant to the admissions requirement that students pass Writing 100, the Director of Assessment conducted a study examining the relationship between Career Commitment Essay Scores and Writing 100 scores among applicants to FSEHD Teacher preparation programs in 2009. Study findings found almost no correlation between the CCE scores and Writing 100 grades of students who applied to a FSEHD teacher preparation program between January 2006 and May 2008. The almost complete absence of a relationship between these two variables indicates that the CCE and Writing 100 entrance requirements are not redundant and that Writing 100 grades did not predict essay scores.
In response to new state requirements for admission to teacher preparation programs, the Director of Assessment examined the validity of admissions-based PPST scores (math, reading, writing, and sum scores), and the Elementary Education Content Exercise in predicting subsequent candidate PLT (K-6 and 7-12) scores and GPA. Using data from 1200 candidates over the past 3 years, findings revealed low and statistically insignificant correlation between these mandatory admissions measures and subsequent GPA. Further analyses revealed that the new, higher admissions scores would prevent many minority candidates from being admitted to teacher education programs. Based on these two sets of findings, the unit recommended that the state not adopt the new admissions requirements.
Similarly, the Director of Assessment conducted a study of the utility of standardized test scores (MAT, GRE) as admission criteria in the prediction of subsequent program performance was conducted. Results revealed that the MAT is highly and significantly correlated with GPA among advanced program non-completers. Among program completers, the MAT, GRE Verbal, and GRE Analytical tests are significantly correlated with GPA. These findings were used to support the recommendation that the standardized testing requirement be retained at admission to an advanced program.
Finally, in revising and developing assessments, the unit has based its new assessments on models that have been show through research to have predictive validity. For example, findings from a validity study conducted by Denner, Norman, Salzman, Pankratz, and Evans (2004) revealed an expert panel determined that the Teacher Work Sample tasks to be critical to success as a classroom teacher.
Fairness. The following components are included in the fairness criterion and contribute to the extent to which inferences and actions on the basis of assessment scores are appropriate and accurate.
- Freedom from bias: The language and form of assessments must be free of cultural and gender bias.
- Transparency of expectations: Assessment instructions and rubrics must clearly state what is expected for successful performance.
- Opportunity to learn: All candidates must have had learning experiences that prepare them to succeed on an assessment.
- Accommodations: Candidates with documented learning differences must be afforded accommodations in instruction and assessment.
- Multiple opportunities: Candidates must have the opportunity to demonstrate their learning in multiple ways and at different times. (Smith & Miller, 2003)
The unit has addressed the fairness criterion as follows:
As unit assessments have been designed and revised, the wording and design of assessment tasks were reviewed by the respective program faculty with a focus on whether the selected tasks were fair and accessible to all candidates. The design, format, wording, and presentation of unit assessments have been reviewed multiple times by internal and external constituents, and changes have been made in an effort to minimize any unintentional bias. For example, the new summative advanced program assessment, the Professional Impact Project, was renamed as such after faculty indicated that the previously suggested title, Professional Intervention Project, suggested a “deficit model” of education. The title and select other terms in the document were subsequently revised to reflect less biased language.
FSEHD has also made numerous efforts to make assessment expectations transparent and to keep faculty, cooperating teachers, and candidates informed. First, the Director of Assessment and the assessment committee have strived to keep faculty and cooperating teachers up to date and informed on all aspects of the revised assessment system. Information was shared with them during faculty retreats inFebruary 2007, August 2008, August 2009, and March 2010. Information has also been shared with faculty through the Dean's Leadership Committee and electronic correspondence with faculty. The TCWS and Mini TCWS rubrics and prompts are highly descriptive and provide detailed guidance to the candidate and evaluators about expectations and process. The ILP and OPR contain multiple, observable indicators that make expectations explicit and yield granular information about multiple dimensions of a candidate performance. The Assessment of Professional Dispositions in the College Classroom also contain clear, observable, behavioral indicators for evaluators to rate.
In order to foster greater transparency of expectations, FSEHD has engaged in training of faculty and cooperating teachers in FSEHD unit assessments. In Fall 2009, four workshops for the College Supervisors and Cooperating Teachers were successfully conducted. Approval by Rhode Island Department of Education (RIDE) was obtained for participants to earn two continuing education units to attend one of the four workshops. Three hundred eight cooperating teachers attended the OPR workshop. The workshop taught the Cooperating Teachers how to use and appropriately score with the new observation instrument to evaluate teacher candidate teaching behaviors. The specific objectives for the workshop were to: introduce the participants to an assessment instrument that analyzes teaching behaviors and documents the growth of these behaviors; examine the scoring criteria and the six-point rubric of the defined assessment instrument; teach participants to reliably use an assessment instrument with teacher candidates; be exposed to the proper terminology of scoring (analytic holistic scoring, criterion versus norm referenced scoring, performance level rubric, scoring criteria, indicators, normative versus developmental); discuss teaching behaviors, using the defined assessment rubric, in large and then small groups; and analyze teaching videos with respect to Implementation, Climate, and Classroom Management; and reflect on how exposure to this assessment will assist the classroom teachers with reflective teaching practices in their own classrooms.
OPR and TCWS training session for faculty were held in October 2009. While all faculty were invited to attend, 20 attended. The session focused on understanding the components of the TCWS and scoring expectations, with opportunities for participants to score and discuss actual candidate work. Training evaluation data indicated that participants found the training valuable and useful.
In Fall 2010, the Assistant Dean for Partnerships and Placements and Assistant Director of Assessment developed an online course entitled EDU 580 Workshop: Professional Development for Cooperating Teachers. This 3-credit graduate workshop contains substantial content related to unit assessment in FSEHD and is anticipated to be instrumental in developing more competent assessors among FSEHD Cooperating Teachers. The course is being piloted in Spring 2011, with plans for full implementation in Fall 2011. In addition, face-to-face trainings on student teaching assessment are being discussed for Spring 2011. Feedback from pilot participants indicate that the course has been invaluable in helping them understand and apply unit assessment instruments and procedures appropriately.
Attempts to maintain transparency of expectations for candidates are ongoing. Program handbooks and web pages clearly delineate program expectations, as well as program and unit assessments. The unit also regularly holds orientation meetings and information sessions for students at each transition point (admission, preparing to teach, and exit). Additionally, the unit conducts specialized trainings aimed at helping candidates understand what is expected of them and how to best meet expectations. For example, the unit holds regular workshops for teacher candidates preparing for the PRAXIS II tests . The pass rates of candidates who have participated in the Praxis II workshops attests to the success of this particular intervention.
Additionally, In Spring 2008, FSEHD instituted a Tips for Writing a Successful Career Commitment Essay workshop. The goal of this workshop was to help prospective teacher candidates understand this component of the FSEHD admissions process. Approximately 40 students attended the Spring 2008 workshop. A similar number attended the Fall 2009 workshop. A study was conducted to examine the relationship between attendance at the Spring 2008 Career Commitment Essay Workshop and subsequent Career Commitment Essay performance. Findings revealed that mean Career Commitment Essay scores were higher for those who attended the Tips for Writing a Successful Career Commitment Essay workshop as compared to those who did not attend the workshop. The mean essay score differences between workshop attendees and non-attendees was statistically significant for students submitting their essays for the second or third time, students submitting their very first essay, and students overall. Additionally, higher proportions of students who attended the workshop passed the essay requirement as compared to students who did attend the workshop.
Assessment tasks were reviewed by the respective program faculty in terms of opportunity to learn and succeed at the content/skills inherent in the tasks during the assessment design and revision process,. Furthermore, during the first 3 semesters in which the TCWS was piloted, the unit was respectful of faculty members' willingness to take a risk with a new assessment AND conscious that the instrument was indeed being piloted. Programs were also frank that the focus on analyzing student work and teacher impacts on student learning represented an area that many of them were in the beginning stages of implementing. Hence, FSEHD granted programs (and candidates) some leniency in these areas during the early pilot semesters. The unit did not set a cut off score for the TCWS; rather, it provided faculty with general guidelines and then allowed faculty members to assign a TCWS as passing or failing based on their professional judgment and the status of the program in relation to emphasizing certain TCWS concepts and skills. Additionally, faculty in FSEHD programs have begun to engage in curriculum mapping and other processes to examine what is taught at different time points to ensure that candidates have opportunities to learn and succeed at the content and skills inherent in unit assessments. Examples of how they have done this are included in Exhibit 8.
Rhode Island College is committed to making reasonable efforts to assist individuals with documented disabilities. Candidates seeking reasonable classroom or assessment accommodations under the ADA of 1990 and/or Section 504 of the Rehabilitation Act of 1973 are required to register with Disability Services in the Student Life Office. To receive accommodations for any class or unit assessment students must obtain a Request for Reasonable Accommodations form and submit it to their professor at the beginning of the semester. This information is shared with all students at the beginning of each course. In addition, this information is provided in each course syllabus.
Multiple opportunities to demonstrate learning and growth are built into the very design of the FSEHD assessment system. The system includes many opportunities for candidates to demonstrate their learning—in multiple ways and at different times. Furthermore, candidates are afforded opportunities to retake or redo all or part of their unit assessments. The use of multiple assessments with multiple formats, as opposed to a single, “one-shot” assessment, increases the validity of the inferences subsequently made regarding the knowledge, skills, and dispositions of FSEHD candidates.
Utility. The revision/development of new FSEHD unit assessments stemmed largely from faculty. A goal, therefore, was that faculty would find the new assessment system to be of utility to them as professionals and as trainers of future teachers. For example, the TCWS was designed to address faculty concerns that the existing Exit Portfolio was not cohesive and did not place sufficient emphasis on student learning. The OPR was designed as it was to meet faculty requests that the existing Observation Report be replaced with an instrument with specific, observable behavioral indicators, a more precise rating scale, and less need to script an entire lesson. It was decided that the ILP and Mini TCWS at Preparing to Teach would consist of pieces of the OPR and TCWS in order to maintain consistency in candidate performance expectations over time. The Assessment of Professional Dispositions in the College Classroom was designed in response to faculty concerns that unit assessment of dispositions only took place in field settings despite the fact that faculty often recognized candidate disposition while they were still enrolled in classes, long before going out into the field. Over the past three years, faculty and cooperating teachers have provided ample feedback suggesting that they find FSEHD's revised assessments to be useful, user friendly, and superior to past unit assessments. Typical feedback on the utility of specific unit assessments is available.
The Director of Assessment and the assessment committee have also been informed on numerous occasions that the assessments are considerably easier to implement the second and subsequent times that faculty use them. (The first semester presents the largest learning curve.) Additionally, faculty have commented that the Exit assessments are much easier to implement when teacher candidates have already been exposed to them in their coursework and during the Preparing to Teach phase. These findings are to be expected and are a normal part of the change process.
Consequences. Linn (1994) states, “it is not enough to provide evidence that the assessments are measuring intended constructs. Evidence is also needed that the uses and interpretations are contributing to enhanced student achievement and, at the same time, not producing unintended negative consequences.” (p. 8) Positive, intended consequences of the FSEHD unit assessments include improved learning on the part of candidates/graduates, as well as program and unit improvement based on the use of assessment data. Negative, unintended consequences might include a narrowing of the curriculum (to focus on preparation for assessments) or increased student drop out due to unanticipated burdens of the Assessment System. Positive, unintended consequences of the system may occur, as well, and these should be identified.
Graduate follow up surveys of graduates are conducted at the “post” transition point and include opportunities for graduates to provide open-ended feedback regarding the strengths and weaknesses of their programs and overall experiences at FSEHD. This qualitative data has been analyzed for clues as to consequences of the assessment system. At this time, data from program graduates do not reveal any negative unintended consequences of the unit assessment system at the initial or advanced levels. Feedback from faculty regarding the functioning of the assessment system and the consequences thereof is always welcome at FSEHD and is solicited on an ongoing basis. Feedback regarding the positive and negative intended and unintended consequences of the Advanced Program Assessment System are gathered and reflected upon, and none reveal unintended negative consequences of unit assessment.
Multiple Measures. FSEHD unit assessments draw on multiple formats—“traditional” and “alternative” alike. There are many methods for assessing learning; yet, no single assessment format is adequate for all purposes. (American Educational Research Association, 2000) Consequently, the FSEHD assessment system allows candidates to demonstrate their knowledge, skills, and dispositions using a variety of methodologies. The various assessment methodologies include: Selected Response and Short Answers; Constructed Response; Performance Tasks; and Observation and Personal Communication.
As shown in the Assessment System blueprint, all four assessment formats are utilized throughout the four assessment transition checkpoints at FSEHD. This attempt to “balance” assessment in terms of assessment methods yields multiple forms of diverse and redundant types of evidence that can used to check the validity and reliability of judgments and decisions. (Wiggins, 1998)
As discussed above, it is essential that the School utilize assessment instruments and procedures that permit valid inferences regarding the dispositions of their advanced program candidates. Further, reliability, or the consistency of scores across raters, over time, or across different tasks or items that measure the same thing, is a necessary condition for validity. In the design phase of the Initial and Advanced Program Assessment Systems, detailed rubrics for most assessments were constructed to ensure the reliability of scoring the selected tasks and facilitate faculty training in the scoring system. Rubrics were developed that score the tasks according to relevant learning targets. Inter-rater reliability has been enhanced through the revision of existing assessment instruments.
FSEHD also routinely conducts studies to establish consistency, or reliability, of assessment procedures and unit operations. Beginning in spring 2007, assessment data have been analyzed, and coefficients of internal consistency and/or inter-rater reliability (depending on the assessment) have been computed to determine how well this criterion was achieved. Additionally, the Director of Assessment conducts Many-Facet Rasch Measurement analyses to address the following research questions for performance assessments and rating scales in the system:
- Do the scorers differ in the levels of severity they exercise, or do both groups of raters function interchangeably?
- Do faculty and advanced program candidates rate in the same manner?
- Are there any inconsistent raters whose patterns of ratings show little systematic relationship to the scores that other raters give?
- Do some advanced program candidates exhibit unusual profiles of ratings across scale dimensions, receiving unexpectedly high (or low) ratings on certain dimensions, given the ratings the candidate received on other dimensions?
- Are there any raters who cannot effectively differentiate between scale dimensions, giving each teacher candidate very similar ratings across a number of conceptually distinct dimensions?
- Are the categories on the rubrics and rating scales appropriately ordered? Are the rubrics and rating scales functioning properly? Are all the scale categories clearly distinguishable?
The computer program, FACETS (Linacre, 1988), is used to analyze the data and furnish answers to the research questions identified above. The findings from this study are used to refine/improve the Advanced Programs Assessment System, target faculty professional development needs, and serve as evidence of reliability of scoring and the validity of inferences made based on performance and rating scale assessments within the system. Investigations of this nature are conducted on an ongoing basis. The information from such studies is shared with program coordinators, department chairs, and dean's office staff. Suggestions for revising assessments, rubrics, and/or the system are then generated and implemented.
Research has demonstrated that the reliability coefficients for teacher-made assessments generally range from .60 to .85 (Linn & Gronlund, 2000), while standardized tests of achievement and aptitude tend to fall between the .80s and low .90s (Salvia & Ysseldyke, 1998). Assessment experts agree that the required level of reliability for assessment increases as the stakes attached to the assessments increase (i.e., when assessment-based decisions are important, permanent, or have lasting consequences) (Linn & Gronlund, 2000). Salvia and Ysseldyke (1998) specify a minimum reliability of .90 for assessments that are used for tracking and placement. FSEHD is working toward a goal of reliability coefficients of at least .85, given 1) the seriousness of the decisions made about students based on assessments in the Initial and Advanced Programs Assessment System and 2) the newness of many aspects of the assessment systems. With time, the school will aim toward even higher levels of reliability.
Results of FSEHD reliability studies. FSEHD routinely studies the internal consistency of its assessments, and results have generally been positive in this regard. As mentioned earlier, internal consistency of TCWS components is high and in fact has increased over the past two years: Fall 2008/Spring 2009, n=48; Fall 2009, n=120; and Spring 2010, n=253. Estimates of internal reliability (coefficient alpha) during these time periods for the seven TCWS constructs has been as follows: Contextual Factors, α=.89, .93, .94; Learning Goals & Objectives, α=.83, .96, .94; Assessment Plan, α=.75, .96, .94; Design for Instruction, α=.91, .94, .91; Instructional Decision Making, α=.87, .94, .95; Analysis of Student Learning, α=.87, .96, .94; Self Evaluation, α=.85 (Fall 2008/Spring 2009 only); Candidate Reflection on Student Teaching Experience, α=.61, .94, .87. Internal consistency of OPR scales is high and has been improving semester by semester (Fall 2008/Spring 2009, n=72; Fall 2009, n=120; and Spring 2010, n=667, n=1460), as the instrument and the scales have been refined. Internal consistency estimates during each time period for OPR scales have been as follows: Planning, α=.96, .98,.97; Implementation, α=.94, .97, .96; Content, α=.93, .95, .95; Climate, α=.94, .96, .96; Classroom Management, α=.93, .06, .96; Professionalism, α=.95, .98, .97; Reflection, α=.96, .97, .97; Technology Use, α=.98, .97 (Fall 2009/Spring 2010 only).
In 2009, the Director of Assessment conducted a series of studies Many-Facet Rasch Measurement analyses of the Career Commitment Essay assessment process in order to investigate rater consistency, differences in rater severity, and the functioning of the rubric's rating scale. These studies revealed that that the four-point rating scales on the rubric were functioning as intended and that the dimensions in the rubric were well differentiated in terms of difficulty. Differences in rater leniency and severity were compensated by the FSEHD practice of “bumping” average scores ending in 0.5 to the next highest whole number, increasing fairness to candidates. The full report can be accessed for more information.
Analyses of scores on FSEHD's long term Teacher Candidate Dispositions Assessment revealed low inter-rater reliability among raters and provided further evidence supporting the need to revise unit dispositions and the unit dispositions process. Inter-rater reliability was measured in three ways: Pearson correlation, percent exact agreement, and Cohen's Kappa statistic. This study further highlighted differences in the way that faculty and students conceptualize and evaluate candidate and dispositions, further calling into question the validity of inferences made on the basis of scores from this assessment. The full report can be accessed for more information.
A 2010 study by faculty examined the internal consistency of Cooperating Teacher and College Supervisor data from the previous Observation Report supported the unit's move to a revised instrument. This study found the internal consistency (inter- and intra-rater agreement between open-ended and closed-ended items) in the old instrument to be tenuous. Issues about the instrument's purpose (formative or summative), the nature of the raters' task, and their explication of the student teachers' progress, also needed clarity. The full report can be accessed for more information.
A 2011 analysis of the inter-rater reliability of Cooperating Teacher and College Supervisor ratings of a lesson they observed together revealed a need for further training on the instrument and calibration among raters. OPR data from Spring 2010 and Fall 2010 were examined to assess the degree of inter-rater reliability between College Supervisors and Cooperating Teachers. Only those lessons observed on the same date by a candidate's College Supervisor and Cooperating Teacher were used for analysis. This yielded 239 observations (from 211 College Supervisors and Cooperating Teachers) for Spring 2010 and 126 observations (from 117 College Supervisors and Cooperating Teachers) for Fall 2010. Overall, inter-rater reliability between Cooperating Teachers and College Supervisors was not particularly high. Inter-reliability for average ratings on OPR sections ranged from .59 to .75. Inter-rater reliability for individual OPR indicators ranged from .45 to .70. Inter-rater reliability for OPR Capsule Ratings ranged from .68 to .69. Additionally, results suggested that individuals who had attended training on the instrument exhibited higher levels of inter-rater reliability. The full report can be accessed for more information.
The studies, practices, and related improvements to the unit assessment system noted in the above section all lead to more consistent data collection across programs. The thoughtful, planned, and evidence based revision of the assessment system is leading to greater consistency in expectations and assessment practice across the unit. Emerging data indicate that the internal consistency of measurement with new assessments is quite high. Training of faculty and cooperating teachers is aimed at ensuring a common understanding of the design, purpose, and execution of unit assessments, as well as inter-rater reliability. Inter-rater reliability analyses are revealing areas in which further orientation, training, and calibration are needed. All in all, the unit is moving in a very positive direction with regard to developing greater consistency in data collection.
Standard Setting. A final technical criteria for a high quality assessment system is standard setting. In other words, programs and the unit must identify the amount and quality of evidence necessary to demonstrate proficiency on assessments. These are performance standards (Measured Measures, 2000). The standard setting process cannot begin until criteria for levels of student performance (i.e., rubrics) are well-articulated (Smith & Miller, 2003). This is why reliability must be established first.
FSEHD plans to train selected faculty in two approaches to standard setting by the end of 2011: the Angoff method and the Examination of Student Work method. The Angoff method is an assessment-based method for faculty to work collaboratively to determine a passing grade or acceptable performance on an assessment in a course. It can be used with traditional assessment types (such as selected response) that are frequently used on advanced coursework. The Examination of Student Work method of standard setting involves the review of student work (yielded through performance assessment) and results in the establishment of data-based cut off scores and anchor papers/benchmark performances. Skill in using these standard-setting methods and implementation of these procedures within programs will yield more consistent scoring of student work samples at the formative and exit transition points, as well as within courses in programs, resulting in higher reliability in candidate's final course grades. A subsequent training will be offered to faculty in 2012. Additionally, faculty who are trained in standard-setting methods will be encouraged to share their knowledge with their peers in their departments. Standard setting procedures will be applied to all unit performance assessments.
An additional, useful benefit of the standard-setting process is that it often exposes flaws in scoring rubrics or the design of assessments. As such, it is part of an iterative process of ongoing revision and improvement.
- American Education Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Authors.
- Baker, E., L., Linn, R. L., Herman, J. L., & Koretz, D. (2002). Standards for educational accountability (Policy Brief 5). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing.
- Center for the Study of Evaluation & National Center for Research on Evaluation, Standards, and Student Testing. (1999). CRESST assessment glossary. Los Angeles, CA: CRESST/UCLA.
- Haessig, C.J. & LaPotin, A.S. (2007). Lessons Learned in the Assessment School of Hard Knocks. Irving, CA: Electronic Educational Environment, UCIrvine.
- Linacre, J.M. (1988). FACETS. Chicago: Mesa.
- Linn, R. L., & Gronlund, N. E. (2000). Measurement and evaluation in teaching (8th ed.). New York: Macmillan.
- Linn, R. L. (1994). Performance assessment: Policy promises and technical measurement standards. Educational Researcher, 23 (9), 4-14.
- McLeod, S. (2005). Data-driven teachers. Minneapolis: School Technology Leadership Initiative, University of Minnesota.
- Measured measures: Technical considerations for developing a local assessment system. (2005). Augusta, ME: Maine Department of Education.
- Smith, D. & Miller, L. (2003). Comprehensive local assessment systems (CLASs) primer: A guide to assessment system design and use. Gorham, ME: Southern Maine Partnership, University of Southern Maine.
- Stiggins, R.J. (2001). Leadership for Excellence in Assessment: A Powerful New School District Planning Guide. Portland, OR: Assessment Training Institute.
- Salvia, J., & Ysseldyke, J. E. (1998). Assessment (7th ed.). Boston: Houghton Mifflin.
- Webb, N. L. (2005). Issues related to judging the alignment of curriculum standards and assessments. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal, Canada.
- Wiggins, G. (1998). Educative assessment. San Francisco, CA: Jossey-Bass.