All research is conducted via the use of scientific tests and measures, which yield certain observations and data. But for this data to be of any use, the tests must possess certain properties like reliability and validity, that ensure unbiased, accurate, and authentic results. This PsycholoGenie post explores these properties and explains them with the help of examples.
Reliability and validity are key concepts in the field of psychometrics, which is the study of theories and techniques involved in psychological measurement or assessment.
The science of psychometrics forms the basis of psychological testing and assessment, which involves obtaining an objective and standardized measure of the behavior and personality of the individual test taker. It is an exhaustive process that examines and measures all aspects of an individual’s identity. The data obtained via this process is then interlinked and integrated to form a rounded profile of the individual. Such profiles are often created in day-to-day life by various professionals, e.g, doctors create medical and lifestyle profiles of the patient in order to diagnose and treat health disorders, if any. Career counselors employ a similar approach to identify the field most suited to an individual. Such profiles are also constructed in courts to lend context and justification to legal cases, in order to be able to resolve them quickly, judiciously, and efficiently. However, to be able to formulate accurate profiles, the method of assessment being employed must be accurate, unbiased, and relatively error-free. In order to ensure these qualities, each method or technique must possess certain essential properties.
◉ Standardization – All testing must be conducted under consistent and uniform parameters to avoid introduction of any erroneous variation in text results.
◉ Objectivity – The evaluation of test must be carried out in an objective manner such that no bias, either of the examiner or the examinee, is introduced or reflected in the obtained data.
◉ Test Norms – Each test must be designed in such a way that the results can be interpreted in a relative manner, i.e., it must establish a frame of reference or a point of comparison to compare the attributes of two or more individuals in a common setting.
◉ Reliability – The test must yield the same result each time it is administered on a particular entity or individual, i.e., the test results must be consistent.
◉ Validity – The test being conducted should produce data that it intends to measure, i.e., the results must satisfy and be in accordance with the objectives of the test.
Concept of Reliability
It refers to the consistency and reproducibility of data produced by a given method, technique, or experiment. The form of assessment is said to be reliable if it repeatedly produces stable and similar results under consistent conditions. Consistency is partly ensured if the attribute being measured is stable and does not change suddenly. However, errors may be introduced by factors such as the physical and mental state of the examinee, inadequate attention, distractedness, response to visual and sensory stimuli in the environment, etc. When estimating the reliability of a measure, the examiner must be able to demarcate and differentiate between the errors produced as a result of inefficient measurement and the actual variability of the true score. A true score is that subset of measured data that would recur consistently across various instances of testing in the absence of errors. Hence, the general score produced by a test would be a composite of the true score and the errors of measurement.
Types of Reliability
Test-retest Reliability
It is a measure of the consistency of test results when the test is administered to the same individual twice, where both instances are separated by a specific period of time, using the same testing instruments and conditions. The two scores are then evaluated to determine the true score and the stability of the test.
This type is used in case of attributes that are not expected to change within that given time period. This works for the measuring of physical entities, but in the case of psychological constructs, it does exhibit a few drawbacks that may induce errors in the score. Firstly, the quality being studied may have undergone a change between the two instances of testing. Secondly, the experience of taking the test again could alter the way the examinee performs. And lastly, if the time interval between the two tests is not sufficient, the individual might give different answers based on the memory of his previous attempt.
Example
Medical monitoring of “critical” patients works on this principle since vital statistics of the patient are compared and correlated over specific-time intervals, in order to determine whether the patient’s health is improving or deteriorating. Depending on which, the medication and treatment of the patient is altered.
Parallel-forms Reliability
It measures reliability by either administering two similar forms of the same test, or conducting the same test in two similar settings. Despite the variability, both versions must focus on the same aspect of skill, personality, or intelligence of the individual. The two scores obtained are compared and correlated to determine if the results show consistency despite the introduction of alternate versions of environment or test. However, this leads to the question of whether the two similar but alternate forms are actually equivalent or not.
Example
If the problem-solving skills of an individual are being tested, one could generate a large set of suitable questions that can then be separated into two groups with the same level of difficulty, and then administered as two different tests. The comparison of the scores from both tests would help in eliminating errors, if any.
Inter-rater Reliability
t measures the consistency of the scoring conducted by the evaluators of the test. It is important since not all individuals will perceive and interpret the answers in the same way, hence the deemed accurateness of the answers will vary according to the person evaluating them. This helps in refining and eliminating any errors that may be introduced by the subjectivity of the evaluator. If a majority of the evaluators judge are in agreement with regards to the answers, the test is accepted as being reliable. But if there is no consensus between the judges, it implies that the test is not reliable and has failed to actually test the desired quality. However, the judging of the test should be carried out without the influence of any personal bias. In other words, the judges should not be agreeable or disagreeable to the other judges based on their personal perception of them.
Example
This is often put into practice in the form of a panel of accomplished professionals, and can be witnessed in various contexts such as, the judging of a beauty pageant, conducting a job interview, a scientific symposium, etc.
Internal Consistency Reliability
It refers to the ability of different parts of the test to probe the same aspect or construct of an individual. If two similar questions are posed to the examinee, the generation of similar answers implies that the test shows internal consistency. If the answers are dissimilar, the test is not consistent and needs to be refined. It is a statistical approach to determine reliability. It is of two types.
► Average Inter-item Correlation
It considers all the questions that probe the same construct, segregates them into individual pairs, and then calculates the correlation coefficient of each pair of questions. Finally, an average is calculated of all the correlation coefficients to yield the final value for the average inter-item correlation. In other words, it ascertains the correlation between each question of the entire test.
► Split-half Reliability
It splits the questions that probe the same construct into two sets of equal proportions, and the data obtained from both sets is compared and matched in order to determine the correlation, if any, between these two sets of data.
Concept of Validity
It refers to the ability of the test to measure data that satisfies and supports the objectives of the test. It refers to the extent of applicability of the concept to the real world instead of a experimental setup. With respect to psychometrics, it is known as test validity and can be described as the degree to which the evidence supports a given theory. It is important since it helps researchers determine which test to implement in order to develop a measure that is ethical, efficient, cost-effective, and one that truly probes and measures the construct in question. Other non-psychological forms of validity include experimental validity and diagnostic validity. Experimental validity refers to whether a test will be supported by statistical evidence and if the test or theory has any real-life application. Diagnostic validity, on the other hand, is in the context of clinical medicine, and refers to the validity
of diagnostic and screening tests.
Types of Validity
Construct Validity
It refers to the ability of the test to measure the construct or quality that it claims to measure, i.e., if a test claims to test intelligence, it is valid if it truly tests the intelligence of the individual. It involves conducting a statistical analysis of the internal structure of the test and its examination by a panel of experts to determine the suitability of each question. It also studies the relationship between the test responses to the test questions, and the ability of the individual to comprehend the questions and provide apt answers. For example, if a test is prepared with the intention of testing a student subject knowledge of science, but the language used to present problems is highly sophisticated and difficult to comprehend. In such a case, the test, instead of gauging the knowledge, ends up testing the language proficiency, and hence is not a valid construct for measuring the subject knowledge of the student.
► Convergent Validity
This type of construct validity measures the degree to which two hypothetically-related concepts are actually real in real life. For example, if a test which is designed to test the correlation of the emotion of joy, bliss, and happiness proves the correlation by providing irrefutable data, then the test is said to possess convergent validity.
► Discriminant Validity
It is a measure of the degree to which two hypothetically unrelated concepts are actually unrelated in real life (evidenced by observed data). For example, if a certain test is designed to prove that happiness and despair are unrelated, and this is proved by the data obtained by conducting the test, then the test is said to have discriminant validity.
Content Validity
It is a non-statistical form of validity that involves the examination of the content of the test to determine whether it equally probes and measures all aspects of the given domain, i.e., if a specific domain has 4 subtypes, then equal number of test questions must probe all 4 of these subtypes with an equal intensity. This type of validity has to be taken in to account while formulating the test itself, after conducting a thorough study of the construct to be measured. For example, if a test is designed to assess the learning in the biology department, then that test must cover all aspects of it including its various branches like zoology, botany, microbiology, biotechnology, genetics, ecology, etc., or at least appear to cover.
► Representation Validity
It is also known as translation validity, and refers to the degree to which an abstract theoretical concept can be translated and implemented as a practical testable construct. For example, if one were to design a test to determine if comatose patients could communicate via some form of signals and if the test worked and produced appropriate supportive results, then the test would have representation validity.
► Face Validity
It is an estimate of whether a particular test appears to measure a construct. It does in no way imply whether it actually measures the construct or not, but merely projects that it does. For example, if a test appears to be measuring what it is supposed to, it has high face validity, but if it doesn’t then it has low face validity. It is the least sophisticated form of validity and is also known as surface validity. Hence, if an intelligence appears to be testing the intelligence of individuals, as observed by an evaluator, the test possesses face validity.
Criterion-related Validity
It measures the correlation between the outcomes of a test for a construct and the outcomes of pre-established tests that examine the individual criteria that form the overall construct. In other words, if a given construct has 3 criteria, the outcomes of the test are correlated with the outcomes of tests for each individual criteria that are already established as being valid. For example, if a company conducts an IQ test of a job applicant and matches it with his/her past academic record, any correlation that is observed will be an example of criterion-related validity. Depending on the type of correlation the validity is of two types.
► Concurrent Validity
It refers to the degree to which the results of a test correlates well with the results obtained from a related test that has already been validated. The two tests are taken at the same time, and they provide a correlation between events that are on the same temporal plane (present). For example, if a batch of students is given an evaluative test, and on the same day, their teachers are asked to rate each one of those students and the results of both sets are compared, any correlation that is observed between the two sets of data will be concurrently valid.
► Predictive Validity
It refers to the degree to which the results of a test correlate to the results of a related test that is administered sometime in the future. The difference of the time period between the administering of the two tests allows the correlation to possess a predictive quality. For example, if an evaluative test that claims to test the intelligence of students is administered and the students with high scores gained academic success later, while the ones with low scores did not do well academically, the test is said to possess predictive validity.
Although both concepts are essential for accurate psychological assessment, they are not interdependent. A test may be reliable without being valid, and vice versa. This is explained by considering the example of a weighing machine. If one puts a weight of 500g on the machine, and if it shows any other value than 500g, then it is not a valid measure. However, it may still be considered reliable if each time the weight is put on, the machine shows the same reading of say 250g. Hence, in terms of measurement, validity describes accuracy, whereas reliability describes precision.
In case of a test that is valid but unreliable, implementation of the classical test theory provides the examiner or researcher with options and ways to improve the reliability of that test.